Datasets and software used in papers on approximate string matching

Blog

Datasets and software used in the following papers:

SISAP 2012: Leonid Boytsov, Super-Linear Indices for Approximate Dictionary Searching. SISAP 2012: 162-176
JEA ACM 2011: Leonid Boytsov. 2011. Indexing methods for approximate dictionary searching: Comparative analysis. J. Exp. Algorithmics 16, 1, Article 1 (May 2011).

You can download the articles on this page.
There are virtually no restrictions on using software and data (see details here) that was designed by me. I appreciate if you cite my work.
However, the archives contain several third-party packages (which I used for comparison). These packages may be a subject to different licenses and can be guarded by patents (I do believe that agrep and the underlying shift-and algorithm is patented). These packages include at least the following:

G. Navarro's implementation of the lazy Levenshtein automaton. (The folder NavarroDFA).
NR-grep (developed by G. Navarro).
agrep (developed by Sun Wu and Udi Manber)
FastSS
Download and build instructions.

The data and sources used in the JEA ACM 2011 paper are also available here (check the tab "Source Materials").
1. Download and unpack the source file;
2. Download the datasets to the source file directory;
3. Check the README file for building/testing instructions.
Source files.
- Version 1.1 (used in SISAP 2012 article).
- Version 1.0 (used in JEA ACM 2011 article).
Data sets.
- Version 1.1 (used in SISAP 2012). Size: almost 300 Mb
- Version 1.0 (used JEA ACM 2011). Size: almost 200 Mb You can also download individual datasets: Russian and English synthetic data sets (can also be generated from source files), ClueWeb09 words, random sequences from the human genome. Note that to access individual data sets you may need to create a Google account.

You are here

Datasets and software used in the following papers:

Download and build instructions.

Source files.

Data sets.