System: FGASQ
Introduction
FGASQ is an efficient approximate substring query tool with support for local optimal matching. It was implemented using GNU C++.
Features
The released source supports the following features:
- build q-gram index for text
- approximate substring search (local optimal similar substrings)
- use q-gram index
- do not use q-gram index
Download
The source code of FGASQ is available here for download. The software has been tested on a Linux environment (Ubuntu).
FGASQ.tar.gz
To compile the software:
make
The following executable file will be generated:
subsearch Perform approximate substring search
We also provide example data here: example data
The example data includes a 10M file text10M.txt as a text sequence and a file query5-100.txt containing 5 query sequences.
Requirements
The code could be run in Linux using g++ complier.
Step-by-Step Instruction
Run "./make" to compile the code and generate the executable file.
ALAE parameters are listed by typing “./subsearch”.
Syntax: subsearch <text file> <query file>
-O <output file>
-q <gram length> (default: 11)
-H <threshold of edit distance> (default: 5)
-G <use gram index or not> : 0 - do not use;
1 - use gram index (default);
For example, run "./subsearch text10M.txt query5-100.txt -O result.txt" to perform approximate substring search for the example data. The q-gram index will be automatically built if the option use gram index is chosen and the final results would be in the file result.txt.
Note:
1. Every single line in the query file would be treated as a query sequence.
2. In our algorithm, we used a lower bound l+1-(k+1)*q where l is the length of the query, k is the ED threshold and q is the gram length. So, adjusting the value of parameter q to make sure this lower bound is above 0 is necessary. Otherwise, using q-gram index would not be able to speed up the query process. For example, in our example data, the length of the query is 100. So, if the threshold is fixed to 10, q should be adjust to below or equal to 9.
Datasets
We used human genomes in our experiments.
FGASQ is an efficient approximate substring query tool with support for local optimal matching. It was implemented using GNU C++.
Features
The released source supports the following features:
- build q-gram index for text
- approximate substring search (local optimal similar substrings)
- use q-gram index
- do not use q-gram index
Download
The source code of FGASQ is available here for download. The software has been tested on a Linux environment (Ubuntu).
FGASQ.tar.gz
To compile the software:
make
The following executable file will be generated:
subsearch Perform approximate substring search
We also provide example data here: example data
The example data includes a 10M file text10M.txt as a text sequence and a file query5-100.txt containing 5 query sequences.
Requirements
The code could be run in Linux using g++ complier.
Step-by-Step Instruction
Run "./make" to compile the code and generate the executable file.
ALAE parameters are listed by typing “./subsearch”.
Syntax: subsearch <text file> <query file>
-O <output file>
-q <gram length> (default: 11)
-H <threshold of edit distance> (default: 5)
-G <use gram index or not> : 0 - do not use;
1 - use gram index (default);
For example, run "./subsearch text10M.txt query5-100.txt -O result.txt" to perform approximate substring search for the example data. The q-gram index will be automatically built if the option use gram index is chosen and the final results would be in the file result.txt.
Note:
1. Every single line in the query file would be treated as a query sequence.
2. In our algorithm, we used a lower bound l+1-(k+1)*q where l is the length of the query, k is the ED threshold and q is the gram length. So, adjusting the value of parameter q to make sure this lower bound is above 0 is necessary. Otherwise, using q-gram index would not be able to speed up the query process. For example, in our example data, the length of the query is 100. So, if the threshold is fixed to 10, q should be adjust to below or equal to 9.
Datasets
We used human genomes in our experiments.