This is a 2-BWT implementation based on FM-Index, based on High Throughput Short Read Alignment via Bi-directional BWT
(http://www.cs.bgu.ac.il/~dekelts/tsds14/Lam.pdf)
A simple C++ based FM-Index [1] implementation using RRR [4] wavelet trees [5]
which allows to build a full-text index over a given text T
of size n
supporting the following operations:
count(P,m)
: count the number of occurences of patternP
of sizem
inT
.count1(P,m)
: count the number of occurences of patternP
of sizem
inT
with exactly one error.count2(P,m)
: count the number of occurences of patternP
of sizem
inT
with exactly two errors.locate(P,m)
: locate the text positions of all occurences ofP
of sizem
inT
.locate1(P,m)
: locate the text positions of all occurences ofP
of sizem
inT
with exactly one error.locate2(P,m)
: locate the text positions of all occurences ofP
of sizem
inT
with exactly two errors.extract(A,B)
: extractT[A,B]
from the index.recover()
: recoverT
from the index.
The constructed index uses nH_k + o(n log sigma)
bits of space [3] which is roughly the size of
the compressed representation of T
and can perform the above operations without the need
to store T
. An empirical evaluation of the index is sown in the Benchmark section below.
Drawbacks of the FM-Index is long construction time and high memory requirements during construction.
make
./fmbuild alice29.txt alice29.txt.fm
Builds and writes the FM-Index alice29.txt.fm
.
./fmcount -i alice29.txt.fm alice.qry [0-2]
The queries are stored in a new line seperated file:
the
house
keep
Alice
and
The index returns the number of occurrences for each query:
./fmcount -i alice29.txt.fm alice.qry
the : 2101
house : 20
keep : 11
Alice : 395
and : 880
./fmlocate -i alice29.txt.fm alice.qry [0-2]
The index returns a sorted list of the locations of all occurences for each query:
./fmlocate -i alice29.txt.fm alice.qry 0
Read 3 queries
keep (11) : 46385 51125 69491 74680 81562 83046 104830 105180 133621 149966 151623
poison (3) : 8151 8619 8731
tomorrow (1) : 63637
./fmextract -i alice29.txt.fm alice.extract
The queries are stored in a new line seperated file:
118 147
1213 1245
24 55
The index returns the extracted text snippet for each query:
./fmextract -i alice29.txt.fm alice.extract
118 - 147 : 'THE MILLENNIUM FULCRUM EDITION'
1213 - 1245 : 'TOOK A WATCH OUT OF ITS WAISTCOAT'
24 - 55 : 'ALICE'S ADVENTURES IN WONDERLAND'
./fmrecover -i alice29.txt.fm
The index outputs the original text to stdout
.
The -v
command line parameter enables verbose messages:
./fmbuild -v alice29.txt
building index.
- remapping alphabet.
- creating cumulative counts C[].
- performing bwt.
- sample SA locations.
- creating bwt output.
- create RRR wavelet tree over bwt.
build FM-Index done. (0.101 sec)
space usage:
- remap_reverse: 75 bytes (0.07%)
- C: 1028 bytes (0.90%)
- Suffixes: 9508 bytes (8.31%)
- Positions: 9512 bytes (8.31%)
- Sampled: 7948 bytes (6.95%)
- T_bwt: 86088 bytes (75.23%)
input Size n = 152090 bytes
index Size = 114431 bytes (0.75 n)
writing FM Index to file 'alice29.txt.fm'
A small code example illustrating the use of the FM class:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "FM.h"
int
main(int argc,char** argv) {
uint8_t* T = /* read text */
uint32_t n = /* sizeof T */
FM* = new FM(T,n);
if(FM) {
/* count the occurences of 'house' in T */
uint32_t cnt = FM->count("house",strlen("house"));
/* get all locations offsets of 'house' in T */
uint32_t matches; /* number of matches */
uint32_t* locs; /* list of location offsets */
locs = FM->locate("house",strlen("house"),&matches);
/* extract text snippet from T */
uint8_t* snippet = FM->extract(5,251);
/* recover T from index */
uint32_t nnew; /* size of Tnew */
uint8_t* Tnew = FM->reconstructText(&nnew);
delete FM;
}
}
Compile with g++ -o test main.cpp FM.cpp util.c libcds.a libdivsufsort.a
Experiments are similar to the ones described in [1]. All experiments were run on a Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz
using 4GB of RAM.
The default samplerate s=64
was used in all experiments.
Test data was taken from the The Pizza&Chili Site and TREC.
File | Description | Alphabet size | Entropy (bps) |
---|---|---|---|
wsj | English Text taken from the TREC wsj collection | 90 | 4.60 |
src | Concatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.0 | 230 | 5.47 |
proteins | Sequence of newline-separated protein sequences | 25 | 4.20 |
dna | Gene DNA sequences | 4 | 1.97 |
xml | XML that provides bibliographic info on compsci pubs(dblp) | 96 | 5.26 |
Peak memory usage was measured using the valgrind --tool=massif
tool. Running time was measured using the gettimeofday()
system call.
File | Size [MB] | Time [sec] | Memory [MB] |
---|---|---|---|
wsj | 3 | 2.157 | 20.02 |
wsj | 6 | 4.487 | 39.73 |
wsj | 12 | 9.293 | 79.15 |
wsj | 25 | 19.028 | 158.0 |
wsj | 50 | 38.699 | 315.7 |
wsj | 100 | 78.287 | 631.2 |
wsj | 200 | 158.341 | 1233 |
src | 50 | 39.909 | 315.7 |
src | 100 | 80.489 | 631.2 |
src | 200 | 163.072 | 1233 |
proteins | 50 | 35.374 | 315.7 |
proteins | 100 | 72.824 | 631.2 |
proteins | 200 | 146.173 | 1233 |
dna | 50 | 25.941 | 315.7 |
dna | 100 | 53.287 | 631.2 |
dna | 200 | 109.449 | 1233 |
xml | 50 | 36.601 | 315.7 |
xml | 100 | 74.104 | 631.2 |
xml | 200 | 150.050 | 1233 |
Construction time increases linearly O(n)
with size n
of T
. Memory requirement is roughly 6n
independent of the file type.
File | Size [MB] | Index Size [MB] | Index Size [relative to file size] |
---|---|---|---|
wsj | 25 | 15 | 0.6 |
wsj | 50 | 29 | 0.58 |
wsj | 100 | 57 | 0.57 |
wsj | 200 | 112 | 0.56 |
src | 50 | 30 | 0.77 |
src | 100 | 59 | 0.77 |
src | 200 | 117 | 0.75 |
proteins | 50 | 36 | 0.72 |
proteins | 100 | 71 | 0.71 |
proteins | 200 | 137 | 0.685 |
dna | 50 | 24 | 0.48 |
dna | 100 | 48 | 0.48 |
dna | 200 | 95 | 0.475 |
xml | 50 | 25 | 0.50 |
xml | 100 | 50 | 0.50 |
xml | 200 | 99 | 0.495 |
Index size depends on the compressability of T
. For each specific file type, as n
increases,
the size of the auxillary data becomes less significant which leads to better overall compression ratio.
50000 random patterns for each length m=5 to 50
were extracted from each 200MB file.
The graph shows the time in seconds it took the FM-Index to answer all 50000 queries on different file types. Note that the query
time grows linearly O(m)
with the pattern length. Files with smaller alphabet show faster query times overall.
Also note that the query time (m=20
) does not depend on the size n
of the text T
:
File | Size [MB] | Query Time [sec] |
---|---|---|
wsj | 3 | 13.37 |
wsj | 6 | 11.96 |
wsj | 12 | 12.74 |
wsj | 25 | 12.74 |
wsj | 50 | 12.07 |
wsj | 100 | 14.59 |
wsj | 200 | 14.34 |
Overall, searching for 50000 patterns in T
using a FM-Index is fast but requieres considerable amount of upfront work (time+space).
k
queries pf length 5
are randomly selected from T
so they roughly amount to a certain number of total occurences over all queries.
Note the running time increases linearly with the number of occurrences. Similar to count()
, files with small alphabet size perform
better most likely due to decreased height of the wavelet tree. Comparing the running times of both count()
and locate()
we observe
that locate()
is significantly slower than count()
as patterns (especially short ones), on large text collections, tend to have many occurrences.
The samplerate s
(default = 64) determines how often the underlying suffix array is sampled to enable faster locate()
calls. For different samplerates
the index achieves different time and space trade-offs during the locate()
operation:
The graph above shows the time required to retrieve the locations of 50000 occurrences compared to the size of the FM-Index for different suffix samplerates.
Smaller samplerate decrease the locate()
running time but the space used by the index increases. For s=8
the index is 1.5 times the sizes of the original text.
For s=512
, that is only every 512th position in the underlying suffix array is sampled, the index is less than 50% of the original text. The above graph shows that the
best time-space trade-off is achieved for samplerates between 64 and 128.
Extracting snippets of T
from the index shows the same properties as the locate()
operation. Like locate()
, the running time of extract()
grows linearly with the size of the
size of the snippet. The running time is also influenced by the samplerate s
which provides the same time and space trade-offs described above.
The following libraries are needed to use the index. Sourcecode for both libaries is included.
- libcds -- succinct low level data structures
- libdivsufsort -- fast suffix sorting (requires cmake to build)
GPL v3 (see LICENCE file)
- Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005.
- Francisco Claude and Gonzalo Navarro. Practical Rank/Select Queries over Arbitrary Sequences. SPIRE'08 176-187, 2008.
- Veli M"akinen and Gonzalo Navarro. Implicit Compression Boosting with Applications to Self-Indexing. SPIRE'07 214-226.
- R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. SODA'02, 233-242.
- R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. SODA'03, 841-850.
- The Pizza&Chili Site.
Matthias Petri [email protected] see github.