Skip to content

Inducing enhanced suffix arrays for string collections [DCC'16, TCS 2017]

License

Notifications You must be signed in to change notification settings

felipelouza/gsa-is

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#gsa-is

This code is an implementation of gSAIS and gSACA-K [1], which extend the linear­time suffix sorting algorithms SAIS [2] and SACA-K [3] to compute the generalized suffix array.

Overall, SACA-K's time-space trade-off is Pareto optimal compared to the all other algorithms in the experiments.

-- ##run:

To run a test type:

make
make run DIR=dataset INPUT=uniprot.100.fasta K=100 MODE=6

One can change to 32 bits integers (when n < 2^31) in lib/utils.h, setting m64 to 0.

-- Settings:

MODE parameter specifies which algorithm is called by main.c:

  • 1: SAIS*
  • 2: SACA-K*
  • 3: SAIS
  • 4: SACA-K
  • 5: gSAIS
  • 6: gSACA-K

* SAIS and SACA-K versions that receive an integer alphabet as input.

-- ##experiments:

Collections:

Collection size (GB) d n n/d available at
Revision 0.39 20,433 419,437,293 20,527 http://jltsiren.kapsi.fi/data/fiwiki.bz2
Page 3.74 1,000 4,019,585,128 4,019,585 http://jltsiren.kapsi.fi/data/fiwiki.bz2
Influenza 0.56 394,217 597,471,768 1,516 ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz
Enwiki 8.32 3,903,703 8,933,518,792 2,288 http://algo2.iti.kit.edu/gog/projects/ALENEX15/collections/ENWIKIBIG/text_SURF.sdsl
DNA reads 2.87 32,621,862 3,082,739,100 94 http://gage.cbcb.umd.edu/data/Hg_chr14/Data.quakeCor.tgz
Uniprot 15.77 50,825,784 16,931,428,229 333 ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/

Datasets:

Dataset d n size (MB) avgLen
pages 1 2 2,371,241 2.26 1,185,621
pages 2 7 14,707,622 14.03 2,101,089
pages 3 15 35,043,296 33.42 2,336,220
pages 4 31 52,890,374 50.44 1,706,141
pages 5 62 119,466,260 113.93 1,926,875
pages 6 125 412,148,014 393.05 3,297,184
pages 7 250 840,886,051 801.93 3,363,544
pages 8 450 2,127,332,977 2,028.78 4,727,407
pages 9 650 2,635,913,891 2,513.80 4,055,252
pages 10 1,000 4,019,585,128 3,833.38 4,019,585
revesion 1 40 99,138 0.09 2,478
revesion 2 80 260,366 0.25 3,255
revesion 3 159 914,542 0.87 5,752
revesion 4 319 2,091,943 2.00 6,558
revesion 5 638 4,294,715 4.10 6,732
revesion 6 1,277 16,058,904 15.31 12,575
revesion 7 2,554 34,168,931 32.59 13,379
revesion 8 5,108 59,888,778 57.11 11,725
revesion 9 10,217 144,241,776 137.56 14,118
revesion 10 20,433 419,437,293 400.01 20,527
influenza 1 769 924,268 0.88 1,202
influenza 2 1,539 1,773,953 1.69 1,153
influenza 3 3,079 3,600,436 3.43 1,169
influenza 4 6,159 7,056,864 6.73 1,146
influenza 5 12,319 14,100,766 13.45 1,145
influenza 6 24,638 32,959,239 31.43 1,338
influenza 7 49,277 70,800,914 67.52 1,437
influenza 8 98,554 142,930,737 136.31 1,450
influenza 9 197,108 289,558,572 276.14 1,469
influenza 10 394,217 597,471,768 569.79 1,516
wikipedia 1 7,624 13,303,449 12.69 1,745
wikipedia 2 15,248 25,737,808 24.55 1,688
wikipedia 3 30,497 53,002,468 50.55 1,738
wikipedia 4 60,995 101,073,689 96.39 1,657
wikipedia 5 121,990 189,880,688 181.08 1,557
wikipedia 6 243,981 478,842,959 456.66 1,963
wikipedia 7 487,963 1,071,827,039 1,022.17 2,197
wikipedia 8 975,926 2,315,093,668 2,207.85 2,372
wikipedia 9 1,951,852 4,459,594,679 4,253.00 2,285
wikipedia 10 3,903,703 8,933,518,792 8,519.67 2,288
reads 1 100,000 9,030,390 8.61 90
reads 2 200,000 17,703,050 16.88 89
reads 3 400,000 36,808,318 35.10 92
reads 4 800,000 75,089,460 71.61 94
reads 5 1,600,000 149,953,139 143.01 94
reads 6 3,200,000 302,041,259 288.05 94
reads 7 6,400,000 604,318,315 576.32 94
reads 8 12,800,000 1,209,578,418 1,153.54 94
reads 9 25,600,000 2,419,646,287 2,307.55 95
reads 10 32,621,862 3,082,739,100 2,939.93 94
protein 1 100,000 26,198,280 24.98 262
protein 2 200,000 52,503,813 50.07 263
protein 3 400,000 106,286,416 101.36 266
protein 4 800,000 219,589,476 209.42 274
protein 5 1,600,000 531,194,430 506.59 332
protein 6 3,200,000 1,253,483,804 1,195.42 392
protein 7 6,400,000 2,667,015,568 2,543.46 417
protein 8 12,800,000 5,162,366,462 4,923.22 403
protein 9 25,600,000 9,070,832,678 8,650.62 354
protein 10 50,825,784 16,931,428,229 16,147.07 333

Running time (seconds):

Dataset SAIS* SACA-K* SAIS SACA-K gSAIS gSACA-K
pages 1 2.90E-01 4.10E-01 1.90E-01 1.20E-01 1.90E-01 1.20E-01
pages 2 3.43E+00 3.35E+00 3.38E+00 2.09E+00 2.22E+00 2.06E+00
pages 3 1.06E+01 1.01E+01 8.61E+00 7.60E+00 9.12E+00 7.65E+00
pages 4 1.75E+01 1.59E+01 1.50E+01 1.25E+01 1.48E+01 1.23E+01
pages 5 4.44E+01 3.89E+01 3.78E+01 3.21E+01 3.82E+01 3.17E+01
pages 6 1.94E+02 1.57E+02 1.76E+02 1.35E+02 1.75E+02 1.33E+02
pages 7 4.09E+02 3.26E+02 3.74E+02 2.79E+02 3.73E+02 2.76E+02
pages 8 1.12E+03 8.88E+02 1.00E+03 9.22E+02 1.00E+03 7.31E+02
pages 9 1.54E+03 1.33E+03 1.31E+03 1.02E+03 1.31E+03 9.89E+02
pages 10 2.57E+03 2.17E+03 2.09E+03 1.61E+03 2.09E+03 1.56E+03
revesion 1 1.20E-02 1.00E-02 9.00E-03 8.00E-03 8.00E-03 7.00E-03
revesion 2 2.00E-02 1.50E-02 1.40E-02 1.20E-02 1.40E-02 1.20E-02
revesion 3 6.60E-02 6.70E-02 5.60E-02 5.30E-02 5.70E-02 5.30E-02
revesion 4 2.30E-01 2.40E-01 1.60E-01 1.50E-01 1.50E-01 1.50E-01
revesion 5 6.30E-01 6.80E-01 3.80E-01 3.10E-01 3.40E-01 3.20E-01
revesion 6 3.89E+00 4.32E+00 3.37E+00 3.07E+00 3.12E+00 2.92E+00
revesion 7 1.04E+01 1.01E+01 8.83E+00 7.72E+00 9.52E+00 7.70E+00
revesion 8 1.96E+01 1.83E+01 1.73E+01 1.48E+01 1.71E+01 1.47E+01
revesion 9 5.56E+01 4.87E+01 4.98E+01 4.07E+01 4.89E+01 3.99E+01
revesion 10 2.12E+02 1.70E+02 1.93E+02 1.45E+02 1.87E+02 1.39E+02
influenza 1 8.40E-02 6.80E-02 6.60E-02 6.50E-02 6.70E-02 6.50E-02
influenza 2 1.58E-01 1.61E-01 1.23E-01 1.24E-01 1.30E-01 1.24E-01
influenza 3 4.23E-01 4.24E-01 2.68E-01 2.46E-01 2.69E-01 2.54E-01
influenza 4 1.21E+00 1.17E+00 6.30E-01 6.20E-01 6.70E-01 6.10E-01
influenza 5 2.96E+00 2.92E+00 1.67E+00 1.58E+00 1.75E+00 1.54E+00
influenza 6 8.89E+00 8.70E+00 6.67E+00 6.16E+00 6.70E+00 6.08E+00
influenza 7 2.19E+01 2.13E+01 1.84E+01 1.70E+01 1.85E+01 1.67E+01
influenza 8 4.94E+01 4.57E+01 4.29E+01 3.81E+01 4.34E+01 3.76E+01
influenza 9 1.10E+02 9.64E+01 9.66E+01 8.23E+01 9.85E+01 8.16E+01
influenza 10 2.60E+02 2.12E+02 2.30E+02 1.82E+02 2.32E+02 1.80E+02
wikipedia 1 3.41E+00 3.37E+00 1.98E+00 1.92E+00 2.41E+00 2.12E+00
wikipedia 2 8.12E+00 7.77E+00 6.71E+00 6.25E+00 6.48E+00 5.99E+00
wikipedia 3 1.95E+01 1.83E+01 1.78E+01 1.50E+01 1.65E+01 1.47E+01
wikipedia 4 4.24E+01 3.79E+01 4.02E+01 3.23E+01 3.77E+01 3.15E+01
wikipedia 5 9.05E+01 7.54E+01 8.52E+01 6.68E+01 8.33E+01 6.52E+01
wikipedia 6 2.72E+02 2.03E+02 2.53E+02 1.83E+02 2.47E+02 1.81E+02
wikipedia 7 6.86E+02 4.98E+02 6.15E+02 4.32E+02 6.24E+02 4.27E+02
wikipedia 8 1.96E+03 1.43E+03 1.54E+03 1.09E+03 1.54E+03 1.06E+03
wikipedia 9 4.40E+03 3.47E+03 3.25E+03 2.33E+03 3.62E+03 2.35E+03
wikipedia 10 7.89E+03 6.18E+03 7.07E+03 5.25E+03 7.08E+03 5.08E+03
reads 1 1.76E+00 1.79E+00 1.06E+00 1.08E+00 1.07E+00 1.05E+00
reads 2 4.63E+00 4.52E+00 4.22E+00 4.04E+00 3.24E+00 3.02E+00
reads 3 1.20E+01 1.12E+01 1.11E+01 9.84E+00 1.02E+01 9.16E+00
reads 4 2.85E+01 2.46E+01 2.74E+01 2.16E+01 2.54E+01 2.13E+01
reads 5 6.29E+01 5.10E+01 6.24E+01 4.81E+01 5.86E+01 4.61E+01
reads 6 1.38E+02 1.07E+02 1.39E+02 1.03E+02 1.31E+02 9.74E+01
reads 7 3.08E+02 2.22E+02 2.99E+02 2.18E+02 2.81E+02 2.48E+02
reads 8 6.90E+02 4.88E+02 6.52E+02 4.55E+02 6.10E+02 4.19E+02
reads 9 2.07E+03 1.83E+03 1.49E+03 1.03E+03 1.39E+03 9.36E+02
reads 10 2.16E+03 1.68E+03 1.94E+03 1.37E+03 1.81E+03 1.24E+03
protein 1 1.01E+01 9.32E+00 6.94E+00 6.15E+00 8.30E+00 7.30E+00
protein 2 2.35E+01 2.08E+01 1.88E+01 1.62E+01 2.07E+01 1.74E+01
protein 3 5.51E+01 4.56E+01 4.78E+01 3.89E+01 5.10E+01 3.98E+01
protein 4 1.27E+02 9.93E+01 1.19E+02 9.03E+01 1.24E+02 9.01E+01
protein 5 3.63E+02 2.55E+02 3.44E+02 2.39E+02 3.39E+02 2.34E+02
protein 6 9.71E+02 6.63E+02 8.99E+02 5.85E+02 8.97E+02 5.73E+02
protein 7 2.52E+03 1.88E+03 2.22E+03 1.47E+03 2.20E+03 1.41E+03
protein 8 5.13E+03 3.88E+03 5.34E+03 3.51E+03 4.60E+03 3.04E+03
protein 9 9.45E+03 7.17E+03 1.02E+04 7.34E+03 8.71E+03 5.96E+03
protein 10 2.13E+04 1.65E+04 1.87E+04 1.26E+04 1.86E+04 1.22E+04

Space (MB):

Dataset SAIS* SACA-K* SAIS SACA-K gSAIS gSACA-K
pages 1 18.50 18.09 11.72 11.31 11.72 11.31
pages 2 114.76 112.21 72.68 70.13 72.68 70.13
pages 3 273.43 267.36 173.17 167.10 173.17 167.10
pages 4 412.69 403.52 261.37 252.20 261.37 252.20
pages 5 932.11 911.46 590.32 569.66 590.32 569.66
pages 6 3,215.52 3,144.44 2,036.36 1,965.28 2,036.36 1,965.28
pages 7 6,560.70 6,415.45 4,154.91 4,009.66 4,154.91 4,009.66
pages 8 16,597.26 16,230.27 10,510.91 10,143.92 10,510.91 10,143.92
pages 9 40,680.40 40,220.86 23,083.78 22,624.23 23,083.78 22,624.23
pages 10 62,034.13 61,334.01 35,200.50 34,500.38 35,200.50 34,500.38
revesion 1 0.78 0.76 0.49 0.47 0.49 0.47
revesion 2 2.03 1.99 1.29 1.24 1.29 1.24
revesion 3 7.14 6.98 4.52 4.36 4.52 4.36
revesion 4 16.33 15.96 10.34 9.98 10.34 9.98
revesion 5 33.52 32.77 21.23 20.48 21.24 20.48
revesion 6 125.30 122.53 79.36 76.58 79.36 76.58
revesion 7 266.61 260.70 168.85 162.93 168.85 162.93
revesion 8 467.31 456.94 295.96 285.57 295.96 285.57
revesion 9 1,125.41 1,100.52 712.73 687.80 712.73 687.80
revesion 10 3,272.45 3,200.13 2,072.43 2,000.03 2,072.43 2,000.03
influenza 1 7.30 7.06 4.65 4.41 4.65 4.41
influenza 2 13.97 13.54 8.88 8.46 8.89 8.46
influenza 3 28.27 27.48 17.95 17.17 17.97 17.17
influenza 4 55.30 53.86 35.08 33.65 35.11 33.65
influenza 5 110.36 107.63 69.96 67.24 70.01 67.24
influenza 6 257.69 251.55 163.27 157.16 163.39 157.16
influenza 7 553.21 540.36 350.40 337.61 350.64 337.61
influenza 8 1,116.44 1,090.85 707.00 681.55 707.51 681.55
influenza 9 2,261.16 2,209.91 1,431.85 1,380.72 1,432.73 1,380.72
influenza 10 4,665.13 4,559.85 2,953.86 2,848.97 2,955.75 2,848.97
wikipedia 1 107.45 101.53 69.39 63.44 69.39 63.44
wikipedia 2 207.41 196.42 133.77 122.73 133.78 122.73
wikipedia 3 426.01 404.49 274.35 252.74 274.37 252.74
wikipedia 4 810.54 771.36 521.34 481.96 521.36 481.96
wikipedia 5 1,519.14 1,449.14 975.82 905.42 975.88 905.42
wikipedia 6 3,820.79 3,654.21 2,450.67 2,283.30 2,450.81 2,283.30
wikipedia 7 8,530.61 8,179.25 5,463.79 5,110.87 5,464.09 5,110.87
wikipedia 8 36,372.60 35,332.97 20,916.24 19,870.61 20,917.68 19,870.61
wikipedia 9 69,951.59 68,062.91 40,180.41 38,277.01 40,180.59 38,277.01
wikipedia 10 140,078.79 136,344.46 80,440.72 76,677.01 80,441.12 76,677.01
reads 1 72.91 69.28 47.04 43.06 47.07 43.06
reads 2 142.65 135.83 91.90 84.42 92.00 84.42
reads 3 295.69 282.35 175.52 175.52 190.38 175.52
reads 4 600.92 575.94 385.35 358.06 386.09 358.06
reads 5 1,194.90 1,150.16 763.98 715.03 765.88 715.03
reads 6 2,396.03 2,316.60 1,527.06 1,440.25 1,531.88 1,440.25
reads 7 4,777.69 4,635.00 3,036.75 2,881.62 3,048.72 2,881.62
reads 8 9,543.64 9,277.18 6,065.95 5,767.72 6,083.01 5,767.72
reads 9 37,781.02 37,116.19 21,555.19 20,767.99 21,628.14 20,767.99
reads 10 48,114.93 47,287.75 27,450.08 26,459.36 27,535.43 26,459.36
protein 1 212.12 200.26 137.01 124.92 137.16 124.92
protein 2 424.63 401.34 274.11 250.36 274.41 250.36
protein 3 857.04 812.43 552.33 506.81 552.95 506.81
protein 4 1,768.69 1,678.39 1,139.24 1,047.09 1,140.44 1,047.09
protein 5 4,288.93 4,058.80 2,767.10 2,532.93 2,769.17 2,532.93
protein 6 10,081.31 9,575.53 6,490.38 5,977.08 6,495.06 5,977.08
protein 7 42,240.02 40,744.26 24,411.26 22,891.18 24,435.77 22,891.18
protein 8 81,798.41 78,869.12 47,289.48 44,308.95 47,335.89 44,308.95
protein 9 143,689.08 138,605.24 83,036.60 77,855.58 83,134.74 77,855.58
protein 10 267,478.92 258,740.86 154,213.88 145,323.62 154,449.44 145,323.62

Workspace (MB, but SACA-K and gSACA-K):

Dataset SAIS* SACA-K* SAIS SACA-K gSAIS gSACA-K
pages 1 0.41 0.00 0.41 1,024 0.41 1,024
pages 2 2.55 0.00 2.55 1,024 2.55 1,024
pages 3 6.07 0.00 6.07 1,024 6.07 1,024
pages 4 9.17 0.00 9.17 1,024 9.17 1,024
pages 5 20.66 0.00 20.66 1,024 20.66 1,024
pages 6 71.08 0.00 71.08 1,024 71.08 1,024
pages 7 145.25 0.00 145.25 1,024 145.25 1,024
pages 8 367.00 0.00 367.00 1,024 367.00 1,024
pages 9 459.55 0.01 459.55 1,024 459.55 1,024
pages 10 700.13 0.01 700.13 1,024 700.13 1,024
revesion 1 0.02 0.00 0.02 1,024 0.02 1,024
revesion 2 0.05 0.00 0.05 1,024 0.05 1,024
revesion 3 0.16 0.00 0.16 1,024 0.16 1,024
revesion 4 0.37 0.00 0.37 1,024 0.37 1,024
revesion 5 0.76 0.00 0.76 1,024 0.76 1,024
revesion 6 2.78 0.01 2.78 1,024 2.78 1,024
revesion 7 5.92 0.01 5.92 1,024 5.92 1,024
revesion 8 10.39 0.02 10.39 1,024 10.39 1,024
revesion 9 24.93 0.04 24.93 1,024 24.93 1,024
revesion 10 72.40 0.08 72.40 1,024 72.40 1,024
influenza 1 0.25 0.00 0.24 1,024 0.25 1,024
influenza 2 0.43 0.01 0.43 1,024 0.43 1,024
influenza 3 0.80 0.01 0.78 1,024 0.80 1,024
influenza 4 1.46 0.02 1.43 1,024 1.46 1,024
influenza 5 2.78 0.05 2.72 1,024 2.78 1,024
influenza 6 6.23 0.09 6.11 1,024 6.23 1,024
influenza 7 13.04 0.19 12.80 1,024 13.04 1,024
influenza 8 25.96 0.38 25.46 1,024 25.96 1,024
influenza 9 52.00 0.75 51.13 1,024 52.00 1,024
influenza 10 106.78 1.50 104.89 1,024 106.78 1,024
wikipedia 1 5.96 0.03 5.96 1,024 5.96 1,024
wikipedia 2 11.05 0.06 11.04 1,024 11.05 1,024
wikipedia 3 21.63 0.12 21.62 1,024 21.63 1,024
wikipedia 4 39.41 0.23 39.38 1,024 39.41 1,024
wikipedia 5 70.46 0.47 70.40 1,024 70.46 1,024
wikipedia 6 167.51 0.93 167.37 1,024 167.51 1,024
wikipedia 7 353.22 1.86 352.92 1,024 353.22 1,024
wikipedia 8 1,047.08 7.45 1,045.63 1,024 1,047.08 1,024
wikipedia 9 1,903.58 14.89 1,903.40 1,024 1,903.58 1,024
wikipedia 10 3,764.11 29.78 3,763.71 1,024 3,764.11 1,024
reads 1 4.01 0.38 3.98 1,024 4.01 1,024
reads 2 7.58 0.76 7.48 1,024 7.58 1,024
reads 3 14.86 1.53 14.58 1,024 14.86 1,024
reads 4 28.04 3.05 27.30 1,024 28.04 1,024
reads 5 50.85 6.10 48.94 1,024 50.85 1,024
reads 6 91.63 12.21 86.82 1,024 91.63 1,024
reads 7 167.11 24.42 155.14 1,024 167.11 1,024
reads 8 315.29 48.83 298.24 1,024 315.29 1,024
reads 9 860.14 195.31 787.20 1,024 860.14 1,024
reads 10 1,076.07 248.89 990.72 1,024 1,076.07 1,024
protein 1 12.24 0.38 12.09 1,024 12.24 1,024
protein 2 24.06 0.76 23.75 1,024 24.06 1,024
protein 3 46.14 1.53 45.52 1,024 46.14 1,024
protein 4 93.36 3.05 92.16 1,024 93.36 1,024
protein 5 236.24 6.10 234.17 1,024 236.24 1,024
protein 6 517.99 12.21 513.30 1,024 517.99 1,024
protein 7 1,544.59 48.83 1,520.08 1,024 1,544.59 1,024
protein 8 3,026.95 97.66 2,980.53 1,024 3,026.95 1,024
protein 9 5,279.16 195.31 5,181.02 1,024 5,279.16 1,024
protein 10 9,125.83 387.77 8,890.27 1,024 9,125.83 1,024

##references

[1] Louza, F. A., Gog, S., Telles, G. P, "Induced Suffix Sorting for String Collections", submitted to DCC, 2016.

[2] Nong G., Zhang S., Chan W. H., "Two efficient algorithms for linear time suffix array construction", IEEE Trans. Comput., vol. 60, no. 10, pp. 1471–1484, 2011

[3] Nong, G., "Practical linear-time O(1)-workspace suffix sorting for constant alphabets", ACM Trans. Inform. Syst., vol. 31, no. 3, pp. 1–15, 2013

About

Inducing enhanced suffix arrays for string collections [DCC'16, TCS 2017]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published