Skip to content

Inducing enhanced suffix arrays for string collections [DCC'16, TCS 2017]

License

Notifications You must be signed in to change notification settings

felipelouza/gsa-is

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#gsa-is

This code is an implementation of gSAIS and gSACA-K, which are modifications of SAIS [1] and SACA-K [2] algorithms to compute the generalized suffix array, maintaining their theoretical bounds and improving their practical performance

-- Settings:

We have implemented six MODEs in main.c:

  • 1: SAIS*
  • 2: SACA-K*
  • 3: SAIS
  • 4: SACA-K
  • 5: gSAIS
  • 6: gSACA-K

* uses integer alphabet as input.

-- ##run:

To run a test type:

make
make run DIR=dataset INPUT=uniprot.100.fasta K=100 MODE=6

##Experimental results

Collections:

Collection size (GiB) d n n/d
Revision 0.39 20,433 419,437,293 20,527
Page 3.74 1,000 4,019,585,128 4,019,585
Influenza 0.56 394,217 597,471,768 1,516
Enwiki 8.32 3,903,703 8,933,518,792 2,288
DNA reads 2.87 32,621,862 3,082,739,100 94
Uniprot 15.77 50,825,784 16,931,428,229 333

Datasets:

Dataset d n size (MiB) avgLen
pages 1 2 2,371,241 2.26 1,185,621
pages 2 7 14,707,622 14.03 2,101,089
pages 3 15 35,043,296 33.42 2,336,220
pages 4 31 52,890,374 50.44 1,706,141
pages 5 62 119,466,260 113.93 1,926,875
pages 6 125 412,148,014 393.05 3,297,184
pages 7 250 840,886,051 801.93 3,363,544
pages 8 450 2,127,332,977 2,028.78 4,727,407
pages 9 650 2,635,913,891 2,513.80 4,055,252
pages 10 1,000 4,019,585,128 3,833.38 4,019,585
revesion 1 40 99,138 0.09 2,478
revesion 2 80 260,366 0.25 3,255
revesion 3 159 914,542 0.87 5,752
revesion 4 319 2,091,943 2.00 6,558
revesion 5 638 4,294,715 4.10 6,732
revesion 6 1,277 16,058,904 15.31 12,575
revesion 7 2,554 34,168,931 32.59 13,379
revesion 8 5,108 59,888,778 57.11 11,725
revesion 9 10,217 144,241,776 137.56 14,118
revesion 10 20,433 419,437,293 400.01 20,527
influenza 1 769 924,268 0.88 1,202
influenza 2 1,539 1,773,953 1.69 1,153
influenza 3 3,079 3,600,436 3.43 1,169
influenza 4 6,159 7,056,864 6.73 1,146
influenza 5 12,319 14,100,766 13.45 1,145
influenza 6 24,638 32,959,239 31.43 1,338
influenza 7 49,277 70,800,914 67.52 1,437
influenza 8 98,554 142,930,737 136.31 1,450
influenza 9 197,108 289,558,572 276.14 1,469
influenza 10 394,217 597,471,768 569.79 1,516
wikipedia 1 7,624 13,303,449 12.69 1,745
wikipedia 2 15,248 25,737,808 24.55 1,688
wikipedia 3 30,497 53,002,468 50.55 1,738
wikipedia 4 60,995 101,073,689 96.39 1,657
wikipedia 5 121,990 189,880,688 181.08 1,557
wikipedia 6 243,981 478,842,959 456.66 1,963
wikipedia 7 487,963 1,071,827,039 1,022.17 2,197
wikipedia 8 975,926 2,315,093,668 2,207.85 2,372
wikipedia 9 1,951,852 4,459,594,679 4,253.00 2,285
wikipedia 10 3,903,703 8,933,518,792 8,519.67 2,288
reads 1 100,000 9,030,390 8.61 90
reads 2 200,000 17,703,050 16.88 89
reads 3 400,000 36,808,318 35.10 92
reads 4 800,000 75,089,460 71.61 94
reads 5 1,600,000 149,953,139 143.01 94
reads 6 3,200,000 302,041,259 288.05 94
reads 7 6,400,000 604,318,315 576.32 94
reads 8 12,800,000 1,209,578,418 1,153.54 94
reads 9 25,600,000 2,419,646,287 2,307.55 95
reads 10 32,621,862 3,082,739,100 2,939.93 94
protein 1 100,000 26,198,280 24.98 262
protein 2 200,000 52,503,813 50.07 263
protein 3 400,000 106,286,416 101.36 266
protein 4 800,000 219,589,476 209.42 274
protein 5 1,600,000 531,194,430 506.59 332
protein 6 3,200,000 1,253,483,804 1,195.42 392
protein 7 6,400,000 2,667,015,568 2,543.46 417
protein 8 12,800,000 5,162,366,462 4,923.22 403
protein 9 25,600,000 9,070,832,678 8,650.62 354
protein 10 50,825,784 16,931,428,229 16,147.07 333

Running time (seconds):

Dataset SAIS (int) SACA-K (int) SAIS SACA-K gSAIS gSACA-K
pages 1 2.90E-01 4.10E-01 1.90E-01 1.20E-01 1.90E-01 1.20E-01
pages 2 3.43E+00 3.35E+00 3.38E+00 2.09E+00 2.22E+00 2.06E+00
pages 3 1.06E+01 1.01E+01 8.61E+00 7.60E+00 9.12E+00 7.65E+00
pages 4 1.75E+01 1.59E+01 1.50E+01 1.25E+01 1.48E+01 1.23E+01
pages 5 4.44E+01 3.89E+01 3.78E+01 3.21E+01 3.82E+01 3.17E+01
pages 6 1.94E+02 1.57E+02 1.76E+02 1.35E+02 1.75E+02 1.33E+02
pages 7 4.09E+02 3.26E+02 3.74E+02 2.79E+02 3.73E+02 2.76E+02
pages 8 1.12E+03 8.88E+02 1.00E+03 9.22E+02 1.00E+03 7.31E+02
pages 9 1.54E+03 1.33E+03 1.31E+03 1.02E+03 1.31E+03 9.89E+02
pages 10 2.57E+03 2.17E+03 2.09E+03 1.61E+03 2.09E+03 1.56E+03
revesion 1 1.20E-02 1.00E-02 9.00E-03 8.00E-03 8.00E-03 7.00E-03
revesion 2 2.00E-02 1.50E-02 1.40E-02 1.20E-02 1.40E-02 1.20E-02
revesion 3 6.60E-02 6.70E-02 5.60E-02 5.30E-02 5.70E-02 5.30E-02
revesion 4 2.30E-01 2.40E-01 1.60E-01 1.50E-01 1.50E-01 1.50E-01
revesion 5 6.30E-01 6.80E-01 3.80E-01 3.10E-01 3.40E-01 3.20E-01
revesion 6 3.89E+00 4.32E+00 3.37E+00 3.07E+00 3.12E+00 2.92E+00
revesion 7 1.04E+01 1.01E+01 8.83E+00 7.72E+00 9.52E+00 7.70E+00
revesion 8 1.96E+01 1.83E+01 1.73E+01 1.48E+01 1.71E+01 1.47E+01
revesion 9 5.56E+01 4.87E+01 4.98E+01 4.07E+01 4.89E+01 3.99E+01
revesion 10 2.12E+02 1.70E+02 1.93E+02 1.45E+02 1.87E+02 1.39E+02
influenza 1 8.40E-02 6.80E-02 6.60E-02 6.50E-02 6.70E-02 6.50E-02
influenza 2 1.58E-01 1.61E-01 1.23E-01 1.24E-01 1.30E-01 1.24E-01
influenza 3 4.23E-01 4.24E-01 2.68E-01 2.46E-01 2.69E-01 2.54E-01
influenza 4 1.21E+00 1.17E+00 6.30E-01 6.20E-01 6.70E-01 6.10E-01
influenza 5 2.96E+00 2.92E+00 1.67E+00 1.58E+00 1.75E+00 1.54E+00
influenza 6 8.89E+00 8.70E+00 6.67E+00 6.16E+00 6.70E+00 6.08E+00
influenza 7 2.19E+01 2.13E+01 1.84E+01 1.70E+01 1.85E+01 1.67E+01
influenza 8 4.94E+01 4.57E+01 4.29E+01 3.81E+01 4.34E+01 3.76E+01
influenza 9 1.10E+02 9.64E+01 9.66E+01 8.23E+01 9.85E+01 8.16E+01
influenza 10 2.60E+02 2.12E+02 2.30E+02 1.82E+02 2.32E+02 1.80E+02
wikipedia 1 3.41E+00 3.37E+00 1.98E+00 1.92E+00 2.41E+00 2.12E+00
wikipedia 2 8.12E+00 7.77E+00 6.71E+00 6.25E+00 6.48E+00 5.99E+00
wikipedia 3 1.95E+01 1.83E+01 1.78E+01 1.50E+01 1.65E+01 1.47E+01
wikipedia 4 4.24E+01 3.79E+01 4.02E+01 3.23E+01 3.77E+01 3.15E+01
wikipedia 5 9.05E+01 7.54E+01 8.52E+01 6.68E+01 8.33E+01 6.52E+01
wikipedia 6 2.72E+02 2.03E+02 2.53E+02 1.83E+02 2.47E+02 1.81E+02
wikipedia 7 6.86E+02 4.98E+02 6.15E+02 4.32E+02 6.24E+02 4.27E+02
wikipedia 8 1.96E+03 1.43E+03 1.54E+03 1.09E+03 1.54E+03 1.06E+03
wikipedia 9 4.40E+03 3.47E+03 3.25E+03 2.33E+03 3.62E+03 2.35E+03
wikipedia 10 7.89E+03 6.18E+03 7.07E+03 5.25E+03 7.08E+03 5.08E+03
reads 1 1.76E+00 1.79E+00 1.06E+00 1.08E+00 1.07E+00 1.05E+00
reads 2 4.63E+00 4.52E+00 4.22E+00 4.04E+00 3.24E+00 3.02E+00
reads 3 1.20E+01 1.12E+01 1.11E+01 9.84E+00 1.02E+01 9.16E+00
reads 4 2.85E+01 2.46E+01 2.74E+01 2.16E+01 2.54E+01 2.13E+01
reads 5 6.29E+01 5.10E+01 6.24E+01 4.81E+01 5.86E+01 4.61E+01
reads 6 1.38E+02 1.07E+02 1.39E+02 1.03E+02 1.31E+02 9.74E+01
reads 7 3.08E+02 2.22E+02 2.99E+02 2.18E+02 2.81E+02 2.48E+02
reads 8 6.90E+02 4.88E+02 6.52E+02 4.55E+02 6.10E+02 4.19E+02
reads 9 2.07E+03 1.83E+03 1.49E+03 1.03E+03 1.39E+03 9.36E+02
reads 10 2.16E+03 1.68E+03 1.94E+03 1.37E+03 1.81E+03 1.24E+03
protein 1 1.01E+01 9.32E+00 6.94E+00 6.15E+00 8.30E+00 7.30E+00
protein 2 2.35E+01 2.08E+01 1.88E+01 1.62E+01 2.07E+01 1.74E+01
protein 3 5.51E+01 4.56E+01 4.78E+01 3.89E+01 5.10E+01 3.98E+01
protein 4 1.27E+02 9.93E+01 1.19E+02 9.03E+01 1.24E+02 9.01E+01
protein 5 3.63E+02 2.55E+02 3.44E+02 2.39E+02 3.39E+02 2.34E+02
protein 6 9.71E+02 6.63E+02 8.99E+02 5.85E+02 8.97E+02 5.73E+02
protein 7 2.52E+03 1.88E+03 2.22E+03 1.47E+03 2.20E+03 1.41E+03
protein 8 5.13E+03 3.88E+03 5.34E+03 3.51E+03 4.60E+03 3.04E+03
protein 9 9.45E+03 7.17E+03 1.02E+04 7.34E+03 8.71E+03 5.96E+03
protein 10 2.13E+04 1.65E+04 1.87E+04 1.26E+04 1.86E+04 1.22E+04

Space (MiB):

Dataset SAIS (int) SACA-K (int) SAIS SACA-K gSAIS gSACA-K
pages 1 18.50 18.09 11.72 11.31 11.72 11.31
pages 2 114.76 112.21 72.68 70.13 72.68 70.13
pages 3 273.43 267.36 173.17 167.10 173.17 167.10
pages 4 412.69 403.52 261.37 252.20 261.37 252.20
pages 5 932.11 911.46 590.32 569.66 590.32 569.66
pages 6 3,215.52 3,144.44 2,036.36 1,965.28 2,036.36 1,965.28
pages 7 6,560.70 6,415.45 4,154.91 4,009.66 4,154.91 4,009.66
pages 8 16,597.26 16,230.27 10,510.91 10,143.92 10,510.91 10,143.92
pages 9 40,680.40 40,220.86 23,083.78 22,624.23 23,083.78 22,624.23
pages 10 62,034.13 61,334.01 35,200.50 34,500.38 35,200.50 34,500.38
revesion 1 0.78 0.76 0.49 0.47 0.49 0.47
revesion 2 2.03 1.99 1.29 1.24 1.29 1.24
revesion 3 7.14 6.98 4.52 4.36 4.52 4.36
revesion 4 16.33 15.96 10.34 9.98 10.34 9.98
revesion 5 33.52 32.77 21.23 20.48 21.24 20.48
revesion 6 125.30 122.53 79.36 76.58 79.36 76.58
revesion 7 266.61 260.70 168.85 162.93 168.85 162.93
revesion 8 467.31 456.94 295.96 285.57 295.96 285.57
revesion 9 1,125.41 1,100.52 712.73 687.80 712.73 687.80
revesion 10 3,272.45 3,200.13 2,072.43 2,000.03 2,072.43 2,000.03
influenza 1 7.30 7.06 4.65 4.41 4.65 4.41
influenza 2 13.97 13.54 8.88 8.46 8.89 8.46
influenza 3 28.27 27.48 17.95 17.17 17.97 17.17
influenza 4 55.30 53.86 35.08 33.65 35.11 33.65
influenza 5 110.36 107.63 69.96 67.24 70.01 67.24
influenza 6 257.69 251.55 163.27 157.16 163.39 157.16
influenza 7 553.21 540.36 350.40 337.61 350.64 337.61
influenza 8 1,116.44 1,090.85 707.00 681.55 707.51 681.55
influenza 9 2,261.16 2,209.91 1,431.85 1,380.72 1,432.73 1,380.72
influenza 10 4,665.13 4,559.85 2,953.86 2,848.97 2,955.75 2,848.97
wikipedia 1 107.45 101.53 69.39 63.44 69.39 63.44
wikipedia 2 207.41 196.42 133.77 122.73 133.78 122.73
wikipedia 3 426.01 404.49 274.35 252.74 274.37 252.74
wikipedia 4 810.54 771.36 521.34 481.96 521.36 481.96
wikipedia 5 1,519.14 1,449.14 975.82 905.42 975.88 905.42
wikipedia 6 3,820.79 3,654.21 2,450.67 2,283.30 2,450.81 2,283.30
wikipedia 7 8,530.61 8,179.25 5,463.79 5,110.87 5,464.09 5,110.87
wikipedia 8 36,372.60 35,332.97 20,916.24 19,870.61 20,917.68 19,870.61
wikipedia 9 69,951.59 68,062.91 40,180.41 38,277.01 40,180.59 38,277.01
wikipedia 10 140,078.79 136,344.46 80,440.72 76,677.01 80,441.12 76,677.01
reads 1 72.91 69.28 47.04 43.06 47.07 43.06
reads 2 142.65 135.83 91.90 84.42 92.00 84.42
reads 3 295.69 282.35 175.52 175.52 190.38 175.52
reads 4 600.92 575.94 385.35 358.06 386.09 358.06
reads 5 1,194.90 1,150.16 763.98 715.03 765.88 715.03
reads 6 2,396.03 2,316.60 1,527.06 1,440.25 1,531.88 1,440.25
reads 7 4,777.69 4,635.00 3,036.75 2,881.62 3,048.72 2,881.62
reads 8 9,543.64 9,277.18 6,065.95 5,767.72 6,083.01 5,767.72
reads 9 37,781.02 37,116.19 21,555.19 20,767.99 21,628.14 20,767.99
reads 10 48,114.93 47,287.75 27,450.08 26,459.36 27,535.43 26,459.36
protein 1 212.12 200.26 137.01 124.92 137.16 124.92
protein 2 424.63 401.34 274.11 250.36 274.41 250.36
protein 3 857.04 812.43 552.33 506.81 552.95 506.81
protein 4 1,768.69 1,678.39 1,139.24 1,047.09 1,140.44 1,047.09
protein 5 4,288.93 4,058.80 2,767.10 2,532.93 2,769.17 2,532.93
protein 6 10,081.31 9,575.53 6,490.38 5,977.08 6,495.06 5,977.08
protein 7 42,240.02 40,744.26 24,411.26 22,891.18 24,435.77 22,891.18
protein 8 81,798.41 78,869.12 47,289.48 44,308.95 47,335.89 44,308.95
protein 9 143,689.08 138,605.24 83,036.60 77,855.58 83,134.74 77,855.58
protein 10 267,478.92 258,740.86 154,213.88 145,323.62 154,449.44 145,323.62

Workspace (MiB, but SACA-K and gSACA-K):

Dataset SAIS (int) SACA-K (int) SAIS SACA-K gSAIS gSACA-K
pages 1 0.41 0.00 0.41 1,024 0.41 1,024
pages 2 2.55 0.00 2.55 1,024 2.55 1,024
pages 3 6.07 0.00 6.07 1,024 6.07 1,024
pages 4 9.17 0.00 9.17 1,024 9.17 1,024
pages 5 20.66 0.00 20.66 1,024 20.66 1,024
pages 6 71.08 0.00 71.08 1,024 71.08 1,024
pages 7 145.25 0.00 145.25 1,024 145.25 1,024
pages 8 367.00 0.00 367.00 1,024 367.00 1,024
pages 9 459.55 0.01 459.55 1,024 459.55 1,024
pages 10 700.13 0.01 700.13 1,024 700.13 1,024
revesion 1 0.02 0.00 0.02 1,024 0.02 1,024
revesion 2 0.05 0.00 0.05 1,024 0.05 1,024
revesion 3 0.16 0.00 0.16 1,024 0.16 1,024
revesion 4 0.37 0.00 0.37 1,024 0.37 1,024
revesion 5 0.76 0.00 0.76 1,024 0.76 1,024
revesion 6 2.78 0.01 2.78 1,024 2.78 1,024
revesion 7 5.92 0.01 5.92 1,024 5.92 1,024
revesion 8 10.39 0.02 10.39 1,024 10.39 1,024
revesion 9 24.93 0.04 24.93 1,024 24.93 1,024
revesion 10 72.40 0.08 72.40 1,024 72.40 1,024
influenza 1 0.25 0.00 0.24 1,024 0.25 1,024
influenza 2 0.43 0.01 0.43 1,024 0.43 1,024
influenza 3 0.80 0.01 0.78 1,024 0.80 1,024
influenza 4 1.46 0.02 1.43 1,024 1.46 1,024
influenza 5 2.78 0.05 2.72 1,024 2.78 1,024
influenza 6 6.23 0.09 6.11 1,024 6.23 1,024
influenza 7 13.04 0.19 12.80 1,024 13.04 1,024
influenza 8 25.96 0.38 25.46 1,024 25.96 1,024
influenza 9 52.00 0.75 51.13 1,024 52.00 1,024
influenza 10 106.78 1.50 104.89 1,024 106.78 1,024
wikipedia 1 5.96 0.03 5.96 1,024 5.96 1,024
wikipedia 2 11.05 0.06 11.04 1,024 11.05 1,024
wikipedia 3 21.63 0.12 21.62 1,024 21.63 1,024
wikipedia 4 39.41 0.23 39.38 1,024 39.41 1,024
wikipedia 5 70.46 0.47 70.40 1,024 70.46 1,024
wikipedia 6 167.51 0.93 167.37 1,024 167.51 1,024
wikipedia 7 353.22 1.86 352.92 1,024 353.22 1,024
wikipedia 8 1,047.08 7.45 1,045.63 1,024 1,047.08 1,024
wikipedia 9 1,903.58 14.89 1,903.40 1,024 1,903.58 1,024
wikipedia 10 3,764.11 29.78 3,763.71 1,024 3,764.11 1,024
reads 1 4.01 0.38 3.98 1,024 4.01 1,024
reads 2 7.58 0.76 7.48 1,024 7.58 1,024
reads 3 14.86 1.53 14.58 1,024 14.86 1,024
reads 4 28.04 3.05 27.30 1,024 28.04 1,024
reads 5 50.85 6.10 48.94 1,024 50.85 1,024
reads 6 91.63 12.21 86.82 1,024 91.63 1,024
reads 7 167.11 24.42 155.14 1,024 167.11 1,024
reads 8 315.29 48.83 298.24 1,024 315.29 1,024
reads 9 860.14 195.31 787.20 1,024 860.14 1,024
reads 10 1,076.07 248.89 990.72 1,024 1,076.07 1,024
protein 1 12.24 0.38 12.09 1,024 12.24 1,024
protein 2 24.06 0.76 23.75 1,024 24.06 1,024
protein 3 46.14 1.53 45.52 1,024 46.14 1,024
protein 4 93.36 3.05 92.16 1,024 93.36 1,024
protein 5 236.24 6.10 234.17 1,024 236.24 1,024
protein 6 517.99 12.21 513.30 1,024 517.99 1,024
protein 7 1,544.59 48.83 1,520.08 1,024 1,544.59 1,024
protein 8 3,026.95 97.66 2,980.53 1,024 3,026.95 1,024
protein 9 5,279.16 195.31 5,181.02 1,024 5,279.16 1,024
protein 10 9,125.83 387.77 8,890.27 1,024 9,125.83 1,024

##references

[1] G. Nong, S. Zhang, and W. H. Chan, “Two efficient algorithms for linear time suffix array construction,” IEEE Trans. Comput., vol. 60, no. 10, pp. 1471–1484, 2011

[2] Nong, G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets,” ACM Trans. Inform. Syst., vol. 31, no. 3, pp. 1–15, 2013

About

Inducing enhanced suffix arrays for string collections [DCC'16, TCS 2017]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published