For 10 x 10 matrices
Method | MatrixSize | Mean | Error | StdDev | Median | SpeedUp |
---|---|---|---|---|---|---|
SSEDLL | 10 | 0.6794 | 0.005906 | 0.005235 | 0.679 | 1 |
VectorSharp | 10 | 0.764 | 0.013935 | 0.012353 | 0.7623 | 1.124521637 |
AVX2DLL | 10 | 1.0042 | 0.020026 | 0.018732 | 1.0027 | 1.478068884 |
Multiply1dWithTranspose | 10 | 1.186 | 0.0225 | 0.0221 | 1.745657933 | |
Multiply1dWithTransposeAndUnrolled | 10 | 1.286 | 0.0273 | 0.0373 | 1.892846629 | |
Multiply1d | 10 | 1.331 | 0.0265 | 0.0498 | 1.959081543 | |
MultiplyJaggedSharp | 10 | 1.531 | 0.0306 | 0.0618 | 2.253458934 | |
Multiply2d | 10 | 2.477 | 0.0494 | 0.0607 | 3.645863998 | |
Multiply1dDLLFirstFor | 10 | 3.06 | 0.0598 | 0.0948 | 3.052 | 4.503974095 |
Multiply1dWithTransposeAndUnrolledAndParallelDLL | 10 | 3.337 | 0.0667 | 0.1838 | 4.911686782 | |
AVX2DLLParallel | 10 | 3.6387 | 0.095875 | 0.267261 | 3.5417 | 5.355755078 |
OpenMPParallel | 10 | 3.6543 | 0.090168 | 0.258708 | 3.5584 | 5.378716515 |
OpenMPParallel | 10 | 3.68 | 0.1112 | 0.3082 | 3.655 | 5.416544009 |
SSEDLLParallel | 10 | 3.6822 | 0.077172 | 0.216399 | 3.6528 | 5.419782161 |
Multiply1dSharp | 10 | 5.574 | 0.0377 | 0.0352 | 5.554 | 8.20429791 |
Multiply1dWithTransposeAndUnrolledAndParallelSharp | 10 | 5.82 | 0.1144 | 0.1272 | 5.774 | 8.566382102 |
VectorSharpParallel | 10 | 5.9281 | 0.1183 | 0.26459 | 5.9272 | 8.725493082 |
Multiply1dWithTranspose | 10 | 7.84 | 0.39191 | 0.45132 | 11.53959376 | |
Multiply1dWithTransposeAndUnrolled | 10 | 9.208 | 0.08588 | 0.07613 | 13.55313512 | |
Multiply1d | 10 | 9.278 | 0.18502 | 0.39428 | 13.65616721 | |
Multiply1dDLLSecondFor | 10 | 34.132 | 0.6811 | 1.4366 | 33.929 | 50.23844569 |
CUDASecondMultiplyWithoutCopy | 10 | 70.879 | 3.9748 | 11.7199 | 63.806 | 104.3258758 |
CUDAFirstMultiplyWithoutCopy | 10 | 72.128 | 3.2209 | 9.4968 | 70.66 | 106.1642626 |
CUDASecondMultiply | 10 | 307.212 | 9.2096 | 25.9758 | 295.513 | 452.1813365 |
CUDAFirstMultiply | 10 | 329.753 | 12.5671 | 36.857 | 322.609 | 485.3591404 |
Multiply1dDLLThirdFor | 10 | 341.777 | 6.8053 | 10.1859 | 339.846 | 503.0571092 |
For 100 x 100 matrices
Method | MatrixSize | Mean | Error | StdDev | Median | SpeedUp |
---|---|---|---|---|---|---|
OpenMPParallel | 100 | 90.71 | 1.7972 | 3.5475 | 90.368 | 1 |
CUDASecondMultiplyWithoutCopy | 100 | 91.069 | 4.0229 | 11.8617 | 84.786 | 1.003957667 |
CUDAFirstMultiplyWithoutCopy | 100 | 101.229 | 4.7851 | 14.109 | 93.281 | 1.115962959 |
OpenMPParallel | 100 | 123.5062 | 1.394218 | 1.304152 | 123.7963 | 1.361549994 |
AVX2DLLParallel | 100 | 127.5145 | 1.979355 | 1.85149 | 127.4217 | 1.405738066 |
SSEDLLParallel | 100 | 151.2455 | 2.689533 | 2.515791 | 150.4583 | 1.667352001 |
Multiply1dWithTransposeAndUnrolledAndParallelDLL | 100 | 161.152 | 3.1758 | 5.3928 | 1.776562672 | |
VectorSharpParallel | 100 | 167.6437 | 3.27149 | 4.99591 | 166.6681 | 1.848128101 |
Multiply1dDLLFirstFor | 100 | 213.945 | 4.9133 | 13.6962 | 209.672 | 2.358560247 |
SSEDLL | 100 | 283.0659 | 5.559858 | 7.03144 | 279.1983 | 3.120558924 |
VectorSharp | 100 | 291.4436 | 3.41663 | 3.028753 | 291.4584 | 3.212915886 |
AVX2DLL | 100 | 308.838 | 6.140071 | 9.559351 | 307.6695 | 3.404674237 |
CUDASecondMultiply | 100 | 380.426 | 7.5718 | 19.9472 | 372.659 | 4.193870577 |
CUDAFirstMultiply | 100 | 418.775 | 16.2783 | 47.7414 | 405.02 | 4.616635432 |
Multiply1dWithTransposeAndUnrolledAndParallelSharp | 100 | 452.535 | 8.9537 | 19.2738 | 447.901 | 4.988810495 |
Multiply1dSharp | 100 | 466.95 | 9.1219 | 8.5327 | 466.327 | 5.147723514 |
Multiply1dDLLSecondFor | 100 | 496.614 | 17.0064 | 47.6879 | 481.255 | 5.474743689 |
Multiply1dWithTransposeAndUnrolled | 100 | 1,076.45 | 12.0103 | 11.2345 | 11.86696064 | |
MultiplyJaggedSharp | 100 | 1,357.60 | 23.2452 | 21.7436 | 14.96639841 | |
Multiply1dWithTranspose | 100 | 1,383.89 | 16.8266 | 14.051 | 15.25624518 | |
Multiply1d | 100 | 1,448.66 | 28.6429 | 53.0914 | 15.97024584 | |
Multiply2d | 100 | 2,338.70 | 24.3722 | 21.6053 | 25.78218499 | |
Multiply1dWithTransposeAndUnrolled | 100 | 5052.179 | 96.05974 | 94.34351 | 55.69594312 | |
Multiply1dWithTranspose | 100 | 10679.086 | 238.10122 | 233.84723 | 117.7277698 | |
Multiply1d | 100 | 11062.934 | 257.56857 | 240.9298 | 121.959365 | |
Multiply1dDLLThirdFor | 100 | 34,115.24 | 674.722 | 1,466.79 | 34,069.34 | 376.091313 |
For 250 x 250 matrices
Method | MatrixSize | Mean | Error | StdDev | Median | SpeedUp |
---|---|---|---|---|---|---|
CUDASecondMultiplyWithoutCopy | 250 | 278.101 | 5.5423 | 8.1238 | 276.07 | 1 |
CUDAFirstMultiplyWithoutCopy | 250 | 466.271 | 9.3062 | 19.2189 | 464.37 | 1.67662468 |
OpenMPParallel | 250 | 977.838 | 36.1275 | 105.9556 | 940.499 | 3.516125436 |
AVX2DLLParallel | 250 | 1277.137 | 23.474968 | 21.9585 | 1275.5604 | 4.592349542 |
OpenMPParallel | 250 | 1319.3033 | 14.792703 | 13.837104 | 1323.224 | 4.743971794 |
VectorSharpParallel | 250 | 1451.0278 | 28.88237 | 33.26098 | 1446.1795 | 5.217628847 |
CUDASecondMultiply | 250 | 1,494.41 | 17.6265 | 16.4878 | 1,494.41 | 5.373608869 |
CUDAFirstMultiply | 250 | 1,587.01 | 32.2718 | 37.1642 | 1,575.51 | 5.706606593 |
SSEDLLParallel | 250 | 1886.6368 | 35.732465 | 35.094058 | 1888.9606 | 6.783998619 |
Multiply1dWithTransposeAndUnrolledAndParallelDLL | 250 | 2,579.18 | 125.206 | 369.1726 | 9.274249284 | |
Multiply1dDLLFirstFor | 250 | 3,608.78 | 119.0656 | 341.6215 | 3,516.38 | 12.97652292 |
AVX2DLL | 250 | 3709.3901 | 74.124676 | 76.120585 | 3702.7734 | 13.33828393 |
VectorSharp | 250 | 4023.0501 | 79.92504 | 186.82219 | 4019.5895 | 14.46614755 |
Multiply1dDLLSecondFor | 250 | 4,113.06 | 81.2145 | 160.3095 | 4,085.47 | 14.78980299 |
SSEDLL | 250 | 4887.964 | 96.258633 | 166.041412 | 4848.0801 | 17.57621871 |
Multiply1dWithTransposeAndUnrolledAndParallelSharp | 250 | 5,679.70 | 106.9903 | 100.0788 | 5,644.94 | 20.42315921 |
Multiply1dSharp | 250 | 6,107.17 | 121.2515 | 199.2196 | 6,088.18 | 21.96025545 |
Multiply1dWithTransposeAndUnrolled | 250 | 16,644.40 | 283.8997 | 251.6696 | 59.85020191 | |
Multiply1dWithTranspose | 250 | 20,767.40 | 411.7288 | 364.9868 | 74.6757473 | |
Multiply1d | 250 | 22,007.59 | 388.9259 | 344.7727 | 79.13524223 | |
MultiplyJaggedSharp | 250 | 23,476.33 | 511.5211 | 717.08 | 84.41654291 | |
Multiply2d | 250 | 37,923.64 | 556.8028 | 520.8337 | 136.3664388 | |
Multiply1dWithTransposeAndUnrolled | 250 | 81530.829 | 1627.87636 | 3538.86924 | 293.169852 | |
Multiply1d | 250 | 211245.979 | 3626.30218 | 3392.04531 | 759.6016519 | |
Multiply1dWithTranspose | 250 | 211646.275 | 4120.88527 | 4905.62077 | 761.0410426 | |
Multiply1dDLLThirdFor | 250 | 230,850.58 | 7,159.01 | 20,308.92 | 227,076.13 | 830.0961701 |
For 500 x 500 matrices
Method | MatrixSize | Mean | Error | StdDev | Median | SpeedUp |
---|---|---|---|---|---|---|
CUDASecondMultiplyWithoutCopy | 500 | 1,617.57 | 5.3356 | 4.4555 | 1,616.60 | 1 |
CUDAFirstMultiplyWithoutCopy | 500 | 2,982.11 | 13.5538 | 12.6783 | 2,977.83 | 1.843570584 |
CUDASecondMultiply | 500 | 4,885.78 | 97.2554 | 213.478 | 4,794.18 | 3.020438645 |
CUDAFirstMultiply | 500 | 6,329.02 | 41.373 | 34.5483 | 6,325.76 | 3.912665456 |
OpenMPParallel | 500 | 6,909.16 | 138.3472 | 405.7483 | 6,831.61 | 4.271312021 |
AVX2DLLParallel | 500 | 7626.2035 | 193.681383 | 568.033968 | 7579.8578 | 4.714596188 |
OpenMPParallel | 500 | 7767.8566 | 207.585201 | 608.81146 | 7787.6031 | 4.802167568 |
VectorSharpParallel | 500 | 10103.3084 | 170.65822 | 159.63381 | 10064.9031 | 6.245967508 |
SSEDLLParallel | 500 | 10665.1787 | 351.561003 | 1031.067565 | 10706.7508 | 6.593321414 |
Multiply1dWithTransposeAndUnrolledAndParallelDLL | 500 | 17,061.96 | 371.5151 | 1,089.59 | 10.54787636 | |
AVX2DLL | 500 | 28751.9743 | 557.92099 | 572.94378 | 28723.85 | 17.77476151 |
Multiply1dDLLFirstFor | 500 | 31,504.49 | 673.7689 | 1,878.20 | 31,006.38 | 19.4763921 |
Multiply1dDLLSecondFor | 500 | 31,892.80 | 642.258 | 1,725.38 | 31,841.49 | 19.71645298 |
VectorSharp | 500 | 35725.6557 | 1153.30357 | 3252.91318 | 34896.4154 | 22.08596193 |
SSEDLL | 500 | 42266.4917 | 804.890291 | 790.50989 | 42298.7542 | 26.12957295 |
Multiply1dWithTransposeAndUnrolledAndParallelSharp | 500 | 43,081.55 | 416.9136 | 389.9812 | 43,112.64 | 26.63345024 |
Multiply1dSharp | 500 | 52,213.44 | 956.1917 | 894.4223 | 51,962.30 | 32.27887829 |
Multiply1dWithTransposeAndUnrolled | 500 | 133,871.48 | 1,386.78 | 1,297.19 | 82.76070261 | |
Multiply1dWithTranspose | 500 | 168,821.07 | 3,231.48 | 3,173.75 | 104.3668947 | |
Multiply1d | 500 | 186,925.08 | 3,728.21 | 8,261.46 | 115.5589763 | |
MultiplyJaggedSharp | 500 | 214,642.38 | 4,247.82 | 4,891.79 | 132.6940917 | |
Multiply2d | 500 | 382,662.14 | 7,518.85 | 8,658.72 | 236.5656079 | |
Multiply1dWithTransposeAndUnrolled | 500 | 621591.97 | 11740.53246 | 10982.10135 | 384.274447 | |
Multiply1dDLLThirdFor | 500 | 923,522.94 | 18,484.37 | 40,573.64 | 912,753.65 | 570.9312285 |
Multiply1dWithTranspose | 500 | 1776699.222 | 27481.4902 | 25706.20298 | 1098.373441 | |
Multiply1d | 500 | 1874828.175 | 36478.4177 | 43424.96138 | 1159.037753 |
For 1000 x 1000 matrices
Method | MatrixSize | Mean | Error | StdDev | Median | SpeedUp |
---|---|---|---|---|---|---|
CUDASecondMultiplyWithoutCopy | 1000 | 11,579.21 | 19.2134 | 17.9722 | 11,570.90 | 1 |
CUDASecondMultiply | 1000 | 18,479.25 | 173.4195 | 162.2167 | 18,427.10 | 1.595898564 |
CUDAFirstMultiplyWithoutCopy | 1000 | 25,541.90 | 30.1455 | 28.1981 | 25,530.24 | 2.205840742 |
CUDAFirstMultiply | 1000 | 32,335.12 | 159.2015 | 141.128 | 32,367.82 | 2.792514241 |
OpenMPParallel | 1000 | 42849.9545 | 599.573745 | 468.107742 | 42699.2955 | 3.700592674 |
AVX2DLLParallel | 1000 | 47126.6893 | 2643.418823 | 7710.976781 | 43572.2417 | 4.069938538 |
OpenMPParallel | 1000 | 50,522.67 | 1,702.64 | 5,020.27 | 50,222.83 | 4.363221286 |
SSEDLLParallel | 1000 | 65991.0517 | 1309.21226 | 1607.83023 | 65553.3 | 5.699095958 |
VectorSharpParallel | 1000 | 68496.049 | 1358.07578 | 1765.88212 | 68741.7562 | 5.915431652 |
Multiply1dWithTransposeAndUnrolledAndParallelDLL | 1000 | 115,897.95 | 2,299.41 | 4,949.72 | 10.00913784 | |
AVX2DLL | 1000 | 253923.7516 | 4909.293764 | 7497.010086 | 253550.85 | 21.92927358 |
Multiply1dWithTransposeAndUnrolledAndParallelSharp | 1000 | 327,605.85 | 4,459.02 | 4,170.97 | 327,616.40 | 28.29258074 |
VectorSharp | 1000 | 362379.0806 | 20471.89341 | 59717.47399 | 341778.4 | 31.29565449 |
SSEDLL | 1000 | 389636.52 | 7713.942335 | 8883.387464 | 387519.9 | 33.64965187 |
Multiply1dDLLFirstFor | 1000 | 775,414.91 | 21,458.89 | 63,272.02 | 770,391.65 | 66.9661091 |
Multiply1dDLLSecondFor | 1000 | 1,020,197.32 | 26,687.63 | 77,425.67 | 1,006,458.80 | 88.10592118 |
Multiply1dWithTransposeAndUnrolled | 1000 | 1,072,467.98 | 18,017.41 | 15,971.97 | 92.62010176 | |
Multiply1dWithTranspose | 1000 | 1,368,079.88 | 43,202.23 | 44,365.51 | 118.1496323 | |
Multiply1d | 1000 | 1,771,000.48 | 23,732.90 | 22,199.77 | 152.9465195 | |
Multiply1dSharp | 1000 | 2,301,483.30 | 45,097.46 | 57,033.84 | 2,308,407.40 | 198.7598896 |
Multiply2d | 1000 | 3,966,262.18 | 79,468.64 | 234,315.07 | 342.5329369 | |
Multiply1dDLLThirdFor | 1000 | 4,218,862.42 | 79,396.25 | 70,382.71 | 4,236,403.45 | 364.3479101 |
Multiply1dWithTransposeAndUnrolled | 1000 | 5368611.738 | 106780.2314 | 195253.691 | 463.6421555 | |
MultiplyJaggedSharp | 1000 | 10,104,715.06 | 175,000.92 | 163,695.97 | 872.659842 | |
Multiply1dWithTranspose | 1000 | 14726615.4 | 149368.5863 | 139719.4683 | 1271.814771 | |
Multiply1d | 1000 | 16993129.82 | 439113.3627 | 450937.0939 | 1467.554691 |