This project helps to explore your hardware and achieve the peak performance for your needs.
- OS: ubuntu 20.04
- Compiler: GCC 9.3.0
- optimization level: -O3
- Platform: x86_x64
- Hardware info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
Stepping: 10
CPU MHz: 2808.010
BogoMIPS: 5616.02
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 1.5 MiB
L3 cache: 9 MiB
Vulnerability Itlb multihit: KVM: Vulnerable
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds: Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xt
opology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd
ibrs ibpb stibp fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves md_clear flush_l1d arch_capabilities
-
Miroarch: Coffee Lake (AVX2)
- registers: 16 x 256bit
- cache line size: 64bytes(8 FP64)
-
Instruction Timing
Intrinsics Instruction Latency Throughput _mm256_load_pd vmovapd 7 0.5 _mm256_store_pd vmovapd 5 1 _mm256_fmadd_pd vfmadd213pd 4 0.5 _mm256_permute4x64_pd vpermpd 3 1 _mm256_permute2f128_pd vperm2f128 3 1 _mm256_shuffle_pd vshufpd 1 1 -
Coffee Lake microarch
Coffee Lake microarch has two issue ports each has a FMA
unit(see the previous diagram), so at most two floating point instructions(add
and mul
) could be done per-cycle, in addition the CPU supports avx2
extension with 256bit register which could hold 4 FP64(8 FP32) values, as a result 16 FP64 instructions could be executed per-cycle. The CPU could boost it's frequency from 2.8GHz to at most 4GHz, so the peak performance could be 64GFLOPS(FP64).
Also the following formula could be used:
N_CORES * FREQUENCY * FMA * UNITS * (SIZE_OF_VECTOR/64)
for my hardware in single thread
1 * 4 * 2 * 2 * (256/64) = 64
Method | GFLOPS(FP64) |
---|---|
Theory | 64 |
MKL | 56.971941 |
OpenBlas | 51.170771 |
naive-ijk | 1.960400 |
manual | 46.719804 |