Merge pull request #743 from AaltoSciComp/yu/update-deeplearning · AaltoSciComp/scicomp-docs@dc55af0

Could not lex literal_block ' Calculating pi using 1000000000 stochastic trials\n ==PROF== Connected to process 3944692 (/home/username/hpc-examples/slurm/pi-gpu)\n ==PROF== Profiling "throw_dart": 0%....50%....100% - 19 passes\n Throws: 785390400/1000000000 Pi: 3.141561508\n ==PROF== Disconnected from process 3944692\n [3944692] [email protected]\nthrow_dart(curandStateXORWOW *, int *, unsigned long *) (512, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 7.0\n Section: GPU Speed Of Light Throughput\n ----------------------- ------------- ------------\n Metric Name Metric Unit Metric Value\n ----------------------- ------------- ------------\n DRAM Frequency cycle/usecond 877.96\n SM Frequency cycle/nsecond 1.24\n Elapsed Cycles cycle 10544719\n Memory Throughput % 49.01\n DRAM Throughput % 0.04\n Duration msecond 8.52\n L1/TEX Cache Throughput % 73.75\n L2 Cache Throughput % 49.01\n SM Active Cycles cycle 8850096.11\n Compute (SM) Throughput % 37.09\n ----------------------- ------------- ------------\n\n OPT This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full\n waves across all SMs. Look at Launch Statistics for more details.\n\n Section: Launch Statistics\n -------------------------------- --------------- ---------------\n Metric Name Metric Unit Metric Value\n -------------------------------- --------------- ---------------\n Block Size 128\n Function Cache Configuration CachePreferNone\n Grid Size 512\n Registers Per Thread register/thread 21\n Shared Memory Configuration Size byte 0\n Driver Shared Memory Per Block byte/block 0\n Dynamic Shared Memory Per Block byte/block 0\n Static Shared Memory Per Block byte/block 0\n Threads thread 65536\n Waves Per SM 0.40\n -------------------------------- --------------- ---------------\n\n Section: Occupancy\n ------------------------------- ----------- ------------\n Metric Name Metric Unit Metric Value\n ------------------------------- ----------- ------------\n Block Limit SM block 32\n Block Limit Registers block 21\n Block Limit Shared Mem block 32\n Block Limit Warps block 16\n Theoretical Active Warps per SM warp 64\n Theoretical Occupancy % 100\n Achieved Occupancy % 39.97\n Achieved Active Warps Per SM warp 25.58\n ------------------------------- ----------- ------------\n\n OPT Estimated Speedup: 60.03%\n This kernel\'s theoretical occupancy is not impacted by any block limit. The difference between calculated\n theoretical (100.0%) and measured achieved occupancy (40.0%) can be the result of warp scheduling overheads\n or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block\n as well as across blocks of the same kernel. See the CUDA Best Practices Guide\n (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on\n optimizing occupancy.' as "bash". Highlighting skipped.

check-warnings (3.12): triton/usage/quotas.rst#L11

more than one target found for 'any' cross-reference 'smallfiles': could be :doc:`the page on small files` or :std:ref:`the page on small files`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge pull request #743 from AaltoSciComp/yu/update-deeplearning #2728

Summary

Merge pull request #743 from AaltoSciComp/yu/update-deeplearning #2728

Jobs

Run details

checkwarnings.yaml

Annotations