Skip to content

Evaluating different memory managers for dynamic GPU memory

License

Notifications You must be signed in to change notification settings

nightduck/GPUMemManSurvey

 
 

Repository files navigation

GPUMemManSurvey

Evaluating different memory managers for dynamic GPU memory allocation.

Requirements

The framework was tested on Windows 10, Arch Linux <5.9.9> as well as Manjaro <5.4>

  • CUDA Toolkit

    • Tested on 10.1, 10.2, 11.0 and 11.1
      • Windows Download
      • Arch Linux (pacman -S cuda)
  • C++ Compiler

    • Tested on
      • gcc 9.0 and gcc 10.2
        • Arch Linux (pacman -S gcc)
      • VS 2019
  • boost (required for ScatterAlloc)

    • Tested with boost 1.66 and 1.74
      • Windows Download
        • Set the installed location in BaseCMake.cmake
      • Arch Linux (pacman -S boost)
  • CMake

    • Version >= 3.16, tested with 3.18
      • Windows Download
      • Arch Linux (pacman -S cmake)
  • Python

    • Tested with Python 3.8 and Python 3.9
      • Windows Download or download via Windows Store
      • Arch Linux (pacman -S python)
    • Requires packages
      • argparse (python pip -m install argparse)

Setup Instructions

  • Make sure all requirements are installed and configured correctly
    • On Windows also set the correct boost path in BaseCMake.cmake
  • Setup
    • Option A: Setup from Archive
      • Extract archive
      • In top-level directory, call
        • git submodule init
        • git submodule update
    • Option B: Setup from GitHub
      • git clone --recursive -b AEsubmission https://github.com/GPUPeople/GPUMemManSurvey.git
  • python init.py
  • Install
    • On Windows use the Developer PowerShell for VS 20XX (msbuild is needed) to call the scripts
    • Option A:
      • If you want to build everything, call python setupAll.py --cc XX, set correct CC (tested with 61, 70 and 75)
    • OptionB:
      • You can build each testcase separately, there is a setup.py in each tests folder
        • python setup.py --cc XX, set correct CC tested with 61, 70, 75
  • To clean/reset the build folders, simply call python cleanAll.py
    • Once again, there is a separate clean.py in every test subfolder

Testcase Instructions

To run a representative testsuite, simply call

  • python testAll.py -mem_size 8 -device 0 -runtest -genres
    • The memory size is in GB
    • The device ID of the device to use (has to match with the CC passed in build stage)

These runtime measures were measured for the limited testcase as setup in testAll.py on a TITAN V and an Intel Core i7-8700X on Windows 10 and Manjaro respectively.

Task Time (min:sec) - Linux Time (min:sec) - Windows
Overall 28 min 28 sec 1 h 3 min 47 sec
Build 9 min 45 sec 28 min 15 sec
Test All 18 min 43 sec 35 min 32 sec
- - -
Allocation 1 min 48 sec 2 min 39 sec
Mixed Allocation 0 min 35 sec 2 min 46 sec
Scaling 2 min 52 sec 4 min 03 sec
Fragmentation 1 min 47 sec 2 min 36 sec
Out-of-Memory 7 min 15 sec 8 min 21 sec
Graph Init 0 min 11 sec 1 min 44 sec
Graph Update 0 min 11 sec 1 min 44 sec
Graph Update Range 0 min 11 sec 1 min 44 sec
Register Footprint 0 min 03 sec 0 min 05 sec
Initialization 0 min 05 sec 0 min 12 sec
Synthetic Workload 1 min 58 sec 4 min 50 sec
Synthetic Workload Write 1 min 57 sec 4 min 48 sec

The framework does not perform many sanity checks, please read the documentation first if something is not working as expected if some parameter was not configured correctly for example.

Folder Structure

  • frameworks -> includes code for all frameworks
  • externals -> not all CUDA versions have CUB yet
  • include / src / scripts -> framework code
  • tests -> all test implementations
    • alloc_test -> all allocation tests
      • test_allocation.py
      • test_mixed_allocation.py
      • test_scaling.py
    • frag_test -> all memory/fragmentation tests
      • test_fragmenation.py
      • test_oom.py
    • graph_test -> all graph tests
      • test_graph_init.py
      • test_graph_update.py
    • synth_test -> all synthetic tests
      • test_registers.py
      • test_synth_init.py
      • test_synth_workload.py

Frameworks

Framework Status Paper Code
CUDA Device Allocator ✔️ - -
XMalloc (2010) ✔️ Webpage -
ScatterAlloc (2012) ✔️ Webpage GitHub - Repository
FDGMalloc (2013) Webpage Webpage
Register Efficient (2014) ✔️ Webpage Webpage
Halloc (2014) ✔️ Presentation GitHub - Repository
DynaSOAr (2019) Webpage GitHub - Repository
Bulk-Sempaphore (2019) Webpage -
Ouroboros (2020) ✔️ Paper GitHub - Repository

Testcases

Each testcase is controlled and executed via python scripts, a commonality of all scripts is that to run the testcase, one has to pass -runtest to the script, to gather all results into one file one can pass -genres. Pass -h to print a help screen with all parameters. All testcases get a -device parameter to control which device should execute the GPU code (e.g. 0) and how much memory on this device should be reserved for the memory manager, specified via -allocsize (size on GB).

Data to Plot - Map

This table shows which test file can be used to generate which plot used in the paper.

Figure/Section Script Command
Sec. 4.1 test_registers.py python test_registers.py -t o+s+h+c+r+x -runtest -genres -allocsize 8 -device 0
Sec. 4.1 test_synth_init.py python test_synth_init.py -t o+s+h+c+r+x -runtest -genres -allocsize 8 -device 0
Fig. 9.a test_allocation.py python test_allocation.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -iter 100 -runtest -genres -timeout 120 -allocsize 8 -device 0
Fig. 9.b test_allocation.py python test_allocation.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -iter 100 -runtest -genres -timeout 120 -allocsize 8 -device 0
Fig. 9.c test_allocation.py python test_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-8192 -iter 100 -runtest -genres -warp -timeout 120 -allocsize 8 -device 0
Fig. 9.d test_mixed_allocation.py python test_mixed_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-8192 -iter 100 -runtest -genres -timeout 120 -allocsize 8 -device 0
Fig. 10.x test_scaling.py python test_scaling.py -t o+s+h+c+r+x -byterange 4-8192 -threadrange 0-20 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0
Fig. 11.a test_fragmentation.py python test_fragmentation.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -iter 100 -runtest -genres -timeout 60 -allocsize 8 -device 0
Fig. 11.b test_oom.py python test_oom.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -runtest -genres -timeout 3600 -allocsize 2 -device 0
Fig. 11.c test_synth_workload.py python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-20 -range 4-64 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0
Fig. 11.d test_synth_workload.py python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-20 -range 4-4096 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0
Fig. 11.e test_synth_workload.py python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-20 -range 4-64 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0 -testwrite
Fig. 11.f test_graph_init.py python test_graph_init.py -t o+s+h+c+r+x -configfile config_init.json -runtest -genres -timeout 600 -allocsize 8 -device 0
Fig. 11.g test_graph_update.py python test_graph_update.py -t o+s+h+c+r+x -configfile config_update_range.json -runtest -genres -timeout 600 -allocsize 8 -device 0

Allocation Testcases

Single Threaded / Single Warp Allocation Performance

To test single threaded or single warp performance, navigate to tests/alloc_tests and call the script test_allocation.py

  • python test_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
    • This will start 10000 threads, each of them will start by allocating 4 Bytes and then increase linearly up to 64 Bytes

This will generate one csv file for each approach with mean, min, max, median performance averaged over the number of iterations. To generate one file with all approaches already executed, pass option -genres instead or additional to -runtest.

Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-num 10000 How many threads/warps to start, e.g. 10000
-range 4-64 Which allocation range to test, e.g. 4-64 Bytes
-iter 50 How often to run test and average over runs, e.g. 50
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-warp Pass this flag to start 1 warp instead of 1 warp per allocation
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Mixed Range Allocation Performance

To test allocation performance when threads are allocating with different sizes (constrained by a maximum/minimum allocation size), navigate to tests/alloc_tests and call the script test_mixed_allocation.py

  • python test_mixed_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
    • This will start 10000 threads, each of them will allocate in the range of 4-64 Bytes

This will generate one csv file for each approach with mean, min, max, median performance averaged over the number of iterations. To generate one file with all approaches already executed, pass option -genres instead or additional to -runtest.

Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-num 10000 How many threads/warps to start, e.g. 10000
-range 4-64 Which allocation range to test, e.g. 4-64 Bytes
-iter 50 How often to run test and average over runs, e.g. 50
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-warp Pass this flag to start 1 warp instead of 1 warp per allocation
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Performance Scaling

To test performance scaling over a changing number of threads, navigate to tests/alloc_tests and call the script test_scaling.py

  • python test_scaling.py -t o+s+h+c+r+x -byterange 4-64 -threadrange 0-10 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
    • This will start with 2⁰ threads up to 2¹⁰ threads, testing all powers of 2 in-between, and for each number of threads test the range 4-64 Bytes

This will generate one csv file for each approach and for each number of threads with mean, min, max, median performance averaged over the number of iterations. To generate one file with all approaches already executed, pass option -genres instead or additional to -runtest. Can also be started with one warp per allocation by passing -warp.

Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-threadrange 0-10 The range of threads to test, given as a power of 2, e.g. 0-10 would test 2⁰, , ..., 2¹⁰ threads for the given -byterange
-byterange 4-64 Which allocation range to test, e.g. 4-64 Bytes
-iter 50 How often to run test and average over runs, e.g. 50
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-warp Pass this flag to start 1 warp instead of 1 warp per allocation
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Fragmentation Testcases

Memory Fragmentation Testcase

This testcase tests the fragmentation of the returned addresses of a given allocation by reporting the maximum address range returned by each allocating thread. It also tracks the static maximum over a number of iterations. It continues to allocate and free a number of allocations for the number of -iter and returns those ranges.

  • python test_fragmentation.py -t o+s+h+c+r+x -num 10000 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
    • This will start 10000 threads, each of them will start by allocating 4 Bytes and then increase linearly up to 64 Bytes, reporting the current range and static maximum range

This will generate one csv file for each approach with min address range, max address range, min address range (static) and max address range (max). To generate one file with all approaches already executed, pass option -genres instead or additional to -runtest.

Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-num 10000 Starts 10000 threads
-range 4-64 Which allocation range to test, e.g. 4-64 Bytes
-iter 50 How often to run test and average over runs, e.g. 50
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Out-of-Memory Testcase

Tests out-of-memory behavior for a range of allocation sizes, hence how efficient the memory is utilized. The range will be sampled for each power of 2 in-between the given -range

  • python test_oom.py -t o+s+h+c+r+x -num 10000 -range 4-64 -runtest -timeout 60 -allocsize 8 -device 0
    • This starts 10000 allocating threads, tests powers of 2 in the range 4-64 and continues to allocate until out-of-memory is reported, recording the number of iterations in the csv file

This will generate one csv file for each approach and records the number of successful iterations. To generate one file with all approaches already executed, pass option -genres instead or additional to -runtest.

Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-num 10000 Starts 10000 threads
-range 4-64 Which allocation range to test, e.g. 4-64 Bytes
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Dynamic Graph Testcases

Graph testcases require a config.json file, which has the following parameters

Parameter Value-Example Description
iterations 10 How many iterations to do, in which the graph is initialized new, e.g. 10
update_iterations 10 How many edge update iterations to perform
batch_size 10000 How many edges to insert each iteration
range 0 If range is 0, the edge sources are randomly distributed amongst the available vertices, if > 0, then updates will be focused on this smaller range, which is shifted over the graph update_iterations times
test_init true If this is set to true, only initialization will be measured.
verify false If this is set to true, then each operation will be verified against a host dynamic graph -> takes quite a long time
realistic_deletion false If this is set to false, the deletion operation will delete exactly the same edges that where introduced during the insertion opertion. Otherwise, random edges will be selected from the graph.

The testcase can handle .mtx (Matrix Market Format) files, which can be downloaded from the SuiteSparse Collection, and will automatically convert each file into a more efficient binary format, which greatly improves load times for multiple runs.

Graph Initialization

This testcase will test dynamic graph initialization. One has to pass a configfile as described above, the list of graphs to test is given at the top of test_graph_init.py.

  • python test_graph_init.py -t o+s+h+c+r+x -configfile config_init.json -runtest -timeout 120 -allocsize 8 -device 0
    • Tests initialization performance for all graphs noted in test_graph_init.py, configured according to config_init.json
Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-configfile config_init.json All the configuration details for this testcase, as described above
-graphstats Writes out graph statistics, does not run the actual testcase afterwards
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Graph Edge Updates

This testcase will test dynamic graph updates. One has to pass a configfile as described above, the list of graphs to test is given at the top of test_graph_update.py.

  • python test_graph_update.py -t o+s+h+c+r+x -configfile config_update.json -runtest -timeout 120 -allocsize 8 -device 0
    • Tests edge update performance for all graphs noted in test_graph_update.py, configured according to config_update.json, this will test random edge updates
  • python test_graph_update.py -t o+s+h+c+r+x -configfile config_update_range.json -runtest -timeout 120 -allocsize 8 -device 0
    • Tests edge update performance for all graphs noted in test_graph_update.py, configured according to config_update_range.json, this will test pressured edge updates with a given range of source vertices shifted over the graph
Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-configfile config_update.json All the configuration details for this testcase, as described above
-graphstats Writes out graph statistics, does not run the actual testcase afterwards
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-timeout 120 Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Synthetic Testcases

Register Requirements

This testcase will report the number of registers required for a respective call to malloc or free.

  • python test_registers.py -t o+s+h+c+r+x -runtest -allocsize 8 -device 0
Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Memory Manager Initialization

This testcase will test how long it takes to initialize each memory manager.

  • python test_synth_init.py -t o+s+h+c+r+x -runtest -allocsize 8 -device 0
Option Parameter-Example Description
-t o+s+h+c+f+r+x Specify which frameworks to test, first letter of approach separated by +, e.g. c : cuda or s : scatteralloc
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use

Workload Testcase

This testcase will test the classic case of a number of threads producing varying numbers of output elements and compares it to a baseline implemented with an CUB::ExclusiveSum.

  • python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-10 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
    • This will start with 2⁰ threads up to 2¹⁰ threads, testing all powers of 2 in-between, and for each number of threads test the range 4-64 Bytes
    • The option -testwrite will test write performance to this memory area
Option Parameter-Example Description
-t o+s+h+c+f+r+x+b Specify which frameworks to test, first letter of approach separated by +, e.g. b : baseline (CUB exclusive sum) or c : cuda or s : scatteralloc
-threadrange 0-10 The range of threads to test, given as a power of 2, e.g. 0-10 would test 2⁰, , ..., 2¹⁰ threads for the given -byterange
-range 4-64 Which allocation range to test, e.g. 4-64 Bytes
-iter 50 How often to run test and average over runs, e.g. 50
-runtest Pass this flag to execute the testcase and run the approaches
-genres Pass this flag to gather all results from existing csv files into one
-allocsize 8 How large the manageable memory ares per memory manager should be in GB
-device 0 Which GPU device to use
-testwrite If parameter is passed, not the allocation performance is measured but the write performance to these allocations

Test table TITAN V

Build Init Reg. Perf. 10K Perf. 100K Warp 10K Warp 100K Mix 10K Mix 100K Scale Frag. 1 OOM Graph Init. Graph Up. Graph Range Synth.4-64 Synth.4-4096 Synth. Write
CUDA 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XMalloc 🅰️ ✔️ ✔️ ✔️ 💥 ✔️ ✔️ ✔️ 💥 💥 💥 💥 💥 💥 ✔️ ✔️ ✔️
ScatterAlloc 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Halloc 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Reg-Eff - AW 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Reg-Eff - C 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 💥 ✔️ ✔️ ✔️ ✔️ 💥 💥 💥 💥 💥
Reg-Eff - CF 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Reg-Eff - CM 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Reg-Eff - CFM 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - P - S 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - P - VA 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - P - VL 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - C - S 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - C - VA 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - C - VL 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

Test table 2080Ti

Build Init Reg. Perf. 10K Perf. 100K Warp 10K Warp 100K Mix 10K Mix 100K Scale Frag. 1 OOM Graph Init. Graph Up. Graph Range Synth.4-64 Synth.4-4096 Synth. Write
CUDA 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XMalloc 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 💥 ✔️ ✔️ ✔️
ScatterAlloc 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Halloc 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Reg-Eff - AW 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Reg-Eff - C 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 💥 ✔️ ✔️ ✔️
Reg-Eff - CF 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 💥 ✔️ ✔️ ✔️
Reg-Eff - CM 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 💥 ✔️ ✔️ ✔️
Reg-Eff - CFM 🅰️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥 💥 ✔️ ✔️ ✔️
Oro - P - S 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - P - VA 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - P - VL 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - C - S 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Oro - C - VA 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ 💥
Oro - C - VL 🆎 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

About

Evaluating different memory managers for dynamic GPU memory

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 54.7%
  • Cuda 26.5%
  • Python 11.5%
  • CMake 5.3%
  • C 1.6%
  • Jupyter Notebook 0.4%