This project aims to create a free and open-source benchmark suite for evaluating GPU performance on AI tasks. The current version supports standard deep learning tests across three major architectures: Transformers (BERT), RNNs (LSTM), and CNNs (ResNet50). It also includes tests for Ollama model variants during inference. The benchmark suite is configurable, allowing users to test different model parameters and GPU configurations, with plans to expand coverage to additional architectures and models in future updates.
Link To YouTube Video Explanation of Standard DL Test
Link To YouTube Video Explanation of Ollama Inference Tests
standard_dl_test
config
bert.yaml
: Configuration file for the BERT model.lstm.yaml
: Configuration file for the LSTM model.resnet50.yaml
: Configuration file for the ResNet50 model.
bert_train_test.py
: Benchmark script for evaluating GPUs on the BERT model with Hugging Face's publicly available IMDB dataset for text classification.lstm_train_test.py
: Benchmark script for evaluating GPUs on the LSTM model with UCI's publicly available Individual Household Electric Power Consumption dataset for time series forecasting.resnet50_train_test.py
: Benchmark script for evaluating GPUs on the ResNet50 model with PyTorch's available CIFAR10 dataset for image classification. The images are resized to 256x256 to make the benchmarking scenario more realistic.run.py
: Main script to run the standard DL test suite.
ollama_test
config.yaml
: Configuration file for for ollama test suitelog_2_csv.py
: Converts the output in ./logs (created when ./test_all.py is run) to more easily ingestible CSV file.test_all.py
: This is the main script to run to test with Ollama. It automates running Docker containers for different GPU and model combinations, pulling models via Ollama's API, and ensuring they are accessible for prompt testing. It manages container lifecycle events, GPU selection, and volume handling based on a YAML configuration file.test_model_prompts.py
: This script loads test prompts from a YAML file, runs multiple inference tests on a specified LLM model via a local API, and logs key performance metrics such as total duration, tokens per second, and GPU utilization. The results are averaged across test runs, saved to uniquely named log files, and include example responses from the model for each prompt.test_prompts.yaml
: A YAML file containing all prompts to be tested.
gpu_max_load.py
: Script to test GPU maximum load, useful for checking thermals or server noise at maximum loading.requirements.txt
: List of required packages and dependencies.setup.py
: Script for setting up the project environment. This script requires a version of Conda installed as it creates a Conda environment.
Ensure you have the following installed:
- Python 3.8 or higher (3.10 recommended)
- CUDA compatible GPU(s)
- NVIDIA CUDA Toolkit and cuDNN
- Docker must be installed. Below are links to outline that process if needed.
- Docker Desktop Installation Instructions: https://www.docker.com/get-started/
- Docker Engine (No GUI) Installation Instructions: https://docs.docker.com/engine/install/ubuntu/
-
Clone the repository:
git clone [email protected]:thedatadaddi/BenchDaddi.git cd BenchDaddi
-
To use Conda, install Miniconda or Anaconda. Install Miniconda.
python setup.py conda activate gpu_bm_venv
-
Alternatively, create a virtual environment with
venv
orvirtualenv
and install the required packages:python3 -m venv gpu_bm_venv # Windows gpu_bm_venv\Scripts\activate # Linux or Mac source gpu_bm_venv/bin/activate pip install -r requirements.txt
-
To benchmark with the BERT model:
python bert_train_test.py
-
To benchmark with the LSTM model:
python lstm_train_test.py
-
To benchmark with the ResNet50 model:
python resnet50_train_test.py
-
To execute all benchmarks and obtain overall benchmark scores:
python run.py
-
cd ./ollama_test
-
Adjust
config.yaml
if needed -
Add or remove prompts to test_prompts.yaml
-
python test_all.py
The project dependencies are listed in the requirements.txt
file and can be installed using pip.
The default setting evaluates each GPU using FP32, 3 epochs, and reasonable configurations pertaining to each model.
- Memory Allocated: Reflects the memory currently in use by tensors, models, and other GPU data structures that are actively holding data.
- Memory Reserved: Includes both the memory currently allocated and additional memory that has been set aside for future allocations to avoid fragmentation and allocation overhead.
These metrics are measured just after the first model is loaded onto the GPU(s) and after every batch_logging_output_inc
(defaults to 100) batches are processed.
Measures the time taken for each operation: batch, epoch, etc. This is averaged and reported for each GPU per epoch and globally averaged for training and testing.
Measures the time taken to load data onto the GPU(s). This metric is measured for each batch, epoch, etc., and is also expressed as a percentage of total execution time to help identify bottlenecks.
Indicates how many samples a GPU can process per second for a particular model, dataset, and task. A sample could be any piece of information ingested by the model, such as text tokens or images. The training and testing throughput are summed to provide the benchmark score for each model. The global score of GPU performance across all models is the sum of each model's benchmark score. These scores are recorded in the ./logs/results_*.log
file.
The Ollama test evaluates each model's performance by focusing on the duration and efficiency of processing prompt tokens and generating responses. Each metric is measured in nanoseconds by Ollama as default and is converted to seconds by dividing by ( 10^9 ).
The overall time spent generating the full response from the model, including both the time spent evaluating the prompt and generating the response tokens.
The time taken to load the model into memory, which includes the initialization of all necessary weights and parameters.
-
Prompt Eval Count: Represents the number of tokens in the input prompt that need to be evaluated before the model generates the response.
-
Prompt Eval Duration: The time taken to evaluate the input prompt. This includes tokenizing the prompt and processing it through the model to understand the context.
-
Response Eval Count: The number of tokens generated by the model as part of the response.
-
Response Eval Duration: The time spent generating the response tokens. This measures the speed of the model during the generation phase.
This metric is calculated by dividing API response parameter eval_count
by eval_duration
and multiplying by ( 10^9 ), which provides the rate of token generation in tokens per second.
This project is licensed under the MIT License. See the LICENSE.txt file for details.
Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.
If you use this benchmark in your research, please cite the following:
@misc{TDD GPU Benchmark Suite,
author = {TheDataDaddi},
title = {TDD GPU Benchmark Suite},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/thedatadaddi/gpu_bm_suite.git}},
}
For any questions or issues, please contact [email protected].