setting `--gpu_device` is not working for mutiple GPUs #1063

menghaowei · 2025-02-02T08:15:08Z

I have a server with 8 * A100 40G GPUs, and I try to run Alphafold3 with docker or bash, but nomatter how I set --gpu_device, only the first GPU is used.

Here is some test and outputs

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:3D:00.0 Off |                  Off |
| N/A   31C    P0             33W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:3E:00.0 Off |                  Off |
| N/A   33C    P0             36W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          On  |   00000000:40:00.0 Off |                  Off |
| N/A   32C    P0             37W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          On  |   00000000:41:00.0 Off |                  Off |
| N/A   32C    P0             35W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          On  |   00000000:B1:00.0 Off |                  Off |
| N/A   32C    P0             39W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          On  |   00000000:B2:00.0 Off |                  Off |
| N/A   32C    P0             36W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          On  |   00000000:B4:00.0 Off |                  Off |
| N/A   33C    P0             35W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          On  |   00000000:B5:00.0 Off |                  Off |
| N/A   33C    P0             33W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Here is my test.

input_data="/home/menghw/data2/AF3_test_json/"
output_data="/home/menghw/data2/02.AF3_output/"
model_dir="/home/menghw/07.AF3_local_test/alphafold3/weights/"
database_dir="/home/menghw/07.AF3_local_test/database/"
docker run -it \
    --volume $input_data:/root/af_input \
    --volume $output_data:/root/af_output \
    --volume $model_dir:/root/models \
    --volume $database_dir:/root/public_databases \
    --gpus 8 \
    alphafold3 \
    python run_alphafold.py \
    --json_path=/root/af_input/06.test_multi_GPU8.json \
    --model_dir=/root/models \
    --output_dir=/root/af_output \
    --gpu_device=7

Here is running log

use the model parameters.
Found local devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)], using device 7: cuda:7
Building model from scratch...
Processing fold inputs.
Processing fold input #1
Processing fold input test_pre_load_GPU8
Checking we can load the model parameters...

Although, it is look like running well, only the first GPU is taken!

nvidia-smi
Sun Feb  2 16:14:26 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:3D:00.0 Off |                  Off |
| N/A   32C    P0             35W /  250W |   38855MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:3E:00.0 Off |                  Off |
| N/A   33C    P0             38W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          On  |   00000000:40:00.0 Off |                  Off |
| N/A   33C    P0             39W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          On  |   00000000:41:00.0 Off |                  Off |
| N/A   32C    P0             38W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          On  |   00000000:B1:00.0 Off |                  Off |
| N/A   33C    P0             41W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          On  |   00000000:B2:00.0 Off |                  Off |
| N/A   33C    P0             39W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          On  |   00000000:B4:00.0 Off |                  Off |
| N/A   33C    P0             37W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          On  |   00000000:B5:00.0 Off |                  Off |
| N/A   33C    P0             35W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3802743      C   python                                      38846MiB |
|    1   N/A  N/A   3802743      C   python                                        416MiB |
|    2   N/A  N/A   3802743      C   python                                        416MiB |
|    3   N/A  N/A   3802743      C   python                                        416MiB |
|    4   N/A  N/A   3802743      C   python                                        416MiB |
|    5   N/A  N/A   3802743      C   python                                        416MiB |
|    6   N/A  N/A   3802743      C   python                                        416MiB |
|    7   N/A  N/A   3802743      C   python                                        416MiB |
+-----------------------------------------------------------------------------------------+

Can anyone help me?

The text was updated successfully, but these errors were encountered:

MominIqbal-1234 · 2025-02-04T00:56:06Z

i am face this error did you know how its solve
first i install all dependencies then i create protein_sequence.fasta file then run this command

python3 run_alphafold.py --fasta_paths=protein_sequence.fasta --output_dir=output --model_names=model_1,model_2 --num_recycles=3

and then return this error

RuntimeError: jaxlib version 0.4.33 is newer than and incompatible with jax version 0.4.26. Please update your jax and/or jaxlib packages.

OS : Ubuntu
Python : 3.12.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setting `--gpu_device` is not working for mutiple GPUs #1063

setting `--gpu_device` is not working for mutiple GPUs #1063

menghaowei commented Feb 2, 2025

MominIqbal-1234 commented Feb 4, 2025 •

edited

Loading

setting --gpu_device is not working for mutiple GPUs #1063

setting --gpu_device is not working for mutiple GPUs #1063

Comments

menghaowei commented Feb 2, 2025

MominIqbal-1234 commented Feb 4, 2025 • edited Loading

setting `--gpu_device` is not working for mutiple GPUs #1063

setting `--gpu_device` is not working for mutiple GPUs #1063

MominIqbal-1234 commented Feb 4, 2025 •

edited

Loading