Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting --gpu_device is not working for mutiple GPUs #1063

Open
menghaowei opened this issue Feb 2, 2025 · 1 comment
Open

setting --gpu_device is not working for mutiple GPUs #1063

menghaowei opened this issue Feb 2, 2025 · 1 comment

Comments

@menghaowei
Copy link

I have a server with 8 * A100 40G GPUs, and I try to run Alphafold3 with docker or bash, but nomatter how I set --gpu_device, only the first GPU is used.

Here is some test and outputs

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:3D:00.0 Off |                  Off |
| N/A   31C    P0             33W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:3E:00.0 Off |                  Off |
| N/A   33C    P0             36W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          On  |   00000000:40:00.0 Off |                  Off |
| N/A   32C    P0             37W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          On  |   00000000:41:00.0 Off |                  Off |
| N/A   32C    P0             35W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          On  |   00000000:B1:00.0 Off |                  Off |
| N/A   32C    P0             39W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          On  |   00000000:B2:00.0 Off |                  Off |
| N/A   32C    P0             36W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          On  |   00000000:B4:00.0 Off |                  Off |
| N/A   33C    P0             35W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          On  |   00000000:B5:00.0 Off |                  Off |
| N/A   33C    P0             33W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Here is my test.

input_data="/home/menghw/data2/AF3_test_json/"
output_data="/home/menghw/data2/02.AF3_output/"
model_dir="/home/menghw/07.AF3_local_test/alphafold3/weights/"
database_dir="/home/menghw/07.AF3_local_test/database/"
docker run -it \
    --volume $input_data:/root/af_input \
    --volume $output_data:/root/af_output \
    --volume $model_dir:/root/models \
    --volume $database_dir:/root/public_databases \
    --gpus 8 \
    alphafold3 \
    python run_alphafold.py \
    --json_path=/root/af_input/06.test_multi_GPU8.json \
    --model_dir=/root/models \
    --output_dir=/root/af_output \
    --gpu_device=7

Here is running log

use the model parameters.
Found local devices: [CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)], using device 7: cuda:7
Building model from scratch...
Processing fold inputs.
Processing fold input #1
Processing fold input test_pre_load_GPU8
Checking we can load the model parameters...

Although, it is look like running well, only the first GPU is taken!

nvidia-smi
Sun Feb  2 16:14:26 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:3D:00.0 Off |                  Off |
| N/A   32C    P0             35W /  250W |   38855MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:3E:00.0 Off |                  Off |
| N/A   33C    P0             38W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          On  |   00000000:40:00.0 Off |                  Off |
| N/A   33C    P0             39W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          On  |   00000000:41:00.0 Off |                  Off |
| N/A   32C    P0             38W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          On  |   00000000:B1:00.0 Off |                  Off |
| N/A   33C    P0             41W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          On  |   00000000:B2:00.0 Off |                  Off |
| N/A   33C    P0             39W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          On  |   00000000:B4:00.0 Off |                  Off |
| N/A   33C    P0             37W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          On  |   00000000:B5:00.0 Off |                  Off |
| N/A   33C    P0             35W /  250W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3802743      C   python                                      38846MiB |
|    1   N/A  N/A   3802743      C   python                                        416MiB |
|    2   N/A  N/A   3802743      C   python                                        416MiB |
|    3   N/A  N/A   3802743      C   python                                        416MiB |
|    4   N/A  N/A   3802743      C   python                                        416MiB |
|    5   N/A  N/A   3802743      C   python                                        416MiB |
|    6   N/A  N/A   3802743      C   python                                        416MiB |
|    7   N/A  N/A   3802743      C   python                                        416MiB |
+-----------------------------------------------------------------------------------------+

Can anyone help me?

@MominIqbal-1234
Copy link

MominIqbal-1234 commented Feb 4, 2025

i am face this error did you know how its solve
first i install all dependencies then i create protein_sequence.fasta file then run this command

python3 run_alphafold.py --fasta_paths=protein_sequence.fasta --output_dir=output --model_names=model_1,model_2 --num_recycles=3

and then return this error

RuntimeError: jaxlib version 0.4.33 is newer than and incompatible with jax version 0.4.26. Please update your jax and/or jaxlib packages.

OS : Ubuntu
Python : 3.12.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants