You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train a force fields model by using a variation of the following command that is mentioned in the readme to match my directories:
train_folder_ff.py --root_dir "alignn/examples/sample_data_ff" --config "alignn/examples/sample_data_ff/config_example_atomwise.json" --output_dir=temp
However, training is super slow and does not seem to utilize the GPU at all. This can be further confirmed by running nvidia-smi and viewing the output during training:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 42C P8 13W / 170W | 71MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB |
| 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB |
+---------------------------------------------------------------------------------------+
If I am training a model that does not utilize force fields, the GPU is used.
For example, running train_folder.py --root_dir "alignn/examples/sample_data" --config "alignn/examples/sample_data/config_example.json" --output_dir=temp and simultanously running nvidia-smi gives the following output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 46C P2 62W / 170W | 921MiB / 12288MiB | 39% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB |
| 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB |
| 0 N/A N/A 29095 C .../miniconda3/envs/version/bin/python 848MiB |
+---------------------------------------------------------------------------------------+
I have done my best to check that all the dependencies are compatible and I can confirm that the device is switched to cuda in the train_folder_ff.py script.
The text was updated successfully, but these errors were encountered:
I have tried a wide range of batch sizes even really big batch sizes such as 1028 but the performance was unaffected. I even tried passing batch_size as an argument and the problem still persisted.
I am trying to train a force fields model by using a variation of the following command that is mentioned in the readme to match my directories:
train_folder_ff.py --root_dir "alignn/examples/sample_data_ff" --config "alignn/examples/sample_data_ff/config_example_atomwise.json" --output_dir=temp
However, training is super slow and does not seem to utilize the GPU at all. This can be further confirmed by running
nvidia-smi
and viewing the output during training:+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 42C P8 13W / 170W | 71MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB |
| 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB |
+---------------------------------------------------------------------------------------+
If I am training a model that does not utilize force fields, the GPU is used.
For example, running
train_folder.py --root_dir "alignn/examples/sample_data" --config "alignn/examples/sample_data/config_example.json" --output_dir=temp
and simultanously runningnvidia-smi
gives the following output:+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 46C P2 62W / 170W | 921MiB / 12288MiB | 39% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB |
| 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB |
| 0 N/A N/A 29095 C .../miniconda3/envs/version/bin/python 848MiB |
+---------------------------------------------------------------------------------------+
I have done my best to check that all the dependencies are compatible and I can confirm that the device is switched to cuda in the train_folder_ff.py script.
The text was updated successfully, but these errors were encountered: