use lightning or pytorch-lightning #438

better629 · 2024-12-10T08:54:53Z

Describe the bug

installed pytorch-lightning=2.4.0, nemo_toolkit=2.1.0rc0 under NeMo-Aligner=0.5.0

error with

[rank0]:   File "/tf/NeMo-Aligner/./examples/nlp/gpt/train_reward_model.py", line 137, in main
[rank0]:     init_using_ptl(trainer, ptl_model, train_dataloader, train_ds)
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/nemo_aligner/utils/train_script_utils.py", line 107, in init_using_ptl
[rank0]:     call._call_configure_model(ptl_trainer)
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 111, in _call_configure_model
[rank0]:     if is_overridden("configure_sharded_model", trainer.lightning_module):
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden
[rank0]:     raise ValueError("Expected a parent")
[rank0]: ValueError: Expected a parent

It seems that resolve_and_create_trainer inits a lightning Trainer but makes a judge with pytorch_lightning Trainer under init_using_ptl.
Does anyone occur with this problem.

I updated the code inside nemo_aligner/utils/train_script_utils.py and it will run forward

# from pytorch_lightning.trainer import call                        # raw code inside train_script_utils.py
# from pytorch_lightning.trainer.states import TrainerFn  # raw code inside train_script_utils.py
from lightning.pytorch.trainer import call
from lightning.pytorch.trainer.states import TrainerFn

Steps/Code to reproduce bug

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location:
Docker
Method of NeMo-Aligner install: [pip install or from source].
pip
If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version Ubuntu 22.04
PyTorch version 2.5.1
Python version 3.10

Additional context

Add any other context about the problem here.
Example: GPU model

The text was updated successfully, but these errors were encountered:

terrykong · 2024-12-10T19:58:33Z

Hi. Thanks for trying out aligner.

Can you try the dockerfile + the code on the dev branch? This has been resolved there

better629 · 2024-12-11T01:37:22Z

My access env is in the docker, and can support docker in docker. But I will try the steps in dockerfile.

Seems that the dockerfile uses ALIGNER_COMMIT=main in dev branch, should we update to ALIGNER_COMMIT=dev ? @terrykong

BTW, can you check the #436

terrykong · 2024-12-11T01:50:06Z

should we update to ALIGNER_COMMIT=dev

Yea, exactly. If you're running docker build you can just add on --build-arg ALIGNER_COMMIT=dev

better629 · 2024-12-11T02:22:33Z

should we update to ALIGNER_COMMIT=dev

Yea, exactly. If you're running docker build you can just add on --build-arg ALIGNER_COMMIT=dev

Seems the TensorRT-LLM requires cuda 12.5 in dev branch. Do you tried TensorRT-LLM v0.7.1 with cuda 12.2?

better629 · 2024-12-11T03:09:32Z

@terrykong
I use the ./examples/nlp/gpt/train_reward_model.py and install following the docker in dev branch(TensorRT-LLM / TransformerEngine meet errors, but it seems that won't influence reward training).

Still got OOM with single A800 80G card or got stucked in Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 with multi-cards.

command

CUDA_VISIBLE_DEVICES=2 python3 ./examples/nlp/gpt/train_reward_model.py \
      trainer.num_nodes=1 \
      trainer.devices=1 \
      ++model.micro_batch_size=1 \
      ++model.global_batch_size=1 \
      ++model.data.data_impl=json \
      ++model.data.seq_length=400 \
      ++model.encoder_seq_length=400 \
      pretrained_checkpoint.restore_from_path=/tf/model/mistral-7b.nemo \
      "model.data.data_prefix={train: ["train.json"], validation: ["test.json"], test: ["val.json"]}" \
      exp_manager.explicit_log_dir=./results/reward_model_7b \
      trainer.rm.val_check_interval=10 \
      exp_manager.create_wandb_logger=False \
      trainer.rm.save_interval=100 \
      trainer.rm.max_steps=1000 \
      +model.tensor_model_parallel_size=1 \
      ++model.pipeline_model_parallel_size=1 \
      ++model.activations_checkpoint_granularity="selective" \
      model.reward_model_type="regression" \
      model.regression.num_attributes=1

better629 added the bug Something isn't working label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use lightning or pytorch-lightning #438

use lightning or pytorch-lightning #438

better629 commented Dec 10, 2024

terrykong commented Dec 10, 2024

better629 commented Dec 11, 2024 •

edited

Loading

terrykong commented Dec 11, 2024

better629 commented Dec 11, 2024

better629 commented Dec 11, 2024 •

edited

Loading

use lightning or pytorch-lightning #438

use lightning or pytorch-lightning #438

Comments

better629 commented Dec 10, 2024

terrykong commented Dec 10, 2024

better629 commented Dec 11, 2024 • edited Loading

terrykong commented Dec 11, 2024

better629 commented Dec 11, 2024

better629 commented Dec 11, 2024 • edited Loading

better629 commented Dec 11, 2024 •

edited

Loading

better629 commented Dec 11, 2024 •

edited

Loading