Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use lightning or pytorch-lightning #438

Open
better629 opened this issue Dec 10, 2024 · 5 comments
Open

use lightning or pytorch-lightning #438

better629 opened this issue Dec 10, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@better629
Copy link

Describe the bug

installed pytorch-lightning=2.4.0, nemo_toolkit=2.1.0rc0 under NeMo-Aligner=0.5.0

error with

[rank0]:   File "/tf/NeMo-Aligner/./examples/nlp/gpt/train_reward_model.py", line 137, in main
[rank0]:     init_using_ptl(trainer, ptl_model, train_dataloader, train_ds)
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/nemo_aligner/utils/train_script_utils.py", line 107, in init_using_ptl
[rank0]:     call._call_configure_model(ptl_trainer)
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 111, in _call_configure_model
[rank0]:     if is_overridden("configure_sharded_model", trainer.lightning_module):
[rank0]:   File "/tf/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/utilities/model_helpers.py", line 42, in is_overridden
[rank0]:     raise ValueError("Expected a parent")
[rank0]: ValueError: Expected a parent

It seems that resolve_and_create_trainer inits a lightning Trainer but makes a judge with pytorch_lightning Trainer under init_using_ptl.
Does anyone occur with this problem.

I updated the code inside nemo_aligner/utils/train_script_utils.py and it will run forward

# from pytorch_lightning.trainer import call                        # raw code inside train_script_utils.py
# from pytorch_lightning.trainer.states import TrainerFn  # raw code inside train_script_utils.py
from lightning.pytorch.trainer import call
from lightning.pytorch.trainer.states import TrainerFn

Steps/Code to reproduce bug

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location:
    Docker
  • Method of NeMo-Aligner install: [pip install or from source].
    pip
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version Ubuntu 22.04
  • PyTorch version 2.5.1
  • Python version 3.10

Additional context

Add any other context about the problem here.
Example: GPU model

@better629 better629 added the bug Something isn't working label Dec 10, 2024
@terrykong
Copy link
Collaborator

Hi. Thanks for trying out aligner.

Can you try the dockerfile + the code on the dev branch? This has been resolved there

@better629
Copy link
Author

better629 commented Dec 11, 2024

My access env is in the docker, and can support docker in docker. But I will try the steps in dockerfile.

Seems that the dockerfile uses ALIGNER_COMMIT=main in dev branch, should we update to ALIGNER_COMMIT=dev ? @terrykong

BTW, can you check the #436

@terrykong
Copy link
Collaborator

should we update to ALIGNER_COMMIT=dev

Yea, exactly. If you're running docker build you can just add on --build-arg ALIGNER_COMMIT=dev

@better629
Copy link
Author

should we update to ALIGNER_COMMIT=dev

Yea, exactly. If you're running docker build you can just add on --build-arg ALIGNER_COMMIT=dev

Seems the TensorRT-LLM requires cuda 12.5 in dev branch. Do you tried TensorRT-LLM v0.7.1 with cuda 12.2?

@better629
Copy link
Author

better629 commented Dec 11, 2024

@terrykong
I use the ./examples/nlp/gpt/train_reward_model.py and install following the docker in dev branch(TensorRT-LLM / TransformerEngine meet errors, but it seems that won't influence reward training).

Still got OOM with single A800 80G card or got stucked in Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 with multi-cards.

command

CUDA_VISIBLE_DEVICES=2 python3 ./examples/nlp/gpt/train_reward_model.py \
      trainer.num_nodes=1 \
      trainer.devices=1 \
      ++model.micro_batch_size=1 \
      ++model.global_batch_size=1 \
      ++model.data.data_impl=json \
      ++model.data.seq_length=400 \
      ++model.encoder_seq_length=400 \
      pretrained_checkpoint.restore_from_path=/tf/model/mistral-7b.nemo \
      "model.data.data_prefix={train: ["train.json"], validation: ["test.json"], test: ["val.json"]}" \
      exp_manager.explicit_log_dir=./results/reward_model_7b \
      trainer.rm.val_check_interval=10 \
      exp_manager.create_wandb_logger=False \
      trainer.rm.save_interval=100 \
      trainer.rm.max_steps=1000 \
      +model.tensor_model_parallel_size=1 \
      ++model.pipeline_model_parallel_size=1 \
      ++model.activations_checkpoint_granularity="selective" \
      model.reward_model_type="regression" \
      model.regression.num_attributes=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants