Skip to content

Commit 2b53a90

Browse files
authored
[docs] troubleshooting guide (huggingface#2133)
* first take at troubleshooting guide * logging moved to the troubleshooting guide * TOC updates and gudie edits * minor edits * moved to tutorials * feedback addressed * batch size clarifications * typo * kernel, early stopping hanging, feedback
1 parent 39d255b commit 2b53a90

File tree

5 files changed

+226
-173
lines changed

5 files changed

+226
-173
lines changed

docs/source/_toctree.yml

+2-4
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
title: Launching distributed code
1616
- local: basic_tutorials/notebook
1717
title: Launching distributed training from Jupyter Notebooks
18+
- local: basic_tutorials/troubleshooting
19+
title: Troubleshooting guide
1820
title: Tutorials
1921
- sections:
2022
- local: usage_guides/explore
@@ -37,10 +39,6 @@
3739
title: Saving and loading training states
3840
- local: usage_guides/tracking
3941
title: Using experiment trackers
40-
- local: usage_guides/debug
41-
title: Debugging timeout errors
42-
- local: usage_guides/memory
43-
title: How to avoid CUDA Out-of-Memory
4442
- local: usage_guides/mps
4543
title: How to use Apple Silicon M1 GPUs
4644
- local: usage_guides/deepspeed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
-->
15+
16+
# Troubleshooting guide
17+
18+
This guide aims to provide you the tools and knowledge required to navigate some common issues. However,
19+
as 🤗 Accelerate continuously evolves and the use cases and setups are diverse, you might encounter an issue not covered in this
20+
guide. If the suggestions listed in this guide do not cover your such situation, please refer to the final section of
21+
the guide, [Asking for Help](#ask-for-help), to learn where to find help with your specific issue.
22+
23+
## Logging
24+
25+
When facing an error, logging can help narrow down where it is coming from. In a distributed setup with multiple processes,
26+
logging can be a challenge, but 🤗 Accelerate provides a utility that streamlines the logging process and ensures that
27+
logs are synchronized and managed effectively across the distributed setup.
28+
29+
To troubleshoot an issue, use `accelerate.logging` instead of the standard Python `logging` module:
30+
31+
```diff
32+
- import logging
33+
+ from accelerate.logging import get_logger
34+
- logger = logging.getLogger(__name__)
35+
+ logger = get_logger(__name__)
36+
```
37+
38+
To set the log level (`INFO`, `DEBUG`, `WARNING`, `ERROR`, `CRITICAL`), export it as the `ACCELERATE_LOG_LEVEL` environment,
39+
or pass as `log_level` to `get_logger`:
40+
41+
```python
42+
from accelerate.logging import get_logger
43+
44+
logger = get_logger(__name__, log_level="INFO")
45+
```
46+
47+
By default, the log is called on main processes only. To call it on all processes, pass `main_process_only=False`.
48+
If a log should be called on all processes and in order, also pass `in_order=True`.
49+
50+
## Hanging code and timeout errors
51+
52+
### Mismatched tensor shapes
53+
54+
If your code seems to be hanging for a significant amount time on a distributed setup, a common cause is mismatched shapes of tensors on different
55+
devices.
56+
57+
When running scripts in a distributed fashion, functions such as [`Accelerator.gather`] and [`Accelerator.reduce`] are
58+
necessary to grab tensors across devices to perform operations on them collectively. These (and other) functions rely on
59+
`torch.distributed` performing a `gather` operation, which requires that tensors have the **exact same shape** across all processes.
60+
When the tensor shapes don't match, you will experience handing code, and eventually hit a timeout exception.
61+
62+
If you suspect this to be the case, use Accelerate's operational debug mode to immediately catch the issue.
63+
64+
The recommended way to enable Accelerate's operational debug mode is during `accelerate config` setup.
65+
Alternative ways to enable debug mode are:
66+
67+
* From the CLI:
68+
69+
```bash
70+
accelerate launch --debug {my_script.py} --arg1 --arg2
71+
```
72+
73+
* As an environmental variable (which avoids the need for `accelerate launch`):
74+
75+
```bash
76+
ACCELERATE_DEBUG_MODE="1" torchrun {my_script.py} --arg1 --arg2
77+
```
78+
79+
* Manually changing the `config.yaml` file:
80+
81+
```diff
82+
compute_environment: LOCAL_MACHINE
83+
+debug: true
84+
```
85+
86+
Once you enable the debug mode, you should get a similar traceback that points to the tensor shape mismatch issue:
87+
88+
```py
89+
Traceback (most recent call last):
90+
File "/home/zach_mueller_huggingface_co/test.py", line 18, in <module>
91+
main()
92+
File "/home/zach_mueller_huggingface_co/test.py", line 15, in main
93+
broadcast_tensor = broadcast(tensor)
94+
File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py", line 303, in wrapper
95+
accelerate.utils.operations.DistributedOperationException:
96+
97+
Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.
98+
99+
Operation: `accelerate.utils.operations.broadcast`
100+
Input shapes:
101+
- Process 0: [1, 5]
102+
- Process 1: [1, 2, 5]
103+
```
104+
105+
### Early stopping leads to hanging
106+
107+
When doing early stopping in distributed training, if each process has a specific stopping condition (e.g. validation loss),
108+
it may not be synchronized across all of them. As a result, a break can happen on process 0 but not on process 1.
109+
This will cause the code to hang indefinitely until a timeout occurs.
110+
111+
If you have early stopping conditionals, use `set_breakpoint` and `check_breakpoint` methods to make sure all the processes
112+
are ended correctly:
113+
114+
```py
115+
# Assume `should_do_breakpoint` is a custom defined function that returns a conditional,
116+
# and that conditional might be true only on process 1
117+
if should_do_breakpoint(loss):
118+
accelerator.set_breakpoint()
119+
120+
# Later in the training script when we need to check for the breakpoint
121+
if accelerator.check_breakpoint():
122+
break
123+
```
124+
125+
### Hanging on low kernel versions on Linux
126+
127+
This is a known issue. On Linux with kernel version < 5.5, hanging processes have been reported. To avoid
128+
encountering this problem, we recommend upgrading your system to a later kernel version.
129+
130+
## CUDA out of memory
131+
132+
One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory",
133+
as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply
134+
start their script and let it run.
135+
136+
To address this problem, `Accelerate` offers a utility `find_executable_batch_size` that is heavily based on [toma](https://github.com/BlackHC/toma).
137+
The utility retries code that fails due to OOM (out-of-memory) conditions and lowers batch sizes automatically.
138+
139+
### find_executable_batch_size
140+
141+
This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some
142+
training script. To use it, restructure your training function to include an inner function that includes this wrapper,
143+
and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code.
144+
145+
<Tip warning={true}>
146+
147+
The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us.
148+
149+
</Tip>
150+
151+
It should also be noted that anything which will consume CUDA memory and passed to the `accelerator` **must** be declared inside the inner function,
152+
such as models and optimizers.
153+
154+
```diff
155+
def training_function(args):
156+
accelerator = Accelerator()
157+
158+
+ @find_executable_batch_size(starting_batch_size=args.batch_size)
159+
+ def inner_training_loop(batch_size):
160+
+ nonlocal accelerator # Ensure they can be used in our context
161+
+ accelerator.free_memory() # Free all lingering references
162+
model = get_model()
163+
model.to(accelerator.device)
164+
optimizer = get_optimizer()
165+
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
166+
lr_scheduler = get_scheduler(
167+
optimizer,
168+
num_training_steps=len(train_dataloader)*num_epochs
169+
)
170+
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
171+
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
172+
)
173+
train(model, optimizer, train_dataloader, lr_scheduler)
174+
validate(model, eval_dataloader)
175+
+ inner_training_loop()
176+
```
177+
178+
To find out more, check the documentation [here](../package_reference/utilities#accelerate.find_executable_batch_size).
179+
180+
## Non-reproducible results between device setups
181+
182+
If you have changed the device setup and are observing different model performance, this is likely due to the fact that
183+
you have not updated your script when moving from one setup to another. The same script with the same batch size across TPU,
184+
multi-GPU, and single-GPU with Accelerate will have different results.
185+
186+
For example, if you were previously training on a single GPU with a batch size of 16, when moving to two GPU setup,
187+
you need to change the batch size to 8 to have the same effective batch size. This is because when training with Accelerate,
188+
the batch size passed to the dataloader is the **batch size per GPU**.
189+
190+
To make sure you can reproduce the results between the setups, make sure to use the same seed, adjust the batch size
191+
accordingly, consider scaling the learning rate.
192+
193+
For more details and a quick reference for batch sizes, check out the [Comparing performance between different device setups](../concept_guides/performance) guide.
194+
195+
## Performance issues on different GPUs
196+
197+
If your multi-GPU setup consists of different GPUs, you may hit some limitations:
198+
199+
- There may be an imbalance in GPU memory between the GPUs. In this case, the GPU with smaller memory will limit the batch size or the size of the model that can be loaded onto the GPUs.
200+
- If you are using GPUs with different performance profiles, the performance will be driven by the slowest GPU that you are using as the other GPUs will have to wait for it to complete its workload.
201+
202+
Vastly different GPUs within the same setup can lead to performance bottlenecks.
203+
204+
## Ask for help
205+
206+
If the above troubleshooting tools and advice did not help you resolve your issue, reach out for help to the community
207+
and the team.
208+
209+
### Forums
210+
211+
Ask for help on the Hugging Face forums - post your question in the [🤗Accelerate category](https://discuss.huggingface.co/c/accelerate/18)
212+
Make sure to write a descriptive post with relevant context about your setup and reproducible code to maximize the likelihood that your problem is solved!
213+
214+
### Discord
215+
216+
Post a question on [Discord](http://hf.co/join/discord), and let the team and the community help you.
217+
218+
### GitHub Issues
219+
220+
Create an Issue on the 🤗 Accelerate [GitHub repository](https://github.com/huggingface/accelerate/issues) if you suspect
221+
to have found a bug related to the library. Include context regarding the bug and details about your distributed setup
222+
to help us better figure out what's wrong and how we can fix it.

docs/source/package_reference/logging.md

+2-18
Original file line numberDiff line numberDiff line change
@@ -15,23 +15,7 @@ rendered properly in your Markdown viewer.
1515

1616
# Logging with Accelerate
1717

18-
Accelerate has its own logging utility to handle logging while in a distributed system.
19-
To utilize this replace cases of `logging` with `accelerate.logging`:
20-
```diff
21-
- import logging
22-
+ from accelerate.logging import get_logger
23-
- logger = logging.getLogger(__name__)
24-
+ logger = get_logger(__name__)
25-
```
26-
27-
## Setting the log level
28-
29-
The log level can be set with the `ACCELERATE_LOG_LEVEL` environment variable or by passing
30-
`log_level` to `get_logger`:
31-
```python
32-
from accelerate.logging import get_logger
33-
34-
logger = get_logger(__name__, log_level="INFO")
35-
```
18+
Refer to the [Troubleshooting guide](../usage_guides/troubleshooting#logging) or to the example below to learn
19+
how to use 🤗 Accelerate's logger.
3620

3721
[[autodoc]] logging.get_logger

docs/source/usage_guides/debug.md

-93
This file was deleted.

0 commit comments

Comments
 (0)