-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory overhead in multiprocessing #161
Comments
Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this? |
I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%. The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this? |
Hi, is your problem solved now? I also encountered some similar issues (unexpected OOM resulting in failed jobs). I also think the source of the unexpected OOM issues may be a few docs with long-context. |
Same issue also here. Applied a 20k word size limitation per document entries before that do solve most of it but still having a few oom (may also be due to the size of specific file for ingestion?). Would be nice to circumvent the issue as it fails silently… |
So I found a fix: best is to lower the tokenizer batches. 1000 runs fine (or even lower to use all available cpus on long text) Actually in tokenizer.py 1000 was meant as the default value: While we do have: Maybe bringing back 1000 would be safer? |
My OOM case happened in the text extractor. But I do not know how to fix it. Sad. |
Reducing workers or batch_size temporarily fixes memory overflows, but the real issue is the module’s inability to detect these problems. Enhancements are needed for stable, efficient performance. |
Got it. Typically, I started n of tasks to process data (a pipeline could consist of the WARC reader, URL filter, Text Extractor, and Writer). However, several tasks even more (e.g., half numbers of tasks) failed due to OOM. I have to rerun the script to resume these tasks, which may require more time, and more memory. But I do not know what's the exact memory size, which could cause these tasks to fail again. I'm struggling with it. Any suggestions are welcome! cc @guipenedo |
When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.
Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.
The text was updated successfully, but these errors were encountered: