Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory overhead in multiprocessing #161

Open
jordane95 opened this issue Apr 24, 2024 · 8 comments
Open

Memory overhead in multiprocessing #161

jordane95 opened this issue Apr 24, 2024 · 8 comments
Assignees

Comments

@jordane95
Copy link
Contributor

jordane95 commented Apr 24, 2024

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

@guipenedo
Copy link
Collaborator

Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?

@justHungryMan
Copy link
Contributor

justHungryMan commented May 24, 2024

I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.

The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this?
I suspect this might be a memory overhead issue.

@SinclairCoder
Copy link

Hi, is your problem solved now? I also encountered some similar issues (unexpected OOM resulting in failed jobs). I also think the source of the unexpected OOM issues may be a few docs with long-context.

@Pclanglais
Copy link

Same issue also here. Applied a 20k word size limitation per document entries before that do solve most of it but still having a few oom (may also be due to the size of specific file for ingestion?). Would be nice to circumvent the issue as it fails silently…

@Pclanglais
Copy link

So I found a fix: best is to lower the tokenizer batches. 1000 runs fine (or even lower to use all available cpus on long text)

Actually in tokenizer.py 1000 was meant as the default value:
batch_size (int): batch size for tokenization (default: 1000)

While we do have:
batch_size: int = 10000, # batch size for tokenization

Maybe bringing back 1000 would be safer?

@SinclairCoder
Copy link

My OOM case happened in the text extractor. But I do not know how to fix it. Sad.

@justHungryMan
Copy link
Contributor

Reducing workers or batch_size temporarily fixes memory overflows, but the real issue is the module’s inability to detect these problems. Enhancements are needed for stable, efficient performance.

@SinclairCoder
Copy link

SinclairCoder commented Aug 31, 2024

Got it. Typically, I started n of tasks to process data (a pipeline could consist of the WARC reader, URL filter, Text Extractor, and Writer). However, several tasks even more (e.g., half numbers of tasks) failed due to OOM. I have to rerun the script to resume these tasks, which may require more time, and more memory. But I do not know what's the exact memory size, which could cause these tasks to fail again. I'm struggling with it.

Any suggestions are welcome!

cc @guipenedo

@hynky1999 hynky1999 self-assigned this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants