Memory overhead in multiprocessing #161

jordane95 · 2024-04-24T08:00:52Z

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

guipenedo · 2024-04-24T14:09:23Z

Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?

justHungryMan · 2024-05-24T09:44:23Z

I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.

The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this?
I suspect this might be a memory overhead issue.

SinclairCoder · 2024-08-26T12:15:38Z

Hi, is your problem solved now? I also encountered some similar issues (unexpected OOM resulting in failed jobs). I also think the source of the unexpected OOM issues may be a few docs with long-context.

Pclanglais · 2024-08-30T19:41:40Z

Same issue also here. Applied a 20k word size limitation per document entries before that do solve most of it but still having a few oom (may also be due to the size of specific file for ingestion?). Would be nice to circumvent the issue as it fails silently…

Pclanglais · 2024-08-31T00:33:37Z

So I found a fix: best is to lower the tokenizer batches. 1000 runs fine (or even lower to use all available cpus on long text)

Actually in tokenizer.py 1000 was meant as the default value:
batch_size (int): batch size for tokenization (default: 1000)

While we do have:
batch_size: int = 10000, # batch size for tokenization

Maybe bringing back 1000 would be safer?

SinclairCoder · 2024-08-31T11:38:56Z

My OOM case happened in the text extractor. But I do not know how to fix it. Sad.

justHungryMan · 2024-08-31T11:55:17Z

Reducing workers or batch_size temporarily fixes memory overflows, but the real issue is the module’s inability to detect these problems. Enhancements are needed for stable, efficient performance.

SinclairCoder · 2024-08-31T12:37:39Z

Got it. Typically, I started n of tasks to process data (a pipeline could consist of the WARC reader, URL filter, Text Extractor, and Writer). However, several tasks even more (e.g., half numbers of tasks) failed due to OOM. I have to rerun the script to resume these tasks, which may require more time, and more memory. But I do not know what's the exact memory size, which could cause these tasks to fail again. I'm struggling with it.

Any suggestions are welcome!

cc @guipenedo

hynky1999 mentioned this issue Jun 3, 2024

Memory overflow issue with long-context data using datatrove #204

Open

hynky1999 self-assigned this Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory overhead in multiprocessing #161

Memory overhead in multiprocessing #161

jordane95 commented Apr 24, 2024 •

edited

Loading

guipenedo commented Apr 24, 2024

justHungryMan commented May 24, 2024 •

edited

Loading

SinclairCoder commented Aug 26, 2024

Pclanglais commented Aug 30, 2024

Pclanglais commented Aug 31, 2024

SinclairCoder commented Aug 31, 2024

justHungryMan commented Aug 31, 2024

SinclairCoder commented Aug 31, 2024 •

edited

Loading

Memory overhead in multiprocessing #161

Memory overhead in multiprocessing #161

Comments

jordane95 commented Apr 24, 2024 • edited Loading

guipenedo commented Apr 24, 2024

justHungryMan commented May 24, 2024 • edited Loading

SinclairCoder commented Aug 26, 2024

Pclanglais commented Aug 30, 2024

Pclanglais commented Aug 31, 2024

SinclairCoder commented Aug 31, 2024

justHungryMan commented Aug 31, 2024

SinclairCoder commented Aug 31, 2024 • edited Loading

jordane95 commented Apr 24, 2024 •

edited

Loading

justHungryMan commented May 24, 2024 •

edited

Loading

SinclairCoder commented Aug 31, 2024 •

edited

Loading