Rep_removal for large data files crashes on 16GB memory #1035

shahrokhDaijavad · 2025-02-10T22:25:17Z

Search before asking

I searched the issues and found no similar issues.

Component

Other, Transforms/Other

What happened + What you expected to happen

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

12:11:53 INFO - pipeline id pipeline_id
12:11:53 INFO - code location None
12:11:53 INFO - data factory data_ is using local data access: input_folder - [/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20](http://localhost:8888/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20) output_folder - files-rep_removal
12:11:53 INFO - data factory data_ max_files -1, n_sample -1
12:11:53 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:11:53 INFO - orchestrator rep_removal started at 2025-02-10 12:11:53
12:11:53 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
12:12:16 INFO - encoding parquet
12:51:53 INFO - making suffix array
12:51:53 INFO - Starting the deduplication process for file: [/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet](http://localhost:8888/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet)

cpu speed: 3228 MHz, Cores: 10

12:51:53 INFO - timeout is: 45743.31654275093
12:51:53 INFO - Scheduling 96 jobs to create dataset parts.

gpu_usage: 0.00%, GPU speed: 0 MHz

Reproduction script

Run the following on a Mac M1 with 16GB memory

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

shivdeep-singh-ibm · 2025-02-12T15:08:31Z

Since, we are get out of memory while using this transform.
we should two transforms in a sequence..

a) Resize
b) RepRemoval.

Something like

we can choose resize_max_rows_per_table in a way such that it does not give oom error.

import os
REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

from dpk_resize.runtime import Resize
Resize(input_folder= os.path.dirname(file1),
        output_folder= "output",
        resize_max_rows_per_table= 1000).transform()

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= "output",
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

shahrokhDaijavad · 2025-02-12T15:11:47Z

Thank you, @shivdeep-singh-ibm ! We talked about this exact solution yesterday. Thanks for spelling it out.

@Hajar-Emami This is exactly what we were talking about yesterday!
If you add resize to the list of data-prep-toolkit-transforms that you pip install, then you can use it exactly as Shivdeep has above, to make the file smaller (e.g., 1000 rows) before rep_removal.

Hajar-Emami · 2025-02-12T19:04:55Z

Many Thanks @shivdeep-singh-ibm . Yes, as we discussed with @shahrokhDaijavad and @touma-I, we should include the Resize step before running any of GneissWeb recipe's components.

agoyal26 · 2025-03-06T07:17:27Z

Should we add this to official documentation for benefit of users?

shahrokhDaijavad · 2025-03-06T20:13:09Z

Sure, @agoyal26. I submitted PR #1106 to mention this. Please approve.

shahrokhDaijavad added the bug Something isn't working label Feb 10, 2025

shahrokhDaijavad mentioned this issue Mar 6, 2025

Update README.md for rep_removal to mention memory requirement #1106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rep_removal for large data files crashes on 16GB memory #1035

Rep_removal for large data files crashes on 16GB memory #1035

shahrokhDaijavad commented Feb 10, 2025

shivdeep-singh-ibm commented Feb 12, 2025

shahrokhDaijavad commented Feb 12, 2025 •

edited

Loading

Hajar-Emami commented Feb 12, 2025 •

edited

Loading

agoyal26 commented Mar 6, 2025

shahrokhDaijavad commented Mar 6, 2025

Rep_removal for large data files crashes on 16GB memory #1035

Rep_removal for large data files crashes on 16GB memory #1035

Comments

shahrokhDaijavad commented Feb 10, 2025

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

shivdeep-singh-ibm commented Feb 12, 2025

shahrokhDaijavad commented Feb 12, 2025 • edited Loading

Hajar-Emami commented Feb 12, 2025 • edited Loading

agoyal26 commented Mar 6, 2025

shahrokhDaijavad commented Mar 6, 2025

shahrokhDaijavad commented Feb 12, 2025 •

edited

Loading

Hajar-Emami commented Feb 12, 2025 •

edited

Loading