Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Out of Memory issue when processing large datasets #49

Merged
merged 1 commit into from
Feb 12, 2025

Conversation

peidaqi
Copy link
Contributor

@peidaqi peidaqi commented Feb 12, 2025

The collect_data.py script will crash when processing large datasets if there's not enough memory.
esp. datasets larger than 1GB - LiveCodeBench, MATH, USACO

Added an option to assign a smaller writer_batch_size for these datasets.

Tested on computer with 24GB RAM. Default writer_batch_size of 1000 will crash and lowering to 200 works.

The collect_data.py script will crash when processing large datasets if
there's not enough memory.
esp. datasets larger than 1GB - LiveCodeBench, MATH, USACO

Added an option to assign a smaller writer_batch_size for these
datasets.

Tested on computer with 24GB RAM. Default writer_batch_size of 1000 will crash and lowering to 200 works.
@Muennighoff Muennighoff merged commit 678550f into simplescaling:main Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants