Skip to content

Commit

Permalink
nanogpt: allow multithreading in load dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
okuvshynov committed Jun 17, 2023
1 parent 7339b90 commit bb7e967
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion data/openwebtext/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,13 @@
# good number to use is ~order number of cpu cores // 2
num_proc = 8

# number of workers in load_dataset() call
# best number might be different from num_proc above as it also depends on NW speed.
# it is better than 1 usually though
num_proc_load_dataset = num_proc

# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
dataset = load_dataset("openwebtext")
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)

# owt by default only contains the 'train' split, so create a test split
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
Expand Down

0 comments on commit bb7e967

Please sign in to comment.