Skip to content
This repository has been archived by the owner on Apr 5, 2023. It is now read-only.

Latest commit

 

History

History
 
 

openwebtext

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

openwebtext dataset

after running prepare.py (preprocess) we get:

  • train.bin is ~17GB, val.bin ~8.5MB
  • train has ~9B tokens (9,035,582,198)
  • val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references: