Stars
The WordScape repository contains code for the WordScape pipeline to create datasets to train document understanding models.
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
The RedPajama-Data repository contains code for preparing large datasets for training large language models.