Name | Source | # Docs | # Words | # Labels |
---|---|---|---|---|
20NewsGroup | 20Newsgroup | 16309 | 1612 | 20 |
BBC_News | BBC-News | 2225 | 2949 | 5 |
DBLP | DBLP | 54595 | 1513 | 4 |
M10 | M10 | 8355 | 1696 | 10 |
To load one of the already preprocessed datasets as follows:
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")
Just use one of the dataset names listed above. Note: it is case-sensitive!
Otherwise, you can load a custom preprocessed dataset in the following way:
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")
- Make sure that the dataset is in the following format:
- corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
- vocabulary: a .txt file where each line represents a word of the vocabulary
The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.