You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the readme, input in the partition column for a custom dataset should be of the type 'training', 'validation', 'test', which I can't get to yield a partition:
Make sure that the dataset is in the following format:
corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
vocabulary: a .txt file where each line represents a word of the vocabulary
The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.
However, it seems the right format is 'train', 'val', 'test', which does work for me - just passing this on to make the ReadMe clearer.
def load_custom_dataset_from_folder(self, path):
"""
Loads all the dataset from a folder
Parameters
----------
path : path of the folder to read
"""
self.dataset_path = path
try:
if exists(self.dataset_path + "/metadata.json"):
self._load_metadata(self.dataset_path + "/metadata.json")
else:
self.__metadata = dict()
df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
if len(df.keys()) > 1:
df[1] = df[1].replace("train", "a_train")
df[1] = df[1].replace("val", "b_val")
df = df.sort_values(1).reset_index(drop=True)
self.__metadata['last-training-doc'] = len(df[df[1] == 'a_train'])
self.__metadata['last-validation-doc'] = len(df[df[1] == 'b_val']) + len(df[df[1] == 'a_train'])
The text was updated successfully, but these errors were encountered:
According to the readme, input in the partition column for a custom dataset should be of the type 'training', 'validation', 'test', which I can't get to yield a partition:
Make sure that the dataset is in the following format:
The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.
However, it seems the right format is 'train', 'val', 'test', which does work for me - just passing this on to make the ReadMe clearer.
The text was updated successfully, but these errors were encountered: