Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vocabulary: a .txt for custom dataset #92

Closed
SaraAmd opened this issue Feb 1, 2023 · 1 comment
Closed

vocabulary: a .txt for custom dataset #92

SaraAmd opened this issue Feb 1, 2023 · 1 comment

Comments

@SaraAmd
Copy link

SaraAmd commented Feb 1, 2023

how to generate vocabulary file from our csv / tsv dataset?

@silviatti
Copy link
Collaborator

Hi, you can load the tsv file and then split the words using the spaces and save only the unique words. Like this:

import pandas as pd
df = pd.read_csv(dataset_path + "/corpus.tsv", sep='\t', header=None)
vocabulary = set()
for document in df[0].tolist():
    for word in document.split(): 
         vocabulary.add(word)
with open(dataset_path + "/vocabulary.txt", 'w') as fw:
    for word in vocabulary:
        fw.write(word)

Best,

Silvia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants