Skip to content

Commit

Permalink
Made the large corpus even larger
Browse files Browse the repository at this point in the history
  • Loading branch information
lafrancef committed May 25, 2017
1 parent 8bf495e commit d09af8a
Show file tree
Hide file tree
Showing 5 changed files with 391,869 additions and 96,815 deletions.
6 changes: 5 additions & 1 deletion europarl_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import nltk
from unidecode import unidecode
from collections import Counter
import sys

def filter_sentence(s):
words = nltk.tokenize.word_tokenize(s)
Expand Down Expand Up @@ -52,4 +53,7 @@ def build(europath):
for fname in os.listdir(europath):
docs.append(create_doc(europath + '/' + fname, word_to_id))

write_all(id_to_word, docs)
write_all(id_to_word, docs)

if __name__ == '__main__':
build(sys.argv[1])
20 changes: 10 additions & 10 deletions fast-lda/europarl_fi/europarl_fi.dat

Large diffs are not rendered by default.

Loading

0 comments on commit d09af8a

Please sign in to comment.