Language Detection

This is a project on language detection.

Model

The model is based on the TextCNN ^[1]^[2].

Requirements

python 3.7
Pytorch 1.5
CUDA (Recommended version >=10.0)
torchtext 0.11.0

Getting Started

Data

We need download Tatoeba dataset as our train data, which includes 403 kinds of language. You can download in data directory:

wget http://downloads.tatoeba.org/exports/sentences.tar.bz2
bunzip2 sentences.tar.bz2
tar xvf sentences.tar

Train

First, the data should be processed:

python main.py --data_process

It will first call data_process.py to split data to train_process.csv and test_process.csv. Before training, the parameters can be adjusted in args.py. Then the model can be trained as follow:

python main.py --train

The model will be got in the model Directory and a vocab in the data directory. Training will cost a while, so we can directly use the vocabulary in the data directory and the model in the model directory.

Test

The ability of the model we have trained can be tested by using test_process.csv:

python main.py --test

The model I trained has achieved 95.93% accuracy in the test set (10000 samples are randomly selected by default).

Test any single sentence

python main.py --test_single

Then you can follow the prompts to enter a single sentence and enjoy it.

Improvement

At present, it only randomly initializes the word vector. In the future, the word vector trained by FastText ^[3]^[4] can be used in this model.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
model		model
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Detection

Model

Requirements

Getting Started

Data

Train

Test

Test any single sentence

Improvement

About

Releases

Packages

Languages

lzw108/Language-Detection

Folders and files

Latest commit

History

Repository files navigation

Language Detection

Model

Requirements

Getting Started

Data

Train

Test

Test any single sentence

Improvement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages