Skip to content

This project realizes language recognition based on TextCNN implemented by Pytorch.

Notifications You must be signed in to change notification settings

lzw108/Language-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Detection

Language Detection

This is a project on language detection.

Model

The model is based on the TextCNN [1][2].

Requirements

  • python 3.7
  • Pytorch 1.5
  • CUDA (Recommended version >=10.0)
  • torchtext 0.11.0

Getting Started

Data

We need download Tatoeba dataset as our train data, which includes 403 kinds of language. You can download in data directory:

wget http://downloads.tatoeba.org/exports/sentences.tar.bz2
bunzip2 sentences.tar.bz2
tar xvf sentences.tar

Train

First, the data should be processed:

python main.py --data_process 

It will first call data_process.py to split data to train_process.csv and test_process.csv. Before training, the parameters can be adjusted in args.py. Then the model can be trained as follow:

python main.py --train 

The model will be got in the model Directory and a vocab in the data directory. Training will cost a while, so we can directly use the vocabulary in the data directory and the model in the model directory.

Test

The ability of the model we have trained can be tested by using test_process.csv:

python main.py --test

The model I trained has achieved 95.93% accuracy in the test set (10000 samples are randomly selected by default).

Test any single sentence

python main.py --test_single

Then you can follow the prompts to enter a single sentence and enjoy it.

Improvement

At present, it only randomly initializes the word vector. In the future, the word vector trained by FastText [3][4] can be used in this model.

About

This project realizes language recognition based on TextCNN implemented by Pytorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages