PERT

Add disclaimer in README of all projects. (#202 )

Oct 29, 2023

5a08814 · Oct 29, 2023

Name	Name	Last commit message	Last commit date
parent directory ..
Configs	Configs	add the codes of PERT project	Jun 10, 2022
Corpus	Corpus	add the codes of PERT project	Jun 10, 2022
Models	Models	add the codes of PERT project	Jun 10, 2022
NEZHA	NEZHA	add the codes of PERT project	Jun 10, 2022
LICENSE	LICENSE	add the codes of PERT project	Jun 10, 2022
PERT.png	PERT.png	add the codes of PERT project	Jun 10, 2022
PinyinCharDataProcesser.py	PinyinCharDataProcesser.py	add the codes of PERT project	Jun 10, 2022
README.md	README.md	Add disclaimer in README of all projects. (#202 )	Oct 29, 2023
THIRD_PARTY_NOTICE	THIRD_PARTY_NOTICE	add the codes of PERT project	Jun 10, 2022
Trainer.py	Trainer.py	add the codes of PERT project	Jun 10, 2022
py2wordPert.py	py2wordPert.py	add the codes of PERT project	Jun 10, 2022
requirements.txt	requirements.txt	add the codes of PERT project	Jun 10, 2022
run_eval.sh	run_eval.sh	add the codes of PERT project	Jun 10, 2022
run_train.sh	run_train.sh	add the codes of PERT project	Jun 10, 2022

README.md

PERT

PERT is a solution for pinyin-to-character conversion task which is core to the Chinese input method. It adopts transformer-based network, especially the NEZHA language model, whose outperforms the n-gram and RNN models significantly. Besides, it can incorporate the n-gram and the external lexicon to adapt out of domain.

Setup

# Create python environment (optional)
conda create -n pert python=3.6.5
source activate pert

# Install python dependencies
pip install -r requirements.txt

Train

We use Trainer.py to train the PERT model. Here is an example:

CUDA_VISIBLE_DEVICES='0' nohup python Trainer.py \
   --vocab "./Corpus/CharListFrmC4P.txt" \
   --pyLex "./Corpus/pinyinList.txt" \
   --chardata "./Corpus/train_texts_CharSeg_1k.txt" \
   --pinyindata "./Corpus/train_texts_pinyin_1k.txt" \
   --num_loading_workers 2 \
   --prefetch_factor 1 \
   --bert_config "./Configs/bert_config_tiny_nezha_py.json" \
   --train_batch_size 2048 \
   --seq_length 16 \
   --num_epochs 10 \
   --continue_train_index 0 \
   --save "./Models/pert_tiny_py_lr5e4_2kBs_10e/" \
   --save_per_n_epoches 1 \
   > "./Logs/Training_pert_tiny_py_lr5e4_2kBs_10e_log.txt" 2>&1 &

Evaluate

We use py2wordPert.py to evaluate the PERT model. Here is an example:

CUDA_VISIBLE_DEVICES='0' nohup python py2wordPert.py \
   --charLex "./Corpus/CharListFrmC4P.txt" \
   --pyLex "./Corpus/pinyinList.txt" \
   --pinyin2PhrasePath "./Corpus/ModernChineseLexicon4PinyinMapping.txt" \
   --bigramModelPath "./Models/Bigram/Bigram_CharListFrmC4P.json" \
   --modelPath "./Models/pert_tiny_py_lr5e4_2kBs_10e/" \
   --charFile "./Corpus/train_texts_CharSeg_1k.txt" \
   --pinyinFile "./Corpus/train_texts_pinyin_1k.txt" \
   --conversionRsltFile "./Logs/Eval_rslt_pert_tiny_py_lr5e4_2kBs_10e_log.txt" \
   > "./Logs/Eval_pert_tiny_py_lr5e4_2kBs_10e_log.txt" 2>&1 &

Directory structure

PERT
│── Trainer.py (To train PERT)
│── py2wordPert.py (To do the Pinyin-to-character conversion task by PERT)
│── PinyinCharDataProcesser.py (To provide the dataset)
│── run_train.sh (Shell script to train PERT)
│── run_eval.sh (Shell script to evaluate PERT)
│
├── NEZHA (The NEZHA language model)
│
├── Configs (The configurations to train PERT at various scals)
│
├── Corpus (The necessary lexicon and the example corpus)
│   ├── CharListFrmC4P.txt (The list of Chinese characters)
│   ├── pinyinList.txt (The list of pinyin tokens)
│   │
│   ├── ModernChineseLexicon4PinyinMapping.txt (The word items and the corresponding pinyin tokens in Modern Chinese Lexicon)
│   │
│   ├── train_texts_CharSeg_1k.txt (The example corpus of Chinese character)
│   │
│   └── train_texts_pinyin_1k.txt (The example corpus of pinyin)
│
└── Models 
    ├── Bigram (The Bigram model trained on some news corpus)
    └── pert_tiny_py_lr5e4_2kBs_10e (The PERT model trained on some news corpus under the conditions of learning rate: 5e4, batch size: 2048, and epoch number: 10)

Reference

Please cite our paper if you use our models in your works:

@article{DBLP:journals/corr/abs-2205-11737,
  author    = {Jinghui Xiao and
               Qun Liu and
               Xin Jiang and
               Yuanfeng Xiong and
               Haiteng Wu and
               Zhe Zhang},
  title     = {{PERT:} {A} New Solution to Pinyin to Character Conversion Task},
  journal   = {CoRR},
  volume    = {abs/2205.11737},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2205.11737},
  doi       = {10.48550/arXiv.2205.11737},
  eprinttype = {arXiv},
  eprint    = {2205.11737},
  timestamp = {Mon, 30 May 2022 15:47:29 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2205-11737.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

【This open source project is not an official Huawei product, Huawei is not expected to provide support for this project.】

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

PERT

PERT

README.md

PERT

Setup

Train

Evaluate

Directory structure

Reference

Files

PERT

Directory actions

More options

Directory actions

More options

Latest commit

History

PERT

Folders and files

parent directory

README.md

PERT

Setup

Train

Evaluate

Directory structure

Reference