This directory contains language-specific data files. Most importantly, you will find here:
- A list of unique characters for the target language (e.g. English) in
data/alphabet.txt
. After installing the training code, you can checkpython -m coqui_stt_training.util.check_characters --help
for a tool that creates an alphabet file from a list of training CSV files. - A script used to generate a binary n-gram language model:
data/lm/generate_lm.py
.
For more information on how to build these resources from scratch, see the External scorer scripts
section on stt.readthedocs.io.