cmusphinx/SimpleLM at master · DamonGuzman/cmusphinx

History

Name		Name	Last commit message	Last commit date
parent directory ..
README		README
SimpleLM.pl		SimpleLM.pl
punctrem.sed		punctrem.sed

README

 To run you will need:

  -> binaries for the CMU-Cambridge Language Model Toolkit,
     I'm using v2.03
  -> cmudictionary
  -> punctrem.sed
  -> SimpleLM.pl

  Install, punctrem.sed expects to live with the CMU-Cambridge binaries.



 To run: (After editing SimpleLM.pl for correct path points)

   SimpleLM.pl foo.txt foo_lm


# INPUT: a text file

  # The input text can be just about any text document. It attempts to
  # deal with punctuation and other pieces. It doesn't require special
  # formating.

# OUTPUT: a .dict a .arpabo and a .arpabo.DMP file
  
  # These files are what are needed for SPHINX




 Please let me know if this is useful or you have additions.

 [email protected]
  
---------------------------------------------------------------


SimpleLM.pl - notes from header


# Ricky Houghton, Carnegie Mellon University (Feburary 2nd, 2000)
# This is a based on a version that Alex Hauptmann wrote.

# This script should work for most simple cases, however there are
# many problems/concerns that need to be addressed for larger sets of
# data.
#

# 0.) This is an early version, it could use a bit of cleaning
# up. Hopefully many will find it useful as is.

# 1.) This script can not deal with words not in cmu_dict. A future
# version will create pronunciations for OOV words on the fly. 
# 

# 2.) This script does not merge a smaller corpus with a general
# corpus. This merging step is actually quite important. Even with a
# tight corpus, there is a real benefit to merging with a general
# language model. I will release MergeLM once I find a good general
# text set for merging and we have a pronunciation generator.

# 3.) This script does not deal with text normalization. For hand
# crafted corpora this should not be a problem. However, if you want
# to build an LM for recognizing something like NPR, and would like to
# use text from the WEB as a source of language data, it should be
# normalized. This process would convert the number "100,000", to "one
# hundred thousand", "www.cmu.edu" would become "w w w dot c m u dot e
# d u", or maybe "w w w dot c m u dot ed u". That is, it would attempt
# to convert numbers and symbols to text strings that the recognizer
# might return. (Note, awb has a text normalizer that can be used,
# however I didn't have time to incorporate it into the script
# tonight. Other things popped up.

# 4.) I've made no real effort to allow for parameter passing.  I ran
# out of time and I'm committed for the next week training new
# acoustic models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleLM

SimpleLM

README

Files

SimpleLM

Directory actions

More options

Directory actions

More options

Latest commit

History

SimpleLM

Folders and files

parent directory

README