Pure Python Spell Checking based on Peter Norvig's blog post on setting up a simple spell checking algorithm.
It uses a Levenshtein Distance
algorithm to find permutations within an edit distance of 2 from the
original word. It then compares all permutations (insertions, deletions,
replacements, and transpositions) to known words in a word frequency list.
Those words that are found more often in the frequency list are more likely
the correct results.
The easiest method to install is using pip:
pip install pyspellchecker
To install from source:
git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python setup.py install
As always, I highly recommend using the Pipenv package to help manage dependencies!
After installation, using pyspellchecker should be fairly straight forward:
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
If the Word Frequency list is not to your liking, you can add additional text to generate a more appropriate list for your use case.
from spellChecker import SpellChecker
spell = SpellChecker() # loads default word frequency list
spell.word_frequency.load_text_file('./my_free_text_doc.txt')
# if I just want to make sure some words are not flagged as misspelled
spell.word_frequency.load_words(['microsoft', 'apple', 'google'])
spell.known(['microsoft', 'google']) # will return both now!
More work in storing and loading word frequency lists is planned; stay tuned.
On-line documentation is in the future; until then you can find SpellChecker here:
correction(word)
: Returns the most probable result for the misspelled word
candidates(word)
: Returns a set of possible candidates for the misspelled
word
known([words])
: Returns those words that are in the word frequency list
unknown([words])
: Returns those words that are not in the frequency list
word_probability(word)
: The frequency of the given word out of all words in
the frequency list
edit_distance_1(word)
: Returns a set of all strings at a Levenshtein Distance
of one
edit_distance_2(word)
: Returns a set of all strings at a Levenshtein Distance
of two