Stop auto-correcting Numbers and special characters #27

fahadshery · 2019-02-13T15:36:12Z

Hi again,

If I have the following string:
The support is 24/7, awesome!

It removes 24/7. Is there a way to avoid deleting numbers and other special characters such as"$£" etc.?

The text was updated successfully, but these errors were encountered:

mammothb · 2019-02-13T23:17:00Z

lookup() has an ignore_token argument which lets you avoid making changes to word which matches the ignore_token pattern.

For example, \d{2}\w*\b ignores 24th in 24th december, and \d+\W+\d+\b should ignore 24/7 given in your example. So I think you could combine both patterns into \d{2}\w*\b|\d+\W+\d+\b to ignore both cases.

fahadshery · 2019-02-15T16:29:17Z

for me word_segmentation is more relevant for this work. is it possible to do it in this method?

mammothb · 2019-02-15T23:34:39Z

word_segmentation has the ignore_token argument as well.

RajaParikshit · 2019-02-18T06:50:51Z

lookup_compound() has an ignore_non_words arguments which works fine in case of numbers and abbreviations but it still exclude special characters like : / etc. How to overcome this problem?

mammothb · 2019-02-18T07:39:10Z

The original author has described a possible solution but has not implemented it in the original code. I am also not sure how to implement in the current code.

fahadshery · 2019-02-22T14:29:16Z

Also the regex option is not working. I created this regex on regex101 for python:
result_segmented = sym_spell.word_segmentation(line,ignore_token = "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")

checkout the regex here. https://regex101.com/r/2nJYy7/1

word_segmentation is still removing special characters and numbers
lookup_compound somewhat working but failing the following string it took 24hours to fix the issue

mammothb · 2019-02-23T02:24:02Z

What is your expected output for each line?

fahadshery · 2019-02-26T10:05:47Z

I have updated the page with the expected output for each line.
https://regex101.com/r/2nJYy7/3

fahadshery · 2019-02-26T10:07:09Z

please suggest for word_segmentation

mammothb · 2019-02-28T07:59:53Z

I realized the way I implemented ignore_token is to not perform lookup on the word and return it as it is. So I don't think it can be used to ignore punctuation since it look for the pattern in brodband, instead of brodband and ,. You could preserve , by using something like \w+, as the pattern, but then brodband, will not get corrected to broadband,.

fahadshery · 2019-02-28T10:06:33Z

Understood about the commas.

but why it is still deleting the numbers? I definitely want to keep the numbers and preferably the $£ at least. Here is the code:

import sys
import os
from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

initial_capacity = 83000 # maximum edit distance per dictionary precalculation
edit_distance_max = 2
prefix_length =7

sym_spell = SymSpell(max_dictionary_edit_distance=edit_distance_max, prefix_length=prefix_length)

#get the path to the dictionary which is saved in the same project folder
dictionary_path = os.path.join(os.getcwd(),"frequency_dictionary_en_82_765.txt")

#which columns in the dictionary are the term/word and the frequecy of the word/term
term_index = 0
count_index = 1

#load the dictionary
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

corrected_verbatim = []
total_corrections = 0

#read in the input file
input_file_path = os.path.join(os.getcwd(),"input_words.txt")
print(input_file_path)

with open(input_file_path, "r") as infile:
    for line in infile:
        line = line.rstrip()

        result_segmented = sym_spell.word_segmentation(phrase=line,max_edit_distance=edit_distance_max,
                                                       ignore_token = "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
        correction = []

        if not result_segmented:
            correction = [line,line]
        else:
            correction =[line, result_segmented.corrected_string]
            total_corrections += result_segmented.distance_sum                   
            
        results_compound = sym_spell.lookup_compound(line,2,
                                                    ignore_non_words= "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
        
        if not results_compound:
            correction.extend([line])
        else:
            correction.extend([results_compound[0].term])
        
        results_lookup = sym_spell.lookup(line, Verbosity.TOP, max_edit_distance=2,
                                          ignore_token= "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
        
        if not results_lookup:
            correction.extend([line])
        else:
            correction.extend([results_lookup[0].term])
        
        corrected_verbatim.append(correction)
            
       # if not results:
       #     corrected_words.append((word, word))
        #else:
         #   corrected_words.append((word, results[0].term))
            

#for line in word_segmented_corrected_words:
#    print(line)

#for line in word_compound_corrected_words:
#    print(line)

print('Total number of spell corrections made = ' + str(total_corrections))

colnames = [["original_text", "spellchecked_text_word_segmented","spellchecked_text_word_compound","spellchecked_text_word_lookup"]]
colnames.extend(corrected_verbatim)

#corrected_words1 = ["\""+x+"\"" "," +y + "," +z for (x,y,z) in colnames]

#for word in corrected_words1:
 #   print(word)
print('Creating a output csv. It is located in the current directory called "spell_corrected.csv" \n')

import csv

#ug_list = zip(list1,list2,list3)

with open(os.path.join(os.getcwd(), "spell_corrected.csv"), "w",newline="") as outfile:
        wrtr = csv.writer(outfile)
        wrtr.writerows(colnames)
        
print('Spell check completed and output generated. \n')
#result = sym_spell.lookup("land lin", Verbosity.ALL)
#for r in result:
#    print(r)

I have also attached the input_file.txt for the reference. I think word_segmentation has no effect of the regex.

input_words.txt

mammothb · 2019-02-28T10:18:15Z

I am able to preserve the number by using Python's raw string notation for regular expression patterns, i.e.,

ignore_token=r"\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]"

instead of

ignore_token="\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]"

I think your backslashes \ were treated as escape characters since you did not use raw string notation.

fahadshery · 2019-02-28T15:12:07Z

gosh. this was super helpful. I was able to keep the results as intended.
thank you so much for your help.
Lastly, do you know why it replaces "I" with an "a"?

mammothb · 2019-03-01T02:59:11Z

I don't see "I" being replaced in the example you gave me

fahadshery · 2019-03-01T11:43:43Z

it's replacing capital I with an a

fahadshery · 2019-03-01T12:08:36Z

Additionally, if I do spell check of the string: "it took 6 mnths to solv", it converts it to "it took 6mnths to sold". Ideally it should have done it as:"it took 6 months to solve". Any ideas why it doesn't correct mnth to month and deletes the space between 6 and mnth?

mammothb · 2019-03-01T13:05:29Z

6 mths converted to 6mths because 6 matches \d\w*\b and is ignored. Then when two words are joined together to look for a possibly correct word (part of word_segmentation algorithm), 6mths matches \d\w*\b as well and is ignored.

solv is corrected sold instead of solve because the frequency of sold (42979181) is higher than that of solve (13452150) in the provided frequency dictionary.

I is correct to a while i is preserved because i is an exact match of i (gets returned as a result immediately) while I is not an exact match of i (different case) so it is corrected based the rest of the lookup algorithm and a (9081174698) has higher frequency than that of i (3086225277).

You could try to apply lower() to your input to prevent I from being corrected to a.

abhilashpandurangan · 2020-06-10T20:45:04Z

How to ignore numbers and punctuation like 2.55 in lookup_compound()? I'm aware that lookup has ignore_tokens. But lookup_compound does not allow that.

PierreSp · 2021-01-23T16:06:59Z

I had the same problem. A quick fix is to edit helpers.py (

symspellpy/symspellpy/helpers.py

Line 113 in e7a91a8

def parse_words(phrase, preserve_case=False, split_by_space=False):

)

Replace re.findall(r"([^\W_]+['’][^\W_])" with the elements you want to remove (eg re.findall(r"([^$][^$])" if you want to remove only "$" symbols). If desired I can also make a pull request to allow modifications of this regex command when calling SymSpell ?

cahya-wirawan · 2022-06-09T14:27:52Z

in case someone is still looking for a solution of this issue (how to ignore numbers and punctuation), I created a new class as workaround:


import pkg_resources
from symspellpy import SymSpell, Verbosity
import string
import re


class SpellChecker():
    def __init__(self):
        self.sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
        dictionary_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_dictionary_en_82_765.txt"
        )
        bigram_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
        )
        self.sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
        self.sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

    def lookup(self, input_term, max_edit_distance=2):
        suggestions = self.sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=max_edit_distance,
                                            transfer_casing=True, include_unknown=True)
        return suggestions[0].term

    def lookup_compound(self, input_term, max_edit_distance=2, **kwargs):
        suggestions = self.sym_spell.lookup_compound((input_term), max_edit_distance=max_edit_distance,
                                                     transfer_casing=True, ignore_non_words=True, **kwargs)
        return suggestions[0].term if len(suggestions) > 0 else input_term

    def correct(self, text, **kwargs):
        result = ""
        start = 0
        for match in re.finditer(f"[{re.escape(string.punctuation)}]", text):
            end = match.start(0)
            spaces = re.search(r"^(\s+)", text[start: end])
            corrected_text = self.lookup_compound(text[start: end], **kwargs)
            corrected_text = spaces.group(0) + corrected_text if spaces is not None else corrected_text
            spaces = re.search(r"(\s+)$", text[start: end])
            corrected_text = corrected_text + spaces.group(0) if spaces is not None else corrected_text
            corrected_text += match.group(0)
            result = ''.join([result, corrected_text])
            start = match.end(0)
        spaces = re.search(r"^(\s+)", text[start:])
        corrected_text = self.lookup_compound(text[start:], **kwargs)
        corrected_text = spaces.group(0) + corrected_text if spaces is not None else corrected_text
        result = ''.join([result, corrected_text])
        return result

Then we can use it as follow:

text = """NewYork City traaces its oriigins to atrading posti foundid onthe southen tipof Mahattan Island by 
Dutchcolonists inapproximately 1624 (acording tothe news). Thesettlement wasnamed NewAmsterdam (Dutch: Nieuw 
Amsterdam) in 1626 and was chartered as a city in 1653!"""

spell_checker = SpellChecker()
# The standard lookup_compound():
spell_checker.lookup_compound(text)

# The new text correction:
spell_checker.correct(text)

The result of the standard lookup_compound() is:

'New York City traces its origins to a trading post founded on the southern tip of Manhattan Island 
by dUtch colonists in approximately 1624 according to the news tHe settlement was named neW 
amsterdam dutch New amsTerdam in 1626 and was chartered as a city in 1653'

And the new text correction:

'New York City traces its origins to a trading post founded on the southern tip of Manhattan Island 
by Dutch colonists in approximately 1624 (according to the news). The settlement was named New 
Amsterdam (Dutch: New Amsterdam) in 1626 and was chartered as a city in 1653!'

mammothb closed this as completed Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop auto-correcting Numbers and special characters #27

Stop auto-correcting Numbers and special characters #27

fahadshery commented Feb 13, 2019

mammothb commented Feb 13, 2019

fahadshery commented Feb 15, 2019

mammothb commented Feb 15, 2019

RajaParikshit commented Feb 18, 2019

mammothb commented Feb 18, 2019

fahadshery commented Feb 22, 2019 •

edited

Loading

mammothb commented Feb 23, 2019

fahadshery commented Feb 26, 2019 •

edited

Loading

fahadshery commented Feb 26, 2019

mammothb commented Feb 28, 2019

fahadshery commented Feb 28, 2019

mammothb commented Feb 28, 2019

fahadshery commented Feb 28, 2019

mammothb commented Mar 1, 2019

fahadshery commented Mar 1, 2019

fahadshery commented Mar 1, 2019

mammothb commented Mar 1, 2019

abhilashpandurangan commented Jun 10, 2020

PierreSp commented Jan 23, 2021

cahya-wirawan commented Jun 9, 2022

Stop auto-correcting Numbers and special characters #27

Stop auto-correcting Numbers and special characters #27

Comments

fahadshery commented Feb 13, 2019

mammothb commented Feb 13, 2019

fahadshery commented Feb 15, 2019

mammothb commented Feb 15, 2019

RajaParikshit commented Feb 18, 2019

mammothb commented Feb 18, 2019

fahadshery commented Feb 22, 2019 • edited Loading

mammothb commented Feb 23, 2019

fahadshery commented Feb 26, 2019 • edited Loading

fahadshery commented Feb 26, 2019

mammothb commented Feb 28, 2019

fahadshery commented Feb 28, 2019

mammothb commented Feb 28, 2019

fahadshery commented Feb 28, 2019

mammothb commented Mar 1, 2019

fahadshery commented Mar 1, 2019

fahadshery commented Mar 1, 2019

mammothb commented Mar 1, 2019

abhilashpandurangan commented Jun 10, 2020

PierreSp commented Jan 23, 2021

cahya-wirawan commented Jun 9, 2022

fahadshery commented Feb 22, 2019 •

edited

Loading

fahadshery commented Feb 26, 2019 •

edited

Loading