Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop auto-correcting Numbers and special characters #27

Closed
fahadshery opened this issue Feb 13, 2019 · 20 comments
Closed

Stop auto-correcting Numbers and special characters #27

fahadshery opened this issue Feb 13, 2019 · 20 comments

Comments

@fahadshery
Copy link

Hi again,

If I have the following string:
The support is 24/7, awesome!

It removes 24/7. Is there a way to avoid deleting numbers and other special characters such as"$£" etc.?

@mammothb
Copy link
Owner

lookup() has an ignore_token argument which lets you avoid making changes to word which matches the ignore_token pattern.

For example, \d{2}\w*\b ignores 24th in 24th december, and \d+\W+\d+\b should ignore 24/7 given in your example. So I think you could combine both patterns into \d{2}\w*\b|\d+\W+\d+\b to ignore both cases.

@fahadshery
Copy link
Author

for me word_segmentation is more relevant for this work. is it possible to do it in this method?

@mammothb
Copy link
Owner

word_segmentation has the ignore_token argument as well.

@RajaParikshit
Copy link

lookup_compound() has an ignore_non_words arguments which works fine in case of numbers and abbreviations but it still exclude special characters like : / etc. How to overcome this problem?

@mammothb
Copy link
Owner

The original author has described a possible solution but has not implemented it in the original code. I am also not sure how to implement in the current code.

@fahadshery
Copy link
Author

fahadshery commented Feb 22, 2019

Also the regex option is not working. I created this regex on regex101 for python:
result_segmented = sym_spell.word_segmentation(line,ignore_token = "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")

checkout the regex here. https://regex101.com/r/2nJYy7/1

word_segmentation is still removing special characters and numbers
lookup_compound somewhat working but failing the following string it took 24hours to fix the issue

@mammothb
Copy link
Owner

What is your expected output for each line?

@fahadshery
Copy link
Author

fahadshery commented Feb 26, 2019

I have updated the page with the expected output for each line.
https://regex101.com/r/2nJYy7/3

@fahadshery
Copy link
Author

please suggest for word_segmentation

@mammothb
Copy link
Owner

I realized the way I implemented ignore_token is to not perform lookup on the word and return it as it is. So I don't think it can be used to ignore punctuation since it look for the pattern in brodband, instead of brodband and ,. You could preserve , by using something like \w+, as the pattern, but then brodband, will not get corrected to broadband,.

@fahadshery
Copy link
Author

Understood about the commas.

but why it is still deleting the numbers? I definitely want to keep the numbers and preferably the at least. Here is the code:

import sys
import os
from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

initial_capacity = 83000 # maximum edit distance per dictionary precalculation
edit_distance_max = 2
prefix_length =7

sym_spell = SymSpell(max_dictionary_edit_distance=edit_distance_max, prefix_length=prefix_length)

#get the path to the dictionary which is saved in the same project folder
dictionary_path = os.path.join(os.getcwd(),"frequency_dictionary_en_82_765.txt")

#which columns in the dictionary are the term/word and the frequecy of the word/term
term_index = 0
count_index = 1

#load the dictionary
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

corrected_verbatim = []
total_corrections = 0

#read in the input file
input_file_path = os.path.join(os.getcwd(),"input_words.txt")
print(input_file_path)

with open(input_file_path, "r") as infile:
    for line in infile:
        line = line.rstrip()

        result_segmented = sym_spell.word_segmentation(phrase=line,max_edit_distance=edit_distance_max,
                                                       ignore_token = "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
        correction = []

        if not result_segmented:
            correction = [line,line]
        else:
            correction =[line, result_segmented.corrected_string]
            total_corrections += result_segmented.distance_sum                   
            
        results_compound = sym_spell.lookup_compound(line,2,
                                                    ignore_non_words= "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
        
        if not results_compound:
            correction.extend([line])
        else:
            correction.extend([results_compound[0].term])
        
        results_lookup = sym_spell.lookup(line, Verbosity.TOP, max_edit_distance=2,
                                          ignore_token= "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
        
        if not results_lookup:
            correction.extend([line])
        else:
            correction.extend([results_lookup[0].term])
        
        corrected_verbatim.append(correction)
            
       # if not results:
       #     corrected_words.append((word, word))
        #else:
         #   corrected_words.append((word, results[0].term))
            

#for line in word_segmented_corrected_words:
#    print(line)

#for line in word_compound_corrected_words:
#    print(line)

print('Total number of spell corrections made = ' + str(total_corrections))

colnames = [["original_text", "spellchecked_text_word_segmented","spellchecked_text_word_compound","spellchecked_text_word_lookup"]]
colnames.extend(corrected_verbatim)

#corrected_words1 = ["\""+x+"\"" "," +y + "," +z for (x,y,z) in colnames]

#for word in corrected_words1:
 #   print(word)
print('Creating a output csv. It is located in the current directory called "spell_corrected.csv" \n')

import csv

#ug_list = zip(list1,list2,list3)

with open(os.path.join(os.getcwd(), "spell_corrected.csv"), "w",newline="") as outfile:
        wrtr = csv.writer(outfile)
        wrtr.writerows(colnames)
        
print('Spell check completed and output generated. \n')
#result = sym_spell.lookup("land lin", Verbosity.ALL)
#for r in result:
#    print(r)

I have also attached the input_file.txt for the reference. I think word_segmentation has no effect of the regex.

input_words.txt

@mammothb
Copy link
Owner

I am able to preserve the number by using Python's raw string notation for regular expression patterns, i.e.,

ignore_token=r"\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]"

instead of

ignore_token="\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]"

I think your backslashes \ were treated as escape characters since you did not use raw string notation.

@fahadshery
Copy link
Author

gosh. this was super helpful. I was able to keep the results as intended.
thank you so much for your help.
Lastly, do you know why it replaces "I" with an "a"?

@mammothb
Copy link
Owner

mammothb commented Mar 1, 2019

I don't see "I" being replaced in the example you gave me

img

@fahadshery
Copy link
Author

it's replacing capital I with an a

@fahadshery
Copy link
Author

Additionally, if I do spell check of the string: "it took 6 mnths to solv", it converts it to "it took 6mnths to sold". Ideally it should have done it as:"it took 6 months to solve". Any ideas why it doesn't correct mnth to month and deletes the space between 6 and mnth?

@mammothb
Copy link
Owner

mammothb commented Mar 1, 2019

6 mths converted to 6mths because 6 matches \d\w*\b and is ignored. Then when two words are joined together to look for a possibly correct word (part of word_segmentation algorithm), 6mths matches \d\w*\b as well and is ignored.

solv is corrected sold instead of solve because the frequency of sold (42979181) is higher than that of solve (13452150) in the provided frequency dictionary.

I is correct to a while i is preserved because i is an exact match of i (gets returned as a result immediately) while I is not an exact match of i (different case) so it is corrected based the rest of the lookup algorithm and a (9081174698) has higher frequency than that of i (3086225277).

You could try to apply lower() to your input to prevent I from being corrected to a.

@mammothb mammothb closed this as completed Mar 6, 2019
@abhilashpandurangan
Copy link

How to ignore numbers and punctuation like 2.55 in lookup_compound()? I'm aware that lookup has ignore_tokens. But lookup_compound does not allow that.

@PierreSp
Copy link

I had the same problem. A quick fix is to edit helpers.py (

def parse_words(phrase, preserve_case=False, split_by_space=False):
)

Replace re.findall(r"([^\W_]+['’][^\W_])" with the elements you want to remove (eg re.findall(r"([^$][^$])" if you want to remove only "$" symbols). If desired I can also make a pull request to allow modifications of this regex command when calling SymSpell ?

@cahya-wirawan
Copy link

in case someone is still looking for a solution of this issue (how to ignore numbers and punctuation), I created a new class as workaround:


import pkg_resources
from symspellpy import SymSpell, Verbosity
import string
import re


class SpellChecker():
    def __init__(self):
        self.sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
        dictionary_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_dictionary_en_82_765.txt"
        )
        bigram_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
        )
        self.sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
        self.sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

    def lookup(self, input_term, max_edit_distance=2):
        suggestions = self.sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=max_edit_distance,
                                            transfer_casing=True, include_unknown=True)
        return suggestions[0].term

    def lookup_compound(self, input_term, max_edit_distance=2, **kwargs):
        suggestions = self.sym_spell.lookup_compound((input_term), max_edit_distance=max_edit_distance,
                                                     transfer_casing=True, ignore_non_words=True, **kwargs)
        return suggestions[0].term if len(suggestions) > 0 else input_term

    def correct(self, text, **kwargs):
        result = ""
        start = 0
        for match in re.finditer(f"[{re.escape(string.punctuation)}]", text):
            end = match.start(0)
            spaces = re.search(r"^(\s+)", text[start: end])
            corrected_text = self.lookup_compound(text[start: end], **kwargs)
            corrected_text = spaces.group(0) + corrected_text if spaces is not None else corrected_text
            spaces = re.search(r"(\s+)$", text[start: end])
            corrected_text = corrected_text + spaces.group(0) if spaces is not None else corrected_text
            corrected_text += match.group(0)
            result = ''.join([result, corrected_text])
            start = match.end(0)
        spaces = re.search(r"^(\s+)", text[start:])
        corrected_text = self.lookup_compound(text[start:], **kwargs)
        corrected_text = spaces.group(0) + corrected_text if spaces is not None else corrected_text
        result = ''.join([result, corrected_text])
        return result

Then we can use it as follow:

text = """NewYork City traaces its oriigins to atrading posti foundid onthe southen tipof Mahattan Island by 
Dutchcolonists inapproximately 1624 (acording tothe news). Thesettlement wasnamed NewAmsterdam (Dutch: Nieuw 
Amsterdam) in 1626 and was chartered as a city in 1653!"""

spell_checker = SpellChecker()
# The standard lookup_compound():
spell_checker.lookup_compound(text)

# The new text correction:
spell_checker.correct(text)

The result of the standard lookup_compound() is:

'New York City traces its origins to a trading post founded on the southern tip of Manhattan Island 
by dUtch colonists in approximately 1624 according to the news tHe settlement was named neW 
amsterdam dutch New amsTerdam in 1626 and was chartered as a city in 1653'

And the new text correction:

'New York City traces its origins to a trading post founded on the southern tip of Manhattan Island 
by Dutch colonists in approximately 1624 (according to the news). The settlement was named New 
Amsterdam (Dutch: New Amsterdam) in 1626 and was chartered as a city in 1653!'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants