-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop auto-correcting Numbers and special characters #27
Comments
For example, |
for me |
|
|
The original author has described a possible solution but has not implemented it in the original code. I am also not sure how to implement in the current code. |
Also the regex option is not working. I created this regex on regex101 for python: checkout the regex here. https://regex101.com/r/2nJYy7/1
|
What is your expected output for each line? |
I have updated the page with the expected output for each line. |
please suggest for |
I realized the way I implemented |
Understood about the commas. but why it is still deleting the import sys
import os
from symspellpy.symspellpy import SymSpell, Verbosity # import the module
initial_capacity = 83000 # maximum edit distance per dictionary precalculation
edit_distance_max = 2
prefix_length =7
sym_spell = SymSpell(max_dictionary_edit_distance=edit_distance_max, prefix_length=prefix_length)
#get the path to the dictionary which is saved in the same project folder
dictionary_path = os.path.join(os.getcwd(),"frequency_dictionary_en_82_765.txt")
#which columns in the dictionary are the term/word and the frequecy of the word/term
term_index = 0
count_index = 1
#load the dictionary
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
corrected_verbatim = []
total_corrections = 0
#read in the input file
input_file_path = os.path.join(os.getcwd(),"input_words.txt")
print(input_file_path)
with open(input_file_path, "r") as infile:
for line in infile:
line = line.rstrip()
result_segmented = sym_spell.word_segmentation(phrase=line,max_edit_distance=edit_distance_max,
ignore_token = "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
correction = []
if not result_segmented:
correction = [line,line]
else:
correction =[line, result_segmented.corrected_string]
total_corrections += result_segmented.distance_sum
results_compound = sym_spell.lookup_compound(line,2,
ignore_non_words= "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
if not results_compound:
correction.extend([line])
else:
correction.extend([results_compound[0].term])
results_lookup = sym_spell.lookup(line, Verbosity.TOP, max_edit_distance=2,
ignore_token= "\d{2}\w*\b|\d+\W+\d+\b|\d\w*\b|[!@£#$%^&*();,.?:{}/|<>]")
if not results_lookup:
correction.extend([line])
else:
correction.extend([results_lookup[0].term])
corrected_verbatim.append(correction)
# if not results:
# corrected_words.append((word, word))
#else:
# corrected_words.append((word, results[0].term))
#for line in word_segmented_corrected_words:
# print(line)
#for line in word_compound_corrected_words:
# print(line)
print('Total number of spell corrections made = ' + str(total_corrections))
colnames = [["original_text", "spellchecked_text_word_segmented","spellchecked_text_word_compound","spellchecked_text_word_lookup"]]
colnames.extend(corrected_verbatim)
#corrected_words1 = ["\""+x+"\"" "," +y + "," +z for (x,y,z) in colnames]
#for word in corrected_words1:
# print(word)
print('Creating a output csv. It is located in the current directory called "spell_corrected.csv" \n')
import csv
#ug_list = zip(list1,list2,list3)
with open(os.path.join(os.getcwd(), "spell_corrected.csv"), "w",newline="") as outfile:
wrtr = csv.writer(outfile)
wrtr.writerows(colnames)
print('Spell check completed and output generated. \n')
#result = sym_spell.lookup("land lin", Verbosity.ALL)
#for r in result:
# print(r) I have also attached the |
I am able to preserve the number by using Python's raw string notation for regular expression patterns, i.e.,
instead of
I think your backslashes |
gosh. this was super helpful. I was able to keep the results as intended. |
it's replacing capital I with an a |
Additionally, if I do spell check of the string: |
You could try to apply |
How to ignore numbers and punctuation like 2.55 in lookup_compound()? I'm aware that lookup has ignore_tokens. But lookup_compound does not allow that. |
I had the same problem. A quick fix is to edit helpers.py ( symspellpy/symspellpy/helpers.py Line 113 in e7a91a8
Replace re.findall(r"([^\W_]+['’][^\W_])" with the elements you want to remove (eg re.findall(r"([^$][^$])" if you want to remove only "$" symbols). If desired I can also make a pull request to allow modifications of this regex command when calling SymSpell ? |
in case someone is still looking for a solution of this issue (how to ignore numbers and punctuation), I created a new class as workaround:
Then we can use it as follow:
The result of the standard lookup_compound() is:
And the new text correction:
|
Hi again,
If I have the following string:
The support is 24/7, awesome!
It removes
24/7
. Is there a way to avoid deletingnumber
s and otherspecial characters
such as"$£"
etc.?The text was updated successfully, but these errors were encountered: