-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SymSpell LookupCompound excluding Numbers and Special characters #34
Comments
Punctuation: Numbers: I will add this feature later this year. |
Hi @wolfgarbe is this function implimented?
|
Not yet. |
Hi @wolfgarbe is this function implemented? |
I think it is vitally important not to remove numbers. |
Hi @wolfgarbe Are these functionalities implemented? |
Is this feature available yet? |
Try this
from absl import app
from absl import flags
from symspellpy import SymSpell,Verbosity
import pkg_resources
import re
import pdb
flags.DEFINE_string("test_sentence",
"If the extracted string less less than 50 characters long, and is not
sentence-terminated, then we assume that it is a header."
, "sample test sentence")
flags.DEFINE_string("test_sentence1",
"itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness"
, "sample test sentence")
flags.DEFINE_string("filename", 'sample1.txt', "filename")
FLAGS = flags.FLAGS
class SpellChecker(object):
def __init__(self, edit_distance_max = 2, prefix_length = 7):
self.dictionary_path = pkg_resources.resource_filename("symspellpy"
, "frequency_dictionary_en_82_765.txt")
self.sym_spell = SymSpell(edit_distance_max, prefix_length)
self.sym_spell.load_dictionary(self.dictionary_path, 0, 1)
self.edit_distance_max = edit_distance_max
def do_symspell(self, sentence):
endswith_dot = False
if sentence.endswith('.'):
sentence = sentence[:-1]
endswith_dot = True
for word in sentence.split():
if re.search("[^a-zA-Z]", word):
if word not in self.sym_spell._words:
self.sym_spell._words[word] =1
else:
self.sym_spell._words[word] +=1
results = self.sym_spell.lookup_compound(sentence,
max_edit_distance=self.edit_distance_max , transfer_casing = True,
ignore_non_words= True, split_phrase_by_space= True,
ignore_term_with_digits=True)
sentence = sentence if not results else results[0].term
return sentence+"." if endswith_dot else sentence
def do_word_segmentation(self, sentence):
results = self.sym_spell.word_segmentation(sentence)
return results.corrected_string
def main(_):
spell_checker_obj = SpellChecker()
# with open(FLAGS.filename) as f:
# sentences = f.read().splitlines()
# for sentence in sentences:
# print("Prev: %s"%sentence)
# print("After: %s"%spell_checker_obj.do_symspell(sentence))
print(spell_checker_obj.do_symspell(
"“I see it, I deduce it. How do I know that you have been getting
yourself very wet lately, and that you have a most clumsy and careless
servent girl?”"
))
# print(spell_checker_obj.do_word_segmentation(FLAGS.test_sentence1))
if __name__ == '__main__':
app.run(main)
…On Tue, May 19, 2020 at 7:06 AM Soumya ***@***.***> wrote:
Hello. Is there an option in symspell where you can skip a list of
keywords, rather than defining a regex for the purpose?
Also, is there a way to detect language using symspell?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN7CVKX6PHZZWZ3FSUBJIXDRSJ745ANCNFSM4FHBJBPQ>
.
--
This correspondence may contain personal or confidential information. If
you are not the intended recipient, please delete the e-mail and any
attachments and notify London Hydro immediately.
|
I'm trying to use SymSpell for OCR post processing spell correction.
I have noticed that, SymSpell LookupCompound excluding Numbers and Special characters from the output. In my context, numbers and characters are really important for further analysis.
Is it possible to avoid Numbers and Special characters elimination?
Version: SymSpell 6.3 C# project
Steps to reproduce:
Build the SymSpell C# code
Go to \SymSpell\SymSpell.CompoundDemo
Run dotnet run .
Enter below input
"To find out more about how we use information, visit or contact-any of our offices 24/7"
It gives below output.
to find out more about how we use information visit or contact any of our offices of 5 30,646,750
Problem:
We can notice that, the output doesn't contain ',' and 24/7
Expected Behavior
to find out more about how we use information, visit or contact any of our offices 24/7
The text was updated successfully, but these errors were encountered: