Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SymSpell LookupCompound excluding Numbers and Special characters #34

Open
geomygeorge opened this issue Jun 26, 2018 · 8 comments
Open

Comments

@geomygeorge
Copy link

I'm trying to use SymSpell for OCR post processing spell correction.
I have noticed that, SymSpell LookupCompound excluding Numbers and Special characters from the output. In my context, numbers and characters are really important for further analysis.
Is it possible to avoid Numbers and Special characters elimination?

Version: SymSpell 6.3 C# project

Steps to reproduce:

  1. Build the SymSpell C# code

  2. Go to \SymSpell\SymSpell.CompoundDemo

  3. Run dotnet run .

  4. Enter below input
    "To find out more about how we use information, visit or contact-any of our offices 24/7"

  5. It gives below output.
    to find out more about how we use information visit or contact any of our offices of 5 30,646,750

Problem:
We can notice that, the output doesn't contain ',' and 24/7

Expected Behavior
to find out more about how we use information, visit or contact any of our offices 24/7

@wolfgarbe
Copy link
Owner

Punctuation:
When parsing the input string into separate terms in line 773 string[] termList1 = ParseWords(input); the punctuation characters (like ',') between words could be preserved (currently they are discarded) and stored in a separate array, possibly also upper/lower-case information of the words. After the correction the result is created from the separate suggesionParts in line 894. At this point the suggestionParts could be recombined with the preserved punctuation characters and case information.

Numbers:
Currently 24/7 is treated as two separate terms: 24 and 7 (/ is treated as punctuation and discarded).
As there are no numbers in the included dictionary the to terms "24" "7" are "corrected" into "of" "a".
Either you add numbers to the dictionary or you remove and preserve all numbers during the parsing in line 773 and later re-combine.

I will add this feature later this year.

@trungkiendang
Copy link

trungkiendang commented Oct 25, 2018

Hi @wolfgarbe is this function implimented?

I will add this feature later this year.

@wolfgarbe
Copy link
Owner

Not yet.

@Prashant118
Copy link

Hi @wolfgarbe is this function implemented?

@fahadshery
Copy link

I think it is vitally important not to remove numbers.

@hardiksanchawat
Copy link

Hi @wolfgarbe

Are these functionalities implemented?

@islama-lh
Copy link

Is this feature available yet?

@islama-lh
Copy link

islama-lh commented May 19, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants