-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use "regex" library from PyPI to build GLOBAL_SENTENCE_TERMINATORS regex #4
base: master
Are you sure you want to change the base?
Conversation
This is analogous to sentencex-js/pull/3 — both remove the |
Hi @divec, All the tests are passing with this pull request, but I noticed a performance regression. Running accuracy.py in the benchmarks folder of this repo gives the following measures
So that is more than two times the master version. Since the original issue is only with sentence terminator regex and not with other regex usage, I tried to apply a minimal patch: diff --git a/sentencex/base.py b/sentencex/base.py
index a2218e6..e2c0bde 100644
--- a/sentencex/base.py
+++ b/sentencex/base.py
@@ -1,4 +1,5 @@
import re
+import regex
from typing import Dict, Iterator, Tuple
from .terminators import GLOBAL_SENTENCE_TERMINATORS
@@ -44,7 +45,7 @@ class Language(object, metaclass=Languages):
language = "base"
abbreviations: set = set()
- GLOBAL_SENTENCE_BOUNDARY_REGEX = re.compile(r"[%s]+" % "".join(GLOBAL_SENTENCE_TERMINATORS))
+ GLOBAL_SENTENCE_BOUNDARY_REGEX = regex.compile(r"\p{Sentence_Terminal}+")
quotes_regx_str = r"|".join([f"{left}(\n|.)*?{right}" for left, right in quote_pairs.items()])
quotes_regex = re.compile(r"%s+" % quotes_regx_str)
parens_regex = re.compile(r"([\((<{\[])(?:\\\1|.)*?[\)\]})]")
Then I got 1.37 seconds- which is better, but still a regression in performance. Maintaining an in-library collection of these characters helped us to achieve a javascript counterpart. So the zero dependency for this library was an intentional decision. What if we fix the amaharic issue and the comment issue you pointed out and keep the zero dependency pattern as such? |
The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list. https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01 However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list. |
Thanks @santhoshtr. If we maintain the list of terminators within sentencex, then we probably also need to maintain the list of close punctuation and spaces in order to expose boundary information:
The master copy of these lists is available in https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakProperty.txt , which we pull into UnicodeJS using an extraction script that might be useful to borrow from, if you decide that maintaining these lists within sentencex is worthwhile for the performance boost. |
Ah ok, understood. I wonder if the wikinlp version got partially corrupted due to encoding errors? The new version in the commit you added looks a lot less strange! I note the new code in the commit adds U+3002 IDEOGRAPHIC FULL STOP manually; a related character that seems to be missing is U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP. I wonder why these are missing from the perl script output? Maybe they omit ideographic characters to save memory or something? U+2026 HORIZONTAL ELLIPSIS is not actually listed as STerm in the Unicode Character Database. Did we make a heuristic decision to treat it as a sentence terminator nevertheless? If we really want to include it, then the manual entry should actually use |
Fixes #3