Description
🐛 Description
sif4sci
may return None;Similarly,GensimWordTokenizer
may return None, ethier.
Error Message
(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100
before running your script.)
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
import json
from EduNLP.SIF import sif4sci, is_sif, to_sif
def load_items2():
items = []
with open("OpenLUNA.json", encoding="utf-8") as f:
for line in f:
items.append( json.loads(line))
return items
items = load_items2()
# ----------------------------------------- #
tokenization_params1 = {
"formula_params": {
"method": "linear",
"symbolize_figure_formula": True
}
}
tokenizer = GensimWordTokenizer(symbol="fgm")
# ----------------------------------------- #
wrong_num = 0
for item in items:
res = sif4sci(item["stem"], symbol="gm", tokenization_params=tokenization_params1, errors="ignore")
# res = tokenizer(item["stem"])
if res is None:
wrong_num += 1
print(f"There are {wrong_num} / {len(items)} wrong cases!")
# There are 156 / 792 wrong cases!
What have you tried to solve it?
Actually, I figure out that this is caused by our way to hangle Error raised, which is "ignore" in GensimWordTokenizer.
But, as I look at the specific error, I find one main type related to SIF Parser
. So I wonder if we need to handle this problem ?
For example, Parser can not identify "n="
and "p="
(1)
s1 = "执行右面的程序框图,则输出的n=$\\FigureID{3bf20b93-8af1-11eb-b205-b46bfc50aa29}$$\\FigureID{59b88b3f-8af1-11eb-9450-b46bfc50aa29}$$\\FigureID{63116570-8b75-11eb-b694-b46bfc50aa29}$$\\FigureID{6a006177-8b76-11eb-9ac0-b46bfc50aa29}$$\\FigureID{088f15e9-8b7c-11eb-959f-b46bfc50aa29}$"
is_sif(s1)
RecursionError Traceback (most recent call last)
<ipython-input-3-a8de420882df> in <module>
11
12 # ----------------------------------------- #
---> 13 is_sif(s1)
14
15 # ----------------------------------------- #
e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\sif.py in is_sif(item, check_formula, return_parser)
50 """
51 item_parser = Parser(item, check_formula)
---> 52 item_parser.description_list()
53 if item_parser.fomula_illegal_flag:
54 raise ValueError(item_parser.fomula_illegal_message)
e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description_list(self)
344 """
345 # print('call description_list')
--> 346 self.description()
347 if self.error_flag:
348 # print("Error")
e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description(self)
304 # if self.error_flag:
305 # return
--> 306 self.txt_list()
307 if self.error_flag:
308 return
e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
298 return
299 if self.lookahead != self.empty:
--> 300 self.txt_list()
301
302 def description(self):
... last 1 frames repeated, from the frame below ...
e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
298 return
299 if self.lookahead != self.empty:
--> 300 self.txt_list()
301
302 def description(self):
RecursionError: maximum recursion depth exceeded in comparison
Environment
Operating System: windows
Python Version: Pyhon 3.6