-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential issues in substring dedup #121
Comments
Can you share the code you used to decode and expand a bit on what exactly you did to compile these excerpts? |
I just add some debug code to the function to produce the resulting document if duplicates:
text = doc.text
if self.debug:
doc.metadata['duplicates'] = duplicates
doc.metadata['raw_text'] = text
# TODO improve
for d in duplicates:
text = text.replace(d, "")
doc.text = text |
Actually, these examples are quite common, like 2 in 10? |
Any idea on this? @guipenedo Could there be any wrong offset at byte-level operation? |
Also, I find that some duplicates are decoded into string with a strange ending
|
So it's been a while since I took a look at this and the person who made the exactsubstr code is no longer involved with the project, but to me both issues sound like typical byte level issues where there is an offset by one problem.
|
Yeah, I think the strange char is related to some problems with BPE, it is a subword token that couldn't be decoded into one full word. In the original implementation by google, they haven't even decoded the token ids assuming the output tokens are directly feeded for lm training. I find some bugs in the byte range normalization code which could produce this type of non sense text. I will soon submit a PR to fix this |
Could this line be too strict? Some texts are not exactly the same after being encoded and decoded, they only differ by a small margin datatrove/src/datatrove/pipeline/dedup/exact_substrings.py Lines 334 to 336 in a98aafd
|
For example, for this text,
Using qwen 1.5 tokenizer for encoded and decode, I find the final sentence with a rare token is incorrectly decoded.
They look exactly the same, but are different in terms of underlying bytes or chars.
|
Also, see this one
|
Hi @guipenedo , I used your substring dedup script to perform deduplication on a dump of cc and did some manual inspection. I find that some resulting duplicates a bit strange.
For example,
Many duplicates seem no sense after being decoded into text from bytes. Is this normal? Because some of the examples look good.
The text was updated successfully, but these errors were encountered: