-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexibility in minhash dedup by index #110
Comments
We could add the option you request, something like:
For the second part of your issue, you can simply save diff indexes in diff folders. If you have a use case/example where this doesn't work, I could make it so you can provide a list of glob expressions to match indexes |
Actually we already have a block to create an index without deduplicating: |
List of glob expressions would be great, if we want to do index dedup selectively |
Hi @guipenedo , do you have time to add this? My use case is: I want to deduplicate across all cc dumps. My plan is to first perform the deduplication inside each dump. Then, run a separate phase of cross-dump deduplication sequentially. When using the current code, I store the index of each dump in different folders. Now I want to run dedup across dumps, I need to load index of all previous dumps, but they are stored in different folders. It seems that the current interface cannot support this... |
Hi @jordane95,
Basically for each dump you will both deduplicate within the dump ( A word of advice though, for FineWeb we originally tried this approach and the performance is considerably worse than just independently deduplicating each dump. Specially for the last dumps to be processed you will be removing almost all of the data and what is left is usually of bad quality. This may depend on how many dumps you plan to process though |
Thanks for your reply. I might need to add some changes myself. It seems that your approach cannot be run in parallel, since next dump depends on results of previous dumps. Also it cannot reuse the intra-dump dedup results for ablation study. Regarding the second point, I have seen your awesome fineweb dataset. But I think there must be many duplicates in it (for example, url dedup is not done according to your info). May I know how did you perform experiments on intra-dump dedup v.s. cross-dump dedup? Did you merge all dumps to train the model or on each dump individually for comparison. Since if we use all dumps to train our model, the bad quality data left in the last dump will only be seen once and with a small proportion. So there won't be much difference v.s. intra-dedup. Unless we think that the data duplicated across dumps are of high-quality... |
We plan to have a blogpost out later this week that will hopefully answer these questions |
I find the glob and subdirectory have been switched here Lines 118 to 158 in 0f2c69f
However, my index are stored in a structure like this
I want to load index from all previous parts, so I suppose simple list of glob + subdirectory won't work here, since |
Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say we want to run 10 tasks in parallel), then run the dedup by index in later tasks.
It seems that the hash index of all datasets must be stored in one folder, so subsequent dataset being processed must be deduped from all the index in the existing folder. Also we cannot specific from which index we want to dedup the current dataset.
The text was updated successfully, but these errors were encountered: