Flexibility in minhash dedup by index #110

jordane95 · 2024-02-28T09:02:00Z

Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say we want to run 10 tasks in parallel), then run the dedup by index in later tasks.

It seems that the hash index of all datasets must be stored in one folder, so subsequent dataset being processed must be deduped from all the index in the existing folder. Also we cannot specific from which index we want to dedup the current dataset.

guipenedo · 2024-02-29T10:15:57Z

We could add the option you request, something like:

dedup_in_index
dedup_in_self
and create_index_name controls whether you create an index or not

For the second part of your issue, you can simply save diff indexes in diff folders. If you have a use case/example where this doesn't work, I could make it so you can provide a list of glob expressions to match indexes

guipenedo · 2024-02-29T10:50:29Z

Actually we already have a block to create an index without deduplicating: MinhashBuildIndex

jordane95 · 2024-03-06T12:45:11Z

We could add the option you request, something like:

dedup_in_index

dedup_in_self
and create_index_name controls whether you create an index or not

For the second part of your issue, you can simply save diff indexes in diff folders. If you have a use case/example where this doesn't work, I could make it so you can provide a list of glob expressions to match indexes

List of glob expressions would be great, if we want to do index dedup selectively

jordane95 · 2024-05-28T06:39:59Z

Hi @guipenedo , do you have time to add this? My use case is: I want to deduplicate across all cc dumps. My plan is to first perform the deduplication inside each dump. Then, run a separate phase of cross-dump deduplication sequentially. When using the current code, I store the index of each dump in different folders. Now I want to run dedup across dumps, I need to load index of all previous dumps, but they are stored in different folders. It seems that the current interface cannot support this...

guipenedo · 2024-05-28T08:45:23Z

Hi @jordane95,
We are a bit swamped rn but we did something similar with the existing code:

we deduplicated the dumps sequentially (most recent to oldest)
we used the same index_folder in step2 for each dump
we set create_index_name=DUMP in step 2
we set only_dedup_in_index=False in step 2

Basically for each dump you will both deduplicate within the dump (only_dedup_in_index=False) and on all the previous dumps. Then a new index file will be created containing the new hashes from this dump that were not found in the previous ones that you may use for the following dump.

A word of advice though, for FineWeb we originally tried this approach and the performance is considerably worse than just independently deduplicating each dump. Specially for the last dumps to be processed you will be removing almost all of the data and what is left is usually of bad quality. This may depend on how many dumps you plan to process though

jordane95 · 2024-05-28T09:37:13Z

Thanks for your reply. I might need to add some changes myself. It seems that your approach cannot be run in parallel, since next dump depends on results of previous dumps. Also it cannot reuse the intra-dump dedup results for ablation study.

Regarding the second point, I have seen your awesome fineweb dataset. But I think there must be many duplicates in it (for example, url dedup is not done according to your info). May I know how did you perform experiments on intra-dump dedup v.s. cross-dump dedup? Did you merge all dumps to train the model or on each dump individually for comparison. Since if we use all dumps to train our model, the bad quality data left in the last dump will only be seen once and with a small proportion. So there won't be much difference v.s. intra-dedup. Unless we think that the data duplicated across dumps are of high-quality...

guipenedo · 2024-05-28T09:38:37Z

We plan to have a blogpost out later this week that will hopefully answer these questions

jordane95 · 2024-05-29T07:22:17Z

I find the glob and subdirectory have been switched here

datatrove/src/datatrove/io.py

Lines 118 to 158 in 0f2c69f

    
               def list_files( 
        
                   self, 
        
                   subdirectory: str = "", 
        
                   recursive: bool = True, 
        
                   glob_pattern: str | None = None, 
        
                   include_directories: bool = False, 
        
               ) -> list[str]: 
        
                   """ 
        
                   Get a list of files on this directory. If `subdirectory` is given will search in `path/subdirectory`. If 
        
                   glob_pattern is given, it will only return files that match the pattern, which can be used to match a given 
        
                   extension, for example `*.myext`. Be careful with subdirectories when using glob (use ** if you want to match 
        
                   any subpath). Args: subdirectory: str:  (Default value = "") recursive: bool:  (Default value = True) 
        
                   glob_pattern: str | None:  (Default value = None) 
        
                   Returns: a list of file paths, relative to `self.path` 
        
                   """ 
        
                   if glob_pattern and not has_magic(glob_pattern): 
        
                       # makes it slightly easier for file extensions 
        
                       glob_pattern = f"*{glob_pattern}" 
        
                   extra_options = {} 
        
                   if isinstance(self.fs, HfFileSystem): 
        
                       extra_options["expand_info"] = False  # speed up 
        
                   if include_directories: 
        
                       extra_options["withdirs"] = True 
        
                   return sorted( 
        
                       [ 
        
                           f 
        
                           for f, info in ( 
        
                               self.find(subdirectory, maxdepth=1 if not recursive else None, detail=True, **extra_options) 
        
                               if not glob_pattern 
        
                               else self.glob( 
        
                                   self.fs.sep.join([subdirectory, glob_pattern]) if subdirectory else glob_pattern, 
        
                                   maxdepth=1 if not recursive else None, 
        
                                   detail=True, 
        
                                   **extra_options, 
        
                               ) 
        
                           ).items() 
        
                           if include_directories or info["type"] != "directory" 
        
                       ] 
        
                   )

However, my index are stored in a structure like this

...
/data/part-00093/index/bucket_000/*.minhash.index
/data/part-00094/index/bucket_000/*.minhash.index
...

I want to load index from all previous parts, so I suppose simple list of glob + subdirectory won't work here, since subdirectory='bucket_000' but the glob pattern should be placed before. I'm thinking about just add a list of folders for index file finding, or should we extend the functionality of glob?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexibility in minhash dedup by index #110

Flexibility in minhash dedup by index #110

jordane95 commented Feb 28, 2024

guipenedo commented Feb 29, 2024

guipenedo commented Feb 29, 2024

jordane95 commented Mar 6, 2024

jordane95 commented May 28, 2024

guipenedo commented May 28, 2024

jordane95 commented May 28, 2024

guipenedo commented May 28, 2024

jordane95 commented May 29, 2024

Flexibility in minhash dedup by index #110

Flexibility in minhash dedup by index #110

Comments

jordane95 commented Feb 28, 2024

guipenedo commented Feb 29, 2024

guipenedo commented Feb 29, 2024

jordane95 commented Mar 6, 2024

jordane95 commented May 28, 2024

guipenedo commented May 28, 2024

jordane95 commented May 28, 2024

guipenedo commented May 28, 2024

jordane95 commented May 29, 2024