Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexibility in minhash dedup by index #110

Open
jordane95 opened this issue Feb 28, 2024 · 8 comments
Open

Flexibility in minhash dedup by index #110

jordane95 opened this issue Feb 28, 2024 · 8 comments

Comments

@jordane95
Copy link
Contributor

Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say we want to run 10 tasks in parallel), then run the dedup by index in later tasks.

It seems that the hash index of all datasets must be stored in one folder, so subsequent dataset being processed must be deduped from all the index in the existing folder. Also we cannot specific from which index we want to dedup the current dataset.

@guipenedo
Copy link
Collaborator

We could add the option you request, something like:

  • dedup_in_index
  • dedup_in_self
    and create_index_name controls whether you create an index or not

For the second part of your issue, you can simply save diff indexes in diff folders. If you have a use case/example where this doesn't work, I could make it so you can provide a list of glob expressions to match indexes

@guipenedo
Copy link
Collaborator

Actually we already have a block to create an index without deduplicating: MinhashBuildIndex

@jordane95
Copy link
Contributor Author

We could add the option you request, something like:

  • dedup_in_index
  • dedup_in_self
    and create_index_name controls whether you create an index or not

For the second part of your issue, you can simply save diff indexes in diff folders. If you have a use case/example where this doesn't work, I could make it so you can provide a list of glob expressions to match indexes

List of glob expressions would be great, if we want to do index dedup selectively

@jordane95
Copy link
Contributor Author

Hi @guipenedo , do you have time to add this? My use case is: I want to deduplicate across all cc dumps. My plan is to first perform the deduplication inside each dump. Then, run a separate phase of cross-dump deduplication sequentially. When using the current code, I store the index of each dump in different folders. Now I want to run dedup across dumps, I need to load index of all previous dumps, but they are stored in different folders. It seems that the current interface cannot support this...

@guipenedo
Copy link
Collaborator

Hi @jordane95,
We are a bit swamped rn but we did something similar with the existing code:

  • we deduplicated the dumps sequentially (most recent to oldest)
  • we used the same index_folder in step2 for each dump
  • we set create_index_name=DUMP in step 2
  • we set only_dedup_in_index=False in step 2

Basically for each dump you will both deduplicate within the dump (only_dedup_in_index=False) and on all the previous dumps. Then a new index file will be created containing the new hashes from this dump that were not found in the previous ones that you may use for the following dump.

A word of advice though, for FineWeb we originally tried this approach and the performance is considerably worse than just independently deduplicating each dump. Specially for the last dumps to be processed you will be removing almost all of the data and what is left is usually of bad quality. This may depend on how many dumps you plan to process though

@jordane95
Copy link
Contributor Author

Thanks for your reply. I might need to add some changes myself. It seems that your approach cannot be run in parallel, since next dump depends on results of previous dumps. Also it cannot reuse the intra-dump dedup results for ablation study.

Regarding the second point, I have seen your awesome fineweb dataset. But I think there must be many duplicates in it (for example, url dedup is not done according to your info). May I know how did you perform experiments on intra-dump dedup v.s. cross-dump dedup? Did you merge all dumps to train the model or on each dump individually for comparison. Since if we use all dumps to train our model, the bad quality data left in the last dump will only be seen once and with a small proportion. So there won't be much difference v.s. intra-dedup. Unless we think that the data duplicated across dumps are of high-quality...

@guipenedo
Copy link
Collaborator

We plan to have a blogpost out later this week that will hopefully answer these questions

@jordane95
Copy link
Contributor Author

I find the glob and subdirectory have been switched here

def list_files(
self,
subdirectory: str = "",
recursive: bool = True,
glob_pattern: str | None = None,
include_directories: bool = False,
) -> list[str]:
"""
Get a list of files on this directory. If `subdirectory` is given will search in `path/subdirectory`. If
glob_pattern is given, it will only return files that match the pattern, which can be used to match a given
extension, for example `*.myext`. Be careful with subdirectories when using glob (use ** if you want to match
any subpath). Args: subdirectory: str: (Default value = "") recursive: bool: (Default value = True)
glob_pattern: str | None: (Default value = None)
Returns: a list of file paths, relative to `self.path`
"""
if glob_pattern and not has_magic(glob_pattern):
# makes it slightly easier for file extensions
glob_pattern = f"*{glob_pattern}"
extra_options = {}
if isinstance(self.fs, HfFileSystem):
extra_options["expand_info"] = False # speed up
if include_directories:
extra_options["withdirs"] = True
return sorted(
[
f
for f, info in (
self.find(subdirectory, maxdepth=1 if not recursive else None, detail=True, **extra_options)
if not glob_pattern
else self.glob(
self.fs.sep.join([subdirectory, glob_pattern]) if subdirectory else glob_pattern,
maxdepth=1 if not recursive else None,
detail=True,
**extra_options,
)
).items()
if include_directories or info["type"] != "directory"
]
)

However, my index are stored in a structure like this

...
/data/part-00093/index/bucket_000/*.minhash.index
/data/part-00094/index/bucket_000/*.minhash.index
...

I want to load index from all previous parts, so I suppose simple list of glob + subdirectory won't work here, since subdirectory='bucket_000' but the glob pattern should be placed before. I'm thinking about just add a list of folders for index file finding, or should we extend the functionality of glob?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants