Skip to content

Commit

Permalink
Update Finetuning README.md (facebookresearch#244)
Browse files Browse the repository at this point in the history
  • Loading branch information
elbayadm authored Dec 4, 2023
1 parent 379aa56 commit 82f9432
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions src/seamless_communication/cli/m4t/finetune/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The trainer and dataloader were designed mainly for demonstration purposes. Thei

M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.

That kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datastes hub](https://huggingface.co/datasets/google/fleurs), (optionally) extracts units from the target audio samples, and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.
That kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datasets hub](https://huggingface.co/datasets/google/fleurs), (optionally) extracts units from the target audio samples, and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.

List of input arguments for `dataset.py`:

Expand All @@ -18,7 +18,7 @@ List of input arguments for `dataset.py`:
--target_lang TARGET_LANG
M4T langcode of the dataset TARGET language
--split SPLIT Dataset split/shard to download (`train`, `test`)
--save_dir SAVE_DIR Directory where the datastets will be stored with HuggingFace datasets cache files
--save_dir SAVE_DIR Directory where the datasets will be stored with HuggingFace datasets cache files
```

Language codes should follow the notation adopted by M4T models.
Expand Down

0 comments on commit 82f9432

Please sign in to comment.