- Submission deadline for ASR and Speech-to-text translation tasks: October 14, 2022
- Submission deadline for machine translation task: October 25, 2022
- Results announcement: October 29, 2022
The official submission leaderboards can be found at the following links:
Code | Language | Translation Pair |
---|---|---|
bzd | Bribri | Spanish |
gn | Guaraní | Spanish |
gvc | Kotiria | Portuguese |
tav | Wa'ikhana | Portuguese |
quy | Quechua | Spanish |
Test files for the ASR task are available here.
The data for the competition can be found here. Alternatively, you can use the provided download script to automatically download the data for all languages. The script takes a single argument, which is the folder in which to download the data to:
./download_data.sh destination_folder
Each language folder contains two subfolders, each corresponding to a different training split. In each subfolder, there are multiple audio files, and a single tsv file containing all transcriptions and translations. Audio files are split such that each file contains a single sentence or utterance. The tsv file is structured as follows:
Header | Content |
---|---|
wav | The corresponding audio filename. |
source_processed | A processed version of the audio transcription. |
source_raw | The original raw transcript. We ask that you use this data for training and evaluation, and to ignore the previous column. |
target_raw | The translation of the transcription into either Spanish or Portuguese. |
The baseline model for the ASR task has been implemented in espnet. The scripts to run the model can be found in the following directory of the espnet repository.