StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024)

Code for the paper: "StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection" published at the ACL 2024 main conference.

📎 Requirements

To run the agent, please make sure that this repository and SimulEval v1.1.0 are installed.

Create a textual file (e.g., src_audiopath_list.txt) containing the list of paths to the audio files (one path per line for each file), which, differently from SimulST, are not split into segments but are the entire speeches. Specifically, in the case of the MuST-C dataset used in the paper, the file contains the paths to the entire TED talk files, similar to the following:

${AUDIO_DIR}/ted_1096.wav
${AUDIO_DIR}/ted_1102.wav
${AUDIO_DIR}/ted_1104.wav
${AUDIO_DIR}/ted_1114.wav
${AUDIO_DIR}/ted_1115.wav
...

Instead, as target file translations.txt, it can either be used a dummy file or the sentences concatenation, one line for each talk. However, for the evaluation of already segmented test sets, such as in MuST-C, we will not need these references, and we will evaluate directly from the segmented translations provided with the dataset, as described in Evaluation with StreamLAAL.

📌 Pre-trained Offline models

⚠️ The offline ST models used for the baseline, AlignAtt, and StreamAtt are the same and already available at the AlignAtt release webpage❗

🤖 Streaming Inference: StreamAtt

For the streaming inference, set --config and --model-path as, respectively, the config file and the model checkpoint downloaded in the Pre-trained Offline models step. As --source and --target, please use the files src_audiopath_list.txt and translations.txt created in the Requirements step.

The output will be saved in --output.

⭐ StreamAtt

For the Hypothesis Selection (based on AlignAtt), please set --frame-num as the value of f used for the inference (f=[2, 4, 6, 8], in the paper).

Depending on the Textual History Selection (Fixed Words or Punctuation), run the following command:

Fixed Words

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
    --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
    --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedWordsHistorySelection \
    --source ${SRC_LIST_OF_AUDIO} \
    --target ${TGT_FILE} \
    --data-bin ${DATA_ROOT} \
    --config config.yaml \
    --model-path checkpoint.pt \
    --source-segment-size 1000 \
    --extract-attn-from-layer 3 \
    --frame-num ${FRAME} \
    --history-words 20 \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
    --device cuda:0

Punctuation

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
    --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
    --history-selection-method  examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.PunctuationHistorySelection \
    --source ${SRC_LIST_OF_AUDIO} \
    --target ${TGT_FILE} \
    --data-bin ${DATA_ROOT} \
    --config config.yaml \
    --model-path checkpoint.pt \
    --source-segment-size 1000 \
    --extract-attn-from-layer 3 \
    --frame-num ${FRAME} \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
    --device cuda:0

⭐ Baseline and Upperbound

To run the baseline, execute the following command:

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
    --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
    --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedAudioHistorySelection \
    --source ${SRC_LIST_OF_AUDIO} \
    --target ${TGT_FILE} \
    --data-bin ${DATA_ROOT} \
    --config config.yaml \
    --model-path checkpoint.pt \
    --source-segment-size 1000 \
    --extract-attn-from-layer 3 \
    --frame-num ${FRAME} \
    --history-words 20 \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
    --device cuda:0

For the simultaneous inference with AlignAtt (the upperbound presented in the paper), please refer to the AlignAtt README.

💬 Evaluation: StreamLAAL

To evaluate the streaming outputs, download and extract the mwerSegmenter in the ${MWERSEGMENTER_DIR} folder, and run the following command:

export MWERSEGMENTER_ROOT=${MWERSEGMENTER_DIR}

streamLAAL --simuleval-instances ${SIMULEVAL_INSTANCES}  \
           --reference ${REFERENCE_TEXTS} \
           --audio-yaml ${AUDIO_YAML} \
           --sacrebleu-tokenizer ${SACREBLEU_TOKENIZER} \
           --latency-unit ${LATENCY_UNIT}

where ${SIMULEVAL_INSTANCES} is the output instances.log produced by the agent in the previous step, ${REFERENCE_TEXTS} are the textual references in the target language (one line for each segment), ${AUDIO_YAML} is the yaml file containing the original audio segmentation, ${SACREBLEU_TOKENIZER} is the sacreBLEU tokenizer used for the quality evaluation (defaults to 13a), and ${LATENCY_UNIT} is the unit used for the latency computation (either word or char, defaults to word, the unit used in the paper).

If invoking streamLAAL does not work, please include the FBK-fairseq directory (${FBK_FAIRSEQ_DIR}) in the PYTHONPATH (export PYTHONPATH=${FBK_FAIRSEQ_DIR}:$PYTHONPATH) or call it explicitly by running python ${FBK_FAIRSEQ_DIR}/examples/speech_to_text/simultaneous_translation/scripts/stream_laal.py.

📍Citation

@inproceedings{papi-et-al-2024-streamatt,
title = {{StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection}},
author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa},
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = {2024},
address = "Bangkok, Thailand",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STREAMATT_STREAMLAAL.md

STREAMATT_STREAMLAAL.md

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024)

📎 Requirements

📌 Pre-trained Offline models

🤖 Streaming Inference: StreamAtt

⭐ StreamAtt

Fixed Words

Punctuation

⭐ Baseline and Upperbound

💬 Evaluation: StreamLAAL

📍Citation

Files

STREAMATT_STREAMLAAL.md

Latest commit

History

STREAMATT_STREAMLAAL.md

File metadata and controls

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024)

📎 Requirements

📌 Pre-trained Offline models

🤖 Streaming Inference: StreamAtt

⭐ StreamAtt

Fixed Words

Punctuation

⭐ Baseline and Upperbound

💬 Evaluation: StreamLAAL

📍Citation