StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024)
Code for the paper: "StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection" published at the ACL 2024 main conference.
To run the agent, please make sure that this repository and SimulEval v1.1.0 are installed.
Create a textual file (e.g., src_audiopath_list.txt
) containing the list of paths to the audio
files (one path per line for each file), which, differently from SimulST, are not split into
segments but are the entire speeches.
Specifically, in the case of the MuST-C dataset used in the paper, the file contains the paths to
the entire TED talk files, similar to the following:
${AUDIO_DIR}/ted_1096.wav
${AUDIO_DIR}/ted_1102.wav
${AUDIO_DIR}/ted_1104.wav
${AUDIO_DIR}/ted_1114.wav
${AUDIO_DIR}/ted_1115.wav
...
Instead, as target file translations.txt
, it can either be used a dummy file or the sentences
concatenation, one line for each talk.
However, for the evaluation of already segmented test sets, such as in MuST-C, we will not need
these references, and we will evaluate directly from the segmented translations provided with the
dataset, as described in Evaluation with StreamLAAL.
For the streaming inference, set --config
and --model-path
as, respectively, the config file
and the model checkpoint downloaded in the
Pre-trained Offline models step.
As --source
and --target
, please use the files src_audiopath_list.txt
and translations.txt
created in the Requirements step.
The output will be saved in --output
.
For the Hypothesis Selection (based on AlignAtt), please set --frame-num
as the value of
f used for the inference (f=[2, 4, 6, 8]
, in the paper).
Depending on the Textual History Selection (Fixed Words or Punctuation), run the following command:
simuleval \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
--simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
--history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedWordsHistorySelection \
--source ${SRC_LIST_OF_AUDIO} \
--target ${TGT_FILE} \
--data-bin ${DATA_ROOT} \
--config config.yaml \
--model-path checkpoint.pt \
--source-segment-size 1000 \
--extract-attn-from-layer 3 \
--frame-num ${FRAME} \
--history-words 20 \
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
--device cuda:0
simuleval \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
--simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
--history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.PunctuationHistorySelection \
--source ${SRC_LIST_OF_AUDIO} \
--target ${TGT_FILE} \
--data-bin ${DATA_ROOT} \
--config config.yaml \
--model-path checkpoint.pt \
--source-segment-size 1000 \
--extract-attn-from-layer 3 \
--frame-num ${FRAME} \
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
--device cuda:0
To run the baseline, execute the following command:
simuleval \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
--simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
--history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedAudioHistorySelection \
--source ${SRC_LIST_OF_AUDIO} \
--target ${TGT_FILE} \
--data-bin ${DATA_ROOT} \
--config config.yaml \
--model-path checkpoint.pt \
--source-segment-size 1000 \
--extract-attn-from-layer 3 \
--frame-num ${FRAME} \
--history-words 20 \
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
--device cuda:0
For the simultaneous inference with AlignAtt (the upperbound presented in the paper), please refer to the AlignAtt README.
To evaluate the streaming outputs, download and extract the
mwerSegmenter in the
${MWERSEGMENTER_DIR}
folder, and run the following command:
export MWERSEGMENTER_ROOT=${MWERSEGMENTER_DIR}
streamLAAL --simuleval-instances ${SIMULEVAL_INSTANCES} \
--reference ${REFERENCE_TEXTS} \
--audio-yaml ${AUDIO_YAML} \
--sacrebleu-tokenizer ${SACREBLEU_TOKENIZER} \
--latency-unit ${LATENCY_UNIT}
where ${SIMULEVAL_INSTANCES}
is the output instances.log
produced by the agent in the previous
step, ${REFERENCE_TEXTS}
are the textual references in the target language (one line for each
segment), ${AUDIO_YAML}
is the yaml file containing the original audio segmentation,
${SACREBLEU_TOKENIZER}
is the sacreBLEU tokenizer used for
the quality evaluation (defaults to 13a
), and ${LATENCY_UNIT}
is the unit used for the latency
computation (either word
or char
, defaults to word
, the unit used in the paper).
If invoking streamLAAL
does not work, please include the FBK-fairseq directory
(${FBK_FAIRSEQ_DIR}
) in the PYTHONPATH
(export PYTHONPATH=${FBK_FAIRSEQ_DIR}:$PYTHONPATH
) or
call it explicitly by running
python ${FBK_FAIRSEQ_DIR}/examples/speech_to_text/simultaneous_translation/scripts/stream_laal.py
.
@inproceedings{papi-et-al-2024-streamatt,
title = {{StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection}},
author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa},
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = {2024},
address = "Bangkok, Thailand",
}