This is the Docker image for WhisperX: Automatic Speech Recognition with Word-Level Timestamps (and Speaker Diarization)
Get the Dockerfile at GitHub, or pull the image from ghcr.io.
Warning
Due to the excessively large file sizes (40GB+), continuous integration cannot be set up for these images. As a result, they will not update automatically.
Please build them manually if they are outdated.
The image tags are formatted as WHISPER_MODEL
-LANG
, for example, tiny-en
, base-de
, or large-v2-zh
.
Please note that I does not uploaded all the combinations.
You can find all available tags at ghcr.io.
In addition, there is also a no_model
tag that does not include any pre-downloaded models, also referred to as latest
.
Important
Clone the Git repository recursively to include submodules:
git clone --recursive https://github.com/jim60105/docker-whisperX.git
The Dockerfile builds the image contained models. It accepts two build arguments: LANG
and WHISPER_MODEL
.
-
LANG
: The language to transcribe. The default isen
. See here for supported languages. -
WHISPER_MODEL
: The model name. The default isbase
. See fast-whisper for supported models.
For example, if you want to build the image with ja
language and large-v2
model:
docker build --build-arg LANG=ja --build-arg WHISPER_MODEL=large-v2 -t whisperx:large-v2-ja .
Mount the current directory as /app
and run WhisperX with additional input arguments:
docker run --gpus all -it -v ".:/app" whisperx:large-v2-ja -- --output_format srt audio.mp3
Note
Remember to prepend --
before the arguments.
--model
and --language
args are defined in Dockerfile, no need to specify.
The main program, WhisperX, is distributed under the BSD-4 license.
Please refer to the git submodules for their respective source code licenses.
The Dockerfile from this repository is licensed under MIT.