A fork of so-vits-svc
with realtime support and greatly improved interface. Based on branch 4.0
(v1) and the models are compatible.
- Realtime voice conversion (enhanced in v1.1.0)
- More accurate pitch estimation using CREPE
- GUI available
- Unified command-line interface (no need to run Python scripts)
- Ready to use just by installing with
pip
. - Automatically download pretrained base model and HuBERT model
- Code completely formatted with black, isort, autoflake etc.
- Other minor differences
Install this via pip (or your favourite package manager that uses pip):
python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -U so-vits-svc-fork
- If no GPU is available, simply remove
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu117
. - If you are using an AMD GPU on Linux, replace
--index-url https://download.pytorch.org/whl/cu117
with--index-url https://download.pytorch.org/whl/rocm5.4.2
. AMD GPUs are not supported on Windows (#120). - If
fairseq
raises an error:- If it prompts
Microsoft C++ Build Tools
is not installed. please install it. - If it prompts that some dll is missing, reinstalling
Microsoft Visual C++ 2022
andWindows SDK
may help.
- If it prompts
Please update this package regularly to get the latest features and bug fixes.
pip install -U so-vits-svc-fork
GUI launches with the following command:
svcg
- Realtime (from microphone)
svc vc --model-path <model-path>
- File
svc --model-path <model-path> source.wav
Pretrained models are available on HuggingFace.
- If using WSL, please note that WSL requires additional setup to handle audio and the GUI will not work without finding an audio device.
- In real-time inference, if there is noise on the inputs, the HuBERT model will react to those as well. Consider using realtime noise reduction applications such as RTX Voice in this case.
- If your dataset has BGM, please remove the BGM using software such as Ultimate Vocal Remover.
3_HP-Vocal-UVR.pth
orUVR-MDX-NET Main
is recommended. 1 - If your dataset is a long audio file with multiple speakers, use
svc sd
to split the dataset into multiple files (usingpyannote.audio
). Further manual classification may be necessary due to accuracy issues. If speakers speak with a variety of speech styles, set --min-speakers larger than the actual number of speakers. Due to unresolved dependencies, please installpyannote.audio
manually:pip install pyannote-audio
. - If your dataset is a long audio file with a single speaker, use
svc split
to split the dataset into multiple files (usinglibrosa
).
Place your dataset like dataset_raw/{speaker_id}/**/{wav_file}.{any_format}
(subfolders and non-ASCII filenames are acceptable) and run:
svc pre-resample
svc pre-config
svc pre-hubert
svc train -t
- Dataset audio duration per file should be <~ 10s or VRAM will run out.
- To change the f0 inference method to CREPE, replace
svc pre-hubert
withsvc pre-hubert -fm crepe
. You may need to reduce--n-jobs
due to performance issues. - It is recommended to change the batch_size in
config.json
before thetrain
command to match the VRAM capacity. The default value is optimized for Tesla T4 (16GB VRAM), but training is possible without that much VRAM. - Silence removal and volume normalization are automatically performed (as in the upstream repo) and are not required.
For more details, run svc -h
or svc <subcommand> -h
.
> svc -h
Usage: svc [OPTIONS] COMMAND [ARGS]...
so-vits-svc allows any folder structure for training data.
However, the following folder structure is recommended.
When training: dataset_raw/{speaker_name}/**/{wav_name}.{any_format}
When inference: configs/44k/config.json, logs/44k/G_XXXX.pth
If the folder structure is followed, you DO NOT NEED TO SPECIFY model path, config path, etc.
(The latest model will be automatically loaded.)
To train a model, run pre-resample, pre-config, pre-hubert, train.
To infer a model, run infer.
Options:
-h, --help Show this message and exit.
Commands:
clean Clean up files, only useful if you are using the default file structure
infer Inference
onnx Export model to onnx
pre-config Preprocessing part 2: config
pre-hubert Preprocessing part 3: hubert If the HuBERT model is not found, it will be...
pre-resample Preprocessing part 1: resample
pre-sd Speech diarization using pyannote.audio
pre-split Split audio files into multiple files
train Train model If D_0.pth or G_0.pth not found, automatically download from hub.
train-cluster Train k-means clustering
vc Realtime inference from microphone
Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!