This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS
and so-vits-svc-5.0
. So the features from so-vits-svc-5.0
are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
The framework of grad-svc
Elysia_Grad_SVC.mp4
-
Such beautiful codes from Grad-TTS
easy to read
-
Multi-speaker based on speaker encoder
-
No speaker leaky based on
GRL
-
No electronic sound
-
Low GPU memery required for train
batch_size: 8, occupy 3.1GB GPU memory when fast epochs, and 5.8G when last epochs
-
Install project dependencies
pip install -r requirements.txt
-
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
. -
Download hubert_soft modelοΌput
hubert-soft-0d54a1f4.pt
intohubert_pretrain/
. -
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into
bigvgan_pretrain/
. -
Download pretrain model gvc.pretrain.pth, and put it into
grad_pretrain/
.python gvc_inference.py --config configs/base.yaml --model ./grad_pretrain/gvc.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav python gvc_inference_wave.py --mel gvc_out.mel.pt --pit gvc_tmp.pit.csv
For this pretrain model,
temperature
is settemperature=1.015
ingvc_inference.py
to get good result.
Put the dataset into the data_raw
directory following the structure below.
data_raw
ββββspeaker0
β ββββ000001.wav
β ββββ...
β ββββ000xxx.wav
ββββspeaker1
ββββ000001.wav
ββββ...
ββββ000xxx.wav
After preprocessing you will get an output with following structure.
data_gvc/
βββ waves-16k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ waves-32k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ mel
β βββ speaker0
β β βββ 000001.mel.pt
β β βββ 000xxx.mel.pt
β βββ speaker1
β βββ 000001.mel.pt
β βββ 000xxx.mel.pt
βββ pitch
β βββ speaker0
β β βββ 000001.pit.npy
β β βββ 000xxx.pit.npy
β βββ speaker1
β βββ 000001.pit.npy
β βββ 000xxx.pit.npy
βββ hubert
β βββ speaker0
β β βββ 000001.vec.npy
β β βββ 000xxx.vec.npy
β βββ speaker1
β βββ 000001.vec.npy
β βββ 000xxx.vec.npy
βββ speaker
β βββ speaker0
β β βββ 000001.spk.npy
β β βββ 000xxx.spk.npy
β βββ speaker1
β βββ 000001.spk.npy
β βββ 000xxx.spk.npy
βββ singer
βββ speaker0.spk.npy
βββ speaker1.spk.npy
- Re-sampling
- Generate audio with a sampling rate of 16000Hz in
./data_gvc/waves-16k
python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000
- Generate audio with a sampling rate of 32000Hz in
./data_gvc/waves-32k
python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000
- Generate audio with a sampling rate of 16000Hz in
- Use 16K audio to extract pitch
python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch
- use 32k audio to extract mel
python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel
- Use 16K audio to extract hubert
python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert
- Use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker
- Extract the average value of the timbre code for inference
python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer
- Use 32k audio to generate training index
python prepare/preprocess_train.py
- Training file debugging
python prepare/preprocess_zzz.py
- Start training
python gvc_trainer.py
- Resume training
python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth
- Log visualization
tensorboard --logdir logs/
-
Export inference model
python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pt
-
Inference
- Convert wave to mel
python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --shift 0
- Convert mel to wave
python gvc_inference_wave.py --mel gvc_out.mel.pt --pit gvc_tmp.pit.csv
- Convert wave to mel
-
Inference step by step
- Extract hubert content vector
python hubert/inference.py -w test.wav -v test.vec.npy
- Extract pitch to the csv text format
python pitch/inference.py -w test.wav -p test.csv
- Convert hubert & pitch to mel
python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
- Convert mel to wave
python gvc_inference_wave.py --mel gvc_out.mel.pt --pit test.csv
- Extract hubert content vector
https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS
https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC
https://github.com/facebookresearch/speech-resynthesis
https://github.com/shivammehta25/Diff-TTSG
https://github.com/gmltmd789/UnitSpeech
https://github.com/zhenye234/CoMoSpeech
https://github.com/seahore/PPG-GradVC
https://github.com/thuhcsi/LightGrad
https://github.com/lmnt-com/wavegrad
https://github.com/naver-ai/facetts
https://github.com/jaywalnut310/vits
https://github.com/NVIDIA/BigVGAN
https://github.com/bshall/soft-vc
https://github.com/mozilla/TTS