Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS and so-vits-svc-5.0. So the features from so-vits-svc-5.0 are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

The framework of grad-svc

Elysia_Grad_SVC.mp4

Features

Such beautiful codes from Grad-TTS

easy to read
Multi-speaker based on speaker encoder
No speaker leaky based on GRL
No electronic sound
Low GPU memery required for train

batch_size: 8, occupy 3.1GB GPU memory when fast epochs, and 5.8G when last epochs

Setup Environment

Install project dependencies
```
pip install -r requirements.txt
```
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into bigvgan_pretrain/.

Download pretrain model gvc.pretrain.pth, and put it into grad_pretrain/.

python gvc_inference.py --config configs/base.yaml --model ./grad_pretrain/gvc.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav

python gvc_inference_wave.py --mel gvc_out.mel.pt --pit gvc_tmp.pit.csv

For this pretrain model, temperature is set temperature=1.015 in gvc_inference.py to get good result.

Dataset preparation

Put the dataset into the data_raw directory following the structure below.

data_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

After preprocessing you will get an output with following structure.

data_gvc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── mel
│    └── speaker0
│    │      ├── 000001.mel.pt
│    │      └── 000xxx.mel.pt
│    └── speaker1
│           ├── 000001.mel.pt
│           └── 000xxx.mel.pt
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy

Re-sampling

Generate audio with a sampling rate of 16000Hz in ./data_gvc/waves-16k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000

Generate audio with a sampling rate of 32000Hz in ./data_gvc/waves-32k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000

Use 16K audio to extract pitch

python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch

use 32k audio to extract mel

python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel

Use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert

Use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker

Extract the average value of the timbre code for inference

python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer

Use 32k audio to generate training index
```
python prepare/preprocess_train.py
```
Training file debugging
```
python prepare/preprocess_zzz.py
```

Train

Start training
```
python gvc_trainer.py
```

Resume training

python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth

Log visualization
```
tensorboard --logdir logs/
```

Finetune Loss

Inference

Export inference model

python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pt

Inference

Convert wave to mel

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --shift 0

Convert mel to wave

python gvc_inference_wave.py --mel gvc_out.mel.pt --pit gvc_tmp.pit.csv

Inference step by step

Extract hubert content vector

python hubert/inference.py -w test.wav -v test.vec.npy

Extract pitch to the csv text format

python pitch/inference.py -w test.wav -p test.csv

Convert hubert & pitch to mel

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0

Convert mel to wave

python gvc_inference_wave.py --mel gvc_out.mel.pt --pit test.csv

Code sources and references

https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS

https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC

https://github.com/facebookresearch/speech-resynthesis

https://github.com/shivammehta25/Diff-TTSG

https://github.com/gmltmd789/UnitSpeech

https://github.com/zhenye234/CoMoSpeech

https://github.com/seahore/PPG-GradVC

https://github.com/thuhcsi/LightGrad

https://github.com/lmnt-com/wavegrad

https://github.com/naver-ai/facetts

https://github.com/jaywalnut310/vits

https://github.com/NVIDIA/BigVGAN

https://github.com/bshall/soft-vc

https://github.com/mozilla/TTS

https://github.com/maxrmorrison/torchcrepe

https://github.com/ubisoft/ubisoft-laforge-daft-exprt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

Features

Setup Environment

Dataset preparation

Data preprocessing

Train

Finetune Loss

Inference

Code sources and references

QQ Grop

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
bigvgan		bigvgan
bigvgan_pretrain		bigvgan_pretrain
configs		configs
grad		grad
grad_extend		grad_extend
grad_pretrain		grad_pretrain
hubert		hubert
hubert_pretrain		hubert_pretrain
pitch		pitch
prepare		prepare
speaker		speaker
speaker_pretrain		speaker_pretrain
spec		spec
LICENSE		LICENSE
README.md		README.md
gvc_export.py		gvc_export.py
gvc_inference.py		gvc_inference.py
gvc_inference_wave.py		gvc_inference_wave.py
gvc_trainer.py		gvc_trainer.py
requirements.txt		requirements.txt

License

MaxMax2016/Grad-SVC

Folders and files

Latest commit

History

Repository files navigation

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

Features

Setup Environment

Dataset preparation

Data preprocessing

Train

Finetune Loss

Inference

Code sources and references

QQ Grop

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages