Clone of SawSing DDSP vocoder's official Implementation.
# pip install "torch==1.11.0" -q # Based on your environment (validated with vX.YZ)
# pip install "torchaudio==0.11.0" -q # Based on your environment
# pip install git+https://github.com/tarepan/SawSing-official
pip install -r requirements.txt
Place 24kHz/16bit .wav files with below directory structure:
data
├─ solo # speaker name
│ ├─ test # scenario-common test
│ ├─ val # scenario-common validation
│ ├─ train-full # training scenario No.1
│ │ ├─ audio
│ │ │ ├─ xxx.wav # place .wav here
│ │ ├─ mel
│ │ │ ├─ xxx.npy # Auto-generated by preprocessing
Then, run preprocessing:
python preprocess.py
Train vocoders from scratch.
- Modify the configuration file
..config/<model_name>.yaml
- Run the following command:
python main.py --config ./configs/sawsinsub.yaml \
--stage training \
--model SawSinSub
You can specify the model with --model
argument.
Currently this repository support 5 harmonic plus noise vocoders[4] (3 in the paper, 2 not):
Model Name (in the paper) | Harmonics Synthesizer | Note |
---|---|---|
SawSub |
Subtracted Sawtooth (exact) | modified from SawSing paper |
SawSinSub (SawSing) |
Subtracted Sawtooth (additive approx.) | from SawSing paper |
Sins (DDSP-Add) |
Added sinusoids | from DDSP paper |
Full |
Subtracted Added sinusoids | modified from DDSP paper |
DWS (DWTS) |
Wavetable | [3] |
SawSinSub
differ from SawSub
in that it approximate Sawtooth with band-limited addtive sinusoids. This works as anti-aliasing.
More details of syntehsizers, refet to synthesizer_demo.
[3] (ICASSP'22)Differentiable Wavetable Synthesis
[4] (ICASSP'93) HNS: Speech modification based on a harmonic+noise model
For validation (compute validation loss and real-time factor):
- Modify the configuration file
..config/<model_name>.yaml
- Run the following command:
# SawSing as an example
python main.py --config ./configs/sawsinsub.yaml \
--stage validation \
--model SawSinSub \
--model_ckpt ./exp/f1-full/sawsinsub-256/ckpts/vocoder_27740_70.0_params.pt \
--output_dir ./test_gen
Both CLI and Python supported.
For detail, jump to ☞ and check it.
mel-to-wave inference.
The code and specfication for extracting mel-spectrograms can be found in preprocess.py
.
# SawSing as an example
python main.py --config ./configs/sawsinsub.yaml \
--stage inference \
--model SawSinSub \
--model_ckpt ./exp/f1-full/sawsinsub-256/ckpts/vocoder_27740_70.0_params.pt \
--input_dir ./path/to/mel
--output_dir ./test_gen
For Sawsing buzzing artifacts, run post-processing.
For more details, please refer to here.
- training
- x.x [iter/sec] @ NVIDIA X0 on Google Colaboratory (AMP+)
- take about y days for whole training
- Original authors use Nvidia RTX 3090 Ti GPU x1
- inference
- z.z [sec/sample] @ xx
The authors provide checkpoints and experiment records. Great!
- Checkpoints
- Sins (DDSP-Add):
./exp/f1-full/sins/ckpts/
- SawSinSub (Sawsing):
./exp/f1-full/sawsinsub-256/ckpts/
- Sins (DDSP-Add):
- The full experimental records, reports and checkpoints can be found under the
exp
folder.
- glitch artifacts: see also [5]
- buzzing artifacts
- only in subtractive synthesizers (SawSub, SawSinSub, Full), see also [6]
- possible solutions
- Replace LTV-FIR with better filter
- Applying UV mask
- E2E training: data-efficient, intepretable and lightweight -> Joint training with acoustic models
- Feature: mel-spectrograms -> controlable features, e.g. f0, UV mask
[5] (ICASSP'22) Improving adversarial waveform generation based singing voice conversion with harmonic signals
[6] (INTERSPEECH'22) Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation
@article{sawsing,
title={DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation},
author={Da-Yi Wu, Wen-Yi Hsiao, Fu-Rong Yang, Oscar Friedman, Warren Jackson, Scott Bruzenak, Yi-Wen Liu, Yi-Hsuan Yang},
journal = {Proc. International Society for Music Information Retrieval},
year = {2022},
}
- Any preceding works