Speech Synthesis Paper

List of speech synthesis papers. Welcome to recommend awesome papers😀

TTS Frontend

Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)

Acoustic Model

Data Efficiency

Vocoder

Autoregressive Model

WaveNet: WaveNet: A Generative Model for Raw Audio (2016)
WaveRNN: Efficient Neural Audio Synthesis (ICML 2018)
LPCNet: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
WaveGAN: Adversarial Audio Synthesis (2018)
MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)

Non-autoregressive Model

Parallel-WaveNet: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
WaveGlow: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
Parallel-WaveGAN: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
MelGAN: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)

TTS towards Stylization

Expressive TTS

ReferenceEncoder-Tacotron: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
GST-Tacotron: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
GMVAE-Tacotron2: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
(Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (InterSpeech 2019)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
Controllable Neural Prosody Synthesis (Interspeech 2020)

MultiSpeaker TTS

Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
SV-Tacotron: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)

Voice Conversion

ASR Based

(introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)

GAN/VAE Based

AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)

Other

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (2019)
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (2019)
Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (2020)

Singing Synthesis

XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
JukeBox: Jukebox: A Generative Model for Music (2020)
XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)

Speech Pretrained Model

Audio-Word2Vec: Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder (2016)
SpeechBERT: SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering (2019)
Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
LICENSE		LICENSE
README.md		README.md
paper_list.md		paper_list.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Synthesis Paper

TTS Frontend

Acoustic Model

Autoregressive Model

Non-autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-autoregressive Model

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

Voice Conversion

ASR Based

GAN/VAE Based

Other

Singing Synthesis

Speech Pretrained Model

About

Releases

Packages

License

TaoTaoFu/speech-synthesis-paper

Folders and files

Latest commit

History

Repository files navigation

Speech Synthesis Paper

TTS Frontend

Acoustic Model

Autoregressive Model

Non-autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-autoregressive Model

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

Voice Conversion

ASR Based

GAN/VAE Based

Other

Singing Synthesis

Speech Pretrained Model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages