EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
_{The official implementation of EmoSphere++}

|Demo page

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Department of Artificial Intelligence, Korea University, Seoul, Korea.

Abstract

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

Training Procedure

Environments

pip install -r requirements.txt
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Vocoder

The BigVGAN 16k checkpoint will be released at a later date. In the meantime, please train using the official BigVGAN implementation or use the official HiFi-GAN checkpoint.

1. Preprocess data

Modify the config file to fit your environment.
We use ESD database, which is an emotional speech database that can be downloaded here: https://hltsingapore.github.io/ESD/.

a) VAD Analysis

Steps for emotion-specific centroid extraction with VAD analysis

sh Analysis.sh

b) Preprocessing

Steps for embedding extraction and binary dataset creation

sh preprocessing.sh

2. Training TTS module and Inference

sh train_run.sh

3. Pretrained checkpoints

TTS module trained on 11M [Download]

Acknowledgements

Our codes are based on the following repos:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_gen/tts		data_gen/tts
egs		egs
mfa_usr		mfa_usr
modules		modules
tasks		tasks
utils		utils
Analysis.sh		Analysis.sh
README.md		README.md
VAD.txt		VAD.txt
VAD_analysis_I2I.py		VAD_analysis_I2I.py
VAD_analysis_mean.py		VAD_analysis_mean.py
align_and_binarize.py		align_and_binarize.py
embedding_extract.py		embedding_extract.py
esd_text_emo.txt		esd_text_emo.txt
preprocessing.sh		preprocessing.sh
requirements.txt		requirements.txt
run.py		run.py
test_pair_wav.txt		test_pair_wav.txt
train_run.sh		train_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
_{The official implementation of EmoSphere++}

|Demo page

Abstract

Training Procedure

Environments

Vocoder

1. Preprocess data

a) VAD Analysis

b) Preprocessing

2. Training TTS module and Inference

3. Pretrained checkpoints

Acknowledgements

About

Releases

Packages

Languages

ayutaz/EmoSpherepp

Folders and files

Latest commit

History

Repository files navigation

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector The official implementation of EmoSphere++

|Demo page

Abstract

Training Procedure

Environments

Vocoder

1. Preprocess data

a) VAD Analysis

b) Preprocessing

2. Training TTS module and Inference

3. Pretrained checkpoints

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
_{The official implementation of EmoSphere++}

Packages