- Awesome Codec, TTS & Speech LM
- Music Generation
- Some Interesting Models
- Speech DataSet
- Some Interesting knowledge
- Reference
- Acoustic Tokens: Acoustic tokens focuses on speech compression and reconstruction, which rely on encoder-decoder architectures with residual vector quantization (RVQ). Specifically, these models quantify speech features (which are downsampled from raw wavforms by one encoder) into a series of discrete tokens, then use one decoder to upsample these discrete tokens into the speech, calculating the reconstruction loss against the original signal. By this approach, we can get discrete acoustic tokens with impressive compression rates and high-fidelity acoustic information, making it more suitable for tasks such as speech synthesis and emotion analysis. (requires maintaining reconstruction ability and a low bitrate)
- Semantic Tokens: Semantic tokens involve applying clustering algorithms such as K-means to extract features from self-supervised learning models, using the cluster indices as discrete representations. And it is prediction-based modeling, these models are trained for representation learning by predicting future frames in an autoregressive manner or by using surrounding frames to predict masked frames. This approach tends to prioritize capturing linguistic information within speech, making it particularly useful for recognition and understanding tasks.
- Speech Large Language Models: These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation. (From speech-trident)
- [2024/12] FreeCodec: A disentangled neural speech codec with fewer tokens [paper][code][demo]
Code Comming Soon
| speaker encoder, content encoder and prosody encoder - [2024/11] TS3-Codec: Transformer-Based Simple Streaming Single Codec [paper] free-convolution
- [2024/11] Scaling Transformer for Low-bitrate High-Quality Speech Coding [paper][code][demo]
Code Comming Soon
| transformer-based and scale it into 1B parameter range - [2024/11] PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain [paper][demo]
Code Comming Soon
| Music Tokenizer, Similar to MsCodec - [2024/11] Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation [paper][code][demo] aliasing-free ✔️
- [2024/11] VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication [paper][demo] integrates the Voice Changer model directly into the speech Codec
- [2024/11] Towards Codec-LM Co-design for Neural Codec Language Models [paper]
Code Comming Soon
| proposing several codec-LM co-design strategies - [2024/11] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [paper] UniCodec | several information-disentangled discrete tokens, similar to ns3_codec
- [2024/11] hertz-dev [code] WaveCodec ✔️
- [2024/11] SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer [paper][code] codebook collapse ✔️
- [2024/11] MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios [paper][demo] discrete cosine transform (MDCT) as input
- [2024/10] Pushing the frontiers of audio generation [blog] google deepmind
- [2024/11] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [paper] Double-Codebook Speaker-invariant Clustering
- [2024/10] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [paper][demo] Is predicting the remaining RVQ codes necessary?
- [2024/10] APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm [paper][demo] two-stage joint-individual training paradigm
- [2024/10] Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [paper][demo] MsCodec, Multi-Scale Encoding
- [2024/10] LSCodec: Low-Bandwidth and Speaker-Decoupled Discrete Speech Codec [paper][demo] speaker timbre decouple
- [2024/10] DM-Codec: Distilling Multimodal Representations for Speech Tokenization [paper][code] acoustic properties, semantic meaning, and contextual clues ✔️
- [2024/10] ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs [paper][demo] address codebook collapse based on intra- and inter-codebook optimization
- [2024/10] Code Drift: Towards Idempotent Neural Audio Codecs [paper][demo] Idempotence – the stability of a codec’s decoded output under multiple rounds of encoding and decoding
- [2021/10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [paper][code] semantic information & content generation ✔️
- [2021/08] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [paper]
- [2021/06] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [paper][code] semantic information & content generation ✔️
- [2020/06] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [paper][code] ✔️
- [2024/10] Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer [paper][code][demo] finetuned-version of DAC ✔️
- [2024/09] BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec [paper][code][demo] low-bitrate neural speech codec ✔️
- [2024/10] Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models [paper][demo] Inconsistency
- [2024/09] Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice [code] S3Tokenizer ✔️
- [2024/09] FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [paper] Flow Matching
- [2024/09] ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech [paper][code] Comprehensive Platform ✔️
- [2024/09] MuCodec: Ultra Low-Bitrate Music Codec [paper][code][demo] Music Codec ✔️
- [2024/09] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis [paper][code][demo] Watermarking ✔️
- [2024/09] NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization [paper][code]
Code Comming Soon
- [2024/09] Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation [paper][demo] CoFi-Speech
- [2024/09] SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis [paper][code][demo] ✔️
- [2024/08] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [paper][code][demo] X-Codec ✔️
- [2024/08] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [paper][code][demo] ✔️
- [2024/08] Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper][code][demo] continuous latent space ✔️
- [2024/08] SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [paper][demo]
- [2024/06] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models [paper][code][demo] SQ-Codec |
Code Comming Soon
- [2024/02] Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models [paper][code][demo] ✔️
- [2024/04] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper][code] ✔️
- [2024/07] SuperCodec: A Neural Speech Codec with Selective Back-Projection Network [paper][code][demo] ✔️
- [2024/07] dMel: Speech Tokenization made Simple [paper]
Code Comming Soon
- [2024/02] APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding [paper][code][demo] ✔️
- [2024/06] Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation [paper][demo]
- [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper][code][demo] ✔️
- [2023/06] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [paper][code][demo] ✔️
- [2024/04] SNAC: Multi-Scale Neural Audio Codec [paper][code][demo] ✔️
- [2024/06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner [paper][code] LLM-Codec ✔️
- [2024/01] Finite Scalar Quantization: VQ-VAE Made Simple [paper][code] FSQ, no codebook collapse ✔️
- [2024/06] Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis [paper][code][demo] ✔️
- [2023/09] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [paper]
- [2024/06] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation [paper][demo]
- [2024/04] The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [paper]
- [2023/06] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding [paper][code][demo] acoustic model CTX-txt2vec and vocoder CTX-vec2wav | speech continuation and editing | similar to Encoder-Decoder ✔️
- [2024/06] Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder [paper]
- [2024/06] Coding Speech through Vocal Tract Kinematics [paper][code] ✔️
- [2024/05] HILCodec: High Fidelity and Lightweight Neural Audio Codec [paper][code][demo] ✔️
- [2024/04] SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [paper][code][demo] ✔️
- [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] Qinco ✔️
- [2024/01] SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models [paper][code][demo] ✔️
- [2024/01] Residual Quantization with Implicit Neural Codebooks [paper][code] ✔️
- [2023/10] Acoustic BPE for Speech Generation with Discrete Tokens [paper][code] ✔️
- [2023/09] BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech [paper][code][demo] ✔️
- [2023/09] Fewer-token Neural Speech Codec with Time-invariant Codes [paper][code][demo] Ti-Codec ✔️
- [2023/09] FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [paper][code][demo] ✔️
- [2023/09] High Fidelity Neural Audio Compression [paper][code][code-Unofficial] [demo] Encodec ✔️
- [2023/09] Soundstorm: Efficient parallel audio generation [paper][demo]
- [2023/09] High-Fidelity Audio Compression with Improved RVQGAN [paper][code][demo] DAC ✔️
- [2023/09] SpatialCodec: Neural Spatial Speech Coding [paper][code][demo] ✔️
- [2023/05] HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec [paper][code] AcademiCodec & Group-RVQ ✔️
- [2023/05] AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec [paper][code][demo] ✔️
- [2023/01] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt [paper][code][demo] ✔️
- [2022/09] AudioLM: a Language Modeling Approach to Audio Generation [paper][demo]
- [2021/07] SoundStream: An End-to-End Neural Audio Codec [paper][code][demo] ✔️
- [2024/12] Autoregressive Speech Synthesis with Next-Distribution Prediction [paper][demo] KALL-E
- [2024/12] ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis [paper][code][demo] Code Comming Soon
- [2024/12] CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [paper][code][demo] ✔️
- [2024/12] TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch [paper]
- [2024/11] Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [paper] Code Comming Soon | Text & Video to Speech
- [2024/11] Debatts: Zero-Shot Debating Text-to-Speech Synthesis [paper][demo] Debating TTS & Dataset
- [2024/11] OuteTTS-0.1-350M [blog][code] ✔️
- [2024/12] The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024 [paper] ISCSLP 2024
- [2024/10] The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings [paper] zero-shot spontaneous style voice cloning | ISCSLP 2024
- [2024/07] ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024 [paper] emotional & background audio generation | ISCSLP 2024
- [2024/11] Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis [paper][code] ✔️
- [2024/10] STTATTS: Unified Speech-To-Text And Text-To-Speech Model [paper][code]
- [2024/10] SPIRIT LM: Interleaved Spoken and Written Language Model [paper][code][demo] ✔️
- [2023/05] Better speech synthesis through scaling [paper][code][blog] Tortoise TTS ✔️
- [2024/10] F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching [paper][code][demo] ✔️
- [2024/09] Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models [paper][demo]
- [2024/09] FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications [paper][code][demo] voice cloning for dubbing and human-like speech generation for chatbots ✔️
- [2024/09] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [paper][code][demo] Masked Generative Model | Similar to Seed-TTS ✔️
- [2024/08] VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling [paper][code][demo] ✔️
- [2024/08] Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation [paper][demo]
- [2024/04] FlashSpeech: Efficient Zero-Shot Speech Synthesis [paper][code][demo] ✔️
- [2024/07] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [paper] [code][demo] ✔️
- [2024/07] Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization [paper][demo] Human FeedBack
- [2024/06] E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS [paper][demo] similar to Seed-TTS
- [2023/11] HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis [paper][code][demo] ✔️
- [2024/06] TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers [paper][code][demo] ✔️
- [2024/01] CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech [paper][demo]
- [2024/06] DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [paper][demo]
- [2024/06] VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [paper][demo]
- [2024/06] Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [paper][demo]
- [2024/06] VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers [paper][demo]
- [2024/06] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model [paper][code][demo] ✔️
- [2024/06] ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec [paper][code][demo] ✔️
- [2024/08] SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [paper][demo] SSL
- [2024/08] VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech [paper][demo] LORA
- [2024/08] StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech [paper][demo] LORA
- [2024/08] EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech [paper] LORA
- [2024/07] Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [paper][demo] Spontaneous
- [2024/01] EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine [code] ✔️
- [2024/06] Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [paper][demo] Monotonic Alignment
- [2024/01] Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction [paper][demo] Transducer/End-to-End
- [2024/01] VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech [paper][code][demo]
Code Comming Soon
| Transducer - [2024/06] High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model [paper][demo] Transducer/End-to-End
- [2023/02] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision [paper][code][demo] SpearTTS | WhisperSpeech ✔️
- [2024/02] Natural language guidance of high-fidelity text-to-speech with synthetic annotations [paper][code][demo] Prompt Control | Parler-TTS ✔️
- [2024/06] WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark [paper][demo]
- [2024/06] Seed-TTS: A Family of High-Quality Versatile Speech Generation Models [paper][demo]
- [2024/06] Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback [paper] Human Feedback
- [2024/04] SpeechAlign: Aligning Speech Generation to Human Preferences [paper][code][demo] Human Feedback ✔️
- [2024/04] StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations [paper][code][demo] Lian Liru(连丽如) dataset ✔️
- [2024/04] TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models [paper][code][demo]
Code Comming Soon
- [2024/03] HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling [paper][demo]
- [2024/01] Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis [paper][demo]
- [2024/03] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [paper][demo]
- [2024/01] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [paper][demo]
- [2024/03] VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild [paper][code][demo] ✔️
- [2023/01] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [paper][code][demo] VALL-E ✔️
- [2023/09] PromptTTS 2: Describing and Generating Voices with Text Prompt [paper][code][demo] ✔️
- [2023/09] Matcha-tts: A fast tts architecture with conditional flow matching [paper][code][demo] ✔️
- [2023/09] Voicebox: Text-guided multilingual universal speech generation at scale [paper][demo]
- [2023/09] Voiceflow: Efficient text-to-speech with rectified flow matching [paper][code][demo] ✔️
- [2023/05] Better speech synthesis through scaling [paper][code] tortoise-tts ✔️
- WavChat classify all spoken dialogue models based on whether the core language model can directly understand and generate speech representations, dividing them into cascaded and end-to-end categories.
- [2024/12] Long-Form Speech Generation with Spoken Language Models [paper][demo] SpeechSSM, Long-Form Generation
- [2024/12] SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training [paper][demo]
- [2024/12] Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners [paper][demo] Flow-Omni, continuous speech tokens
- [2024/02] Paralinguistics-Aware Speech-Empowered LLMs for Natural Conversation [paper][code][demo] learning cross-modal distributional semantics ✔️
- [2024/12] GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot [paper][code] speech interaction model & emotion, intonation, speech rate, and dialect & low latency ✔️
- [2024/11] MooER: Moore-threads Open Omni model for speech-to-speech intERaction [code]
Paper Comming Soon
- [2024/11] SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation [paper] Code Comming Soon | free-codec
- [2024/11] Building a Taiwanese Mandarin Spoken Language Model: A First Attempt [paper][code]
Code Comming Soon
- [2024/11] hertz-dev [code] ✔️
- [2024/11] Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM [paper][demo][code] frozen llm in training ✔️
- [2024/10] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [paper][code] ✔️
- [2024/10] IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [paper][demo] reducing the length difference between speech and text
- [2024/10] OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [paper][demo]
Code Comming Soon
- [2024/09] Westlake-Omni: Open-Source Chinese Emotional Speech Interaction Large Language Model with Unified Discrete Sequence Modeling [code] ✔️
- [2024/09] Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control [paper][demo]
- [2024/09] Moshi: a speech-text foundation model for real time dialogue [paper][code][demo] low delay | only english ✔️
- [2024/09] LLaMA-Omni: Seamless Speech Interaction with Large Language Models [paper][code][demo] only english ✔️
- [2024/09] EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [paper][demo]
- [2024/08] Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming [paper][code] End-to-End | speech interaction model ✔️
- [2024/08] Speech To Speech: an effort for an open-sourced and modular GPT4-o [code] End-to-End | speech interaction model ✔️
- [2024/08] Language Model Can Listen While Speaking [paper][demo] Full Duplex Modeling | speech interaction model
- [????/??] SpeechGPT2: End-to-End Human-Like Spoken Chatbot [paper][code][demo] paper &
Code Comming Soon
| speech interaction model - [2024/01] SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [paper][demo]
Code Comming Soon
- [2023/05] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [paper][code][demo] ✔️
- [2024/07] Generative Expressive Conversational Speech Synthesis [paper][code] GPT-Talker | GPT for response and Style, VITS for audio ✔️
- [2024/06] GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities [paper][code][demo] ✔️
- [2024/02] Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities [paper][code][demo] ✔️
- [2024/02] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [paper][code][demo] ✔️
- [2024/03] WavLLM: Towards Robust and Adaptive Speech Large Language Model [paper][code] ✔️
- [2024/08] DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance [paper][demo]
- [2024/08] Style-Talker: Finetuning Audio Language Model and StyleBased Text-to-Speech Model for Fast Spoken Dialogue Generation [paper][code][demo]
Code Comming Soon
- [2024/04] CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations [paper][code][demo] multi-round dialogue speech generation ✔️
- [2024/11] A fast multimodal LLM for real-time voice [blog][code][demo] Ultravox ✔️
- [2024/10] Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant [paper][code] ✔️
- [2024/10] Ocean-omni: To Understand the World with Omni-modality [paper][code] Baichuan-Omni ✔️
- [2024/08] VITA: Towards Open-Source Interactive Omni Multimodal LLM [paper][code][demo] ✔️
- [2024/07] Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation [paper][code][demo]
Code Coming Soon
| speech interaction model - [2024/07] Qwen2-Audio Technical Report [paper][code] ✔️
- [2024/05] A Full-duplex Speech Dialogue Scheme Based On Large Language Model [paper] neural finite state machine
- [2023/11] Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models [paper][code] ✔️
- [2023/10] SALMONN: Towards Generic Hearing Abilities for Large Language Models [paper][code] ✔️
- [2023/09] Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [paper]
- [2024/02] Codec-SUPERB: An In-Depth Analysis of Sound Codec Models [paper][code]
- [2024/07] EMO-Codec: A Depth Look at Emotion Preservation Capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations [paper]
- [2024/06] DASB - Discrete Audio and Speech Benchmark [paper][code] a benchmark for evaluating discrete audio representations
- [2024/12] Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [paper]
- [2024/02] Towards audio language modeling -- an overview [paper]
- [2024/10] A Survey on Speech Large Language Models [paper]
- [2024/10] Recent Advances in Speech Language Models: A Survey [paper]
- [2024/11] WavChat: A Survey of Spoken Dialogue Models [paper][code]
- [2024/12] Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey [paper][code]
- [2024/12] Flow Matching Guide and Code [paper][code]
- [2024/12] OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows [paper][code] ✔️
- [2024/09] SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis [paper][code][demo] ✔️
- [2024/11] FLowHigh: Towards efficient and high-quality audio super-resolution with single-step flow matching [code][demo] ✔️
- [2024/11] O1 Replication Journey: A Strategic Progress Report -- Part 1 [paper][code] ✔️
- [2024/11] LLaMA-O1: Open Large Reasoning Model Frameworks For Training, Inference and Evaluation With PyTorch and HuggingFace [code] ✔️
- [2024/07] Stable Audio Open [paper] [code] ✔️
- [2024/05] EmoLLM(心理健康大模型) [code][demo] ✔️
- [2023/02] Improving and generalizing flow-based generative models with minibatch optimal transport [paper][code] TorchCFM | Tutorials ✔️
- [2022/10] Flow Matching for Generative Modeling [paper] Conditional Flow Matching
- [2022/09] Rectified Flow: A Marginal Preserving Approach to Optimal Transport [paper][code] ✔️
- [2024/12] SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor [paper][demo]
- [2024/12] MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models [paper][code] ✔️
- [2024/10] MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [paper]
Code Comming Soon
| Similar to MaskGCT - [2024/09] FLUX that Plays Music [paper][code][melodio] KunLun ✔️
- [2024/09] Seed-Music: A Unified Framework for High Quality and Controlled Music Generation [paper][demo] tech-report
- [2024/05] QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation [paper][code][demo] ✔️
- [2024/05] Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [paper][code][demo] Instruction Tuning ✔️
- [2023/06] Simple and Controllable Music Generation [paper][code] Prompt Control | AudioCraft ✔️
- [2024/07] Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation [paper][code][demo][dataset] ✔️
- [2024/06] WenetSpeech4TTS: A 12,800-hour Mandarin TTS corpus for large speech generation model benchmark [paper][demo][dataset] ✔️
- [2020/10] Didispeech: A large scale Mandarin speech corpus [paper][code][demo][dataset]
- Anthropic courses [github]
- LLM101n: Let's build a Storyteller [github]
- Build a Large Language Model (From Scratch) [github]
- build nanoGPT from Karpathy [github]
GitHub
- ChatTTS: https://github.com/2noise/ChatTTS/tree/main
- OpenVoice: https://github.com/myshell-ai/OpenVoice
- GPT-SoVITS: https://github.com/RVC-Boss/GPT-SoVITS
- Bert-vits2-NoBug: https://github.com/ywh-my/Bert-vits2-NoBug
- VoiceCraft: https://github.com/jasonppy/VoiceCraft
- YourTTS: https://github.com/Edresson/YourTTS
- Coqui: https://github.com/coqui-ai/TTS
- ebook2audiobookXTTS: https://github.com/DrewThomasson/ebook2audiobookXTTS
- MARS5-TTS: https://github.com/Camb-ai/MARS5-TTS
- edge-tts: https://github.com/rany2/edge-tts
- metavoice-src: https://github.com/metavoiceio/metavoice-src
- StyleTTS2: https://github.com/yl4579/StyleTTS2
- open-tts-tracker: https://github.com/Vaibhavs10/open-tts-tracker
- Amphion: https://github.com/open-mmlab/Amphion
- CTranslate2: https://github.com/OpenNMT/CTranslate2
- CFM: https://github.com/atong01/conditional-flow-matching
- speech-trident: https://github.com/ga642381/speech-trident
- bark: https://github.com/suno-ai/bark
- LangGPT: https://github.com/langgptai/LangGPT (提示词工程)
- composio: https://github.com/ComposioHQ/composio
- torchdiffeq: https://github.com/rtqichen/torchdiffeq
- podlm: https://github.com/lihuithe/podlm-public (NoteBookLM 的平替)
- NotebookLlama: https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/NotebookLlama (类似 NoteBookLM)
- playnote: https://play.ai/playnote (类似 NotebookLM)
- podcastfy: https://github.com/souzatharsis/podcastfy (类似 NotebookLM)
- dify: https://github.com/langgenius/dify (开源的 LLM 应用开发平台)
- Awesome-Dify-Workflow: https://github.com/svcvit/Awesome-Dify-Workflow
- LiblibAI: https://www.liblib.art (AI创作平台)
Nice Tool
- pytorch-OpCounter: https://github.com/Lyken17/pytorch-OpCounter
- rich: https://github.com/Textualize/rich
- argbind: https://github.com/pseeth/argbind/
- audiotools: https://github.com/descriptinc/audiotools
- hydra: https://github.com/facebookresearch/hydra
- joblib: https://github.com/joblib/joblib
- einops: https://github.com/arogozhnikov/einops
- safetensors: https://github.com/huggingface/safetensors
- OpenDiloco: https://github.com/PrimeIntellect-ai/OpenDiloco
- WeTextProcessing: https://github.com/wenet-e2e/WeTextProcessing
- zed: https://github.com/zed-industries/zed
- weekly: https://github.com/ljinkai/weekly
- tinygrad: https://github.com/tinygrad/tinygrad
- ffmpeg-normalize: https://github.com/slhck/ffmpeg-normalize
- kohya_ss: https://github.com/bmaltais/kohya_ss
- Lora-Training-in-Comfy: https://github.com/LarryJane491/Lora-Training-in-Comfy
- ComfyUI-Manager: https://github.com/ltdrdata/ComfyUI-Manager
- ComfyUI: https://github.com/comfyanonymous/ComfyUI
- comfyui-workspace-manager: https://github.com/11cafe/comfyui-workspace-manager
- CosyVoice+ComfyUI: https://github.com/AIFSH/CosyVoice-ComfyUI
- ComfyUI-wiki: https://github.com/602387193c/ComfyUI-wiki
- ZHO: https://github.com/ZHO-ZHO-ZHO
- tmux: https://github.com/tmux/tmux
- LoRAlib: https://github.com/microsoft/LoRA
- codespaces: https://github.com/codespaces
- Foliate(PDF): https://johnfactotum.github.io/foliate/
- Okular(PDF): https://okular.kde.org/zh-cn/
- audioFlux: https://github.com/libAudioFlux/audioFlux
- PyWavelets: https://github.com/PyWavelets/pywt
- 智能体或工作流平台: https://ai-bot.cn/ai-agent-development-platform/
- open-webui: https://github.com/open-webui/open-webui
- qwen-2.5-code-interpreter: https://github.com/cfahlgren1/qwen-2.5-code-interpreter
- ollama: https://github.com/ollama/ollama; https://ollama.com/
- vllm: https://github.com/vllm-project/vllm
- anythingLLM: https://github.com/Mintplex-Labs/anything-llm
- Windsurf: https://codeium.com/windsurf
- cursor: https://www.cursor.com/
- docling: https://github.com/DS4SD/docling
- TEN-Agent: https://github.com/TEN-framework/TEN-Agent
- [ZhiHu]别慌!一文教你看懂GPT-4o背后的语音技术
- [ZhiHu]百花齐放的Audio Codec: 语音合成利器
- [InterSpeech2024]InterSpeech2024 Speech Processing Using Discrete Speech Units : https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge/ : https://huggingface.co/discrete-speech : arxiv 2024 : [paper]
- [Slides]Challenges in Developing Spoken Language Models slides
- [GitHub]speech-trident : Awesome speech/audio LLMs, representation learning, and codec models
- [GitHub]Awesome-Speech-Language-Model : Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System