-
The Hong Kong University of Science and Technology (Guangzhou)
- Guangzhou
-
17:39
(UTC +08:00)
Stars
Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling
Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
🔊 Text-Prompted Generative Audio Model
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
ACM MM 2024 FlashSpeech: Efficient Zero-Shot Speech Synthesis
ACM MM 2023 CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model
This is an evolving repo for the paper "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey".
FreeU: Free Lunch in Diffusion U-Net (CVPR2024 Oral)
SEED-Story: Multimodal Long Story Generation with Large Language Model
Integration for the OpenAI Api in Unreal Engine
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。
Awesome-LLM: a curated list of Large Language Model
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Official repo for CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Official pytorch implementation of the paper: "Catch-A-Waveform: Learning to Generate Audio from a Single Short Example" (NeurIPS 2021)