Stars
Align Anything: Training All-modality Model with Feedback
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
Sample Repository for the AlibabaCloud Bailian Speech SDK
Awesome Neural Codec Models, Text-to-Speech Synthesizers & Speech Language Models
so-vits-svc fork with realtime support, improved interface and more features.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.
An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.
VoiceBench: Benchmarking LLM-Based Voice Assistants
Fine-tune the Whisper speech recognition model to support training without timestamp data, training with timestamp data, and training without speech data. Accelerate inference and support Web deplo…
【三年面试五年模拟】AI算法工程师面试秘籍。涵盖AIGC、传统深度学习、自动驾驶、机器学习、计算机视觉、自然语言处理、强化学习、具身智能、元宇宙、AGI等AI行业面试笔试经验与干货知识。
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
SALMONN: Speech Audio Language Music Open Neural Network
Sample code for the Microsoft Cognitive Services Speech SDK
The official GitHub page for the survey paper "A Survey of Large Language Models".
serp-ai / bark-with-voice-clone
Forked from suno-ai/bark🔊 Text-prompted Generative Audio Model - With the ability to clone voices
shuaijiang / ke-data-juicer
Forked from modelscope/data-juicerA one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
Manipulate audio with a simple and easy high level interface