Stars
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Fully open reproduction of DeepSeek-R1
Ready-to-use SRT / WebRTC / RTSP / RTMP / LL-HLS media server and media proxy that allows to read, publish, proxy, record and playback video and audio streams.
百聆 是一个类似GPT-4o的语音对话机器人,通过ASR+LLM+TTS实现,集成DeepSeek R1等优秀大模型,时延低至800ms,Mac等低配置也可运行,支持打断
Android Voice Activity Detection (VAD) library. Supports WebRTC VAD GMM, Silero VAD DNN, Yamnet VAD DNN models.
StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
Code for the paper Hybrid Spectrogram and Waveform Source Separation
Noise supression using deep filtering
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaki…
SALMONN: Speech Audio Language Music Open Neural Network
Speech, Language, Audio, Music Processing with Large Language Model
An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawling, transcription and refinement
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Ongoing research training transformer models at scale
Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
Deezer source separation library including pretrained models.
an extremely simple tool for separating vocals and background music, completely localized for web operation, using 2stems/4stems/5stems models 这是一个极简的人声和背景音乐分离工具,本地化网页操作,无需连接外网
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Open-source, accurate and easy-to-use video speech recognition & clipping tool, LLM based AI clipping intergrated.
Instant voice cloning by MIT and MyShell. Audio foundation model.
Fast and memory-efficient exact attention