Skip to content
View zpforlove's full-sized avatar
  • The Hong Kong University of Science and Technology (Guangzhou)
  • Guangzhou
  • 17:39 (UTC +08:00)

Block or report zpforlove

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization

Python 167 11 Updated Jul 12, 2024

This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data

Python 2,023 153 Updated Jan 28, 2025

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Python 61 4 Updated Nov 9, 2024

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.

152 12 Updated Nov 10, 2024

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…

Python 8,420 644 Updated Feb 3, 2025

🔊 Text-Prompted Generative Audio Model

Jupyter Notebook 36,843 4,340 Updated Aug 19, 2024

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"

Python 6,770 601 Updated May 31, 2024

ACM MM 2024 FlashSpeech: Efficient Zero-Shot Speech Synthesis

Python 122 8 Updated Sep 20, 2024

ACM MM 2023 CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Python 199 21 Updated Apr 26, 2024

This is an evolving repo for the paper "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey".

118 4 Updated Jan 14, 2025

FreeU: Free Lunch in Diffusion U-Net (CVPR2024 Oral)

1,807 76 Updated Dec 24, 2024

SEED-Story: Multimodal Long Story Generation with Large Language Model

Python 788 60 Updated Oct 11, 2024

Integration for the OpenAI Api in Unreal Engine

C++ 707 156 Updated Aug 20, 2024

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 39,469 4,840 Updated Feb 6, 2025

整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。

17,995 1,726 Updated Sep 19, 2024

Awesome-LLM: a curated list of Large Language Model

21,225 1,738 Updated Feb 2, 2025

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.

Python 10,209 985 Updated Feb 6, 2025

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

Python 39,862 4,472 Updated Jan 18, 2025

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"

Python 9,379 1,251 Updated Feb 5, 2025

Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation

Python 3,462 502 Updated Jan 24, 2025

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.

Python 7,941 825 Updated Feb 5, 2025

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python 2,786 188 Updated Nov 14, 2024

Official repo for CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Python 46 4 Updated Jan 16, 2025

[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Python 413 23 Updated Jun 5, 2024

Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

7,569 929 Updated Aug 21, 2024

Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).

1,172 57 Updated Jun 28, 2024

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Python 790 96 Updated Sep 30, 2021

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Jupyter Notebook 27,207 3,426 Updated Jul 23, 2024

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python 2,025 149 Updated Jan 21, 2025

Official pytorch implementation of the paper: "Catch-A-Waveform: Learning to Generate Audio from a Single Short Example" (NeurIPS 2021)

Python 188 35 Updated Apr 2, 2024
Next