Stars
ESC-50: Dataset for Environmental Sound Classification
CLIP-based aesthetics predictor inspired by the interface of 🤗 huggingface transformers.
Modelscope-Sora挑战赛第五名参赛方案
Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
Convert PDF to markdown + JSON quickly with high accuracy
A Python base cli tool for caption images with WD series, Joy-caption-pre-alpha,meta Llama 3.2 Vision Instruct and Qwen2 VL Instruct models.
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
智能视频多语言AI配音/翻译工具 - Linly-Dubbing — “AI赋能,语言无界”
State-of-the-Art Text Embeddings
Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Tiny RDM (Tiny Redis Desktop Manager) - A modern, colorful, super lightweight Redis GUI client for Mac, Windows, and Linux.
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Interpreting and Analyzing CLIP's Zero-Shot Image Classification via Mutual Knowledge, NeurIPS 2024
Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search
A Gradio web UI for Large Language Models with support for multiple inference backends.
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A free, open source, multi-platform SQLite database manager.
微信公众号文章批量下载工具,支持图片、评论下载,支持保存html/mhtml/md/pdf/docx文件
The Places365-CNNs for Scene Classification
AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection - CVPR NAS 2023
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization