2023.12 Update ๐
Users can upload any images for the conversation
2024.01 Update ๐๐
Exciting news! I've now incorporated both the powerful GeminiPro and Qwen large models into our conversational scene. Users can now upload images during the conversation, adding a whole new dimension to the interactions.
Linly-Talker is an intelligent AI system that combines large language models (LLMs) with visual models to create a novel human-AI interaction method. It integrates various technologies like Whisper, Linly, Microsoft Speech Services and SadTalker talking head generation system. The system is deployed on Gradio to allow users to converse with an AI assistant by providing images as prompts. Users can have free-form conversations or generate content according to their preferences.
conda create -n linly python=3.8
conda activate linly
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install -r requirements_app.txt
Leverages OpenAI's Whisper, see https://github.com/openai/whisper for usage.
Uses Microsoft Speech Services, see https://github.com/rany2/edge-tts for usage.
Talking head generation uses SadTalker from CVPR 2023, see https://sadtalker.github.io
Download SadTalker models:
bash scripts/download_models.sh
Linly-AI from CVI , Shenzhen University, see https://github.com/CVI-SZU/Linly
Download Linly models: https://huggingface.co/Linly-AI/Chinese-LLaMA-2-7B-hf
git lfs install
git clone https://huggingface.co/Linly-AI/Chinese-LLaMA-2-7B-hf
Or use the API:
# CLI
curl -X POST -H "Content-Type: application/json" -d '{"question": "What are fun places in Beijing?"}' http://url:port
# Python
import requests
url = "http://url:port"
headers = {
"Content-Type": "application/json"
}
data = {
"question": "What are fun places in Beijing?"
}
response = requests.post(url, headers=headers, json=data)
# response_text = response.content.decode("utf-8")
answer, tag = response.json()
# print(answer)
if tag == 'success':
response_text = answer[0]
else:
print("fail")
print(response_text)
Qwen from Alibaba Cloud, see https://github.com/QwenLM/Qwen
Download Qwen models: https://huggingface.co/Qwen/Qwen-7B-Chat-Int4
git lfs install
git clone https://huggingface.co/Qwen/Qwen-1_8B-Chat
Gemini-Pro from Google, see https://deepmind.google/technologies/gemini/
Request API-keys: https://makersuite.google.com/
In the app.py file, tailor your model choice with ease.
# Uncomment and set up the model of your choice:
# llm = Gemini(model_path='gemini-pro', api_key=None, proxy_url=None) # Don't forget to include your Google API key
# llm = Qwen(mode='offline', model_path="Qwen/Qwen-1_8B-Chat")
# Automatic download
# llm = Linly(mode='offline', model_path="Linly-AI/Chinese-LLaMA-2-7B-hf")
# Manual download with a specific path
llm = Linly(mode='offline', model_path="./Chinese-LLaMA-2-7B-hf")
Some optimizations:
- Use fixed input face images, extract features beforehand to avoid reading each time
- Remove unnecessary libraries to reduce total time
- Only save final video output, don't save intermediate results to improve performance
- Use OpenCV to generate final video instead of mimwrite for faster runtime
Gradio is a Python library that provides an easy way to deploy machine learning models as interactive web apps.
For Linly-Talker, Gradio serves two main purposes:
-
Visualization & Demo: Gradio provides a simple web GUI for the model, allowing users to see the results intuitively by uploading an image and entering text. This is an effective way to showcase the capabilities of the system.
-
User Interaction: The Gradio GUI can serve as a frontend to allow end users to interact with Linly-Talker. Users can upload their own images and ask arbitrary questions or have conversations to get real-time responses. This provides a more natural speech interaction method.
Specifically, we create a Gradio Interface in app.py that takes image and text inputs, calls our function to generate the response video, and displays it in the GUI. This enables browser interaction without needing to build complex frontend.
In summary, Gradio provides visualization and user interaction interfaces for Linly-Talker, serving as effective means for showcasing system capabilities and enabling end users.
The folder structure is as follows:
Linly-Talker/
โโโ app.py
โโโ app_img.py
โโโ utils.py
โโโ Linly-api.py
โโโ Linly-example.ipynb
โโโ README.md
โโโ README_zh.md
โโโ request-Linly-api.py
โโโ requirements_app.txt
โโโ scripts
โโโ download_models.sh
โโโ src
โโโ .....
โโโ inputs
โโโ example.png
โโโ first_frame_dir
โโโ example_landmarks.txt
โโโ example.mat
โโโ example.png
โโโ examples
โโโ driven_audio
โโโ bus_chinese.wav
โโโ ......
โโโ RD_Radio40_000.wav
โโโ ref_video
โโโ WDA_AlexandriaOcasioCortez_000.mp4
โโโ WDA_KatieHill_000.mp4
โโโ source_image
โโโ art_0.png
โโโ ......
โโโ sad.png
โโโ checkpoints // SadTalker model weights path
โโโ mapping_00109-model.pth.tar
โโโ mapping_00229-model.pth.tar
โโโ SadTalker_V0.0.2_256.safetensors
โโโ SadTalker_V0.0.2_512.safetensors
โโโ gfpgan // GFPGAN model weights path
โโโ weights
โโโ alignment_WFLW_4HG.pth
โโโ detection_Resnet50_Final.pth
โโโ Chinese-LLaMA-2-7B-hf // Linly model weights path
โโโ config.json
โโโ generation_config.json
โโโ pytorch_model-00001-of-00002.bin
โโโ pytorch_model-00002-of-00002.bin
โโโ pytorch_model.bin.index.json
โโโ README.md
โโโ special_tokens_map.json
โโโ tokenizer_config.json
โโโ tokenizer.model
Next, launch the app:
python app.py
Users can upload images for the conversation
python app_img.py