Skip to content

Commit

Permalink
change the Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
bachvudinh committed Aug 23, 2024
1 parent 3ff5130 commit 98f585c
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 21 deletions.
30 changes: 19 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
<div align="center">

# Llama3-S: When llama learns to listen
<a href='https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
<a href='https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>
<a href='https://huggingface.co/homebrewltd'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
<a href='https://huggingface.co/homebrewltd'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>

<img src="images/llama-listen.jpg" width="180"/>
<p><small>Image source: <a href="https://www.amazon.co.uk/When-Llama-Learns-Listen-Feelings/dp/1839237988">"When Llama Learns to Listen"</a></small></p>
Expand All @@ -16,15 +16,18 @@ The project provides a full codebase and replication instructions for synthetic
⚠️ Work in Progress
Llama3-s is currently under active development. Please note the following limitations:

- The model currently responds only to female voices
- Model is sensitive to bad compression on the incoming audio
- Model cannot listen to >10s audio and get confused
- ~~The model currently responds only to female voices~~ --> Our lastest model responds to all voices
- It processes single-turn sound instruction data

We are continuously working to expand these capabilities.

## News
- [2024/07/19] We released [llama3-s-2024-07-19](https://huggingface.co/homebrewltd/llama3-s-2024-07-19), trained on 1.35B tokens. This model achieves a loss of 1.0.
- [2024/07/01] We released [llama3-s-2024-07-08](https://huggingface.co/homebrewltd/llama3-s-2024-07-08), trained on 700M tokens. This model achieves a loss of 1.7.
- [2024/06/23] We released [llama3-s-init](https://huggingface.co/homebrewltd/llama3-s-init), our initialized model with expanded vocabulary.
- [2024/08/20] We’re excited to share llama3s v0.2, our latest multimodal checkpoint with improved speech understanding. We released [llama3.1-s-instruct-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-instruct-v0.2), trained on 440M tokens for 5 epochs and [llama3.1-s-base-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-base-v0.2), pretrained on 900M sementic sound tokens.
- [2024/07/19] We released [llama3-s-2024-07-19-v0.1](https://huggingface.co/homebrewltd/llama3-s-2024-07-19), trained on 1.35B tokens. This model achieves a loss of 1.0.
- [2024/07/01] We released [llama3-s-2024-07-08-v0.1](https://huggingface.co/homebrewltd/llama3-s-2024-07-08), trained on 700M tokens. This model achieves a loss of 1.7.
- [2024/06/23] We released [llama3-s-init](https://huggingface.co/homebrewltd/llama3-s-init), our initialized model with expanded vocabulary using Encodec as audio tokenizer.

## Contents
- [Models](#models)
Expand All @@ -46,15 +49,21 @@ Get started quickly using our Google Colab notebook:
We provide our fully finetuned models on Phase 1 and 2 data and the initialized model with expanded vocab.
| Date | Checkpoint | Tokens | Step | Batch Size | Loss | Status |
|------|------------|--------|------|------------|------|--------|
| 📅 2024-08-20 | 🔗 [llama3.1-s-instruct-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-instruct-v0.2) | 🔢 440M | 🔄 36305 | 💼 128 | 📉 0.7| 🚧 In progress |
| 📅 2024-08-20 | 🔗 [llama3.1-s-base-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-base-v0.2) | 🔢 900M | 🔄 5042 | 💼 480 | 📉 2.0| 🚧 In progress |
| 📅 2024-08-20 | 🔗 [llama3.1-s-whispervq-init](https://huggingface.co/homebrewltd/llama3.1-s-whispervq-init) | 🔢 0M | 🔄 N/A | 💼 N/A | 📉 N/A | N/A |
| 📅 2024-07-19 | 🔗 [llama3-s-2024-07-19](https://huggingface.co/homebrewltd/llama3-s-2024-07-19) | 🔢 1.35B | 🔄 6520 | 💼 128 | 📉 1.0| 🚧 In progress |
| 📅 2024-07-01 | 🔗 [llama3-s-2024-07-08](https://huggingface.co/homebrewltd/llama3-s-2024-07-08) | 🔢 700M | 🔄 4320 | 💼 128 | 📉 1.7-1.8 | 🚧 In progress |
| 📅 2024-06-23 | 🔗 [llama3-s-init](https://huggingface.co/homebrewltd/llama3-s-init) | 🔢 0M | 🔄 N/A | 💼 N/A | 📉 N/A | N/A |

## Dataset

We provide 3 different version of the processed data for model training, converted to the Llama3 format and ready for fine-tuning:
We provide different version of the processed data for model training, converted to the Llama3 format and ready for fine-tuning.
⚠️ Note: The most recent implementation utilizes WhisperVQ as the audio tokenizer, whereas previous versions employed EnCodec.
| Date | HF Checkpoint | Tokens |
|------------|-------------------------------------------------|--------|
| 📅 2024-08-20 | 🔗 [Instruction-speech-whispervq-v2](https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v2) | 🔢 440M |
| 📅 2024-08-20 | 🔗 [Raw-speech-whispervq-v1](https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v1) | 🔢 900M |
| 📅 2024-07-19 | 🔗 [Instruction-Speech-Full](https://huggingface.co/homebrew-research) | 🔢 1.35B |
| 📅 2024-07-18 | 🔗 [Instruction-Speech-Phase-2](https://huggingface.co/datasets/homebrew-research/instruction-speech-v1.5) | 🔢 800M |
| 📅 2024-06-30 | 🔗 [Instruction-Speech-Phase-1](https://huggingface.co/datasets/homebrew-research/instruction-speech-v1) | 🔢 450M |
Expand Down Expand Up @@ -108,14 +117,13 @@ accelerate launch --config_file ./accelerate_config.yaml train.py
1. Install Package
```
python -m venv torchtune
pip install --pre torch==2.5.0.dev20240617 --index-url https://download.pytorch.org/whl/nightly/cu121 #or cu118
pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly
pip install torch torchvision tensorboard
cd ./torchtune
pip install -e .
```
You can also download the model using tune:
```
tune download meta-llama/Meta-Llama-3-70b --hf-token <token> --output-dir ../model_zoo/Meta-Llama-3-70b --ignore-patterns "original/consolidated*"
tune download homebrewltd/llama3.1-s-whispervq-init --hf-token <token> --output-dir ../model_zoo/llama3.1-s-whispervq-init --ignore-patterns "original/consolidated*"
```
Setup the Dataset from HF path by change the path and change the name of the model in the following YAML file.
```
Expand All @@ -124,7 +132,7 @@ nano torchtune/recipes/configs/jan-llama3-s/8B_full.yaml

2. Training Mutil GPU (1-8GPUs Supported)
```
tune run --nproc_per_node 4 full_finetune_distributed --config janhq-llama3-s/8B_full
tune run --nproc_per_node 4 full_finetune_fsdp2 --config recipes/configs/jan-llama3-1-s/8B_full.yaml
```
## Reference
```bibtex
Expand Down
19 changes: 9 additions & 10 deletions demo/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,20 +100,20 @@ def text_to_audio_file(text):
tts.convert_text_to_audio_file(text, temp_file)
print(f"Saved audio to {temp_file}")
return temp_file
def process_input(input_type, text_input=None, audio_file=None):
def process_input(audio_file=None):

for partial_message in process_audio(audio_file):
yield partial_message

def process_transcribe_input(input_type, text_input=None, audio_file=None):
def process_transcribe_input(audio_file=None):

for partial_message in process_audio(audio_file, transcript=True):
yield partial_message

class StopOnTokens(StoppingCriteria):
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
# encode </s> token
stop_ids = [tokenizer.eos_token_id] # Adjust this based on your model's tokenizer
stop_ids = [tokenizer.eos_token_id, 128009] # Adjust this based on your model's tokenizer
for stop_id in stop_ids:
if input_ids[0][-1] == stop_id:
return True
Expand Down Expand Up @@ -182,15 +182,14 @@ def process_audio(audio_file, transcript=False):
transcrip_button = gr.Button("Please Transcribe the audio for me")

text_output = gr.Textbox(label="Generated Text")

def reset_textbox():
return gr.update(value="")
def update_visibility(input_type):
return (gr.update(visible=input_type == "text"),
gr.update(visible=input_type == "text"))
def convert_and_display(text):
audio_file = text_to_audio_file(text)
return audio_file
def process_example(file_path):
return update_visibility("audio")
return audio_file

input_type.change(
update_visibility,
Expand All @@ -206,16 +205,16 @@ def process_example(file_path):

submit_button.click(
process_input,
inputs=[input_type, text_input, audio_input],
inputs=[audio_input],
outputs=[text_output]
)
transcrip_button.click(
process_transcribe_input,
inputs=[input_type, text_input, audio_input],
inputs=[audio_input],
outputs=[text_output]
)

gr.Examples(examples, inputs=[audio_input], outputs=[audio_input], fn=process_example)
gr.Examples(examples, inputs=[audio_input])
iface.queue(max_size=10)
# iface.launch(server_name="127.0.0.1", server_port=8080)
# launch locally
Expand Down

0 comments on commit 98f585c

Please sign in to comment.