Llama3-S: When llama learns to listen

Image source: "When Llama Learns to Listen"

Warning

llama3-s is an on-going open research experiment in its early traning runs.

Join us in the #research channel in Homebrew's Discord
We livestream training runs in #research-livestream

Note

2nd Aug 2024 Update:

llama3-s can understand female, Australian accents, i.e. our synthetic voice data generator 😂
Can only process single-sound instruction data
Current Demo: https://dollars-scholar-wins-antique.trycloudflare.com/

About

llama3-s is an open, ongoing research experiment to extend a text-based LLM to have native "listening" ability. We are mainly

We are training an early fusion model using techniques inspired by Meta's Chameleon paper. Our approach is focused on token transitivity which extends LLM's vocabulary to include sound tokens, has the potential to be extended to various input types in the future.

llama3-s is being done as an open science experiment with an open source codebase and dataset. We ~~build~~ train in public:

#research : for discussions, updates, and questions
#research-livestream: see our training runs live

Current Progress

2 Aug: We re-trained phase 1 (not yet published) using llama3.1 with much better hyperparameters and techniques, leading to a significant improvement, almost no sign of degradation (0.66 -> 0.61 on MMLU).
1 Aug: Discovered that the training on 1 July introduced a significant degradation to the base model (MMLU reduction from 0.6 -> 0.2) due to a typo in the original training recipe
30 July: Presented llama3-s initial progress at: AI Training: From PyTorch to GPU Clusters
19 July: llama3-s-2024-07-19 can understand a synthetically generated voice
1 July: llama3-s-2024-07-08 intial exploratory training to see if the model can converge, it seems the loss is converging at 1.7 with limited data

Training Runs:

We provide our fully finetuned models on Phase 1 and 2 data and the initialized model with expanded vocab.

Date	Model Checkpoint	Dataset	Tokens	Step	Batch Size	Loss	Training Cost
19 July 24	llama3-s-2024-07-19	Instruction-Speech-Full	1.35B	1195k	128	1.0	~300$
1 July 24	llama3-s-2024-07-08	Instruction-Speech-Phase-2	700M	1431k	128	1.7-1.8	~300$
23 July 24	llama3-s-init	Instruction-Speech-Phase-1	0M	N/A	N/A	N/A

Join Us

llama3-s is an open research project. We're looking for collaborators, and will likely move towards crowdsourcing speech datasets in the future.

Quickstart with Google Colab

Get started quickly using our Google Colab notebook:

Synthetic Generation

For detailed information on synthetic generation, please refer to the Synthetic Generation Guide.

Organize the input/output directory

First Clone the Repo from github:

git clone --single-branch --branch training_script https://github.com/janhq/llama3-s.git

Organize the folder structure as follows before training:

llama3-s
├── HF_Trainer
├── synthetic_data
├── scripts
├── torchtune
├── model_zoo
│   ├── LLM
│   │   ├── Meta-Llama-3-8B-Instruct
│   │   ├── Meta-Llama-3-70B-Instruct

Training with HF Trainer

Install Depencencies

python -m venv hf_trainer
chmod +x scripts/install.sh
./scripts/install.sh

Restart shell now

chmod +x scripts/setup.sh
./scripts/setup.sh
source myenv/bin/activate

Logging Huggingface

huggingface-cli login --token=<token>

Training

export CUTLASS_PATH="cutlass"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
accelerate launch --config_file ./accelerate_config.yaml train.py

Training with Torchtune

Install Package

python -m venv torchtune
pip install --pre torch==2.5.0.dev20240617  --index-url https://download.pytorch.org/whl/nightly/cu121 #or cu118
pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly
cd ./torchtune
pip install -e .

You can also download the model using tune:

tune download meta-llama/Meta-Llama-3-70b --hf-token <token> --output-dir ../model_zoo/Meta-Llama-3-70b --ignore-patterns "original/consolidated*"

Setup the Dataset from HF path by change the path and change the name of the model in the following YAML file.

nano torchtune/recipes/configs/jan-llama3-s/8B_full.yaml

Training Mutil GPU (1-8GPUs Supported)

tune run --nproc_per_node 4 full_finetune_distributed --config janhq-llama3-s/8B_full

References

@misc{chameleonteam2024chameleonmixedmodalearlyfusionfoundation,
      title={Chameleon: Mixed-Modal Early-Fusion Foundation Models}, 
      author={Chameleon Team},
      year={2024},
      eprint={2405.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      journal={arXiv preprint}
}

@misc{zhang2024adamminiusefewerlearning,
      title={Adam-mini: Use Fewer Learning Rates To Gain More}, 
      author={Yushun Zhang and Congliang Chen and Ziniu Li and Tian Ding and Chenwei Wu and Yinyu Ye and Zhi-Quan Luo and Ruoyu Sun},
      year={2024},
      eprint={2406.16793},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={arXiv preprint}
}

@misc{defossez2022highfi,
      title={High Fidelity Neural Audio Compression},
      author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
      year={2022},
      eprint={2210.13438},
      archivePrefix={arXiv},
      journal={arXiv preprint}
}

@misc{WhisperSpeech,
      title={WhisperSpeech: An Open Source Text-to-Speech System Built by Inverting Whisper}, 
      author={Collabora and LAION},
      year={2024},
      url={https://github.com/collabora/WhisperSpeech},
      note={GitHub repository}
}

Acknowledgement

Torchtune: The codebase we built upon
Accelerate: Library for easy use of distributed training
WhisperSpeech: Text-to-speech model for synthetic audio generation
Encodec: High-fidelity neural audio codec for efficient audio compression
Llama3: the Family of Models that we based on that has the amazing language capabilities !!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Llama3-S: When llama learns to listen

About

Current Progress

Training Runs:

Join Us

Quickstart with Google Colab

Synthetic Generation

Organize the input/output directory

Training with HF Trainer

Training with Torchtune

References

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Llama3-S: When llama learns to listen

About

Current Progress

Training Runs:

Join Us

Quickstart with Google Colab

Synthetic Generation

Organize the input/output directory

Training with HF Trainer

Training with Torchtune

References

Acknowledgement