How to Train Your Llama: Lessons Learned on Training LLama 3.1 on Threat Actor Telegram messages

This repo is supplemental material for my presentation for Bsides Philly. If you have any questions or just want to talk get in touch with me:

All Links
Email me at [email protected]

Abstract

Training and fine tuning LLMs is an incredibly complex process, but thanks to different libraries and frameworks we can easily find our own data and fine tune open source models like Llama 3.1.

This talk will be the story of how I scraped Telegram channels operated by threat actors and used this data to fine tune Llama 3.1. It will give attendees an easy way to fine-tune their own models and demonstrate what steps to take and what pitfalls to avoid.

Things to Know

I am by no means an expert in machine learning or AI. I first started my machine learning journey in 2018 when I took various courses on the topic. Back then, it was very time consuming and somewhat difficult. Back then when you wanted to load a dataset you needed to write all the code to load the file, clean it, parse it, and put it into training and test data. Nowadays you can do that just by using Huggingfaces load_dataset method.

The advancements in AI are due to the support of the community and the different tooling and techniques that have been created since 2018. This project was a way for me to get up to speed on the topic and start to do some fun stuff. My hope is that you can learn the basics of fine tuning a model and avoid some of the time consuming mistakes I made along the way.

Dataset

The dataset generated during this project is available on Huggingface: threatactor-telegram-18k

Converting to GGUF

If you followed along you should have a modelfile in merged 16bit format. To convert this you would:

Convert from merged 16 bit to GGUF with llama.cpp

python <LLAMA_CPP_FOLDER>/convert_hf_to_gguf.py <MODEL_FOLDER_NAME>/ --outfile <YOUR_FILE_NAME>.gguf --outtype f16

Create a Modelfile for ollama nano <MODELNAME-Modelfile>

Update Modelfile

FROM <LOCATION_OF_YOUR_GGUF>

Create model from modelfile with ollama

ollama create <YOUR_MODEL_NAME> -f <LOCATION_OF_MODELFILE>

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
How to Train Your Llama Slides.pdf		How to Train Your Llama Slides.pdf
README.md		README.md
gather.py		gather.py
generate_qa_pairs.py		generate_qa_pairs.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to Train Your Llama: Lessons Learned on Training LLama 3.1 on Threat Actor Telegram messages

Abstract

Things to Know

Dataset

Converting to GGUF

About

Releases

Packages

Languages

cwolff411/HowToTrainYourLlama

Folders and files

Latest commit

History

Repository files navigation

How to Train Your Llama: Lessons Learned on Training LLama 3.1 on Threat Actor Telegram messages

Abstract

Things to Know

Dataset

Converting to GGUF

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages