Skip to content

Commit

Permalink
Merge branch 'main' into add-pi-readme
Browse files Browse the repository at this point in the history
  • Loading branch information
research4pan authored Aug 5, 2023
2 parents 083a69d + a96150a commit 24e4737
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 2 deletions.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
[![Doc](https://img.shields.io/badge/Website-Doc-ff69b4.svg)](https://optimalscale.github.io/LMFlow/)
[![Embark](https://img.shields.io/badge/Discord-LMFlow-%237289da.svg?logo=discord)](https://discord.gg/u9VJNpzhvA)
[![slack badge](https://img.shields.io/badge/Slack-Join-blueviolet?logo=slack&amp)](https://join.slack.com/t/lmflow/shared_invite/zt-1wju9nicy-woXbNtS~5MavHSAtiMxmxQ)
[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://i.imgloc.com/2023/07/13/VgJyaZ.jpeg)
[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://s1.ax1x.com/2023/08/06/pPAQTPI.jpg)

An extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.

Expand Down Expand Up @@ -336,9 +336,14 @@ You can config the deepspeed under configs. Details can be referred at [DeepSpee

Thanks to the great efforts of [llama.cpp](https://github.com/ggerganov/llama.cpp). It is possible for everyone to run their LLaMA models on CPU by 4-bit quantization. We provide a script to convert LLaMA LoRA weights to `.pt` files. You only need to use `convert-pth-to-ggml.py` in llama.cpp to perform quantization.

### 4.4 Vocabulary List Extension

Now you can train your own sentencepiece tokenizer and merge it with model's origin hf tokenizer. Check [vocab_extension](https://github.com/OptimalScale/LMFlow/blob/main/scripts/vocab_extension) for more details.

### 4.5 Position Interpolation for LLaMA Models
Now LMFlow supports the latest Linear & NTK (Neural Kernel theory) scaling techniques for LLaMA models. Check [postion_interpolation](
https://github.com/OptimalScale/LMFlow/blob/main/readme/Position_Interpolation.md) for more details

## 5. Model Release

### 5.1 Medical Model Checkpoints
Expand Down Expand Up @@ -397,7 +402,7 @@ Whether you are a beginner or an expert, we believe that you can benefit from th

[![Embark](https://img.shields.io/badge/discord-LMFlow-%237289da.svg?logo=discord)](https://discord.gg/u9VJNpzhvA)
[![slack badge](https://img.shields.io/badge/Slack-join-blueviolet?logo=slack&amp)](https://join.slack.com/t/lmflow/shared_invite/zt-1wju9nicy-woXbNtS~5MavHSAtiMxmxQ)
[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://i.imgloc.com/2023/07/13/VgJyaZ.jpeg)
[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://s1.ax1x.com/2023/08/06/pPAQTPI.jpg)



Expand Down
22 changes: 22 additions & 0 deletions scripts/vocab_extension/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Train & Merge Tokenizer
To automatically convert data, train a SentencePiece tokenizer, and merge the tokenizer, you can run the following script:
```
bash scripts/vocab_extension/train_merge_tokenizer.sh
```
Alternatively, you can run each of the three steps separately:

# Convert JSON Data to TXT
To convert JSON data to TXT for sentencepiece tokenizer training, run:
```
bash scripts/vocab_extension/convert_json_to_txt.sh
```
# Train SentencePiece Tokenizer
To train a SentencePiece tokenizer, run:
```
bash scripts/vocab_extension/train_tokenizer.sh
```
# Merge New Tokenizer with the Origin One
To merge a new tokenizer with the original one, run:
```
bash scripts/vocab_extension/merge_tokenizer.sh
```

0 comments on commit 24e4737

Please sign in to comment.