Merge branch 'main' into add-pi-readme

pbezglasny · Aug 5, 2023 · 24e4737 · 24e4737
2 parents 083a69d + a96150a
commit 24e4737
Show file tree

Hide file tree

Showing 2 changed files with 29 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 [![Doc](https://img.shields.io/badge/Website-Doc-ff69b4.svg)](https://optimalscale.github.io/LMFlow/)
 [![Embark](https://img.shields.io/badge/Discord-LMFlow-%237289da.svg?logo=discord)](https://discord.gg/u9VJNpzhvA)
 [![slack badge](https://img.shields.io/badge/Slack-Join-blueviolet?logo=slack&amp)](https://join.slack.com/t/lmflow/shared_invite/zt-1wju9nicy-woXbNtS~5MavHSAtiMxmxQ)
-[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://i.imgloc.com/2023/07/13/VgJyaZ.jpeg)
+[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://s1.ax1x.com/2023/08/06/pPAQTPI.jpg)
 
 An extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.
 
@@ -336,9 +336,14 @@ You can config the deepspeed under configs. Details can be referred at [DeepSpee
 
 Thanks to the great efforts of [llama.cpp](https://github.com/ggerganov/llama.cpp). It is possible for everyone to run their LLaMA models on CPU by 4-bit quantization. We provide a script to convert LLaMA LoRA weights to `.pt` files. You only need to use `convert-pth-to-ggml.py` in llama.cpp to perform quantization.
 
+### 4.4 Vocabulary List Extension
+
+Now you can train your own sentencepiece tokenizer and merge it with model's origin hf tokenizer. Check [vocab_extension](https://github.com/OptimalScale/LMFlow/blob/main/scripts/vocab_extension) for more details.
+
 ### 4.5 Position Interpolation for LLaMA Models
 Now LMFlow supports the latest Linear & NTK (Neural Kernel theory) scaling techniques for LLaMA models. Check [postion_interpolation](
 https://github.com/OptimalScale/LMFlow/blob/main/readme/Position_Interpolation.md) for more details
+
 ## 5. Model Release
 
 ### 5.1 Medical Model Checkpoints
@@ -397,7 +402,7 @@ Whether you are a beginner or an expert, we believe that you can benefit from th
 
 [![Embark](https://img.shields.io/badge/discord-LMFlow-%237289da.svg?logo=discord)](https://discord.gg/u9VJNpzhvA)
 [![slack badge](https://img.shields.io/badge/Slack-join-blueviolet?logo=slack&amp)](https://join.slack.com/t/lmflow/shared_invite/zt-1wju9nicy-woXbNtS~5MavHSAtiMxmxQ)
-[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://i.imgloc.com/2023/07/13/VgJyaZ.jpeg)
+[![WeChat badge](https://img.shields.io/badge/WeChat-Join-brightgreen?logo=wechat&amp)](https://s1.ax1x.com/2023/08/06/pPAQTPI.jpg)
 
 
 

diff --git a/scripts/vocab_extension/README.md b/scripts/vocab_extension/README.md
@@ -0,0 +1,22 @@
+# Train & Merge Tokenizer
+To automatically convert data, train a SentencePiece tokenizer, and merge the tokenizer, you can run the following script:
+```
+bash scripts/vocab_extension/train_merge_tokenizer.sh
+``` 
+Alternatively, you can run each of the three steps separately:
+
+# Convert JSON Data to TXT
+To convert JSON data to TXT for sentencepiece tokenizer training, run:
+```
+bash scripts/vocab_extension/convert_json_to_txt.sh
+```
+# Train SentencePiece Tokenizer
+To train a SentencePiece tokenizer, run:
+```
+bash scripts/vocab_extension/train_tokenizer.sh
+```
+# Merge New Tokenizer with the Origin One
+To merge a new tokenizer with the original one, run:
+```
+bash scripts/vocab_extension/merge_tokenizer.sh
+```