Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
3rdparty		3rdparty
kernel		kernel
longbench_utils		longbench_utils
palu		palu
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
compress.py		compress.py
requirements.txt		requirements.txt
run_lm_eval.py		run_lm_eval.py
run_long_bench.py		run_long_bench.py
run_ppl_eval.py		run_ppl_eval.py
utils.py		utils.py

Repository files navigation

Palu: Compression KV-Cahce with Low-Rank Decomposition

Official implementation of Palu

Installation

Clone the repository (Make sure you have Git, Conda installed on your system)

git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
cd Palu

Prepare environment

conda create -n Palu python=3.10
conda activate Palu
pip install -r requirements.txt

Install 3rdparty libraries

pip install -e 3rdparty/lm-evaluation-harness
pip install -e 3rdparty/fast-hadamard-transform

Usage

Rank Search and Compression

We provide a scrips compress.py to perform the rank search and low-rank decomposition to generate the low-rank projection matrices for compressing KV-Cache. Here, we perform the decomposition with proposed G-LRD methods with group size equal to 4 as an example.

python compress.py \
--model_id="meta-llama/Llama-2-7b-hf" \
--calib_dataset wikitext2 \
--param_ratio_target 0.5 \
--search_method fisher_uniform \
--head_group_size 4 \
--dump_huggingface_model \
--use_cache

After executing the above command, a compressed models with decomposed low-rank projection matrices will be dumped into the Llama-2-7b-hf_ratio-0.5_gs-4-fisher_uniform directory. Here, the dumped models is stored via the huggingface transformers format.

Evaluation

With the compressed model dumped, we can evaluate the performance of the compressed model on the various tasks.

Perplexity

To evaluate the perplexity of the compressed model on the wikitext2 dataset with sequence length 2048, we can use the ppl_eval.py script.

python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 2048

To reproduce the evalaution of c4 perplexity, simply change the datasets argument to c4.

For evaluating the Palu with the low-rank latenets be quantized, user may pass the --lt_bits arguments to enable the quantization. For example, to evaluate the Palu with 3-bit low-rank aware quantization, we can execute:

python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 4096 \
--lt_bits 3 \
--lt_hadamard

Zero-shot Evaluation

To run zero-shot evaluations of models with compressed KV-Cache, we can use the run_lm_eval.py script, which implement a wrapper around the lm-evaluation-harness library.

Before we start, please make sure you have the lm-eval==0.4.2 library installed. We also include the lm-eval library in 3rdparty directory, and user may run the following command for easy installation:

pip install -e 3rdparty/lm-evaluation-harness

To reproduce the results in our paper, simply execute:

CUDA_VISIBLE_DEVICES=1 python run_lm_eval.py --model_name_or_path /Path/To/Palu/Model \
--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"

Long-Bench

We also provide a script to run the long-bench evaluation on the compressed model.

CUDA_VISIBLE_DEVICES=2 python run_long_bench.py \
--model_id /Path/To/Palu/Model

The scrips will evaluate on "triviaqa", "qasper", "trec", "samsum", "lcc", "repobench-p", "qmsum" and "multi_news" datasets by default. User may also leverage the --datasets argument to specify the tasks to evaluate. For example, add --datasets "triviaqa,qasper" to evaluate on "triviaqa" and "qasper" datasets only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Palu: Compression KV-Cahce with Low-Rank Decomposition

Installation

Usage

Rank Search and Compression

Evaluation

Perplexity

Zero-shot Evaluation

Long-Bench

About

Releases

Packages

Contributors 5

Languages

License

shadowpa0327/Palu

Folders and files

Latest commit

History

Repository files navigation

Palu: Compression KV-Cahce with Low-Rank Decomposition

Installation

Usage

Rank Search and Compression

Evaluation

Perplexity

Zero-shot Evaluation

Long-Bench

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages