Skip to content

Latest commit

 

History

History

inference

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LLM Inference

Compress

Run the following:

python compression/run_compression.py \
    --pretrained-model facebook/opt-125m \
    --quantized-model-dir quantized_opt125m \
    --n-samples 128

Run accuracy benchmark

Run the following:

cd eval/mmlu
./eval_on_mmlu.sh ../../quantized_opt125m facebook/opt-125m /net/nfs.cirrascale/allennlp/akshitab/data/mmlu eval_results

Output format:

Average accuracy 0.202 - math
Average accuracy 0.232 - health
Average accuracy 0.219 - physics
Average accuracy 0.270 - business
Average accuracy 0.198 - biology
Average accuracy 0.172 - chemistry
Average accuracy 0.267 - computer science
Average accuracy 0.204 - economics
Average accuracy 0.234 - engineering
Average accuracy 0.238 - philosophy
Average accuracy 0.236 - other
Average accuracy 0.233 - history
Average accuracy 0.177 - geography
Average accuracy 0.204 - politics
Average accuracy 0.225 - psychology
Average accuracy 0.250 - culture
Average accuracy 0.250 - law
Average accuracy 0.212 - STEM
Average accuracy 0.241 - humanities
Average accuracy 0.215 - social sciences
Average accuracy 0.238 - other (business, health, misc.)
Average accuracy: 0.229

Run efficiency benchmark

Run the following:

cd efficiency
./run_efficiency_benchmark.sh facebook/opt-125m quantized_opt125m

Output format:

Time Elapsed: 500.91 s
Max GPU memory usage:  2.09 GiB.
Average GPU power:  9.00e+01 W.
Average power:  2.04e+02 W.
Total energy:  7.49e-02 kWh.
CO2 emission:  6.35e-03 kg.
Throughput:  0.20 instances / s.
Throughput:  47.30 words / s.
Latency:  5009.10 ms / batch.