Name		Name	Last commit message	Last commit date
parent directory ..
benchmarks		benchmarks
data		data
__init__.py		__init__.py
config.json		config.json
config_hf.json		config_hf.json
config_new.json		config_new.json
demo.ipynb		demo.ipynb
readme.md		readme.md
train.sh		train.sh
train_lora.py		train_lora.py
utils.py		utils.py

readme.md

FinGPT's Benchmark

FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets

The datasets we used, and the multi-task financial LLMs models are available at https://huggingface.co/FinGPT

Before you start, make sure you have the correct versions of the key packages installed.

transformers==4.32.0
peft==0.5.0

Weights & Biases is a good tool for tracking model training and inference, you need to register, get a free API, and create a new project.

wandb produces some nice charts like the following:

Ready-to-use Demo

For users who want ready-to-use financial multi-task language models, please refer to demo.ipynb. Following this notebook, you're able to test Llama2-7B, ChatGLM2-6B, MPT-7B, BLOOM-7B, Falcon-7B, or Qwen-7B with any of the following tasks:

Financial Sentiment Analysis
Headline Classification
Named Entity Recognition
Financial Relation Extraction

We suggest users follow the instruction template and task prompts that we used in our training process. Demos are shown in demo.ipynb. Due to the limited diversity of the financial tasks and datasets we used, models might not respond correctly to out-of-scope instructions. We'll delve into the generalization ability more in our future works.

Prepare Data & Base Models

For the base models we used, we recommend pre-downloading them and save to base_models/.

Refer to the parse_model_name() function in utils.py for the huggingface models we used for each LLM. (We use base models rather than any instruction-tuned version or chat version, except for ChatGLM2)

For the datasets we used, download our processed instruction tuning data from huggingface. Take FinRED dataset as an example:

import datasets

dataset = datasets.load_dataset('FinGPT/fingpt-finred')
# save to local disk space (recommended)
dataset.save_to_disk('data/fingpt-finred')

Then finred became an available task option for training.

We use different datasets at different phases of our instruction tuning paradigm.

Task-specific Instruction Tuning: sentiment-train / finred-re / ner / headline
Multi-task Instruction Tuning: sentiment-train & finred & ner & headline
Zero-shot Aimed Instruction Tuning: finred-cls & ner-cls & headline-cls -> sentiment-cls (test)

You may download the datasets according to your needs. We also provide processed datasets for ConvFinQA and FinEval, but they are not used in our final work.

prepare data from scratch

To prepare training data from raw data, you should follow data/prepate_data.ipynb.

We don't include any source data from other open-source financial datasets in our repository. So if you want to do it from scratch, you need to find the corresponding source data and put them in data/ before you start.

Instruction Tuning

train.sh contains examples of instruction tuning with this repo. If you don't have training data & base models in your local disk, pass --from_remote true in addition.

Task-specific Instruction Tuning

#chatglm2
deepspeed train_lora.py \
--run_name headline-chatglm2-linear \
--base_model chatglm2 \
--dataset headline \
--max_length 512 \
--batch_size 4 \
--learning_rate 1e-4 \
--num_epochs 8

Please be aware that "localhost:2" refers to a particular GPU device.

#llama2-13b
deepspeed -i "localhost:2" train_lora.py \
--run_name sentiment-llama2-13b-8epoch-16batch \
--base_model llama2-13b-nr \
--dataset sentiment-train \
--max_length 512 \
--batch_size 16 \
--learning_rate 1e-5 \
--num_epochs 8 \
--from_remote True \
>train.log 2>&1 &

use

tail -f train.log

to check the training log

Multi-task Instruction Tuning

deepspeed train_lora.py \
--run_name MT-falcon-linear \
--base_model falcon \
--dataset sentiment-train,headline,finred*3,ner*15 \
--max_length 512 \
--batch_size 4 \
--learning_rate 1e-4 \
--num_epochs 4

Zero-shot Aimed Instruction Tuning

deepspeed train_lora.py \
--run_name GRCLS-sentiment-falcon-linear-small \
--base_model falcon \
--test_dataset sentiment-cls-instruct \
--dataset headline-cls-instruct,finred-cls-instruct*2,ner-cls-instruct*7 \
--max_length 512 \
--batch_size 4 \
--learning_rate 1e-4 \
--num_epochs 1 \
--log_interval 10 \
--warmup_ratio 0 \
--scheduler linear \
--evaluation_strategy steps \
--eval_steps 100 \
--ds_config config_hf.json

Evaluation for Financial Tasks

Refer to Benchmarks/evaluate.sh for evaluation script on all Financial Tasks. You can evaluate your trained model on multiple tasks together. For example:

python benchmarks.py \
--dataset fpb,fiqa,tfns,nwgi,headline,ner,re \
--base_model llama2 \
--peft_model ../finetuned_models/MT-llama2-linear_202309241345 \
--batch_size 8 \
--max_length 512

#llama2-13b sentiment analysis
CUDA_VISIBLE_DEVICES=1 python benchmarks.py \
--dataset fpb,fiqa,tfns,nwgi \
--base_model llama2-13b-nr \
--peft_model ../finetuned_models/sentiment-llama2-13b-8epoch-16batch_202310271908  \
--batch_size 8 \
--max_length 512 \
--from_remote True

For Zero-shot Evaluation on Sentiment Analysis, we use multiple prompts and evaluate each of them. The task indicators are fiqa_mlt and fpb_mlt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FinGPT_Benchmark

FinGPT_Benchmark

readme.md

FinGPT's Benchmark

Ready-to-use Demo

Prepare Data & Base Models

prepare data from scratch

Instruction Tuning

Task-specific Instruction Tuning

Multi-task Instruction Tuning

Zero-shot Aimed Instruction Tuning

Evaluation for Financial Tasks

Files

FinGPT_Benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

FinGPT_Benchmark

Folders and files

parent directory

readme.md

FinGPT's Benchmark

Ready-to-use Demo

Prepare Data & Base Models

prepare data from scratch

Instruction Tuning

Task-specific Instruction Tuning

Multi-task Instruction Tuning

Zero-shot Aimed Instruction Tuning

Evaluation for Financial Tasks