This folder contains the code for the entire InstrAug pipeline, which includes generation, post-processing (filtering) and dataset reconstruction. During generation stage, we "ask" LLaMA2-Chat-13B to generate augmented instructions from original ones.
- Create the environment to run llama using the provided configuration file.
conda env create -f env.yaml
conda activate llama
- Download llama2 checkpoint from this link (require application for access on Meta website). Put the checkpoint file in the folder
mkdir -p ./llama/ckpt/
cd llama/ckpt
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-13b-chat
- Download and preprocess MINS dataset following README_MINS.md
Guiding Instructions are already included in llama/gen_instr.txt
. You can also use customized guiding instructions by replacing the original content with them.
Simply run the generation script (we use 2 NVIDIA RTX A6000 48GB GPU in this step).
CUDA_VISIBLE_DEVICES='0,1' bash gen_new_inst.sh
The script first saves raw instructions into RAW_FILE
, then generate into GEN_TRG_FILE
.
The process next filter instructions in SRC_FILE
according to predefined rules to TRG_FILE
.
You must specified the filename in instruction_gen.py
before generation.
Run the following command to build instructions with augmented instructions. You should specify the number of instances per task (SIZE
) and filtered instruction version.
python build_new_dataset.py