Name		Name	Last commit message	Last commit date
parent directory ..
data_preprocessor		data_preprocessor
download_scripts		download_scripts
llama		llama
LICENSE		LICENSE
README.md		README.md
README_MINS.md		README_MINS.md
build_dataset.py		build_dataset.py
build_new_dataset.py		build_new_dataset.py
download_data.sh		download_data.sh
env.yaml		env.yaml
gen_new_inst.sh		gen_new_inst.sh
instruction_gen.py		instruction_gen.py
instruction_templates.py		instruction_templates.py
multi_instruct_tasks_fig.png		multi_instruct_tasks_fig.png
process_data.sh		process_data.sh
util.py		util.py

README.md

InstrAug

This folder contains the code for the entire InstrAug pipeline, which includes generation, post-processing (filtering) and dataset reconstruction. During generation stage, we "ask" LLaMA2-Chat-13B to generate augmented instructions from original ones.

Set up

Create the environment to run llama using the provided configuration file.

conda env create -f env.yaml
conda activate llama

Download llama2 checkpoint from this link (require application for access on Meta website). Put the checkpoint file in the folder

mkdir -p ./llama/ckpt/

cd llama/ckpt
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-13b-chat

Download and preprocess MINS dataset following README_MINS.md

How to Run

Generation and Post-processing

Guiding Instructions are already included in llama/gen_instr.txt. You can also use customized guiding instructions by replacing the original content with them. Simply run the generation script (we use 2 NVIDIA RTX A6000 48GB GPU in this step).

CUDA_VISIBLE_DEVICES='0,1' bash gen_new_inst.sh

The script first saves raw instructions into RAW_FILE, then generate into GEN_TRG_FILE. The process next filter instructions in SRC_FILE according to predefined rules to TRG_FILE. You must specified the filename in instruction_gen.py before generation.

Build dataset

Run the following command to build instructions with augmented instructions. You should specify the number of instances per task (SIZE) and filtered instruction version.

python build_new_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiInstruct

MultiInstruct

README.md

InstrAug

Set up

How to Run

Generation and Post-processing

Build dataset

Files

MultiInstruct

Directory actions

More options

Directory actions

More options

Latest commit

History

MultiInstruct

Folders and files

parent directory

README.md

InstrAug

Set up

How to Run

Generation and Post-processing

Build dataset