This repo implements dynamic activation offloading on-top-of DyNet.
Preparation
# load CUDA 11.1;
cd ~
source cuda.sh 11.1
Install Eigen
mkdir eigen
cd eigen
wget https://github.com/clab/dynet/releases/download/2.1/eigen-b2e267dc99d4.zip
unzip eigen-b2e267dc99d4.zip
Build
git clone https://github.com/clab/dynet.git
cd dynet
mkdir build
cd build
# Run CMake
# -DENABLE_BOOST=ON in combination with -DENABLE_CPP_EXAMPLES=ON also
# compiles the multiprocessing c++ examples
cmake .. -DEIGEN3_INCLUDE_DIR=/path/to/eigen -DENABLE_BOOST=ON -DENABLE_CPP_EXAMPLES=ON -DBACKEND=cuda -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_DAO=ON
# Compile using 2 processes
make -j 2
# Test with an example
./examples/xor
Run mnist
# prepare datasets
ln -s /ssd1/siyuanch/workspace/dynet-dao/datasets datasets
cd build/examples
./mnist -t ../../datasets/mnist/train-images.idx3-ubyte -d ../../datasets/mnist/t10k-images.idx3-ubyte -tl ../../datasets/mnist/train-labels.idx1-ubyte -dl ../../datasets/mnist/t10k-labels.idx1-ubyte --batch_size 128 -N 20
Transformer Example
# cd <repo dir>
./build/examples/transformer-train -c models/iwslt-envi/config.txt --parameters models/iwslt-envi/en-vi.transformer.h2_l2_u128_do010101010001_att1_ls00_pe1_ml150_ffrelu &>models/iwslt-envi/log.en-vi.transformer.h2_l2_u128_do010101010001_att1_ls00_pe1_ml150_ffrelu
Example: fine-tune gpt2 with lora and skip rate 0.2
skip_r=0.2
# cd <repo dir>
mkdir -p models/gpt2-124M
cp /home/siyuanch/ssd/workspace_zelongg/dynet-dao/models/gpt2-124M/hparams.ini models/gpt2-124M
# TODO: modify hparams.ini for epochs, bs and log frequency
mkdir -p models/gpt2-124M-$skip_r # prepare initial checkpoint
echo "768 12 12 4 0 0.1 $skip_r 0 0 0.1 1 1024 1 1 0 models/gpt2-124M-$skip_r/model.params" > models/gpt2-124M-$skip_r/model.config
cp /ssd1/siyuanch/workspace_zelongg/DAO/models/124M/dynet-model.params models/gpt2-124M-$skip_r/model.params
# Add --train-percent 10 to below cmd for faster run
./build/examples/transformer-lm -c models/gpt2-124M/hparams.ini --model-path models/gpt2-124M-$skip_r --attn-lora-r 2 --attention-dropout-p $skip_r --ff-dropout-p 0 --reset-if-stuck --use-smaller-minibatch 2>&1 | tee models/gpt2-124M-$skip_r/train.log
# Run transformer with dao
./build/examples/transformer-lm --train-percent 3 --use_offload --dao-gpu-mem 16384 --dao-verbose 0 -c models/gpt2-124M/hparams.ini --attn-lora-r 2 --attention-dropout-p 0.2 --ff-dropout-p 0.2 --reset-if-stuck --use-smaller-minibatch --dynet-seed 1 2>&1 | tee models/gpt2-124M-0.2/train.log
--use_offload
: enable DAO's offloading backend; otherwise, we fallback to dynet's backend;--dao-gpu-mem [int]
: the gpu memory size for DAO's backend in MB;--dao-cpu-mem [int]
: the cpu memory size for DAO's backend in MB;--dao-verbose [int=0]
: the verbose level of DAO, default 0;--dao-debug
: enable a lot of assertions of DAO;--dynet-seed [int=0]
: the random seed; default = 0, meaning random;--dao-profile 1
: enable tracing of kernels; In your c++ application code, #include <DAO/DAO.h> and use DAO::profiler.dump(std::string name) method to dump the traces into a "name.traces" file.
We have a script to generate script for runnning gpt2; use python examples/gpt2/gen_script.py --help
to look at the usage.
python examples/gpt2/gen_script.py --name gpt2-124M -c models/gpt2-124M/hparams.ini --gpu-mem 3 --attn-lora-r 4 --attention-dropout-p 0.0 0.4 0.8 --ff-dropout-p 0.0 0.4 0.8 --update-freq 8 --bs 2048 --script-name run_linear
We use Engine
to train the model by delaying the forward/backward/update. An usage The API can be seen in the header, an example usage can be seen at here.
Also, we add a feature to dynet to set if a parameter is trainable. dynet::ParameterCollection::set_default_updated(bool trainable)
is used to set if the parameters is by default trainable or not. To specify if a parameter is trainable or not, one can use this api to add parameters into the collection:
/**
* \brief Add parameters with custom initializer
*
* \param d Shape of the parameter
* \param init Custom initializer
* \param name Name of the parameter
* \param device Device placement for the parameter
* \param trainable Whether the parameter is trainable or not
*
* \return Parameter object to be used in the computation graph
*/
Parameter ParameterCollection::add_parameters(const Dim& d, const ParameterInit & init,
const std::string & name, Device *device, bool trainable);