Our code for estimating COAR attributions is split into three stages: initialization, dataset construction, and regression. We describe each stage in more detail below by applying COAR to a ResNet trained on CIFAR-10 (coar/estimate/cifar_resnet
).
This stage involves initializing a directory of data stores (memory-mapped numpy arrays) using a JSON specification file.
We use the following spec file (coar/estimate/cifar_resnet/spec_test.json
) to initialize data stores for CIFAR ResNet attributions:
{
"num_models": 100,
"schema": {
"masks": {
"dtype": "bool_",
"shape": [
2304
]
},
"test_margins": {
"dtype": "float16",
"shape": [
10000
]
}
}
}
This file specifies the dataset size and two data stores:
num_models
indicates that each data store comprises 100 rows, each corresponding to an ablated model.masks
is anum_models x 2304
boolean array, where2304
denotes the number of components in the ResNet model. Each mask (one per row) corresponds to a random subset of components ablated from the ResNet model.val_margins
is anum_models x 10000
float16 array. Each row records the outputs of ablated models on all 10k examples from the CIFAR10 test set.
To initialize these stores, you can run something like:
REPO_DIR=/mnt/xfs/home/harshay/repos_xfs/modelcomponents
STORE_DIR=/mnt/xfs/home/harshay/out/tmp/coar_test
EST_DIR=$REPO_DIR/coar/estimate
set -e
echo "Initializing store"
cd $EST_DIR
python -m initialize_store --logging.logdir $STORE_DIR --logging.spec $EST_DIR/cifar_resnet/spec_test.json
You will also need to download the model and ffcv dataloader
MODEL_PATH=$REPO_DIR/models/cifar_resnet.pt
BETON_PATH=$REPO_DIR/betons/cifar_resnet.pt
wget -O $MODEL_PATH "https://www.dropbox.com/scl/fi/ar7fput9rzyxebep0cgqf/cifar.pt?rlkey=y4hmrj94o4vxe4so55z1ebefw&dl=0"
wget -O $BETON_PATH "https://www.dropbox.com/scl/fi/4zj04xkgnb5mpw4aosvrt/cifar10.beton?rlkey=wspv74qs0h7l5cbxmzntmsywe&dl=0"
In this step, we build a "component dataset" for each example in the test set using the initialized data stores from above.
This dataset includes tuples of
We implement this functionality for the CIFAR ResNet setup using a Python script called make_dataset.py
in (coar/estimate/cifar_resnet
):
cd $REPO_DIR/coar
python -m estimate.cifar_resnet.make_dataset --expt.base_dir $STORE_DIR --expt.subsample_prob 0.95 --expt.batch_size 500 --expt.start_index 0 --expt.end_index 100 --expt.model_path $MODEL_PATH --expt.beton_path $BETON_PATH
We use the datasets described above to estimate component attributions (one per example) using fast_l1, a SAGA-based GPU solver for linear regression.
We use the FFCV library for faster dataloading, so we first convert the component datasets from step 2 into FFCV-compatible .beton
format as follows:
cd $EST_DIR
python -m write_dataset --cfg.data_dir $STORE_DIR --cfg.out_path $STORE_DIR/component_datasets.beton --cfg.x_name masks --cfg.y_name test_margins --cfg.ignore_completed
Then, we use fast_l1
to estimate component attributions for the CIFAR ResNet setup as follows:
python -m run_regression --config $EST_DIR/cifar_resnet/regression_config_test.yaml --data.data_path $STORE_DIR/component_datasets.beton --cfg.out_dir $STORE_DIR/coar_attributions.pt
That's it! The file component_attributions.pt
contains COAR-estimated attributions of the ResNet model on all 50k CIFAR test examples.
To run all of this in one go, check out run.sh. You can speed things up by running make_dataset
(i.e., construct component datasets) in parallel over a cluster of GPU machines. We provide an example SLURM script for this here; note that this will require some modifications depending on your setup. If you just want pre-computed component attributions, check out the README.
We provide example run scripts for ImageNet ResNet50, ImageNet ViT-B/16, and GPT-2 evaluated on TinyStories.