Here are the documentation for the audit conducted on the LSTM-based model which is part of the solution presented in the original repo. For the original repo README, see below.
Before running the code, the input data need to be downloaded from Kaggle and put into input/jigsaw-unintended-bias-in-toxicity-classification
.
Results from these sections can be reproduced on a regular computer. The prediction results for the test set are included in the repo(code/models/lstm_5_targets.csv
), from which the results are derived.
jigsaw_eda.ipynb
. Code for the "input and output" section. No additional dependency needed.accuracy_fairness.ipynb
. Code for the "accuracy and fairness" section.fairlearn
needs to be installed.
To run the code for the "interpretability" and "robustness" part, a computer with Nvidia GPU is needed.
First, install the following dependencies. The use of mamba
is recommended as the dependency resolving process can be slow due to the legacy packages used in the original solution. The correct version of cuda should also be installed.
channels:
- conda-forge
dependencies:
- numpy
- pandas
- scikit-learn
- pytorch==1.8.2
- tensorboardx
- pytorch-pretrained-bert
- cudatoolkit=11.1
- ca-certificates
- certifi
- openssl
- torchvision
- torchaudio
- keras
- nltk
- emoji==1.7.0
- tensorflow
- fastai==1.0.61
- fasttext
- gensim
- ipykernel
- shap
The embeddings need to be downloaded from Kaggle and put into input/gensim-embeddings-dataset
. Then run the train_lstms.py
, and verify that the model and fitted vectorizer are saved in models/Notebook_100_5.bin
and models/char_vectorizer.joblib
. The training process should take around 1h.
Run the notebook inference.ipynb
.
Run the notebook shap.ipynb
.
Run the notebook textflint.ipynb
first, and then textflint_pred.ipynb
.
4th place solution to the Jigsaw Unintended Bias in Toxicity Classification
You should be able to replicate the solution and retrain all the models from our inference kernel just by running all train_*.py
scripts. One would need to put the input data and the embeddings dataset to the input
folder.
code/toxic
contains various utils that are used in train_*
files.
We ended up using a simple average ensemble of 33 models:
- 12 LSTM-based models
- 17 BERT models (base only models)
- 2 GPT2-based models
Not-standart things:
- We were predicting 18 targets with evey model
- We combined some of the target to get the final score. It was hurting AUC but was improving the target metric we care about.
- LSTMs were using char-level embeddings
- LSTMs were trained on different set of embeddings
- We mixed three different types of BERTs: cased, uncased and fine-tuned ucnased (we used the standart fine-tuning procedure)
- We tried to unbias our models with some PL. The idea was taken from this paper, "Bias Mitigation" section
- GPT2 models were using a CNN-based classifier head isntead of the linear classifier head
We trained all our models on a quite straightforward BCE loss. the only thing is that we had 18 targets.
- the main target, with special weights that highlites data points that mention identities
- the main taregt, again, but without any extra weights
- 5 toxicity types
- 9 main identites
- max value for any out of 9 identities columns
- binary column that indecates whether at least one identity was mentioned
All targets, except the last one, were used as soft targets - flaot values from 0 to 1, not hard binary 0/1 targets.
We used a somewhat common weights for the first loss. The toxicity subtypes were trained without any special weights. And the identites loss (last 12 targets) were trained with 0 weight for the NA identities. Even thought the NAs were treated like zeros we decided not to trust that information during the training. Can't say whether it worked though, didn't have time to properly run the ablation study.
In order to improve diversity of out ensemble we sometimes sligtly increased or decreased weight of the main part of the loss (config.main_loss_weight
).
The loss function we used:
def custom_loss(data, targets):
bce_loss_1 = nn.BCEWithLogitsLoss(targets[:, 1:2])(data[:, :1], targets[:, :1])
bce_loss_2 = nn.BCEWithLogitsLoss()(data[:, 1:7], targets[:, 2:8])
bce_loss_3 = nn.BCEWithLogitsLoss(targets[:, 19:20])(data[:, 7:18], targets[:, 8:19])
return config.main_loss_weight * bce_loss_1 + bce_loss_2 + bce_loss_3 / 4
The weights we used for the main part of the loss:
iden = train[IDENTITY_COLUMNS].fillna(0).values
weights = np.ones(len(train))
weights += (iden >= 0.5).any(1)
weights += (train["target"].values >= 0.5) & (iden < 0.5).any(1)
weights += (train["target"].values < 0.5) & (iden >= 0.5).any(1)
weights /= weights.mean() # Don't need to downscale the loss
And finally the identites targets and weights:
subgroup_target = np.hstack([
(iden >= 0.5).any(axis=1, keepdims=True).astype(np.int),
iden,
iden.max(axis=1, keepdims=True),
])
sub_target_weigths = (
~train[IDENTITY_COLUMNS].isna().values.any(axis=1, keepdims=True)
).astype(np.int)
We ended up using a very simple architecture with 2 bidirectional LSTM layers and a dense layer on top of that. You defenitely saw this architecture being copy-pased from one public kernel to another :) The only modificatrion we had is taking pooling not only from the top (second) LSTM layer, but from the first layer as well. Average pooling from both layers and max pooling from the last layer weere concatinaed before the dense classifiyng layers.
The diversity of oru 12 LSTM models comes from using different embeddings and different initial random states. We used the following embeddings (most of them you know and lvoe):
- glove.840B.300d (global vectors)
- crawl-300d-2M (fasttext)
- paragram_300_sl999 (paragram)
- GoogleNews-vectors-negative300 (word2vec)
- Char represnation. We computed top-100 most frequent chars in the data and then represented every token with counts on those chars. That's a poor man's char level modeling :)
Every LSTM model was taking 4 out of 5 concatinated embeddings. Thus the diversity.
We found the trick where you lookup various word forms in the mebdding dictionary being very helpful. That what we ended up using:
PORTER_STEMMER = PorterStemmer()
LANCASTER_STEMMER = LancasterStemmer()
SNOWBALL_STEMMER = SnowballStemmer("english")
def word_forms(word):
yield word
yield word.lower()
yield word.upper()
yield word.capitalize()
yield PORTER_STEMMER.stem(word)
yield LANCASTER_STEMMER.stem(word)
yield SNOWBALL_STEMMER.stem(word)
def maybe_get_embedding(word, model):
for form in word_forms(word):
if form in model:
return model[form]
word = word.strip("-'")
for form in word_forms(word):
if form in model:
return model[form]
return None
def gensim_to_embedding_matrix(word2index, path):
model = KeyedVectors.load(path, mmap="r")
embedding_matrix = np.zeros(
(max(word2index.values()) + 1, model.vector_size),
dtype=np.float32,
)
for word, i in word2index.items():
maybe_embedding = maybe_get_embedding(word, model)
if maybe_embedding is not None:
embedding_matrix[i] = maybe_embedding
return embedding_matrix
We did some preprocessing for the LSTM models and found it helpful:
- Replacements for some weird unicode chars like
\ufeff
- Replacements for the "starred" words like
bit*h
- Removing all uncide chars of the
Mn
category - Replacing all Hebrew chars with the same Hebrew letter
א
. This trick grately reduces the number of distinct characters in the vocabulary. We didn't expect our model to learn toxicity in different languages anyway. But we don't want to lose information about that some Hebrew word was here - The same logic applied to Arabic, Chineese and Japaneese chars
- Replacing emojies with their aliases and a special
EMJ
. token so the model could understand that an emoji was here.
We found that we can greatly increase the inference time for BERT models with two tricks. Both of them are possibly because of the attention_mask
we ahve in berts. What it means for us is that we can zero-pad our sentences as much as we want, it doesn't matter for BERT. Wem managed to reduces inference time from 17 minutes as is to 3.5 minutes. Almost 5x times!
The first optimization is obvious - pad the whole dataset not to the some constant MAX_LEN
value (we used 400 btw), but pad it to the longest comment in batch. Or rather clip the batch to remove some dead weight of zeros on the right. We did it like that:
def clip_to_max_len(batch):
X, lengths = map(torch.stack, zip(*batch))
max_len = torch.max(lengths).item()
return X[:, :max_len]
# sequences is zero-padded numpy array
lengths = np.argmax(sequences == 0, axis=1)
lengths[lengths == 0] = sequences.shape[1]
sequences = torch.from_numpy(sequences)
lengths = torch.from_numpy(lengths)
test_dataset = data.TensorDataset(sequences, lengths)
test_loader = data.DataLoader(
test_dataset,
batch_size=128,
collate_fn=clip_to_max_len,
)
The second is much more important optimization is to sort the dataset by length. This allowed us to greatly reduce the amount of the dead weight of zeros. And just a tiny change to the code above:
ids = lengths.argsort(kind="stable")
inverse_ids = test.id.values[ids].argsort(kind="stable")
test_loader = data.DataLoader(
data.Subset(test_dataset, ids),
batch_size=128,
collate_fn=clip_to_max_len,
)
# preds = here_you_get_predictions_from_the_model
preds = preds[inverse_ids]
We took and idea of adding extra non-toxic data that mentions identites to the dadset to kill the bias of the model. The flow is straightforward:
- Take an amazing datasdet of (8M senteces from wikipedia)[https://www.kaggle.com/mikeortman/wikipedia-sentences]
- Run predictions of all 18 targets on them with the best single model
- Filter out senteces that don't mention any identites
- Filter out senteces with high toxicity predictions (some movies and music albums ahve "toxic" titles like
Hang 'em high!
) - Set the targets for toxicity and subtypes of toxicity to 0 since you know that sentecces from wikipedia should not be toxic
- Add those senteces to the training data to kill the bias towards some identites
According to the article mentioned above the trick works very well. We saw string improvemnts on the CV, which didn't translate to as strong improvent on the LB. We still believe that this is a very neat trick, but one needs to be careful with sampling - to math the distribution of lengths at the very elast, or better, other tokens.