This repo contains simple functions for converting IOB2-format NER annotation data into tensor formats for Transformer-based NER tasks. Open source examples of this format include this news-headlines dataset (e.g., referenced by Prodigy) and the biomed-ner dataset.
Note: If you use Prodigy to annotate data for an NER task, the IOB2 format is what will be output.
Note: The below functions only convert one text example at a time, so a batch job will require some additional looping, etc.
Note: The conversion process relies heavily on the HuggingFace Tokenizer class which provides utilties for mapping across token and character indices, between input text and the encoded input ids.
Below is an example of an NER/IOB2 format annotation:
# example annotation and labels
annotation = {
"text": "Did Dame Judy Dench star in a British film about Queen Elizabeth?",
"spans": [
{"label": "actor", "start": 4, "end": 19},
{"label": "plot", "start": 30, "end": 37},
{"label": "character", "start": 49, "end": 64}
]
}
example pulled from MITMovie dataset
In order to train a an NER model (a la Token Classification style task), we can represent the target output of the above example as follows:
[0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0]
In which the target labels correspond to the following classes:
0 -> outside (i.e., no labels)
1,2 -> actor (beginning and inside)
3,4 -> character (beginning and inside)
5 -> plot (only beginning; word only requires 1 token)
The following contains instructions for producing this conversion.
One of the first challenges in preprocessing data annotated for an NER task is managing the complexity of nested annotations, different field names, and the various ways to label an annotated entity. Since Transformers are coupled to a Tokenizer, an NER schema based around attaching labels to tokens (e.g., words) introduces complexity because the labeled tokens will have to be converted for every different tokenizer. Additionally, this also allows any nuances of the tokenizer used in annotation to mix with the data.
For these reasons, assigning entity labels to string indices is more generic, decoupled from any specific tokenizer, and more easily checkable for errors in the data or any subsequent processing.
The example below is pulled from the MITMovie annotated dataset.
# example annotation and labels
annotation = {
"text": "Did Dame Judy Dench star in a British film about Queen Elizabeth?",
"spans": [
{"label": "actor", "start": 4, "end": 19},
{"label": "plot", "start": 30, "end": 37},
{"label": "character", "start": 49, "end": 64}
]
}
Due to the complex structure of NER spans and the associated text field, we perform a preprocessing and validation step to ensure everything is in good order. This happens thanks to Pydantic as an intermediate step, but the outputs are still typed dictionaries to keep things simple for the user.
from iob2tensor import preprocess
text = "Did Dame Judy Dench star in a British film about Queen Elizabeth?"
spans = [
{"label": "actor", "start": 4, "end": 19},
{"label": "plot", "start": 30, "end": 37},
{"label": "character", "start": 49, "end": 64}
]
# validate input annotations
annotation = preprocess(text, spans)
The default or expected fields for input annotations are as follows:
from typing import TypedDict
class Span(TypedDict):
start: int
end: int
label: str
class Annotation(TypedDict):
text: str
spans: list[Span]
If your annotated data uses different fields, specify those fields as function arguments. For instance, the BioMed-NER dataset follows the standard NER spans schema but uses different field names.
annotation = {
"text": "Weed seed inactivation in soil mesocosms via biosolarization..." "entities": [
{"start": 0, "end": 4, "class": "ORGANISM"},
{"start": 5, "end": 9, "class": "ORGANISM"},
{"start": 26, "end": 30, "class": "CHEMICALS"},
...
}
annotation = preprocess(
**annotation,
spans_field="entities",
label_field="class"
)
Next, create the IOB label map with your dataset's entity labels. The default label in the IOB2 format, represents all tokens which are not entities and thus is referred to as the outside class. The convention is to assign all tokens of this class as label=0
. Additionally, the IOB2 format distinguishes between the beginning of inside of an entity label, so each entity class will generate 2 distinct labels, following this format:
- B-LABEL
- I-LABEL
This means the label set and mapping will always have a size of (_n_ * 2) + 1
, where n equals the number of distinct labels (e.g., "location", "organization", "person", etc.) and the +1
is from the outside (non-entity) class.
Use the following function to create the initial label map for your dataset's labels.
from iob2tensor import create_label_map
labels = ["actor", "character", "plot"]
label_map = create_label_map(labels)
label_map
>>> {
'O': 0,
'B-ACTOR': 1, 'I-ACTOR': 2,
'B-CHARACTER': 3, 'I-CHARACTER': 4,
'B-PLOT': 5, 'I-PLOT': 6
}
Now we select and initialize a tokenizer - which has to be involved in the iob label conversion due to tokenization - and convert our NER annotation into a label array.
There is a built-in conversion check (on by default) which ensures the conversion is correct. This is guaranteed to work for the supported tokenizers, but can also be turned off in order to reduce computation.
from transformers import AutoTokenizer
from iob2tensor import to_iob_tensor
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
iob_labels = to_iob_tensor(annotation, label_map, tokenizer)
iob_labels
>>> [-100, 0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0, -100]
Now just one step away from a tensor!
import torch
x = torch.tensor(iob_labels)
There is a built-in check (can be optionally turned off) within the main to_iob_tensor()
function, which attempts to confirm the iob2 conversion is correct. Additionally, there are a series of additional unit and end-to-end tests in the tests
directory. Finally, the tokenizers.py
file contains the specific tokenizer checkpoints which I have tested.