In this directory, you can find several notebooks that illustrate how to use LayoutLMv2 both for fine-tuning on custom data as well as inference. I've split up the notebooks according to the different downstream datasets:
- CORD (form understanding)
- DocVQA (visual question answering on documents)
- FUNSD (form understanding)
- RVL-DIP (document image classification)
I've implemented LayoutLMv2 (and LayoutXLM) in the same way as other models in the Transformers library. You have:
LayoutLMv2ForSequenceClassification
, which you can use to classify document images (an example dataset is RVL-CDIP). This model adds a sequence classification head on top of the baseLayoutLMv2Model
, and returnslogits
of shape(batch_size, num_labels)
(similar toBertForSequenceClassification
).LayoutLMv2ForTokenClassification
, which you can use to annotate words appearing in a document image (example datasets here are CORD, FUNSD, SROIE, Kleister-NDA). This model adds a token classification head on top of the baseLayoutLMv2Model
, and treats form understanding as a sequence labeling/named-entity recognition (NER) problem. It returnslogits
of shape(batch_size, sequence_length, num_labels)
(similar toBertForTokenClassification
).LayoutLMv2ForQuestionAnswering
, which you can use to perform extractive visual question answering on document images (an example dataset here is DocVQA). This model adds a question answering head on top of the baseLayoutLMv2Model
, and returnsstart_logits
andend_logits
(similar toBertForQuestionAnswering
).
The full documentation (which also includes tips on how to use LayoutLMv2Processor
) can be found here.
The models on the hub can be found here.
Note that there's also a Gradio demo available for LayoutLMv2, hosted as a HuggingFace Space here.