Skip to content

Commit 3c30155

Browse files
authored
Merge pull request RasaHQ#5187 from RasaHQ/transformers_lm
Language Models from Transformers Lib
2 parents 32c2ead + 86ee337 commit 3c30155

20 files changed

+1454
-58
lines changed

changelog/5187.feature.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Integrate language models from HuggingFace's Transformers Library.
2+
3+
Add a new NLP component ``HFTransformersNLP`` which tokenizes and featurizes incoming messages using a specified
4+
pre-trained model with the Transformers library as the backend.
5+
Add ``LanguageModelTokenizers`` and ``LanguageModelFeaturizers`` which use the information from HFTransformersNLP and
6+
sets them correctly for message object.
7+
Language models currently supported: BERT, OpenAIGPT, GPT-2, XLNet, DistilBert, RoBERTa

docs/nlu/components.rst

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,54 @@ SpacyNLP
8282
# between these two words, therefore setting this to `true`.
8383
case_sensitive: false
8484
85+
86+
.. _HFTransformersNLP:
87+
88+
HFTransformersNLP
89+
~~~~~~~~~~~~~~~~~
90+
91+
:Short: HuggingFace's Transformers based pre-trained language model initializer
92+
:Outputs: nothing
93+
:Requires: nothing
94+
:Description:
95+
Initializes specified pre-trained language model from HuggingFace's `Transformers library
96+
<https://huggingface.co/transformers/>`__. The component applies language model specific tokenization and featurization
97+
to compute sequence and sentence level representations for each example in the training data.
98+
Include :ref:`LanguageModelTokenizer` and :ref:`LanguageModelFeaturizer` to utilize the output of this
99+
component for downstream NLU models.
100+
:Configuration:
101+
102+
.. code-block:: yaml
103+
104+
pipeline:
105+
- name: HFTransformersNLP
106+
107+
# Name of the language model to use
108+
model_name: "bert"
109+
110+
# Shortcut name to specify architecture variation of the above model. Full list of supported architectures
111+
# can be found at https://huggingface.co/transformers/pretrained_models.html . If left empty, it uses the
112+
# default model architecture that original transformers library loads
113+
model_weights: "bert-base-uncased"
114+
115+
# +----------------+--------------+-------------------------+
116+
# | Language Model | Parameter | Default value for |
117+
# | | "model_name" | "model_weights" |
118+
# +----------------+--------------+-------------------------+
119+
# | BERT | bert | bert-base-uncased |
120+
# +----------------+--------------+-------------------------+
121+
# | GPT | gpt | openai-gpt |
122+
# +----------------+--------------+-------------------------+
123+
# | GPT-2 | gpt2 | gpt2 |
124+
# +----------------+--------------+-------------------------+
125+
# | XLNet | xlnet | xlnet-base-cased |
126+
# +----------------+--------------+-------------------------+
127+
# | DistilBERT | distilbert | distilbert-base-uncased |
128+
# +----------------+--------------+-------------------------+
129+
# | RoBERTa | roberta | roberta-base |
130+
# +----------------+--------------+-------------------------+
131+
132+
85133
Text Featurizers
86134
----------------
87135

@@ -182,6 +230,40 @@ ConveRTFeaturizer
182230
- name: "ConveRTFeaturizer"
183231
184232
233+
.. _LanguageModelFeaturizer:
234+
235+
LanguageModelFeaturizer
236+
~~~~~~~~~~~~~~~~~~~~~~~~
237+
238+
:Short:
239+
Creates a vector representation of user message and response (if specified) using a pre-trained language model.
240+
:Outputs:
241+
nothing, used as an input to intent classifiers and response selectors that need intent features and response
242+
features respectively (e.g. ``DIETClassifier`` and ``ResponseSelector``)
243+
:Requires: :ref:`HFTransformersNLP`
244+
:Type: Dense featurizer
245+
:Description:
246+
Creates features for intent classification and response selection.
247+
Uses the pre-trained language model specified in upstream :ref:`HFTransformersNLP` component to compute vector
248+
representations of input text.
249+
250+
.. warning::
251+
Please make sure that you use a language model which is pre-trained on the same language corpus as that of your
252+
training data.
253+
254+
:Configuration:
255+
256+
Include ``HFTransformersNLP`` component before this component. Also, use :ref:`LanguageModelTokenizer` to ensure tokens
257+
are correctly set for all components throughout the pipeline.
258+
259+
.. code-block:: yaml
260+
261+
pipeline:
262+
- name: "HFTransformersNLP"
263+
model_name: # Name of language model to use
264+
- name: "LanguageModelFeaturizer"
265+
266+
185267
RegexFeaturizer
186268
~~~~~~~~~~~~~~~
187269

@@ -784,6 +866,29 @@ ConveRTTokenizer
784866
Creates tokens using the ConveRT tokenizer. Must be used whenever the ``ConveRTFeaturizer`` is used.
785867

786868

869+
.. _LanguageModelTokenizer:
870+
871+
LanguageModelTokenizer
872+
~~~~~~~~~~~~~~~~~~~~~~
873+
874+
:Short: Tokenizer from pre-trained language models
875+
:Outputs: nothing
876+
:Requires: :ref:`HFTransformersNLP`
877+
:Description:
878+
Creates tokens using the pre-trained language model specified in upstream :ref:`HFTransformersNLP` component.
879+
Must be used whenever the ``LanguageModelFeaturizer`` is used.
880+
:Configuration:
881+
882+
Include ``HFTransformersNLP`` component upstream.
883+
884+
.. code-block:: yaml
885+
886+
pipeline:
887+
- name: "HFTransformersNLP"
888+
model_name: # name of language model to use
889+
- name: "LanguageModelTokenizer"
890+
891+
787892
788893
Entity Extractors
789894
-----------------

rasa/nlu/constants.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,7 @@
1717

1818
MESSAGE_ATTRIBUTES = [TEXT, INTENT, RESPONSE]
1919

20-
TOKENS_NAMES = {
21-
TEXT: "tokens",
22-
INTENT: "intent_tokens",
23-
RESPONSE: "response_tokens",
24-
}
20+
TOKENS_NAMES = {TEXT: "tokens", INTENT: "intent_tokens", RESPONSE: "response_tokens"}
2521

2622
SPARSE_FEATURE_NAMES = {
2723
TEXT: "text_sparse_features",
@@ -35,7 +31,18 @@
3531
RESPONSE: "response_dense_features",
3632
}
3733

38-
SPACY_DOCS = {TEXT: "spacy_doc", RESPONSE: "response_spacy_doc"}
34+
LANGUAGE_MODEL_DOCS = {
35+
TEXT: "text_language_model_doc",
36+
RESPONSE: "response_language_model_doc",
37+
}
38+
39+
TOKEN_IDS = "token_ids"
40+
TOKENS = "tokens"
41+
SEQUENCE_FEATURES = "sequence_features"
42+
SENTENCE_FEATURES = "sentence_features"
43+
44+
SPACY_DOCS = {TEXT: "text_spacy_doc", RESPONSE: "response_spacy_doc"}
45+
3946

4047
DENSE_FEATURIZABLE_ATTRIBUTES = [TEXT, RESPONSE]
4148

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import numpy as np
2+
from typing import Any, Optional, Text
3+
4+
from rasa.nlu.config import RasaNLUModelConfig
5+
from rasa.nlu.featurizers.featurizer import Featurizer
6+
from rasa.nlu.training_data import Message, TrainingData
7+
8+
from rasa.nlu.constants import (
9+
TEXT,
10+
LANGUAGE_MODEL_DOCS,
11+
DENSE_FEATURE_NAMES,
12+
DENSE_FEATURIZABLE_ATTRIBUTES,
13+
TOKENS_NAMES,
14+
SEQUENCE_FEATURES,
15+
SENTENCE_FEATURES,
16+
)
17+
18+
19+
class LanguageModelFeaturizer(Featurizer):
20+
21+
provides = [
22+
DENSE_FEATURE_NAMES[attribute] for attribute in DENSE_FEATURIZABLE_ATTRIBUTES
23+
]
24+
25+
requires = [
26+
LANGUAGE_MODEL_DOCS[attribute] for attribute in DENSE_FEATURIZABLE_ATTRIBUTES
27+
] + [TOKENS_NAMES[attribute] for attribute in DENSE_FEATURIZABLE_ATTRIBUTES]
28+
29+
def train(
30+
self,
31+
training_data: TrainingData,
32+
config: Optional[RasaNLUModelConfig] = None,
33+
**kwargs: Any,
34+
) -> None:
35+
36+
for example in training_data.training_examples:
37+
for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
38+
self._set_lm_features(example, attribute)
39+
40+
def get_doc(self, message: Message, attribute: Text) -> Any:
41+
42+
return message.get(LANGUAGE_MODEL_DOCS[attribute])
43+
44+
def process(self, message: Message, **kwargs: Any) -> None:
45+
46+
self._set_lm_features(message)
47+
48+
def _set_lm_features(self, message: Message, attribute: Text = TEXT):
49+
"""Adds the precomputed word vectors to the messages features."""
50+
51+
doc = self.get_doc(message, attribute)
52+
53+
if doc is not None:
54+
sequence_features = doc[SEQUENCE_FEATURES]
55+
sentence_features = doc[SENTENCE_FEATURES]
56+
57+
features = np.concatenate([sequence_features, sentence_features])
58+
59+
features = self._combine_with_existing_dense_features(
60+
message, features, DENSE_FEATURE_NAMES[attribute]
61+
)
62+
message.set(DENSE_FEATURE_NAMES[attribute], features)

rasa/nlu/registry.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
from rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer import (
3030
CountVectorsFeaturizer,
3131
)
32+
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
3233
from rasa.nlu.featurizers.sparse_featurizer.regex_featurizer import RegexFeaturizer
3334
from rasa.nlu.model import Metadata
3435
from rasa.nlu.selectors.response_selector import ResponseSelector
@@ -37,8 +38,10 @@
3738
from rasa.nlu.tokenizers.mitie_tokenizer import MitieTokenizer
3839
from rasa.nlu.tokenizers.spacy_tokenizer import SpacyTokenizer
3940
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
41+
from rasa.nlu.tokenizers.lm_tokenizer import LanguageModelTokenizer
4042
from rasa.nlu.utils.mitie_utils import MitieNLP
4143
from rasa.nlu.utils.spacy_utils import SpacyNLP
44+
from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP
4245
from rasa.utils.common import class_from_module_path, raise_warning
4346
from rasa.utils.tensorflow.constants import (
4447
INTENT_CLASSIFICATION,
@@ -59,12 +62,14 @@
5962
# utils
6063
SpacyNLP,
6164
MitieNLP,
65+
HFTransformersNLP,
6266
# tokenizers
6367
MitieTokenizer,
6468
SpacyTokenizer,
6569
WhitespaceTokenizer,
6670
ConveRTTokenizer,
6771
JiebaTokenizer,
72+
LanguageModelTokenizer,
6873
# extractors
6974
SpacyEntityExtractor,
7075
MitieEntityExtractor,
@@ -78,6 +83,7 @@
7883
LexicalSyntacticFeaturizer,
7984
CountVectorsFeaturizer,
8085
ConveRTFeaturizer,
86+
LanguageModelFeaturizer,
8187
# classifiers
8288
SklearnIntentClassifier,
8389
MitieIntentClassifier,

rasa/nlu/tokenizers/convert_tokenizer.py

Lines changed: 2 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
55
from rasa.nlu.training_data import Message
66
from rasa.nlu.constants import MESSAGE_ATTRIBUTES, TOKENS_NAMES
7+
import rasa.utils.train_utils as train_utils
78
import tensorflow as tf
89

910

@@ -69,10 +70,9 @@ def tokenize(self, message: Message, attribute: Text) -> List[Token]:
6970
# clean tokens (remove special chars and empty tokens)
7071
split_token_strings = self._clean_tokens(split_token_strings)
7172

72-
_aligned_tokens = self._align_tokens(
73+
tokens_out += train_utils.align_tokens(
7374
split_token_strings, token_end, token_start
7475
)
75-
tokens_out += _aligned_tokens
7676

7777
return tokens_out
7878

@@ -81,38 +81,3 @@ def _clean_tokens(self, tokens: List[bytes]):
8181

8282
tokens = [string.decode("utf-8").replace("﹏", "") for string in tokens]
8383
return [string for string in tokens if string]
84-
85-
def _align_tokens(self, tokens_in: List[Text], token_end: int, token_start: int):
86-
"""Align sub-tokens of ConveRT with tokens return by the WhitespaceTokenizer.
87-
88-
As ConveRT might split a single word into multiple tokens, we need to make
89-
sure that the start and end value of first and last sub-token matches the
90-
start and end value of the token return by the WhitespaceTokenizer as the
91-
entities are using those start and end values.
92-
"""
93-
94-
tokens_out = []
95-
96-
current_token_offset = token_start
97-
98-
for index, string in enumerate(tokens_in):
99-
if index == 0:
100-
if index == len(tokens_in) - 1:
101-
s_token_end = token_end
102-
else:
103-
s_token_end = current_token_offset + len(string)
104-
tokens_out.append(Token(string, token_start, end=s_token_end))
105-
elif index == len(tokens_in) - 1:
106-
tokens_out.append(Token(string, current_token_offset, end=token_end))
107-
else:
108-
tokens_out.append(
109-
Token(
110-
string,
111-
current_token_offset,
112-
end=current_token_offset + len(string),
113-
)
114-
)
115-
116-
current_token_offset += len(string)
117-
118-
return tokens_out
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
from typing import Text, List, Any, Dict
2+
3+
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
4+
from rasa.nlu.training_data import Message
5+
6+
from rasa.nlu.constants import (
7+
TOKENS_NAMES,
8+
LANGUAGE_MODEL_DOCS,
9+
DENSE_FEATURIZABLE_ATTRIBUTES,
10+
MESSAGE_ATTRIBUTES,
11+
TOKENS,
12+
)
13+
14+
15+
class LanguageModelTokenizer(Tokenizer):
16+
17+
provides = [TOKENS_NAMES[attribute] for attribute in MESSAGE_ATTRIBUTES]
18+
19+
requires = [
20+
LANGUAGE_MODEL_DOCS[attribute] for attribute in DENSE_FEATURIZABLE_ATTRIBUTES
21+
]
22+
23+
defaults = {
24+
# Flag to check whether to split intents
25+
"intent_tokenization_flag": False,
26+
# Symbol on which intent should be split
27+
"intent_split_symbol": "_",
28+
}
29+
30+
def get_doc(self, message: Message, attribute: Text) -> Dict[Text, Any]:
31+
return message.get(LANGUAGE_MODEL_DOCS[attribute])
32+
33+
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
34+
doc = self.get_doc(message, attribute)
35+
36+
return doc[TOKENS]

rasa/nlu/utils/hugging_face/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)