@@ -82,6 +82,54 @@ SpacyNLP
82
82
# between these two words, therefore setting this to `true`.
83
83
case_sensitive : false
84
84
85
+
86
+ .. _HFTransformersNLP :
87
+
88
+ HFTransformersNLP
89
+ ~~~~~~~~~~~~~~~~~
90
+
91
+ :Short: HuggingFace's Transformers based pre-trained language model initializer
92
+ :Outputs: nothing
93
+ :Requires: nothing
94
+ :Description:
95
+ Initializes specified pre-trained language model from HuggingFace's `Transformers library
96
+ <https://huggingface.co/transformers/> `__. The component applies language model specific tokenization and featurization
97
+ to compute sequence and sentence level representations for each example in the training data.
98
+ Include :ref: `LanguageModelTokenizer ` and :ref: `LanguageModelFeaturizer ` to utilize the output of this
99
+ component for downstream NLU models.
100
+ :Configuration:
101
+
102
+ .. code-block :: yaml
103
+
104
+ pipeline :
105
+ - name : HFTransformersNLP
106
+
107
+ # Name of the language model to use
108
+ model_name : " bert"
109
+
110
+ # Shortcut name to specify architecture variation of the above model. Full list of supported architectures
111
+ # can be found at https://huggingface.co/transformers/pretrained_models.html . If left empty, it uses the
112
+ # default model architecture that original transformers library loads
113
+ model_weights : " bert-base-uncased"
114
+
115
+ # +----------------+--------------+-------------------------+
116
+ # | Language Model | Parameter | Default value for |
117
+ # | | "model_name" | "model_weights" |
118
+ # +----------------+--------------+-------------------------+
119
+ # | BERT | bert | bert-base-uncased |
120
+ # +----------------+--------------+-------------------------+
121
+ # | GPT | gpt | openai-gpt |
122
+ # +----------------+--------------+-------------------------+
123
+ # | GPT-2 | gpt2 | gpt2 |
124
+ # +----------------+--------------+-------------------------+
125
+ # | XLNet | xlnet | xlnet-base-cased |
126
+ # +----------------+--------------+-------------------------+
127
+ # | DistilBERT | distilbert | distilbert-base-uncased |
128
+ # +----------------+--------------+-------------------------+
129
+ # | RoBERTa | roberta | roberta-base |
130
+ # +----------------+--------------+-------------------------+
131
+
132
+
85
133
Text Featurizers
86
134
----------------
87
135
@@ -182,6 +230,40 @@ ConveRTFeaturizer
182
230
- name : " ConveRTFeaturizer"
183
231
184
232
233
+ .. _LanguageModelFeaturizer :
234
+
235
+ LanguageModelFeaturizer
236
+ ~~~~~~~~~~~~~~~~~~~~~~~~
237
+
238
+ :Short:
239
+ Creates a vector representation of user message and response (if specified) using a pre-trained language model.
240
+ :Outputs:
241
+ nothing, used as an input to intent classifiers and response selectors that need intent features and response
242
+ features respectively (e.g. ``DIETClassifier `` and ``ResponseSelector ``)
243
+ :Requires: :ref: `HFTransformersNLP `
244
+ :Type: Dense featurizer
245
+ :Description:
246
+ Creates features for intent classification and response selection.
247
+ Uses the pre-trained language model specified in upstream :ref: `HFTransformersNLP ` component to compute vector
248
+ representations of input text.
249
+
250
+ .. warning ::
251
+ Please make sure that you use a language model which is pre-trained on the same language corpus as that of your
252
+ training data.
253
+
254
+ :Configuration:
255
+
256
+ Include ``HFTransformersNLP `` component before this component. Also, use :ref: `LanguageModelTokenizer ` to ensure tokens
257
+ are correctly set for all components throughout the pipeline.
258
+
259
+ .. code-block :: yaml
260
+
261
+ pipeline :
262
+ - name : " HFTransformersNLP"
263
+ model_name : # Name of language model to use
264
+ - name : " LanguageModelFeaturizer"
265
+
266
+
185
267
RegexFeaturizer
186
268
~~~~~~~~~~~~~~~
187
269
@@ -784,6 +866,29 @@ ConveRTTokenizer
784
866
Creates tokens using the ConveRT tokenizer. Must be used whenever the ``ConveRTFeaturizer `` is used.
785
867
786
868
869
+ .. _LanguageModelTokenizer :
870
+
871
+ LanguageModelTokenizer
872
+ ~~~~~~~~~~~~~~~~~~~~~~
873
+
874
+ :Short: Tokenizer from pre-trained language models
875
+ :Outputs: nothing
876
+ :Requires: :ref: `HFTransformersNLP `
877
+ :Description:
878
+ Creates tokens using the pre-trained language model specified in upstream :ref: `HFTransformersNLP ` component.
879
+ Must be used whenever the ``LanguageModelFeaturizer `` is used.
880
+ :Configuration:
881
+
882
+ Include ``HFTransformersNLP `` component upstream.
883
+
884
+ .. code-block :: yaml
885
+
886
+ pipeline :
887
+ - name : " HFTransformersNLP"
888
+ model_name : # name of language model to use
889
+ - name : " LanguageModelTokenizer"
890
+
891
+
787
892
788
893
Entity Extractors
789
894
-----------------
0 commit comments