Skip to content

Commit aae0710

Browse files
authored
Merge pull request RasaHQ#4800 from RasaHQ/convert_feat
Add new featurizer based on ConveRT model
2 parents 38eb552 + b7af512 commit aae0710

File tree

12 files changed

+240
-14
lines changed

12 files changed

+240
-14
lines changed

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ Added
2020
tracker store
2121
- Add command line argument ``rasa x --config CONFIG``, to specify path to the policy and
2222
NLU pipeline configuration of your bot (default: ``config.yml``)
23+
- Added a new NLU featurizer - ``ConveRTFeaturizer`` based on `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ model released by PolyAI.
24+
- Added a new preconfigured pipeline - ``pretrained_embeddings_convert``.
2325

2426
Changed
2527
-------

alt_requirements/requirements_full.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,7 @@
77
# MITIE Requirements
88
-r requirements_pretrained_embeddings_mitie.txt
99

10+
# ConveRT Requirements
11+
-r requirements_pretrained_embeddings_convert.txt
12+
1013
jieba==0.39
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Minimum Install Requirements
2+
-r ../requirements.txt
3+
4+
tensorflow_text==1.15.1
5+
tensorflow_hub==0.6.0

docs/nlu/choosing-a-pipeline.rst

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,12 @@ it on your dataset.
1919
The Short Answer
2020
----------------
2121

22-
If you have less than 1000 total training examples, and there is a spaCy model for your
23-
language, use the ``pretrained_embeddings_spacy`` pipeline:
22+
If your training data is in english, a good starting point is using ``pretrained_embeddings_convert`` pipeline.
2423

25-
.. literalinclude:: ../../sample_configs/config_pretrained_embeddings_spacy.yml
24+
.. literalinclude:: ../../sample_configs/config_pretrained_embeddings_convert.yml
2625
:language: yaml
2726

28-
29-
If you have 1000 or more labelled utterances,
27+
In case your training data is multi-lingual and is rich with domain specific vocabulary,
3028
use the ``supervised_embeddings`` pipeline:
3129

3230
.. literalinclude:: ../../sample_configs/config_supervised_embeddings.yml
@@ -36,19 +34,39 @@ use the ``supervised_embeddings`` pipeline:
3634
A Longer Answer
3735
---------------
3836

39-
The two most important pipelines are ``supervised_embeddings`` and ``pretrained_embeddings_spacy``.
40-
The biggest difference between them is that the ``pretrained_embeddings_spacy`` pipeline uses pre-trained
41-
word vectors from either GloVe or fastText. The ``supervised_embeddings`` pipeline, on the other hand,
42-
doesn't use any pre-trained word vectors, but instead fits these specifically for your dataset.
37+
The three most important pipelines are ``supervised_embeddings``, ``pretrained_embeddings_convert`` and ``pretrained_embeddings_spacy``.
38+
The ``pretrained_embeddings_spacy`` pipeline uses pre-trained
39+
word vectors from either GloVe or fastText, whereas ``pretrained_embeddings_convert`` uses a pretrained sentence encoding model `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ to
40+
extract vector representations of complete user utterance as a whole. On the other hand, the ``supervised_embeddings`` pipeline
41+
doesn't use any pre-trained word vectors or sentence vectors, but instead fits these specifically for your dataset.
4342

43+
.. note::
44+
These recommendations are highly dependent on your dataset and hence approximate. We suggest experimenting with different pipelines to train the best model.
4445

4546
pretrained_embeddings_spacy
4647
~~~~~~~~~~~~~~~~~~~~~~~~~~~
4748

48-
The advantage of the ``pretrained_embeddings_spacy`` pipeline is that if you have a training example like:
49+
The advantage of ``pretrained_embeddings_spacy`` pipeline is that if you have a training example like:
4950
"I want to buy apples", and Rasa is asked to predict the intent for "get pears", your model
5051
already knows that the words "apples" and "pears" are very similar. This is especially useful
51-
if you don't have very much training data.
52+
if you don't have large enough training data.
53+
54+
55+
pretrained_embeddings_convert
56+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57+
.. warning::
58+
Since ``ConveRT`` model is trained only on an **English** corpus of conversations, this pipeline should only be used if your training data is in English language.
59+
60+
61+
This pipeline uses `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ model to extract vector representation of a sentence and feeds them to ``EmbeddingIntentClassifier`` for intent classification.
62+
The advantage of using ``pretrained_embeddings_convert`` pipeline is that it doesn't treat each word of the user message independently,
63+
but creates a contextual vector representation for the complete sentence. For example, if you have a training example, like:
64+
"can I book a car?", and Rasa is asked to predict the intent for "I need a ride from my place", since the contextual vector representation for both
65+
examples are already very similar, the intent classified for both is highly likely to be the same. This is also useful if you don't have
66+
large enough training data.
67+
68+
.. note::
69+
To use ``pretrained_embeddings_convert`` pipeline, you should install ``tensorflow_text==1.15.1`` and ``tensorflow_hub==0.6.0``. Otherwise, you can also pip install Rasa with ``pip install rasa[convert]``
5270

5371
supervised_embeddings
5472
~~~~~~~~~~~~~~~~~~~~~

docs/nlu/components.rst

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -272,10 +272,32 @@ CountVectorsFeaturizer
272272
OOV_token: None # string or None
273273
OOV_words: [] # list of strings
274274
275-
Intent Classifiers
276-
------------------
277275
276+
ConveRTFeaturizer
277+
~~~~~~~~~~~~~~~~~
278+
279+
:Short: Creates a vector representation of user message and response (if specified) using `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ model.
280+
:Outputs: nothing, used as an input to intent classifiers and response selectors that need intent features and response features respectively(e.g. ``EmbeddingIntentClassifier`` and ``ResponseSelector``)
281+
:Requires: nothing
282+
:Description:
283+
Creates features for intent classification and response selection.
284+
Uses the `default signature <https://github.com/PolyAI-LDN/polyai-models#tfhub-signatures>`_ to compute vector representations of input text.
285+
286+
.. warning::
287+
Since ``ConveRT`` model is trained only on an english corpus of conversations, this featurizer should only be used if your training data is in english language.
288+
289+
.. note::
290+
To use ``ConveRTFeaturizer`` you should install ``tensorflow_text==1.15.1`` and ``tensorflow_hub==0.6.0``. Otherwise, you can also do a pip install of Rasa with ``pip install rasa[convert]``
291+
292+
:Configuration:
293+
294+
.. code-block:: yaml
295+
296+
pipeline:
297+
- name: "ConveRTFeaturizer"
278298
299+
Intent Classifiers
300+
------------------
279301

280302

281303
MitieIntentClassifier
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
import logging
2+
from rasa.nlu.featurizers import Featurizer
3+
from typing import Any, Dict, List, Optional, Text, Tuple
4+
from rasa.nlu.config import RasaNLUModelConfig
5+
from rasa.nlu.training_data import Message, TrainingData
6+
from rasa.nlu.constants import (
7+
MESSAGE_TEXT_ATTRIBUTE,
8+
MESSAGE_VECTOR_FEATURE_NAMES,
9+
SPACY_FEATURIZABLE_ATTRIBUTES,
10+
)
11+
import numpy as np
12+
import tensorflow as tf
13+
14+
logger = logging.getLogger(__name__)
15+
16+
17+
class ConveRTFeaturizer(Featurizer):
18+
19+
provides = [
20+
MESSAGE_VECTOR_FEATURE_NAMES[attribute]
21+
for attribute in SPACY_FEATURIZABLE_ATTRIBUTES
22+
]
23+
24+
def _load_model(self) -> None:
25+
26+
import tensorflow_text
27+
import tensorflow_hub as tfhub
28+
29+
self.graph = tf.Graph()
30+
model_url = "http://models.poly-ai.com/convert/v1/model.tar.gz"
31+
32+
with self.graph.as_default():
33+
self.session = tf.Session()
34+
self.module = tfhub.Module(model_url)
35+
36+
self.text_placeholder = tf.placeholder(dtype=tf.string, shape=[None])
37+
self.encoding_tensor = self.module(self.text_placeholder)
38+
self.session.run(tf.tables_initializer())
39+
self.session.run(tf.global_variables_initializer())
40+
41+
def __init__(self, component_config: Dict[Text, Any] = None) -> None:
42+
43+
super(ConveRTFeaturizer, self).__init__(component_config)
44+
45+
self._load_model()
46+
47+
@classmethod
48+
def required_packages(cls) -> List[Text]:
49+
return ["tensorflow_text", "tensorflow_hub"]
50+
51+
def _compute_features(
52+
self, batch_examples: List[Message], attribute: Text = MESSAGE_TEXT_ATTRIBUTE
53+
) -> np.ndarray:
54+
55+
# Get text for attribute of each example
56+
batch_attribute_text = [ex.get(attribute) for ex in batch_examples]
57+
58+
batch_features = self._run_model_on_text(batch_attribute_text)
59+
60+
return batch_features
61+
62+
def _run_model_on_text(self, batch: List[Text]) -> np.ndarray:
63+
64+
return self.session.run(
65+
self.encoding_tensor, feed_dict={self.text_placeholder: batch}
66+
)
67+
68+
def train(
69+
self,
70+
training_data: TrainingData,
71+
config: Optional[RasaNLUModelConfig],
72+
**kwargs: Any,
73+
) -> None:
74+
75+
batch_size = 64
76+
77+
for attribute in SPACY_FEATURIZABLE_ATTRIBUTES:
78+
79+
non_empty_examples = list(
80+
filter(lambda x: x.get(attribute), training_data.training_examples)
81+
)
82+
83+
batch_start_index = 0
84+
85+
while batch_start_index < len(non_empty_examples):
86+
87+
batch_end_index = min(
88+
batch_start_index + batch_size, len(non_empty_examples)
89+
)
90+
91+
# Collect batch examples
92+
batch_examples = non_empty_examples[batch_start_index:batch_end_index]
93+
94+
batch_features = self._compute_features(batch_examples, attribute)
95+
96+
for index, ex in enumerate(batch_examples):
97+
98+
ex.set(
99+
MESSAGE_VECTOR_FEATURE_NAMES[attribute],
100+
self._combine_with_existing_features(
101+
ex,
102+
batch_features[index],
103+
MESSAGE_VECTOR_FEATURE_NAMES[attribute],
104+
),
105+
)
106+
107+
batch_start_index += batch_size
108+
109+
def process(self, message: Message, **kwargs: Any) -> None:
110+
111+
feats = self._compute_features([message])[0]
112+
message.set(
113+
MESSAGE_VECTOR_FEATURE_NAMES[MESSAGE_TEXT_ATTRIBUTE],
114+
self._combine_with_existing_features(
115+
message, feats, MESSAGE_VECTOR_FEATURE_NAMES[MESSAGE_TEXT_ATTRIBUTE]
116+
),
117+
)

rasa/nlu/registry.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
from rasa.nlu.featurizers.ngram_featurizer import NGramFeaturizer
2525
from rasa.nlu.featurizers.regex_featurizer import RegexFeaturizer
2626
from rasa.nlu.featurizers.spacy_featurizer import SpacyFeaturizer
27+
from rasa.nlu.featurizers.convert_featurizer import ConveRTFeaturizer
2728
from rasa.nlu.model import Metadata
2829
from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
2930
from rasa.nlu.tokenizers.mitie_tokenizer import MitieTokenizer
@@ -64,6 +65,7 @@
6465
NGramFeaturizer,
6566
RegexFeaturizer,
6667
CountVectorsFeaturizer,
68+
ConveRTFeaturizer,
6769
# classifiers
6870
SklearnIntentClassifier,
6971
MitieIntentClassifier,
@@ -128,6 +130,11 @@
128130
},
129131
{"name": "EmbeddingIntentClassifier"},
130132
],
133+
"pretrained_embeddings_convert": [
134+
{"name": "WhitespaceTokenizer"},
135+
{"name": "ConveRTFeaturizer"},
136+
{"name": "EmbeddingIntentClassifier"},
137+
],
131138
}
132139

133140

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ jsonschema==3.0.2
3535
packaging==19.0
3636
gevent==1.4.0
3737
pytz==2019.1
38-
python-dateutil==2.8.0
3938
rasa-sdk~=1.4.0
4039
colorclass==2.2.0
4140
terminaltables==3.1.0
@@ -57,3 +56,4 @@ PyJWT==1.7.1
5756
# remove when [email protected] or a pre-release patch is released
5857
# https://github.com/tensorflow/tensorflow/issues/32319
5958
gast==0.2.2
59+
python-dateutil==2.8.0
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
language: "en"
2+
3+
pipeline: "pretrained_embeddings_convert"

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@
9292
extras_requires = {
9393
"test": tests_requires,
9494
"spacy": ["spacy>=2.1,<2.2"],
95+
"convert": ["tensorflow_text~=1.15.1", "tensorflow_hub~=0.6.0"],
9596
"mitie": ["mitie"],
9697
"sql": ["psycopg2~=2.8.2", "SQLAlchemy~=1.3"],
9798
}

0 commit comments

Comments
 (0)