Merge pull request RasaHQ#4800 from RasaHQ/convert_feat

dakshvar22 · web-flow · commit aae071039936 · 2019-11-22T08:25:07.000+01:00
Add new featurizer based on ConveRT model
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -20,6 +20,8 @@ Added
   tracker store
 - Add command line argument ``rasa x --config CONFIG``, to specify path to the policy and
   NLU pipeline configuration of your bot (default: ``config.yml``)
+- Added a new NLU featurizer - ``ConveRTFeaturizer`` based on `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ model released by PolyAI.
+- Added a new preconfigured pipeline - ``pretrained_embeddings_convert``.
 
 Changed
 -------
diff --git a/alt_requirements/requirements_full.txt b/alt_requirements/requirements_full.txt
@@ -7,4 +7,7 @@
 # MITIE Requirements
 -r requirements_pretrained_embeddings_mitie.txt
 
+# ConveRT Requirements
+-r requirements_pretrained_embeddings_convert.txt
+
 jieba==0.39
diff --git a/alt_requirements/requirements_pretrained_embeddings_convert.txt b/alt_requirements/requirements_pretrained_embeddings_convert.txt
@@ -0,0 +1,5 @@
+# Minimum Install Requirements
+-r ../requirements.txt
+
+tensorflow_text==1.15.1
+tensorflow_hub==0.6.0
diff --git a/docs/nlu/choosing-a-pipeline.rst b/docs/nlu/choosing-a-pipeline.rst
@@ -19,14 +19,12 @@ it on your dataset.
 The Short Answer
 ----------------
 
-If you have less than 1000 total training examples, and there is a spaCy model for your
-language, use the ``pretrained_embeddings_spacy`` pipeline:
+If your training data is in english, a good starting point is using ``pretrained_embeddings_convert`` pipeline.
 
-.. literalinclude:: ../../sample_configs/config_pretrained_embeddings_spacy.yml
+.. literalinclude:: ../../sample_configs/config_pretrained_embeddings_convert.yml
     :language: yaml
 
-
-If you have 1000 or more labelled utterances,
+In case your training data is multi-lingual and is rich with domain specific vocabulary,
 use the ``supervised_embeddings`` pipeline:
 
 .. literalinclude:: ../../sample_configs/config_supervised_embeddings.yml
@@ -36,19 +34,39 @@ use the ``supervised_embeddings`` pipeline:
 A Longer Answer
 ---------------
 
-The two most important pipelines are ``supervised_embeddings`` and ``pretrained_embeddings_spacy``.
-The biggest difference between them is that the ``pretrained_embeddings_spacy`` pipeline uses pre-trained
-word vectors from either GloVe or fastText. The ``supervised_embeddings`` pipeline, on the other hand,
-doesn't use any pre-trained word vectors, but instead fits these specifically for your dataset.
+The three most important pipelines are ``supervised_embeddings``, ``pretrained_embeddings_convert`` and ``pretrained_embeddings_spacy``.
+The ``pretrained_embeddings_spacy`` pipeline uses pre-trained
+word vectors from either GloVe or fastText, whereas ``pretrained_embeddings_convert`` uses a pretrained sentence encoding model `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ to
+extract vector representations of complete user utterance as a whole. On the other hand, the ``supervised_embeddings`` pipeline
+doesn't use any pre-trained word vectors or sentence vectors, but instead fits these specifically for your dataset.
 
+.. note::
+    These recommendations are highly dependent on your dataset and hence approximate. We suggest experimenting with different pipelines to train the best model.
 
 pretrained_embeddings_spacy
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The advantage of the ``pretrained_embeddings_spacy`` pipeline is that if you have a training example like:
+The advantage of ``pretrained_embeddings_spacy`` pipeline is that if you have a training example like:
 "I want to buy apples", and Rasa is asked to predict the intent for "get pears", your model
 already knows that the words "apples" and "pears" are very similar. This is especially useful
-if you don't have very much training data.
+if you don't have large enough training data.
+
+
+pretrained_embeddings_convert
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    .. warning::
+        Since ``ConveRT`` model is trained only on an **English** corpus of conversations, this pipeline should only be used if your training data is in English language.
+
+
+This pipeline uses `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ model to extract vector representation of a sentence and feeds them to ``EmbeddingIntentClassifier`` for intent classification.
+The advantage of using ``pretrained_embeddings_convert`` pipeline is that it doesn't treat each word of the user message independently,
+but creates a contextual vector representation for the complete sentence. For example, if you have a training example, like:
+"can I book a car?", and Rasa is asked to predict the intent for "I need a ride from my place", since the contextual vector representation for both
+examples are already very similar, the intent classified for both is highly likely to be the same. This is also useful if you don't have
+large enough training data.
+
+    .. note::
+        To use ``pretrained_embeddings_convert`` pipeline, you should install ``tensorflow_text==1.15.1`` and ``tensorflow_hub==0.6.0``. Otherwise, you can also pip install Rasa with ``pip install rasa[convert]``
 
 supervised_embeddings
 ~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/nlu/components.rst b/docs/nlu/components.rst
@@ -272,10 +272,32 @@ CountVectorsFeaturizer
           OOV_token: None  # string or None
           OOV_words: []  # list of strings
 
-Intent Classifiers
-------------------
 
+ConveRTFeaturizer
+~~~~~~~~~~~~~~~~~
+
+:Short: Creates a vector representation of user message and response (if specified) using `ConveRT <https://github.com/PolyAI-LDN/polyai-models>`_ model.
+:Outputs: nothing, used as an input to intent classifiers and response selectors that need intent features and response features respectively(e.g. ``EmbeddingIntentClassifier`` and ``ResponseSelector``)
+:Requires: nothing
+:Description:
+    Creates features for intent classification and response selection.
+    Uses the `default signature <https://github.com/PolyAI-LDN/polyai-models#tfhub-signatures>`_ to compute vector representations of input text.
+
+    .. warning::
+        Since ``ConveRT`` model is trained only on an english corpus of conversations, this featurizer should only be used if your training data is in english language.
+
+    .. note::
+        To use ``ConveRTFeaturizer`` you should install ``tensorflow_text==1.15.1`` and ``tensorflow_hub==0.6.0``. Otherwise, you can also do a pip install of Rasa with ``pip install rasa[convert]``
+
+:Configuration:
+
+    .. code-block:: yaml
+
+        pipeline:
+        - name: "ConveRTFeaturizer"
 
+Intent Classifiers
+------------------
 
 
 MitieIntentClassifier
diff --git a/rasa/nlu/featurizers/convert_featurizer.py b/rasa/nlu/featurizers/convert_featurizer.py
@@ -0,0 +1,117 @@
+import logging
+from rasa.nlu.featurizers import Featurizer
+from typing import Any, Dict, List, Optional, Text, Tuple
+from rasa.nlu.config import RasaNLUModelConfig
+from rasa.nlu.training_data import Message, TrainingData
+from rasa.nlu.constants import (
+    MESSAGE_TEXT_ATTRIBUTE,
+    MESSAGE_VECTOR_FEATURE_NAMES,
+    SPACY_FEATURIZABLE_ATTRIBUTES,
+)
+import numpy as np
+import tensorflow as tf
+
+logger = logging.getLogger(__name__)
+
+
+class ConveRTFeaturizer(Featurizer):
+
+    provides = [
+        MESSAGE_VECTOR_FEATURE_NAMES[attribute]
+        for attribute in SPACY_FEATURIZABLE_ATTRIBUTES
+    ]
+
+    def _load_model(self) -> None:
+
+        import tensorflow_text
+        import tensorflow_hub as tfhub
+
+        self.graph = tf.Graph()
+        model_url = "http://models.poly-ai.com/convert/v1/model.tar.gz"
+
+        with self.graph.as_default():
+            self.session = tf.Session()
+            self.module = tfhub.Module(model_url)
+
+            self.text_placeholder = tf.placeholder(dtype=tf.string, shape=[None])
+            self.encoding_tensor = self.module(self.text_placeholder)
+            self.session.run(tf.tables_initializer())
+            self.session.run(tf.global_variables_initializer())
+
+    def __init__(self, component_config: Dict[Text, Any] = None) -> None:
+
+        super(ConveRTFeaturizer, self).__init__(component_config)
+
+        self._load_model()
+
+    @classmethod
+    def required_packages(cls) -> List[Text]:
+        return ["tensorflow_text", "tensorflow_hub"]
+
+    def _compute_features(
+        self, batch_examples: List[Message], attribute: Text = MESSAGE_TEXT_ATTRIBUTE
+    ) -> np.ndarray:
+
+        # Get text for attribute of each example
+        batch_attribute_text = [ex.get(attribute) for ex in batch_examples]
+
+        batch_features = self._run_model_on_text(batch_attribute_text)
+
+        return batch_features
+
+    def _run_model_on_text(self, batch: List[Text]) -> np.ndarray:
+
+        return self.session.run(
+            self.encoding_tensor, feed_dict={self.text_placeholder: batch}
+        )
+
+    def train(
+        self,
+        training_data: TrainingData,
+        config: Optional[RasaNLUModelConfig],
+        **kwargs: Any,
+    ) -> None:
+
+        batch_size = 64
+
+        for attribute in SPACY_FEATURIZABLE_ATTRIBUTES:
+
+            non_empty_examples = list(
+                filter(lambda x: x.get(attribute), training_data.training_examples)
+            )
+
+            batch_start_index = 0
+
+            while batch_start_index < len(non_empty_examples):
+
+                batch_end_index = min(
+                    batch_start_index + batch_size, len(non_empty_examples)
+                )
+
+                # Collect batch examples
+                batch_examples = non_empty_examples[batch_start_index:batch_end_index]
+
+                batch_features = self._compute_features(batch_examples, attribute)
+
+                for index, ex in enumerate(batch_examples):
+
+                    ex.set(
+                        MESSAGE_VECTOR_FEATURE_NAMES[attribute],
+                        self._combine_with_existing_features(
+                            ex,
+                            batch_features[index],
+                            MESSAGE_VECTOR_FEATURE_NAMES[attribute],
+                        ),
+                    )
+
+                batch_start_index += batch_size
+
+    def process(self, message: Message, **kwargs: Any) -> None:
+
+        feats = self._compute_features([message])[0]
+        message.set(
+            MESSAGE_VECTOR_FEATURE_NAMES[MESSAGE_TEXT_ATTRIBUTE],
+            self._combine_with_existing_features(
+                message, feats, MESSAGE_VECTOR_FEATURE_NAMES[MESSAGE_TEXT_ATTRIBUTE]
+            ),
+        )
diff --git a/rasa/nlu/registry.py b/rasa/nlu/registry.py
@@ -24,6 +24,7 @@
 from rasa.nlu.featurizers.ngram_featurizer import NGramFeaturizer
 from rasa.nlu.featurizers.regex_featurizer import RegexFeaturizer
 from rasa.nlu.featurizers.spacy_featurizer import SpacyFeaturizer
+from rasa.nlu.featurizers.convert_featurizer import ConveRTFeaturizer
 from rasa.nlu.model import Metadata
 from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
 from rasa.nlu.tokenizers.mitie_tokenizer import MitieTokenizer
@@ -64,6 +65,7 @@
     NGramFeaturizer,
     RegexFeaturizer,
     CountVectorsFeaturizer,
+    ConveRTFeaturizer,
     # classifiers
     SklearnIntentClassifier,
     MitieIntentClassifier,
@@ -128,6 +130,11 @@
         },
         {"name": "EmbeddingIntentClassifier"},
     ],
+    "pretrained_embeddings_convert": [
+        {"name": "WhitespaceTokenizer"},
+        {"name": "ConveRTFeaturizer"},
+        {"name": "EmbeddingIntentClassifier"},
+    ],
 }
 
 
diff --git a/requirements.txt b/requirements.txt
@@ -35,7 +35,6 @@ jsonschema==3.0.2
 packaging==19.0
 gevent==1.4.0
 pytz==2019.1
-python-dateutil==2.8.0
 rasa-sdk~=1.4.0
 colorclass==2.2.0
 terminaltables==3.1.0
@@ -57,3 +56,4 @@ PyJWT==1.7.1
 # remove when tensorflow@1.15.x or a pre-release patch is released
 # https://github.com/tensorflow/tensorflow/issues/32319
 gast==0.2.2
+python-dateutil==2.8.0
diff --git a/sample_configs/config_pretrained_embeddings_convert.yml b/sample_configs/config_pretrained_embeddings_convert.yml
@@ -0,0 +1,3 @@
+language: "en"
+
+pipeline: "pretrained_embeddings_convert"
diff --git a/setup.py b/setup.py
@@ -92,6 +92,7 @@
 extras_requires = {
     "test": tests_requires,
     "spacy": ["spacy>=2.1,<2.2"],
+    "convert": ["tensorflow_text~=1.15.1", "tensorflow_hub~=0.6.0"],
     "mitie": ["mitie"],
     "sql": ["psycopg2~=2.8.2", "SQLAlchemy~=1.3"],
 }
diff --git a/tests/nlu/base/test_featurizers.py b/tests/nlu/base/test_featurizers.py
@@ -72,6 +72,53 @@ def test_spacy_intent_featurizer(spacy_nlp_component):
     assert not any(intent_features_exist)
 
 
+def test_convert_intent_featurizer():
+    from rasa.nlu.featurizers.convert_featurizer import ConveRTFeaturizer
+
+    td = training_data.load_data("data/examples/rasa/demo-rasa.json")
+
+    convert_featurizer = ConveRTFeaturizer()
+    convert_featurizer.train(td, config=None)
+
+    intent_features_exist = np.array(
+        [
+            True if example.get("intent_features") is not None else False
+            for example in td.intent_examples
+        ]
+    )
+
+    # no intent features should have been set
+    assert not any(intent_features_exist)
+
+
+def test_convert_featurizer_output_shape():
+    from rasa.nlu.featurizers.convert_featurizer import ConveRTFeaturizer
+
+    td = training_data.load_data("data/examples/rasa/demo-rasa.json")
+
+    convert_featurizer = ConveRTFeaturizer()
+    convert_featurizer.train(td, config=None)
+
+    text_features_dim = np.array(
+        [
+            example.get("text_features").shape[0]
+            for example in td.intent_examples
+            if example.get("text_features") is not None
+        ]
+    )
+
+    response_features_dim = np.array(
+        [
+            example.get("response_features").shape[0]
+            for example in td.intent_examples
+            if example.get("response_features") is not None
+        ]
+    )
+
+    assert np.all(text_features_dim == 1024)
+    assert np.all(response_features_dim == 1024)
+
+
 @pytest.mark.parametrize(
     "sentence, expected",
     [("hey how are you today", [-0.28451, 0.31007, -0.57039, -0.073056, -0.17322])],
diff --git a/tests/nlu/training/test_train.py b/tests/nlu/training/test_train.py
@@ -35,6 +35,7 @@ def pipelines_for_tests():
                 "NGramFeaturizer",
                 "RegexFeaturizer",
                 "CountVectorsFeaturizer",
+                "ConveRTFeaturizer",
                 "MitieEntityExtractor",
                 "CRFEntityExtractor",
                 "SpacyEntityExtractor",

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+language: "en"`
	`2`	`+`
	`3`	`+pipeline: "pretrained_embeddings_convert"`
Original file line number	Diff line number	Diff line change
`@@ -92,6 +92,7 @@`
`92`	`92`	`extras_requires = {`
`93`	`93`	`"test": tests_requires,`
`94`	`94`	`"spacy": ["spacy>=2.1,<2.2"],`
	`95`	`+ "convert": ["tensorflow_text~=1.15.1", "tensorflow_hub~=0.6.0"],`
`95`	`96`	`"mitie": ["mitie"],`
`96`	`97`	`"sql": ["psycopg2~=2.8.2", "SQLAlchemy~=1.3"],`
`97`	`98`	`}`