Bugfixes countvectorizer (RasaHQ#5038)

RolandJAAI · dakshvar22 · web-flow · commit 35388b7b2740 · 2020-06-03T20:42:30.000+02:00
* bugfixes countvectorizer:
- try to load vocabulary if it is not loaded yet
- apply token_pattern to _process_tokens to identify OOV_tokens correctly

* Added changelog file.

* remove blanks

* pre-compile regex in init for faster processing

* moved vocab validation to load method
added tests

* removed token_pattern processing because it might alter sequence length and probably will have to be removed from the featurizer

* removed second token_pattern test

* replaced call to private class member of original CV class by the respective code

* Update changelog/5038.bugfix.rst

Co-authored-by: Daksh Varshneya &lt;dakshvar22@gmail.com&gt;

* Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py

Co-authored-by: Daksh Varshneya &lt;dakshvar22@gmail.com&gt;

* calling a private member is ok in this case

* Apply suggestions from code review

Co-authored-by: Daksh Varshneya &lt;dakshvar22@gmail.com&gt;
Co-authored-by: Daksh Varshneya &lt;d.varshneya@rasa.com&gt;
diff --git a/changelog/5038.bugfix.rst b/changelog/5038.bugfix.rst
@@ -0,0 +1 @@
+Fixed a bug in the ``CountVectorsFeaturizer`` which resulted in the very first message after loading a model to be processed incorrectly due to the vocabulary not being loaded yet.
diff --git a/rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py b/rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py
@@ -53,6 +53,8 @@ def required_components(cls) -> List[Type[Component]]:
         "analyzer": "word",  # use 'char' or 'char_wb' for character
         # regular expression for tokens
         # only used if analyzer == 'word'
+        # WARNING this pattern is used during training
+        # but not currently used during inference!
         "token_pattern": r"(?u)\b\w\w+\b",
         # remove accents during the preprocessing step
         "strip_accents": None,  # {'ascii', 'unicode', None}
@@ -662,4 +664,10 @@ def load(
                 meta, vocabulary=vocabulary
             )
 
-        return cls(meta, vectorizers)
+        ftr = cls(meta, vectorizers)
+
+        # make sure the vocabulary has been loaded correctly
+        for attribute in vectorizers:
+            ftr.vectorizers[attribute]._validate_vocabulary()
+
+        return ftr
diff --git a/tests/nlu/featurizers/test_count_vectors_featurizer.py b/tests/nlu/featurizers/test_count_vectors_featurizer.py
@@ -345,6 +345,9 @@ def test_count_vector_featurizer_persist_load(tmp_path):
 
     assert train_vect_params == test_vect_params
 
+    # check if vocaculary was loaded correctly
+    assert hasattr(test_ftr.vectorizers[TEXT], "vocabulary_")
+
     test_message1 = Message(sentence1)
     test_ftr.process(test_message1)
     test_message2 = Message(sentence2)

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+Fixed a bug in the ``CountVectorsFeaturizer`` which resulted in the very first message after loading a model to be processed incorrectly due to the vocabulary not being loaded yet.