Skip to content

Commit 35388b7

Browse files
Bugfixes countvectorizer (RasaHQ#5038)
* bugfixes countvectorizer: - try to load vocabulary if it is not loaded yet - apply token_pattern to _process_tokens to identify OOV_tokens correctly * Added changelog file. * remove blanks * pre-compile regex in init for faster processing * moved vocab validation to load method added tests * removed token_pattern processing because it might alter sequence length and probably will have to be removed from the featurizer * removed second token_pattern test * replaced call to private class member of original CV class by the respective code * Update changelog/5038.bugfix.rst Co-authored-by: Daksh Varshneya <[email protected]> * Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py Co-authored-by: Daksh Varshneya <[email protected]> * calling a private member is ok in this case * Apply suggestions from code review Co-authored-by: Daksh Varshneya <[email protected]> Co-authored-by: Daksh Varshneya <[email protected]>
1 parent e80e3f9 commit 35388b7

File tree

3 files changed

+13
-1
lines changed

3 files changed

+13
-1
lines changed

changelog/5038.bugfix.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fixed a bug in the ``CountVectorsFeaturizer`` which resulted in the very first message after loading a model to be processed incorrectly due to the vocabulary not being loaded yet.

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ def required_components(cls) -> List[Type[Component]]:
5353
"analyzer": "word", # use 'char' or 'char_wb' for character
5454
# regular expression for tokens
5555
# only used if analyzer == 'word'
56+
# WARNING this pattern is used during training
57+
# but not currently used during inference!
5658
"token_pattern": r"(?u)\b\w\w+\b",
5759
# remove accents during the preprocessing step
5860
"strip_accents": None, # {'ascii', 'unicode', None}
@@ -662,4 +664,10 @@ def load(
662664
meta, vocabulary=vocabulary
663665
)
664666

665-
return cls(meta, vectorizers)
667+
ftr = cls(meta, vectorizers)
668+
669+
# make sure the vocabulary has been loaded correctly
670+
for attribute in vectorizers:
671+
ftr.vectorizers[attribute]._validate_vocabulary()
672+
673+
return ftr

tests/nlu/featurizers/test_count_vectors_featurizer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,9 @@ def test_count_vector_featurizer_persist_load(tmp_path):
345345

346346
assert train_vect_params == test_vect_params
347347

348+
# check if vocaculary was loaded correctly
349+
assert hasattr(test_ftr.vectorizers[TEXT], "vocabulary_")
350+
348351
test_message1 = Message(sentence1)
349352
test_ftr.process(test_message1)
350353
test_message2 = Message(sentence2)

0 commit comments

Comments
 (0)