Skip to content

Commit 29ea89b

Browse files
committed
Merge branch 'master' of github.com:RasaHQ/rasa_nlu
2 parents f010367 + 13a26e8 commit 29ea89b

File tree

6 files changed

+149
-46
lines changed

6 files changed

+149
-46
lines changed

.travis.yml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,11 @@ python:
1111
env:
1212
# needed to fix issues with boto during testing:
1313
# https://github.com/travis-ci/travis-ci/issues/7940
14-
global: BOTO_CONFIG=/dev/null
14+
global:
15+
- BOTO_CONFIG=/dev/null
16+
# secret for FOSSA
17+
- secure: "g0c6z+CKVPiuGE3G3OGjzcKZJsdsA/j+zsTQ/xr1ie9gKNpKRA0KQAvsU/mow4NeMmt5YKnwKxFqWy0b3Oufm5WLTAWIVepT5FHA7YMVCMcpPIMsbtqe64FqJxhgL+sBJZku5i94PzCruXwbjk3Q5Uad95ZsIJLQLEkzIla6Fdcu5hdkfYVIGRGe0W2YxNVx+3fZimnOOvJmXD/nnZEBesq5fR09uA4v2PsBGuHOTkwuG60rU5bBPZ2PNiEZXt552kJ5yfdSszNd6uzgQNp10L0qXzt1fXeyf9uGAleDS1HrbcVcdWX0jwCwZF/FD40wUcyKGqOWWDs81mehsjhXymKBhy62QinCSFBmcJb/uRJpMWvvzOLaH94TgxFJCxogNKdqNN/jU3V8CuJ2ELp+RCiniO1u9Dd134j6dYWSWYO+R4WXYoIQYVlGZZqIZz+5b5eKsY+r6I1zPBfD+MowJx2HiLpSdgXh3tlhblOZwPKIuUQ++MeBY4JttJA4Sx5K+YIUzDvcx+7jyuYIjrV4D23n6i4dZZHpAqqtlf1iSbnEzO/rnAxZxy1UuEVmXNJSDOoKEzA7Mdd/Nh+M8wcJAb/KDS/TEvT27+TLRlrgOjgjU2q+4yXNomivi3rZJ1+EFUTgs7lgOGCM/LBpJgnyyiSYZoep0YDZHtA3DCnHKQI="
18+
1519
install:
1620
- pip install git+https://github.com/tmbo/MITIE.git
1721
- pip install -r alt_requirements/requirements_dev.txt
@@ -35,6 +39,15 @@ after_success:
3539
- coveralls
3640
jobs:
3741
include:
42+
- stage: test
43+
name: Check Dependency Licenses
44+
before_script:
45+
- "curl -H 'Cache-Control: no-cache' https://raw.githubusercontent.com/fossas/fossa-cli/master/install.sh | sudo bash"
46+
script:
47+
- pip freeze > requirements.txt
48+
- fossa init
49+
- fossa analyze
50+
- fossa test
3851
- stage: docs
3952
if: fork = false AND branch = "master" # forked repository will skip building docs, only master & PRs to it
4053
install:

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ This project adheres to `Semantic Versioning`_ starting with version 0.7.0.
99

1010
Added
1111
-----
12+
- Ability to save successful predictions and classification results to a JSON
13+
file from ``rasa_nlu.evaluate``
1214
- environment variables specified with ``${env_variable}`` in a yaml
1315
configuration file are now replaced with the value of the environment
1416
variable

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
[![Coverage Status](https://coveralls.io/repos/github/RasaHQ/rasa_nlu/badge.svg?branch=master)](https://coveralls.io/github/RasaHQ/rasa_nlu?branch=master)
55
[![PyPI version](https://badge.fury.io/py/rasa_nlu.svg)](https://badge.fury.io/py/rasa_nlu)
66
[![Documentation Status](https://img.shields.io/badge/docs-stable-brightgreen.svg)](https://nlu.rasa.com/)
7+
[![FOSSA Status](https://app.fossa.io/api/projects/git%2Bgithub.com%2FRasaHQ%2Frasa_nlu.svg?type=shield)](https://app.fossa.io/projects/git%2Bgithub.com%2FRasaHQ%2Frasa_nlu?ref=badge_shield)
78

89
Rasa NLU (Natural Language Understanding) is a tool for understanding what is being said in short pieces of text.
910
For example, taking a short message like:

docs/evaluation.rst

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -77,17 +77,25 @@ for every cross-validation fold.
7777

7878
Intent Classification
7979
---------------------
80-
The evaluation script will log precision, recall, and f1 measure for
81-
each intent and once summarized for all.
82-
Furthermore, it creates a confusion matrix for you to see which
83-
intents are mistaken for which others.
84-
Samples which have not been predicted correctly are logged and saved to a file
85-
called ``errors.json`` for easier debugging.
86-
Finally, the evaluation script creates a histogram of the confidence distribution for all predictions,
87-
separating the confidence of wrong and correct predictions in different bars of the histogram.
88-
Improving the quality of your training data will move the blue-ish histogram bars
89-
(confidence of the correct predictions) to the right and the wine-ish histogram bars
90-
(confidence of wrong predictions) to the left.
80+
The evaluation script will produce a report, confusion matrix
81+
and confidence histogram for your model.
82+
83+
The report logs precision, recall, and f1 measure for
84+
each intent, as well as provide an overall average. You can save this
85+
report as a JSON file using the `--report` flag.
86+
87+
The confusion matrix shows you which
88+
intents are mistaken for others; any samples which have been
89+
incorrectly predicted are logged and saved to a file
90+
called ``errors.json`` for easier debugging.
91+
92+
The histogram that the script produces allows you to visualise the
93+
confidence distribution for all predictions,
94+
with the volume of correct and incorrect predictions being displayed by
95+
blue and red bars respectively.
96+
Improving the quality of your training data will move the blue
97+
histogram bars to the right and the red histogram bars
98+
to the left of the plot.
9199

92100

93101
.. note::

rasa_nlu/evaluate.py

Lines changed: 74 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,16 @@ def create_argument_parser():
6363
parser.add_argument('-f', '--folds', required=False, default=10,
6464
help="number of CV folds (crossvalidation only)")
6565

66+
parser.add_argument('--report', required=False, nargs='?',
67+
const="report.json", default=False,
68+
help="output path to save the metrics report")
69+
70+
parser.add_argument('--successes', required=False, nargs='?',
71+
const="successes.json", default=False,
72+
help="output path to save successful predictions")
73+
6674
parser.add_argument('--errors', required=False, default="errors.json",
67-
help="output path for the json with wrong predictions")
75+
help="output path to save model errors")
6876

6977
parser.add_argument('--histogram', required=False, default="hist.png",
7078
help="output path for the confidence histogram")
@@ -163,14 +171,15 @@ def log_evaluation_table(report, # type: Text
163171
logger.info("Classification report: \n{}".format(report))
164172

165173

166-
def get_evaluation_metrics(targets, predictions): # pragma: no cover
174+
def get_evaluation_metrics(targets, predictions, output_dict=False): # pragma: no cover
167175
"""Compute the f1, precision, accuracy and summary report from sklearn."""
168176
from sklearn import metrics
169177

170178
targets = clean_intent_labels(targets)
171179
predictions = clean_intent_labels(predictions)
172180

173-
report = metrics.classification_report(targets, predictions)
181+
report = metrics.classification_report(targets, predictions,
182+
output_dict=output_dict)
174183
precision = metrics.precision_score(targets, predictions,
175184
average='weighted')
176185
f1 = metrics.f1_score(targets, predictions, average='weighted')
@@ -213,37 +222,50 @@ def drop_intents_below_freq(td, cutoff=5):
213222
return TrainingData(keep_examples, td.entity_synonyms, td.regex_features)
214223

215224

216-
def save_nlu_errors(errors, filename):
217-
"""Write out nlu classification errors to a file."""
225+
def save_json(data, filename):
226+
"""Write out nlu classification to a file."""
218227

219228
utils.write_to_file(filename,
220-
json.dumps(errors, indent=4, ensure_ascii=False))
221-
logger.info("Model prediction errors saved to {}.".format(filename))
229+
json.dumps(data, indent=4, ensure_ascii=False))
230+
231+
232+
def collect_nlu_successes(intent_results, successes_filename):
233+
"""Log messages which result in successful predictions
234+
and save them to file"""
235+
236+
successes = [{"text": r.message,
237+
"intent": r.target,
238+
"intent_prediction": {"name": r.prediction,
239+
"confidence": r.confidence}}
240+
for r in intent_results if r.target == r.prediction]
241+
242+
if successes:
243+
save_json(successes, successes_filename)
244+
logger.info("Model prediction successes saved to {}."
245+
.format(successes_filename))
246+
logger.debug("\n\nSuccessfully predicted the following"
247+
"intents: \n{}".format(successes))
248+
else:
249+
logger.info("Your model made no successful predictions")
222250

223251

224-
def collect_nlu_errors(intent_results): # pragma: no cover
252+
def collect_nlu_errors(intent_results, errors_filename):
225253
"""Log messages which result in wrong predictions and save them to file"""
226254

227-
# it could be interesting to include entity-errors later
228-
# therefore we start with a "intent_errors" key
229-
intent_errors = [{"text": r.message,
230-
"intent": r.target,
231-
"intent_prediction": {
232-
"name": r.prediction,
233-
"confidence": r.confidence
234-
}}
235-
for r in intent_results if r.target != r.prediction]
236-
237-
if intent_errors:
238-
logger.info("There were some nlu intent classification errors. "
239-
"Use `--verbose` to show them in the log.")
240-
logger.debug("\n\nThese intent examples could not be classified "
241-
"correctly \n{}".format(intent_errors))
255+
errors = [{"text": r.message,
256+
"intent": r.target,
257+
"intent_prediction": {"name": r.prediction,
258+
"confidence": r.confidence}}
259+
for r in intent_results if r.target != r.prediction]
242260

243-
return {'intent_errors': intent_errors}
261+
if errors:
262+
save_json(errors, errors_filename)
263+
logger.info("Model prediction errors saved to {}."
264+
.format(errors_filename))
265+
logger.debug("\n\nThese intent examples could not be classified "
266+
"correctly: \n{}".format(errors))
244267
else:
245-
logger.info("No prediction errors were found. You are AWESOME!")
246-
return None
268+
logger.info("Your model made no errors")
247269

248270

249271
def plot_intent_confidences(intent_results, intent_hist_filename):
@@ -262,6 +284,8 @@ def plot_intent_confidences(intent_results, intent_hist_filename):
262284

263285

264286
def evaluate_intents(intent_results,
287+
report_filename,
288+
successes_filename,
265289
errors_filename,
266290
confmat_filename,
267291
intent_hist_filename): # pragma: no cover
@@ -284,16 +308,27 @@ def evaluate_intents(intent_results,
284308

285309
targets, predictions = _targets_predictions_from(intent_results)
286310

287-
report, precision, f1, accuracy = get_evaluation_metrics(targets,
288-
predictions)
311+
if report_filename:
312+
report, precision, f1, accuracy = get_evaluation_metrics(targets,
313+
predictions,
314+
output_dict=True)
289315

290-
log_evaluation_table(report, precision, f1, accuracy)
316+
save_json(report, report_filename)
317+
logger.info("Classification report saved to {}."
318+
.format(report_filename))
319+
320+
else:
321+
report, precision, f1, accuracy = get_evaluation_metrics(targets,
322+
predictions)
323+
log_evaluation_table(report, precision, f1, accuracy)
291324

292-
# log and save misclassified samples to file for debugging
293-
errors = collect_nlu_errors(intent_results)
325+
if successes_filename:
326+
# save classified samples to file for debugging
327+
collect_nlu_successes(intent_results, successes_filename)
294328

295-
if errors and errors_filename:
296-
save_nlu_errors(errors, errors_filename)
329+
if errors_filename:
330+
# log and save misclassified samples to file for debugging
331+
collect_nlu_errors(intent_results, errors_filename)
297332

298333
if confmat_filename:
299334
from sklearn.metrics import confusion_matrix
@@ -673,6 +708,8 @@ def remove_duckling_entities(entity_predictions):
673708

674709

675710
def run_evaluation(data_path, model,
711+
report_filename=None,
712+
successes_filename=None,
676713
errors_filename='errors.json',
677714
confmat_filename=None,
678715
intent_hist_filename=None,
@@ -706,6 +743,8 @@ def run_evaluation(data_path, model,
706743

707744
logger.info("Intent evaluation results:")
708745
result['intent_evaluation'] = evaluate_intents(intent_results,
746+
report_filename,
747+
successes_filename,
709748
errors_filename,
710749
confmat_filename,
711750
intent_hist_filename)
@@ -919,6 +958,8 @@ def main():
919958
elif cmdline_args.mode == "evaluation":
920959
run_evaluation(cmdline_args.data,
921960
cmdline_args.model,
961+
cmdline_args.report,
962+
cmdline_args.successes,
922963
cmdline_args.errors,
923964
cmdline_args.confmat,
924965
cmdline_args.histogram)

tests/base/test_evaluation.py

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,16 @@
1414
remove_empty_intent_examples, get_entity_extractors,
1515
get_duckling_dimensions, known_duckling_dimensions,
1616
find_component, remove_duckling_extractors, drop_intents_below_freq,
17-
run_cv_evaluation, substitute_labels, IntentEvaluationResult)
17+
run_cv_evaluation, substitute_labels, IntentEvaluationResult,
18+
evaluate_intents)
1819
from rasa_nlu.evaluate import does_token_cross_borders
1920
from rasa_nlu.evaluate import align_entity_predictions
2021
from rasa_nlu.evaluate import determine_intersection
2122
from rasa_nlu.evaluate import determine_token_labels
2223
from rasa_nlu.config import RasaNLUModelConfig
2324
from rasa_nlu.tokenizers import Token
25+
from rasa_nlu import utils
26+
import json
2427
from rasa_nlu import training_data, config
2528
from tests import utilities
2629

@@ -258,6 +261,41 @@ def test_run_cv_evaluation():
258261
assert len(entity_results.test['ner_crf']["F1-score"]) == n_folds
259262

260263

264+
def test_evaluation_report(tmpdir_factory):
265+
266+
path = tmpdir_factory.mktemp("evaluation").strpath
267+
report_filename = path + "report.json"
268+
269+
intent_results = [
270+
IntentEvaluationResult("", "restaurant_search",
271+
"I am hungry", 0.12345),
272+
IntentEvaluationResult("greet", "greet",
273+
"hello", 0.98765)]
274+
275+
result = evaluate_intents(intent_results,
276+
report_filename,
277+
successes_filename=None,
278+
errors_filename=None,
279+
confmat_filename=None,
280+
intent_hist_filename=None)
281+
282+
report = json.loads(utils.read_file(report_filename))
283+
284+
greet_results = {"precision": 1.0,
285+
"recall": 1.0,
286+
"f1-score": 1.0,
287+
"support": 1}
288+
289+
prediction = {'text': 'hello',
290+
'intent': 'greet',
291+
'predicted': 'greet',
292+
'confidence': 0.98765}
293+
294+
assert len(report.keys()) == 4
295+
assert report["greet"] == greet_results
296+
assert result["predictions"][0] == prediction
297+
298+
261299
def test_empty_intent_removal():
262300
intent_results = [
263301
IntentEvaluationResult("", "restaurant_search",

0 commit comments

Comments
 (0)