Eval markdown

pamelafox · pamelafox · commit 0eff9b106b51 · 2024-10-23T16:44:18.000Z
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -2,6 +2,13 @@
 
 Follow these steps to evaluate the quality of the answers generated by the RAG flow.
 
+* [Deploy a GPT-4 model](#deploy-a-gpt-4-model)
+* [Setup the evaluation environment](#setup-the-evaluation-environment)
+* [Generate ground truth data](#generate-ground-truth-data)
+* [Run bulk evaluation](#run-bulk-evaluation)
+* [Review the evaluation results](#review-the-evaluation-results)
+* [Run bulk evaluation on a PR](#run-bulk-evaluation-on-a-pr)
+
 ## Deploy a GPT-4 model
 
 
@@ -45,7 +52,7 @@ python evals/generate_ground_truth_data.py
 
 Review the generated data after running that script, removing any question/answer pairs that don't seem like realistic user input.
 
-## Evaluate the RAG answer quality
+## Run bulk evaluation
 
 Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
 
@@ -72,6 +79,6 @@ Compare answers across runs by running the following command:
 python -m evaltools diff evals/results/baseline/
 ```
 
-## Run the evaluation on a PR
+## Run bulk evaluation on a PR
 
 To run the evaluation on the changes in a PR, you can add a `/evaluate` comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.