Retrieval-Augmented Generation (RAG) evaluators
This article refers to the Microsoft Foundry (new) portal.
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
| Evaluator | Best practice | Use when | Purpose | Output |
|---|---|---|---|---|
| Document Retrieval | Process evaluation | Retrieval quality is a bottleneck for your RAG, and you have query relevance labels (ground truth) for precise search quality metrics for debugging and parameter optimization | Measures search quality metrics (Fidelity, NDCG, XDCG, Max Relevance, Holes) by comparing retrieved documents against ground truth labels | Composite: Fidelity, NDCG, XDCG, Max Relevance, Holes (with Pass/Fail) |
| Retrieval | Process evaluation | You want to assess textual quality of retrieved context, but you don’t have ground truths | Measures how relevant the retrieved context chunks are to addressing a query using an LLM judge | Binary: Pass/Fail based on threshold (1-5 scale) |
| Groundedness | System evaluation | You want a well-rounded groundedness definition that works with agent inputs, and bring your own GPT models as the LLM-judge | Measures how well the generated response aligns with the given context without fabricating content (precision aspect) | Binary: Pass/Fail based on threshold (1-5 scale) |
| Groundedness Pro (preview) | System evaluation | You want a strict groundedness definition powered by Azure AI Content Safety and use our service model | Detects if the response is strictly consistent with the context using the Azure AI Content Safety service | Binary: True/False |
| Relevance | System evaluation | You want to assess how well the RAG response addresses the query but don’t have ground truths | Measures the accuracy, completeness, and direct relevance of the response to the query | Binary: Pass/Fail based on threshold (1-5 scale) |
| Response Completeness | System evaluation | You want to ensure the RAG response doesn’t miss critical information (recall aspect) from your ground truth | Measures how completely the response covers the expected information compared to ground truth | Binary: Pass/Fail based on threshold (1-5 scale) |
- Groundedness focuses on the precision aspect of the response. It doesn’t contain content outside of the grounding context.
- Response completeness focuses on the recall aspect of the response. It doesn’t miss critical information compared to the expected response or ground truth.
System evaluation
System evaluation examines the quality of the final response in your RAG workflow. These evaluators ensure that the AI-generated content is accurate, relevant, and complete based on the provided context and user query:- Groundedness - Is the response grounded in the provided context without fabrication?
- Groundedness Pro - Does the response strictly adhere to the context (Azure AI Content Safety)?
- Relevance - Does the response accurately address the user’s query?
- Response Completeness - Does the response cover all critical information from ground truth?
Process evaluation
Process evaluation assesses the quality of the document retrieval step in RAG systems. The retrieval step is crucial for providing relevant context to the language model:- Retrieval - How relevant are the retrieved context chunks to the query?
- Document Retrieval - How well does retrieval match ground truth labels (requires qrels)?
Using RAG evaluators
RAG evaluators assess how well AI systems retrieve and use context to generate grounded responses. Each evaluator requires specific data mappings and parameters:| Evaluator | Required inputs | Required parameters |
|---|---|---|
| Groundedness | (response, context) OR (query, response) | deployment_name |
| Groundedness Pro | query, response, context | (none) |
| Relevance | query, response | deployment_name |
| Response Completeness | ground_truth, response | deployment_name |
| Retrieval | query, context | deployment_name |
| Document Retrieval | retrieval_ground_truth, retrieved_documents | (none) |
Example input
Your test dataset should contain the fields referenced in your data mappings:Configuration example
Data mapping syntax:{{item.field_name}}references fields from your test dataset (for example,{{item.query}}).{{sample.output_items}}references agent responses generated or retrieved during evaluation. Use this when evaluating with an agent target or agent response data source. For agent evaluation,contextis optional if the response contains tool calls—the evaluator can extract context from tool call results.
Example output
These evaluators return scores on a 1-5 Likert scale (1 = very poor, 5 = excellent). The default pass threshold is 3. Scores at or above the threshold are considered passing. Key output fields:Document retrieval
Because of its upstream role in RAG, the retrieval quality is important. If the retrieval quality is poor and the response requires corpus-specific knowledge, there’s less chance your language model gives you a satisfactory answer. The most precise measurement is to use thedocument_retrieval evaluator to evaluate retrieval quality and optimize your search parameters for RAG.
-
Document retrieval evaluator measures how well the RAG retrieves the correct documents from the document store. As a composite evaluator useful for RAG scenario with ground truth, it computes a list of useful search quality metrics for debugging your RAG pipelines:
Metric Category Description Fidelity Search Fidelity How well the top n retrieved chunks reflect the content for a given query: number of good documents returned out of the total number of known good documents in a dataset NDCG Search NDCG How good are the rankings to an ideal order where all relevant items are at the top of the list XDCG Search XDCG How good the results are in the top-k documents regardless of scoring of other index documents Max Relevance N Search Max Relevance Maximum relevance in the top-k chunks Holes Search Label Sanity Number of documents with missing query relevance judgments, or ground truth -
To optimize your RAG in a scenario called parameter sweep, you can use these metrics to calibrate the search parameters for the optimal RAG results. Generate different retrieval results for various search parameters such as search algorithms (vector, semantic), top_k, and chunk sizes you’re interested in testing. Then use
document_retrievalto find the search parameters that yield the highest retrieval quality.
Document retrieval example
retrieval_ground_truth contains human-labeled relevance scores per document:
retrieved_documents contains scores from your search system:
Document retrieval output
Thedocument_retrieval evaluator returns multiple metrics for retrieval quality: