Skip to main content

Retrieval-Augmented Generation (RAG) evaluators

This article refers to the Microsoft Foundry (new) portal.
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
A Retrieval-Augmented Generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user’s query. A user’s query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response.
EvaluatorBest practiceUse whenPurposeOutput
Document RetrievalProcess evaluationRetrieval quality is a bottleneck for your RAG, and you have query relevance labels (ground truth) for precise search quality metrics for debugging and parameter optimizationMeasures search quality metrics (Fidelity, NDCG, XDCG, Max Relevance, Holes) by comparing retrieved documents against ground truth labelsComposite: Fidelity, NDCG, XDCG, Max Relevance, Holes (with Pass/Fail)
RetrievalProcess evaluationYou want to assess textual quality of retrieved context, but you don’t have ground truthsMeasures how relevant the retrieved context chunks are to addressing a query using an LLM judgeBinary: Pass/Fail based on threshold (1-5 scale)
GroundednessSystem evaluationYou want a well-rounded groundedness definition that works with agent inputs, and bring your own GPT models as the LLM-judgeMeasures how well the generated response aligns with the given context without fabricating content (precision aspect)Binary: Pass/Fail based on threshold (1-5 scale)
Groundedness Pro (preview)System evaluationYou want a strict groundedness definition powered by Azure AI Content Safety and use our service modelDetects if the response is strictly consistent with the context using the Azure AI Content Safety serviceBinary: True/False
RelevanceSystem evaluationYou want to assess how well the RAG response addresses the query but don’t have ground truthsMeasures the accuracy, completeness, and direct relevance of the response to the queryBinary: Pass/Fail based on threshold (1-5 scale)
Response CompletenessSystem evaluationYou want to ensure the RAG response doesn’t miss critical information (recall aspect) from your ground truthMeasures how completely the response covers the expected information compared to ground truthBinary: Pass/Fail based on threshold (1-5 scale)
Think about groundedness and response completeness as:
  • Groundedness focuses on the precision aspect of the response. It doesn’t contain content outside of the grounding context.
  • Response completeness focuses on the recall aspect of the response. It doesn’t miss critical information compared to the expected response or ground truth.

System evaluation

System evaluation examines the quality of the final response in your RAG workflow. These evaluators ensure that the AI-generated content is accurate, relevant, and complete based on the provided context and user query:
  • Groundedness - Is the response grounded in the provided context without fabrication?
  • Groundedness Pro - Does the response strictly adhere to the context (Azure AI Content Safety)?
  • Relevance - Does the response accurately address the user’s query?
  • Response Completeness - Does the response cover all critical information from ground truth?
Examples:

Process evaluation

Process evaluation assesses the quality of the document retrieval step in RAG systems. The retrieval step is crucial for providing relevant context to the language model:
  • Retrieval - How relevant are the retrieved context chunks to the query?
  • Document Retrieval - How well does retrieval match ground truth labels (requires qrels)?
Examples: For more examples, see all quality evaluator samples.

Using RAG evaluators

RAG evaluators assess how well AI systems retrieve and use context to generate grounded responses. Each evaluator requires specific data mappings and parameters:
EvaluatorRequired inputsRequired parameters
Groundedness(response, context) OR (query, response)deployment_name
Groundedness Proquery, response, context(none)
Relevancequery, responsedeployment_name
Response Completenessground_truth, responsedeployment_name
Retrievalquery, contextdeployment_name
Document Retrievalretrieval_ground_truth, retrieved_documents(none)

Example input

Your test dataset should contain the fields referenced in your data mappings:
{"query": "What are the store hours?", "context": "Our store is open Monday-Friday 9am-6pm and Saturday 10am-4pm.", "response": "The store is open weekdays from 9am to 6pm and Saturdays from 10am to 4pm."}
{"query": "What is the return policy?", "context": "Items can be returned within 30 days with original receipt for full refund.", "response": "You can return items within 30 days if you have your receipt."}

Configuration example

Data mapping syntax:
  • {{item.field_name}} references fields from your test dataset (for example, {{item.query}}).
  • {{sample.output_items}} references agent responses generated or retrieved during evaluation. Use this when evaluating with an agent target or agent response data source. For agent evaluation, context is optional if the response contains tool calls—the evaluator can extract context from tool call results.
testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "groundedness",
        "evaluator_name": "builtin.groundedness",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {
            "context": "{{item.context}}",
            "response": "{{item.response}}",
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "relevance",
        "evaluator_name": "builtin.relevance",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {"query": "{{item.query}}", "response": "{{item.response}}"},
    },
    {
        "type": "azure_ai_evaluator",
        "name": "retrieval",
        "evaluator_name": "builtin.retrieval",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {"query": "{{item.query}}", "context": "{{item.context}}"},
    },
]
See Run evaluations in the cloud for details on running evaluations and configuring data sources.

Example output

These evaluators return scores on a 1-5 Likert scale (1 = very poor, 5 = excellent). The default pass threshold is 3. Scores at or above the threshold are considered passing. Key output fields:
{
    "type": "azure_ai_evaluator",
    "name": "Groundedness",
    "metric": "groundedness",
    "score": 4,
    "label": "pass",
    "reason": "The response is well-grounded in the provided context without fabricating content.",
    "threshold": 3,
    "passed": true
}

Document retrieval

Because of its upstream role in RAG, the retrieval quality is important. If the retrieval quality is poor and the response requires corpus-specific knowledge, there’s less chance your language model gives you a satisfactory answer. The most precise measurement is to use the document_retrieval evaluator to evaluate retrieval quality and optimize your search parameters for RAG.
  • Document retrieval evaluator measures how well the RAG retrieves the correct documents from the document store. As a composite evaluator useful for RAG scenario with ground truth, it computes a list of useful search quality metrics for debugging your RAG pipelines:
    MetricCategoryDescription
    FidelitySearch FidelityHow well the top n retrieved chunks reflect the content for a given query: number of good documents returned out of the total number of known good documents in a dataset
    NDCGSearch NDCGHow good are the rankings to an ideal order where all relevant items are at the top of the list
    XDCGSearch XDCGHow good the results are in the top-k documents regardless of scoring of other index documents
    Max Relevance NSearch Max RelevanceMaximum relevance in the top-k chunks
    HolesSearch Label SanityNumber of documents with missing query relevance judgments, or ground truth
  • To optimize your RAG in a scenario called parameter sweep, you can use these metrics to calibrate the search parameters for the optimal RAG results. Generate different retrieval results for various search parameters such as search algorithms (vector, semantic), top_k, and chunk sizes you’re interested in testing. Then use document_retrieval to find the search parameters that yield the highest retrieval quality.

Document retrieval example

testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "document_retrieval",
        "evaluator_name": "builtin.document_retrieval",
        "initialization_parameters": {
            "ground_truth_label_min": 1,
            "ground_truth_label_max": 5,
        },
        "data_mapping": {
            "retrieval_ground_truth": "{{item.retrieval_ground_truth}}",
            "retrieval_documents": "{{item.retrieved_documents}}",
        },
    },
]
The retrieval_ground_truth contains human-labeled relevance scores per document:
retrieval_ground_truth = [
    {"document_id": "1", "query_relevance_label": 4},
    {"document_id": "2", "query_relevance_label": 2},
]
The retrieved_documents contains scores from your search system:
retrieved_documents = [
    {"document_id": "2", "relevance_score": 45.1},
    {"document_id": "6", "relevance_score": 35.8},
]

Document retrieval output

The document_retrieval evaluator returns multiple metrics for retrieval quality:
[
    {
        "type": "azure_ai_evaluator",
        "name": "Document Retrieval",
        "metric": "ndcg@3",
        "score": 0.646,
        "label": "pass",
        "passed": true
    },
    {
        "type": "azure_ai_evaluator",
        "name": "Document Retrieval",
        "metric": "fidelity",
        "score": 0.019,
        "label": "fail",
        "passed": false
    },
    # more metrics...
]