Skip to main content

How to run an evaluation in Azure DevOps (preview)

This article refers to the Microsoft Foundry (new) portal.
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
This Azure DevOps extension enables offline evaluation of Microsoft Foundry Agents within your CI/CD pipelines. It streamlines the offline evaluation process, so you can identify potential problems and make improvements before releasing an update to production. To use this extension, provide a data set with test queries and a list of evaluators. This task invokes your agents with the queries, evaluates them, and generates a summary report.

Features

  • Agent Evaluation: Automate pre-production assessment of Microsoft Foundry agents in your CI/CD workflow.
  • Evaluators: Use any evaluators from the Foundry evaluator catalog.
  • Statistical Analysis: Evaluation results include confidence intervals and test for statistical significance to determine if changes are meaningful and not due to random variation.

Evaluator categories

Prerequisites

Inputs

Parameters

NameRequired?Description
azure-ai-project-endpointYesEndpoint of your Microsoft Foundry Project.
deployment-nameYesThe name of the Azure AI model deployment to use for evaluation.
data-pathYesPath to the data file that contains the evaluators and input queries for evaluations.
agent-IDsYesID of one or more agents to evaluate in format agent-name:version (for example, my-agent:1 or my-agent:1,my-agent:2). Multiple agents are comma-separated and compared with statistical test results.
baseline-agent-idNoID of the baseline agent to compare against when evaluating multiple agents. If not provided, the first agent is used.

Data file

The input data file should be a JSON file with the following structure: | Field | Type | Required? | Description | | - | - | - | | name | string | Yes | Name of the evaluation dataset. | | evaluators | string[] | Yes | List of evaluator names to use. Check out the list of available evaluators in your project’s evaluator catalog in Foundry portal: Build > Evaluations > Evaluator catalog. | | data | object[] | Yes | Array of input objects with query and optional evaluator fields like ground_truth, context. Automapped to evaluators; use data_mapping to override. | | openai_graders | object | No | Configuration for OpenAI-based evaluators (label_model, score_model, string_check, etc.). | | evaluator_parameters | object | No | Evaluator-specific initialization parameters (for example, thresholds, custom settings). | | data_mapping | object | No | Custom data field mappings (autogenerated from data if not provided). |

Basic sample data file


{
  "name": "test-data",
  "evaluators": [
    "builtin.fluency",
    "builtin.task_adherence",
    "builtin.violence",
  ],
  "data": [
    {
      "query": "Tell me about Tokyo disneyland"
    },
    {
      "query": "How do I install Python?"
    }
  ]
}

Additional sample data files

FilenameDescription
dataset-tiny.jsonDataset with small number of test queries and evaluators.
dataset.jsonDataset with all supported evaluator types and enough queries for confidence interval calculation and statistical test.
dataset-builtin-evaluators.jsonBuilt-in Foundry evaluators example (for example, coherence, fluency, relevance, groundedness, metrics).
dataset-openai-graders.jsonOpenAI-based graders example (label models, score models, text similarity, string checks).
dataset-custom-evaluators.jsonCustom evaluators example with evaluator parameters.
dataset-data-mapping.jsonData mapping example showing how to override automatic field mappings with custom data column names.

Sample pipeline

To use this Azure DevOps extension, add the task to your Azure Pipeline and configure authentication to access your Microsoft Foundry project.
steps:
  - task: AIAgentEvaluation@2
    displayName: "Evaluate AI Agents"
    inputs:
      azure-ai-project-endpoint: "$(AzureAIProjectEndpoint)"
      deployment-name: "$(DeploymentName)"
      data-path: "$(System.DefaultWorkingDirectory)/path/to/your/dataset.json"
      agent-ids: "$(AgentIds)"

Evaluation results and outputs

Evaluation results appear in the Azure DevOps pipeline summary with detailed metrics and comparisons between agents when multiple are evaluated. Evaluation results output to the summary section for each AI Evaluation task run in your Azure DevOps pipeline. The following screenshot is a sample report for comparing two agents.
Screenshot of agent evaluation result.