How to run an evaluation in GitHub Action (preview)
Features
Evaluator categories
Prerequisites
How to set up AI agent evaluations
AI agent evaluations input
Parameters
Data file
Basic sample data file
Additional sample data files
AI agent evaluations workflow
AI agent evaluations output
Related content

How to run an evaluation in GitHub Action (preview)

This article refers to the Microsoft Foundry (new) portal.

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This GitHub Action enables offline evaluation of Microsoft Foundry Agents within your CI/CD pipelines. It’s designed to streamline the offline evaluation process, so you can identify potential problems and make improvements before releasing an update to production. To use this action, provide a data set with test queries and a list of evaluators. This action invokes your agents with the queries, runs the evaluations, and generates a summary report.

Features

Agent Evaluation: Automate pre-production assessment of Microsoft Foundry agents in your CI/CD workflow.
Evaluators: Use any evaluators from the Foundry evaluator catalog.
Statistical Analysis: Evaluation results include confidence intervals and test for statistical significance to determine if changes are meaningful and not due to random variation.

Evaluator categories

Agent evaluators: Process and system-level evaluators for agent workflows.
RAG evaluators: Evaluate end-to-end and retrieval processes in RAG systems.
Risk and safety evaluators: Assess risks and safety concerns in responses.
General purpose evaluators: Quality evaluation such as coherence and fluency.
OpenAI-based graders: Use OpenAI graders including string check, text similarity, score/label model.
Custom evaluators: Define your own custom evaluators using Python code or LLM-as-a-judge patterns.

Prerequisites

A project. To learn more, see Create a project.
A Foundry agent.

The recommended way to authenticate is by using Microsoft Entra ID, which allows you to securely connect to your Azure resources. You can automate the authentication process by using the Azure Login GitHub action. To learn more, see Azure Login action with OpenID Connect.

How to set up AI agent evaluations

AI agent evaluations input

Parameters

Name	Required?	Description
azure-ai-project-endpoint	Yes	Endpoint of your Microsoft Foundry Project.
deployment-name	Yes	The name of the Azure AI model deployment to use for evaluation.
data-path	Yes	Path to the data file that contains the evaluators and input queries for evaluations.
agent-IDs	Yes	ID of one or more agents to evaluate in format `agent-name:version` (for example, `my-agent:1` or `my-agent:1,my-agent:2`). Multiple agents are comma-separated and compared with statistical test results.
baseline-agent-id	No	ID of the baseline agent to compare against when evaluating multiple agents. If not provided, the first agent is used.

Data file

Basic sample data file

{
  "name": "test-data",
  "evaluators": [
    "builtin.fluency",
    "builtin.task_adherence",
    "builtin.violence",
  ],
  "data": [
    {
      "query": "Tell me about Tokyo disneyland"
    },
    {
      "query": "How do I install Python?"
    }
  ]
}

Additional sample data files

Filename	Description
dataset-tiny.json	Dataset with small number of test queries and evaluators.
dataset.json	Dataset with all supported evaluator types and enough queries for confidence interval calculation and statistical test.
dataset-builtin-evaluators.json	Built-in Foundry evaluators example (for example, coherence, fluency, relevance, groundedness, metrics).
dataset-openai-graders.json	OpenAI-based graders example (label models, score models, text similarity, string checks).
dataset-custom-evaluators.json	Custom evaluators example with evaluator parameters.
dataset-data-mapping.json	Data mapping example showing how to override automatic field mappings with custom data column names.

AI agent evaluations workflow

To use the GitHub Action, add the GitHub Action to your CI/CD workflows. Specify the trigger criteria, such as on commit, and the file paths to trigger your automated workflows.

To minimize costs, don’t run evaluation on every commit.

This example shows how you can run Azure Agent AI Evaluation when you compare different agents by using agent IDs.

name: "AI Agent Evaluation"

on:
  workflow_dispatch:
  push:
    branches:
      - main

permissions:
  id-token: write
  contents: read

jobs:
  run-action:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Azure login using Federated Credentials
        uses: azure/login@v2
        with:
          client-id: ${{ vars.AZURE_CLIENT_ID }}
          tenant-id: ${{ vars.AZURE_TENANT_ID }}
          subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}

      - name: Run Evaluation
        uses: microsoft/ai-agent-evals@v3-beta
        with:
          # Replace placeholders with values for your Foundry Project
          azure-ai-project-endpoint: "<your-ai-project-endpoint>"
          deployment-name: "<your-deployment-name>"
          agent-ids: "<your-ai-agent-ids>"
          data-path: ${{ github.workspace }}/path/to/your/data-file

AI agent evaluations output

Evaluation results are output to the summary section for each AI Evaluation GitHub Action run under Actions in GitHub. The following is a sample report for comparing two agents.

Screenshot of agent evaluation result in GitHub Action.

Run AI Red Teaming Agent Locally (Azure AI Evaluation SDK)

How to run an evaluation in Azure DevOps

What is Microsoft Foundry (new)?

Get started

Agent development

Agent tools & integration

Model capabilities

Fine-tuning

Manage agents, models, & tools

Observability, evaluation, & tracing

Developer experience

API & SDK

Responsible AI

Best practices

Setup & configure

Security & governance

Operate & support

How to run an evaluation in GitHub Action

How to run an evaluation in GitHub Action (preview)

Features

Evaluator categories

Prerequisites

How to set up AI agent evaluations

AI agent evaluations input

Parameters

Data file

Basic sample data file

Additional sample data files

AI agent evaluations workflow

AI agent evaluations output

What is Microsoft Foundry (new)?

Get started

Agent development

Agent tools & integration

Model capabilities

Fine-tuning

Manage agents, models, & tools

Observability, evaluation, & tracing

Developer experience

API & SDK

Responsible AI

Best practices

Setup & configure

Security & governance

Operate & support

​How to run an evaluation in GitHub Action (preview)

​Features

​Evaluator categories

​Prerequisites

​How to set up AI agent evaluations

​AI agent evaluations input

​Parameters

​Data file

​Basic sample data file

​Additional sample data files

​AI agent evaluations workflow

​AI agent evaluations output

​Related content

How to run an evaluation in GitHub Action (preview)

Features

Evaluator categories

Prerequisites

How to set up AI agent evaluations

AI agent evaluations input

Parameters

Data file

Basic sample data file

Additional sample data files

AI agent evaluations workflow

AI agent evaluations output

Related content