How to run an evaluation in GitHub Action (preview)
This article refers to the Microsoft Foundry (new) portal.
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Features
- Agent Evaluation: Automate pre-production assessment of Microsoft Foundry agents in your CI/CD workflow.
- Evaluators: Use any evaluators from the Foundry evaluator catalog.
- Statistical Analysis: Evaluation results include confidence intervals and test for statistical significance to determine if changes are meaningful and not due to random variation.
Evaluator categories
- Agent evaluators: Process and system-level evaluators for agent workflows.
- RAG evaluators: Evaluate end-to-end and retrieval processes in RAG systems.
- Risk and safety evaluators: Assess risks and safety concerns in responses.
- General purpose evaluators: Quality evaluation such as coherence and fluency.
- OpenAI-based graders: Use OpenAI graders including string check, text similarity, score/label model.
- Custom evaluators: Define your own custom evaluators using Python code or LLM-as-a-judge patterns.
Prerequisites
- A project. To learn more, see Create a project.
- A Foundry agent.
How to set up AI agent evaluations
AI agent evaluations input
Parameters
| Name | Required? | Description |
|---|---|---|
| azure-ai-project-endpoint | Yes | Endpoint of your Microsoft Foundry Project. |
| deployment-name | Yes | The name of the Azure AI model deployment to use for evaluation. |
| data-path | Yes | Path to the data file that contains the evaluators and input queries for evaluations. |
| agent-IDs | Yes | ID of one or more agents to evaluate in format agent-name:version (for example, my-agent:1 or my-agent:1,my-agent:2). Multiple agents are comma-separated and compared with statistical test results. |
| baseline-agent-id | No | ID of the baseline agent to compare against when evaluating multiple agents. If not provided, the first agent is used. |
Data file
The input data file should be a JSON file with the following structure: | Field | Type | Required? | Description | | - | - | - | | name | string | Yes | Name of the evaluation dataset. | | evaluators | string[] | Yes | List of evaluator names to use. Check out the list of available evaluators in your project’s evaluator catalog in Foundry portal: Build > Evaluations > Evaluator catalog. | | data | object[] | Yes | Array of input objects withquery and optional evaluator fields like ground_truth, context. Automapped to evaluators; use data_mapping to override. |
| openai_graders | object | No | Configuration for OpenAI-based evaluators (label_model, score_model, string_check, etc.). |
| evaluator_parameters | object | No | Evaluator-specific initialization parameters (for example, thresholds, custom settings). |
| data_mapping | object | No | Custom data field mappings (autogenerated from data if not provided). |
Basic sample data file
Additional sample data files
| Filename | Description |
|---|---|
| dataset-tiny.json | Dataset with small number of test queries and evaluators. |
| dataset.json | Dataset with all supported evaluator types and enough queries for confidence interval calculation and statistical test. |
| dataset-builtin-evaluators.json | Built-in Foundry evaluators example (for example, coherence, fluency, relevance, groundedness, metrics). |
| dataset-openai-graders.json | OpenAI-based graders example (label models, score models, text similarity, string checks). |
| dataset-custom-evaluators.json | Custom evaluators example with evaluator parameters. |
| dataset-data-mapping.json | Data mapping example showing how to override automatic field mappings with custom data column names. |
AI agent evaluations workflow
To use the GitHub Action, add the GitHub Action to your CI/CD workflows. Specify the trigger criteria, such as on commit, and the file paths to trigger your automated workflows. This example shows how you can run Azure Agent AI Evaluation when you compare different agents by using agent IDs.AI agent evaluations output
Evaluation results are output to the summary section for each AI Evaluation GitHub Action run under Actions in GitHub. The following is a sample report for comparing two agents.