Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
azd) CLI evaluation experience to add a measured quality loop to an agent created with Microsoft Foundry. This article focuses on the hosted-agent lifecycle in azd, where you create, provision, deploy, initialize evaluation assets, run a first evaluation, inspect the run, and reuse the evaluation recipe for later runs.
Prompt-based agents can also be evaluated when they are available as agent targets in the Foundry project. The hosted-agent deployment steps apply only to hosted agents.
This article covers how to run the first agent evaluation with azd ai agent eval init and azd ai agent eval run.
Prerequisites
- An Azure subscription with access to Microsoft Foundry.
- The Azure Developer CLI (
azd). For installation instructions, see Install the Azure Developer CLI. - The
azd ai agentextension installed (azd extension install azure.ai.agents). If you don’t have the extension installed, when you initialize the starter template or runazd ai agentthe extension is installed automatically. To learn more about theazdAI agent extension see, Microsoft Foundry agent extension - An authenticated
azdsession. To check your authentication status, runazd auth status. If you’re not signed in, runazd auth login. - The
Foundry Userrole on the Foundry resource (previously namedAzure AI User). For more information, see Role-based access control for Microsoft Foundry. - For hosted agents: No preexisting Foundry project is required.
azd ai agent initandazd provisioncreate the necessary resources. - For prompt-based agents: An existing Foundry project with the agent already deployed and available as an evaluation target.
- A model deployment that supports chat completions in the same Foundry project.
- Optional: a JSONL evaluation dataset with representative examples, if you do not want
eval initto generate a smoke dataset.
How azd agent evaluations work
The primary azd CLI evaluation experience is designed for the hosted-agent lifecycle:| Item | Description |
|---|---|
eval init | Creates or repairs local evaluation assets for an agent target. |
eval.yaml | Local runnable evaluation recipe. It records the agent target, dataset reference, evaluator references, and generation options |
| Generated local artifacts | Editable local copies of generated datasets and evaluator rubrics. The artifacts are stored under datasets/ and evaluators/ in the agent folder (for example, src/<agent-name>/datasets/ and src/<agent-name>/evaluators/). |
| Registered service artifacts | The Foundry dataset and evaluator versions used by evaluation runs. These are the source of truth for generated assets. |
eval run | Runs the evaluation recipe against the selected agent target. |
eval update | Registers new service versions from local dataset or evaluator edits and updates eval.yaml after confirmation. |
eval list and eval show | Inspect evaluation runs and results from the CLI. |
optimize --config eval.yaml | Optionally starts optimization from an evaluation recipe after the agent and recipe meet optimization prerequisites. |
azd provision does not create evaluation datasets, evaluators, suites, or optimization jobs. Evaluation setup can involve generation work that takes minutes, so it stays explicit and retryable.
For hosted agents, the first evaluation requires a deployed and invokable agent target. For prompt-based agents, the deployment step does not apply; the agent must already exist in the Foundry project and be available as an evaluation target.
Create and deploy a hosted agent
If you do not already have a hosted-agent project, initialize one withazd:
Initialize evaluation assets
Runeval init from the azd workspace or agent project folder:
--output is optional and defaults to eval.yaml in the agent project root. Use --output <path> to write the config to a different location.
To use an existing dataset and selected evaluators:
./tests/support-golden.jsonl with the path to your own evaluation dataset.
The --dataset value can point to a local file or a registered dataset name. Repeat --evaluator to include multiple built-in or registered custom evaluators. Evaluator references use the format <source>.<name>:
builtin.<name>— references a built-in evaluator provided by Foundry.<name>— references a custom evaluator registered in the Foundry project. Use the evaluator’s registered name without the version suffix.
Defer generation with --no-wait
If dataset or evaluator generation takes too long, use --no-wait to submit generation jobs and exit immediately:
eval.yaml. When you later run azd ai agent eval run, it automatically resumes those operations before starting the evaluation run.
Use a prompt-based agent target
If you initialized evaluation assets for a prompt-based agent, you can use the same evaluation recipe flow. The hosted-agent deployment step is not required for prompt-based agents. Before you run an evaluation, confirm that:- The prompt-based agent exists in the Foundry project.
- The agent is available as an evaluation target.
- You have access to the project endpoint and the agent target.
eval.yamlselects the intended prompt-based agent.
Review eval.yaml
Aftereval init succeeds, open eval.yaml in the agent project root. For example:
eval run from this directory, or pass the path explicitly with --config src/reservation-agent/eval.yaml. The file identifies the agent target, dataset reference, evaluator references, and generation options. A simplified shape is:
eval.yamllives at the agent project root, for examplesrc/<agent-name>/eval.yaml.- Generated datasets live under
datasets/and generated evaluator rubrics live underevaluators/in the agent folder. local_uripaths ineval.yamlare relative to the agent project directory.- Local files referenced by
local_uriare editable. Runazd ai agent eval updateto register local changes as a new version in the service and bump the version ineval.yaml. eval runuses the registered version pinned ineval.yaml. To apply local edits, runeval updatebeforeeval run.- Evaluators can be built-in references (for example,
builtin.task_adherence) or generated custom evaluators withname,version, andlocal_uri. - Treat version fields as strings, even if they look numeric, so the recipe remains stable across YAML parsers.
Run the evaluation
From the agent project folder, run:eval run resolves eval.yaml in the agent project root. You can also pass the config path explicitly:
eval init --no-wait created pending generation operations, eval run resumes those operations before it starts the evaluation run. It does not start new dataset or evaluator generation jobs from scratch.
Inspect evaluation runs
List recent evaluation runs:eval show defaults to the most recently completed evaluation run.
Show a specific run by its run ID. Copy the ID from the azd ai agent eval list output:
- Which agent version was evaluated.
- Which dataset and evaluator versions were resolved.
- Whether the run completed, failed, or completed partially.
- Which metrics or evaluator scores were produced.
- Whether token usage or evaluator logs need investigation.
Re-run after changing the agent
After you update and redeploy a hosted agent, run the same evaluation recipe again:eval.yaml helps keep dataset, evaluator, and threshold references stable across agent changes.
Update, reset, or repair evaluation assets
The agent evaluation flow useseval.yaml as the local evaluation recipe. Use azd ai agent eval update when you edit local dataset files or evaluator rubrics and want to register those edits as new service versions.
To update what an evaluation run uses, choose the path that matches the type of change:
| Change | How to update |
|---|---|
| Change thresholds, evaluator references, output settings, or other recipe fields | Edit eval.yaml, then run azd ai agent eval run --config eval.yaml. |
| Use a different local or registered dataset | Edit the dataset reference in eval.yaml, or rerun azd ai agent eval init --dataset <path-or-name> --output eval.yaml. |
| Add or change evaluator references | Edit eval.yaml, or rerun azd ai agent eval init with repeatable --evaluator values. |
| Register local edits to a generated dataset or evaluator rubric | Run azd ai agent eval update, review the detected changes, and confirm the version-reference update in eval.yaml. |
| Start over from the default generated setup | Run azd ai agent eval init --reset-defaults. |
evaluators/ in the agent folder, run:
eval.yaml already exists, eval init detects it and prints the existing config:
--reset-defaults overwrites the local eval.yaml and regenerates the default evaluation assets. Existing service-registered dataset and evaluator versions are not deleted; only the local recipe is replaced.
Do not rely on remote latest versions changing the local recipe silently. The local eval.yaml records the dataset, evaluator, or suite versions used by the recipe for reproducibility.
Optional: start optimization from evaluation signal
After at least one evaluation run succeeds, you can useeval.yaml as input to agent optimization if the agent and recipe meet the optimization prerequisites.
Before starting optimization, confirm that:
- The agent target is ready for optimization. For hosted agents, the agent is deployed and invokable.
eval.yamlreferences the intended agent, dataset, evaluator versions, and thresholds.- At least one evaluation run completed successfully.
- The agent preparation required by the optimizer is complete. For optimizer prerequisites and agent preparation requirements, see Optimize agent prompts with Prompt Optimizer.
eval.yaml. It submits an optimization job, but it does not silently apply source changes or redeploy the candidate agent. Review any optimizer output before applying changes.
Best practices
- Run
azd ai agent eval initonly after the agent is available as an evaluation target. For hosted agents, the agent must be deployed and invokable. - Start with a small generated dataset or a small subset of your golden dataset.
- Check generated dataset and evaluator review artifacts before trusting scores.
- After editing generated dataset or evaluator files, run
azd ai agent eval updateto register the edited assets before running the evaluation again. - Source-control
eval.yamlif your team wants a reviewable, reproducible evaluation recipe. - Consider source-controlling generated datasets and evaluator rubrics under
datasets/andevaluators/in the agent folder if your team reviews and edits them as part of the evaluation recipe. - Re-run the same
eval.yamlafter agent changes so comparisons use the same test recipe. - Use
azd ai agent optimize --config eval.yamlonly after you have a useful baseline evaluation result and the agent is prepared for optimization.
Limitations
- The primary command flow is optimized for hosted agents and the post-deploy evaluation loop.
azd provisiondoes not create evaluation assets.eval rundoes not generate new datasets or evaluators, except for resuming pending operations fromeval init --no-wait.- Full suite lifecycle, scheduled evaluation, continuous evaluation, alerts, and comparison workflows are not required for the first evaluation path.
Related content
- Evaluate your AI agents
- Human evaluation for Microsoft Foundry agents
- Evaluation cluster analysis
- Optimize agent prompts with Prompt Optimizer
- Set up tracing for AI agents in Microsoft Foundry
- Monitor agents with the Agent Monitoring Dashboard
- Hosted agents in Foundry Agent Service
- Agent development lifecycle