Agent Optimizer is currently in limited preview and only available through a sign-up process. To access the service, complete the intake form. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Prerequisites
- A Foundry project with a deployed hosted agent
- The
azure.ai.agentsCLI extension installed (see Quickstart: Optimize a hosted agent)
Generate a dataset (recommended)
The fastest way to create an evaluation dataset is withazd ai agent eval init. This command generates a dataset and adaptive evaluators tuned to your agent’s domain:
azure.yaml and prompts for a generation instruction describing what your agent does and what scenarios to test.
Example output:
Non-interactive mode
For scripted workflows, pass the inputs directly:Use your own data with generated evaluators
If you already have a golden dataset but want auto-generated evaluators:Run optimization with the generated config
Aftereval init completes, azd ai agent optimize auto-detects the generated eval.yaml:
Create a custom dataset manually (advanced)
For full control over evaluation tasks and criteria, create a JSONL dataset by hand. This is useful when you need precise control over test scenarios or have production data to use directly. By default,azd ai agent optimize uses a built-in dataset with 3 general coding tasks and 25 criteria. For meaningful optimization of your specific agent, create a custom dataset that reflects your agent’s real-world use cases.
Dataset format
Datasets use JSONL (JSON Lines) format. Each line is one JSON object that represents a single evaluation task. A task is an individual scenario in the dataset. It contains a prompt and evaluation criteria.Field reference
| Field | Required | Description |
|---|---|---|
name | Yes | Unique task identifier (for example, "greeting", "math_test") |
prompt | Yes | The message sent to the agent |
criteria | Yes | Array of evaluation criteria — rules that define what “good” looks like for the task |
criteria[].name | Yes | Short name for the criterion (for example, "is_polite") |
criteria[].instruction | Yes | What the evaluator checks. Be specific and testable. The built-in evaluator (builtin.task_adherence) scores each criterion independently as a binary value (0 or 1). |
groundTruth | No | Expected answer (used by some evaluators for reference) |
Example: Customer support agent
Example: Coding assistant
Use a custom dataset
Reference your dataset in a YAML config file:Tips for writing good datasets
Be specific in criteria
Bad:Include edge cases
Test beyond the happy path. Include:- Out-of-scope requests — Inputs your agent should decline or redirect
- Ambiguous queries — Tasks where the agent should ask for clarification
- Adversarial inputs — Attempts to trick the agent into bad behavior
- Multi-step tasks — Complex requests that require structured reasoning
Size guidelines
| Dataset size | Trade-off |
|---|---|
| 3–5 tasks | Quick iteration, limited signal |
| 5–10 tasks | Good balance of speed and coverage |
| 10–20 tasks | Comprehensive evaluation, longer runs |
| 20+ tasks | Thorough but slow — consider for final validation |
Write prompts like real users
Use actual messages from your users if possible. Real prompts capture the vocabulary and context that your agent faces in production.Criteria are scored independently
Each criterion gets a binary score (0 or 1). The task score is the average of its criteria scores. The overall score is the average across all tasks. This means:- A task with 4 criteria where 3 pass scores 0.75
- An agent that passes all criteria on 2 of 3 tasks scores 0.67
Ground truth is optional
ThegroundTruth field provides a reference answer for evaluators that support it. This field isn’t required. The builtin.task_adherence evaluator works entirely from criteria instructions.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
dataset_file not found | Wrong path in eval.yaml | Use a path relative to the config file location |
invalid JSON on line N | Malformed JSONL | Validate that each line is valid JSON. Check for trailing commas. |
| Scores are inconsistent between runs | Vague criteria | Make criteria specific and binary-testable |