How cloud evaluation works
A cloud evaluation has three steps:- Define what to evaluate. Describe your data shape (the
data_source_config) and the evaluators (testing criteria) that score it. - Create the evaluation. Submit the definition by using
openai_client.evals.create(). - Run it and read the results. Start a run by using
openai_client.evals.runs.create(), poll until it completes, and read the scored results. See Get results for the result schema.
Choose your starting point
Existing dataset
Use this path when you already have queries and responses collected in a file (or queries plus ground truth) and you just want Foundry to score them. JSONL supports both turn-level rows and conversation-level inputs; CSV is turn-level only.| Scenario | When to use | Data source type |
|---|---|---|
| Turn-level dataset evaluation | Each row is one query/response pair, optionally with context or ground_truth. | jsonl or csv |
| Conversation-level dataset evaluation (preview) | Each row is a conversation expressed as a messages array. | jsonl |
Data in Foundry or Application Insights
Use this path when your agent is already running and you want to evaluate what actually happened. Instead of moving data out, you point Foundry at the data where it already lives - by Foundry response ID or by Application Insights trace or conversation ID.| Scenario | When to use | Data source type |
|---|---|---|
| Agent response evaluation | Your agent runs in Foundry and you have response IDs to score. | azure_ai_responses |
| Turn-level trace evaluation (preview) | Your agent emits OpenTelemetry traces to Application Insights - including non-Foundry frameworks like LangChain or custom OpenTelemetry-instrumented agents. Each trace is scored independently. | azure_ai_trace_data_source_preview |
| Conversation-level trace evaluation (preview) | Same trace sources, but score full conversations - by conversation ID or by agent filter with sampling. | azure_ai_trace_data_source_preview |
Inputs without responses
Use this path when you have the inputs but no responses yet. Foundry generates responses against a model or agent target at evaluation time, then scores them. Pick a row based on whether your input is queries (sent as individual turns) or scenario descriptions (used to drive a conversation-level interaction).| Scenario | When to use | Data source / target |
|---|---|---|
| Model Target completions | You have queries and want to evaluate responses from a model deployment. | azure_ai_target_completions → azure_ai_model |
| Agent Target completions | You have queries and want to evaluate responses from a Foundry agent. | azure_ai_target_completions → azure_ai_agent |
| Conversation simulation (preview) | You have scenario descriptions (no queries); Foundry simulates a user driving a conversation-level interaction with the agent. | azure_ai_target_completions → azure_ai_agent |
No data yet
Use this path when you’re building a new model or agent and haven’t collected any inputs. Foundry generates the test data from scratch - choose synthetic queries for broad quality coverage or adversarial prompts for safety testing.| Scenario | When to use | Data source / target |
|---|---|---|
| Synthetic data evaluation (preview) | You want quality coverage beyond what you’d write by hand. Foundry generates test queries, sends them to the target, and scores responses. | azure_ai_synthetic_data_gen_preview → azure_ai_model or azure_ai_agent |
| Red team evaluation | You want automated adversarial testing - Foundry generates jailbreaks and harmful-content prompts and scores how the target responds. | azure_ai_red_team → azure_ai_model or azure_ai_agent |
Choose evaluators
Each scenario binds evaluators to fields in your data through column mappings. The available fields depend on the data source. Dataset scenarios expose your custom item fields, while target-generated scenarios also expose the model or agent response via a sample schema. The per-scenario subsections later in this article show the column mappings for each case. For an overview of available evaluators and how to pick them, see built-in evaluators and custom evaluators.Prerequisites
- A Foundry project.
- An Azure OpenAI deployment with a GPT model that supports chat completion (for example,
gpt-5-mini). - Foundry User role on the Foundry project.
The Foundry RBAC roles were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner, and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner, and Azure AI Project Manager. You might still see the previous names in some places while the rename rolls out. The role IDs and core permissions are unchanged by the rename.
- Optionally, you can use your own storage account to run evaluations.
Some evaluation features have regional restrictions. See supported regions for details.
Get started
Install the SDK and set up your client:Prepare input data
Most evaluation scenarios require input data. You can provide data in two ways:Upload a dataset (recommended)
Upload a JSONL or CSV file to create a versioned dataset in your Foundry project. Datasets support versioning and reuse across multiple evaluation runs. Use this approach for production testing and CI/CD workflows. Prepare a JSONL file with one JSON object per line containing the fields your evaluators need:Provide data inline
For quick experimentation with small test sets, provide data directly in the evaluation request usingfile_content.
source as the "source" field in your data source configuration when creating a run. The scenario sections that follow use file_id by default.
Dataset evaluation
Evaluate pre-computed responses in a JSONL file using thejsonl data source type. This scenario is useful when you already have model outputs and want to assess their quality.
Define the data schema and evaluators
Specify the schema that matches your JSONL fields, and select the evaluators (testing criteria) to run. Use thedata_mapping parameter to connect fields from your input data to evaluator parameters with {{item.field}} syntax. Always include data_mapping with the required input fields for each evaluator. Your field names must match those in your JSONL file — for example, if your data has "question" instead of "query", use "{{item.question}}" in the mapping. For the required parameters per evaluator, see built-in evaluators.
Create evaluation and run
Create the evaluation, then start a run against your uploaded dataset. The run executes each evaluator on every row in the dataset.CSV dataset evaluation
Evaluate precomputed responses in a CSV file by using thecsv data source type. This scenario works the same way as dataset evaluation but accepts CSV files instead of JSONL. Use CSV when your data is already in spreadsheet or tabular format.
Prepare a CSV file
Create a CSV file with column headers that match the fields your evaluators need. Each row represents one test case.Upload and run
Upload the CSV file as a dataset. Then, create an evaluation by using thecsv data source type. The schema definition and evaluator configuration are the same as for JSONL evaluations. The only difference is the "type": "csv" in the data source.
Model target evaluation
Send queries to a deployed model at runtime. Evaluate the responses by using theazure_ai_target_completions data source type with an azure_ai_model target. Your input data contains queries. The model generates responses, which you then evaluate.
Define the message template and target
Theinput_messages template controls how queries are sent to the model. Use {{item.query}} to reference fields from your input data. Specify the model to evaluate and optional sampling parameters:
Set up evaluators and data mappings
When the model generates responses at runtime, use{{sample.output_text}} in data_mapping to reference the model’s output. Use {{item.field}} to reference fields from your input data.
Create evaluation and run
Agent target evaluation
Send queries to a Foundry agent at runtime and evaluate the responses by using theazure_ai_target_completions data source type with an azure_ai_agent target. This scenario works for both prompt agents and hosted agents.
Define the message template and target
Theinput_messages template controls how queries are sent to the agent. Use {{item.query}} to reference fields from your input data. Specify the agent to evaluate by name:
Set up evaluators and data mappings
When the agent generates responses at runtime, use{{sample.*}} variables in data_mapping to reference the agent’s output:
| Variable | Description | Use for |
|---|---|---|
{{sample.output_text}} | The agent’s plain text response. | Evaluators that expect a string response (for example, coherence, violence). |
{{sample.output_items}} | The agent’s structured JSON output, including tool calls. | Evaluators that need full interaction context (for example, task_adherence). |
{{item.field}} | A field from your input data. | Input fields like query or ground_truth. |
Create evaluation and run
Hosted agent invocations protocol
Hosted agents that use the invocations protocol support the sameazure_ai_agent target type but use a freeform input_messages format. Instead of the structured template format, provide a JSON object that maps directly to the agent’s /invocations request body. Use {{item.*}} placeholders to substitute fields from your input data.
If a hosted agent supports both the responses and invocations protocols, the service defaults to using the invocations protocol.
Define the message format and target
Create evaluation and run
{{sample.output_text}} for the agent’s text response and {{sample.output_items}} for the full structured output including tool calls.
Agent response evaluation
Retrieve and evaluate Foundry agent responses by response IDs using theazure_ai_responses data source type. Use this scenario to evaluate specific agent interactions after they occur.
A response ID is a unique identifier returned each time a Foundry agent generates a response. You can collect response IDs from agent interactions by using the Responses API or from your application’s trace logs. Provide the IDs inline as file content, or upload them as a dataset (see Prepare input data).
Collect response IDs
Each call to the Responses API returns a response object with a uniqueid field. Collect these IDs from your application’s interactions, or generate them directly:
Create evaluation and run
Trace evaluation (preview)
Evaluate agent interactions that Application Insights already captured. Use theazure_ai_traces data source type. This scenario is useful for post-deployment evaluation of real production traffic. You select traces from your monitoring pipeline and run evaluators against them without replaying any requests.
Trace evaluation is the recommended approach for evaluating agents not built with the Microsoft Foundry Agent Service - including LangChain and custom frameworks. As long as your agent emits OpenTelemetry spans following the GenAI semantic conventions to Application Insights, trace evaluation can assess its interactions by using the same evaluators available for Foundry agents.
- By trace IDs - Evaluate specific agent interactions by providing their
operation_Idvalues from Application Insights. - By agent filter - Automatically discover and evaluate recent traces for a given agent, without manually collecting trace IDs.
Intelligent sampling
Trace evaluation supports intelligent sampling, which selects a representative subset of traces for evaluation instead of evaluating every captured trace. Enable this feature by turning on the Intelligent sampling toggle in the Foundry portal when you configure a trace evaluation run. Intelligent sampling reduces evaluation cost while preserving trace diversity - ensuring that edge cases, error paths, and varied conversation patterns are included in the evaluated set.How intelligent sampling works
The sampling algorithm uses a MinHash farthest-first diversity approach that runs in multiple stages:- Exact deduplication - Removes duplicate traces from the pool.
- Hard filters - Removes broken sessions, truncated traces, and malformed tool calls that aren’t suitable for evaluation.
- Aggregation - Combines trace-level signals into a unified representation.
- MinHash farthest-first selection - Computes locality-sensitive hashes (MinHash signatures) of user text to estimate similarity between traces, then iteratively selects the most dissimilar trace from the remaining pool. Each successive pick maximizes distance from all previously selected traces.
- Evaluation and benchmarks - Maximizes coverage of the input distribution so evaluation scores reflect real-world diversity.
- Rubric generation - Produces more focused and actionable rubrics by exposing diverse conversation patterns.
- Finetuning dataset curation - Selects traces that help models learn more efficiently.
Intelligent sampling example
Trace data requirements
Trace evaluation requires your agent to emit spans that follow the OpenTelemetry semantic conventions for generative AI. Specifically, the evaluation service readsinvoke_agent spans from Application Insights and extracts conversation data from their attributes.
The following span attributes are used:
| Attribute | Required | Description |
|---|---|---|
gen_ai.operation.name | Yes | Must equal "invoke_agent". The service ignores all other spans. |
gen_ai.agent.id | For agent filter mode | Unique agent identifier (format: agent-name:version). |
gen_ai.agent.name | For agent filter mode | Human-readable agent name. |
gen_ai.input.messages | For evaluators query inputs | JSON array of input messages following the GenAI semantic conventions message format. Messages with role user or system map to query; messages with role assistant or tool map to response. |
gen_ai.output.messages | For evaluators query inputs | JSON array of model-generated output messages. All output messages map to response. If output also contains type: tool_call or type: tool_result, it maps to tool_calls. |
gen_ai.tool.definitions | Optional | JSON array of tool schemas available to the agent. If absent, the service attempts to infer tool definitions from tool call messages, but inferred schemas might be incomplete. |
gen_ai.conversation.id | Optional | Conversation identifier, passed through to evaluation results for correlation. |
If
gen_ai.input.messages and gen_ai.output.messages are empty or missing, quality evaluators (coherence, fluency, relevance, intent resolution) return score=None. Safety evaluators (violence, self-harm, sexual, hate/unfairness) can still produce scores with partial data but they might not produce meaningful results.[tracing] extra to enable automatic span emission:
Prerequisites for trace evaluation
In addition to the general prerequisites, trace evaluation requires:- An Application Insights resource connected to your Foundry project. See Set up tracing in Microsoft Foundry.
- The project’s managed identity must have the Log Analytics Reader role on both the Application Insights resource and its linked Log Analytics workspace.
- The
azure-monitor-queryPython package (only needed if you collect trace IDs manually).
APPINSIGHTS_RESOURCE_ID— The Application Insights resource ID (for example,/subscriptions/<subscription_id>/resourceGroups/<rg_name>/providers/Microsoft.Insights/components/<resource_name>).AGENT_ID— The agent identifier emitted by the tracing integration (gen_ai.agent.idattribute), used to filter traces. Format:agent-name:version.TRACE_LOOKBACK_HOURS— (Optional) Number of hours to look back when querying traces. Defaults to1.
Option A: Evaluate by agent filter
The simplest approach is to let the service automatically discover and evaluate recent traces for a specific agent. No manual trace ID collection needed.invoke_agent spans by the gen_ai.agent.id attribute, samples up to max_traces unique trace IDs, and evaluates all spans from those traces.
Option B: Evaluate by trace IDs
For more control, collect specific trace IDs from Application Insights and evaluate them. This method is useful when you want to evaluate a curated set of interactions, such as traces flagged by alerts or sampled for quality review.Collect trace IDs from Application Insights
Query Application Insights foroperation_Id values from your agent’s traces. Each operation_Id represents a complete agent interaction:
Create evaluation and run with trace IDs
Set up evaluators and data mappings
When you evaluate traces, the service automatically extracts conversation data from the OpenTelemetry span attributes. Use these field names directly indata_mapping (without the item. or sample. prefixes used in other scenarios):
| Variable | Source attribute | Description |
|---|---|---|
{{item.query}} | gen_ai.input.messages (user/system roles) | The user query extracted from the trace. |
{{item.response}} | gen_ai.input.messages (assistant/tool roles) + gen_ai.output.messages | The agent’s response extracted from the trace. |
{{item.tool_definitions}} | gen_ai.tool.definitions | Tool schemas available to the agent. Only required for tool-related evaluators. |
{{item.tool_calls}} | Extracted from assistant messages in gen_ai.input.messages / gen_ai.output.messages | Tool calls made by the agent during the interaction. Used by tool evaluators. Only required for tool-related evaluators. |
Synthetic data evaluation (preview)
Use theazure_ai_synthetic_data_gen_preview data source type to generate synthetic test queries, send them to a deployed model or Foundry agent, and evaluate the responses. Use this scenario when you don’t have a test dataset. The service generates queries based on a prompt you provide (and/or from the agent’s instructions), runs them against your target, and evaluates the responses.
How synthetic data evaluation works
- The service generates synthetic queries based on your
promptand optional seed data files. - Each query is sent to the specified target (model or agent) to generate a response.
- Evaluators score each response using the generated query and response.
- The generated queries are stored as a dataset in your project for reuse.
Parameters
| Parameter | Required | Description |
|---|---|---|
samples_count | Yes | Maximum number of synthetic test queries to generate. |
model_deployment_name | Yes | Model deployment to use for generating synthetic queries. Only models with Responses API capability are supported. For availability, see Responses API region availability. |
prompt | No | Instructions describing the type of queries to generate. Optional when the agent target has instructions configured. |
output_dataset_name | No | Name for the output dataset where generated queries are stored. If you don’t provide a name, the service generates one automatically. |
sources | No | Seed data files (by file ID) to improve relevance of generated queries. Currently only one file is supported. |
Set up evaluators and data mappings
The synthetic data generator produces queries in the{{item.query}} field. The target generates responses available in {{sample.output_text}}. Map these fields to your evaluators:
Create evaluation and run
- Python
- cURL
Model target
Generate synthetic queries and evaluate a model:input_messages with synthetic data generation, include only system role messages - the service provides the generated queries as user messages automatically.Agent target
Generate synthetic queries and evaluate a Foundry agent:output_dataset_id property that contains the ID of the generated dataset, which you can use to retrieve or reuse the synthetic data.
Conversation-level evaluation (preview)
Evaluate complete conversations to assess agent quality across entire user interactions - not just individual responses. Use conversation-level evaluation to identify quality problems like incomplete task resolution, user frustration, and tool-call regressions that turn-level evaluation misses. For example, consider a support agent where the user grows frustrated over multiple turns:Turn 1 — User: “I need to reset my password.” Agent: “I found your account. I’ll send a reset link.” Turn 2 — User: “I didn’t get the email.” Agent: “I’ve resent the link. Please check spam.” Turn 3 — User: “Still nothing. Can you just reset it directly?” Agent: “I’ve sent another reset link.”A turn-level evaluator scores only the last response - which is polite and takes action - so it scores well. A conversation-level evaluator grading customer satisfaction across the conversation flags that the agent repeated the same failing action three times without trying an alternative, leaving the user’s problem unresolved. Conversation-level evaluation differs from turn-level evaluation in several ways:
| Aspect | Turn-level | Conversation-level |
|---|---|---|
| Scope | Individual query-response pairs | Complete conversations with multiple exchanges |
| Metrics | Per-response quality and safety | Conversation-level outcomes and user satisfaction |
| Data format | JSONL with query and response fields | JSONL with messages array containing the full conversation |
| Use case | Testing individual model responses | Testing end-to-end agent experiences |
| Option | When to use | Data source type |
|---|---|---|
| From dataset or inline | You have local conversation traces or test data | jsonl with file_id or file_content |
| By conversation ID | You want to evaluate specific conversations from App Insights | azure_ai_trace_data_source_preview with trace_source |
| By agent filter with sampling | You want to assess overall agent quality across sampled production traffic | azure_ai_trace_data_source_preview with trace_source |
| Simulated conversations | You want to generate synthetic test conversations | azure_ai_target_completions with conversation_gen_preview |
Choose an evaluation level
Theevaluation_level parameter on the run determines whether evaluators score individual turns or complete conversations:
| Value | Behavior |
|---|---|
"turn" | Evaluators score each turn independently. |
"conversation" | Evaluators score the entire conversation as a whole. |
| (omitted) | Defaults to "turn". |
Evaluator compatibility: Each evaluator supports specific evaluation levels. Check the evaluator’s
supported_evaluation_levels field in the evaluator catalog.- Turn-only evaluators (for example,
fluency,relevance) can’t be used withevaluation_level="conversation". - Currently, all conversation-level evaluators support both
"turn"and"conversation"levels.
Common errors
| Error | Cause | Solution |
|---|---|---|
| Incompatible evaluation level | Using evaluation_level="conversation" with a turn-only evaluator | Remove the turn-only evaluator or change to evaluation_level="turn" |
Prepare conversation data
Create a JSONL file where each line contains a complete conversation in themessages field. Each message should include a role (user, assistant, or system) and content. For a complete example, see the conversation evaluation samples in the SDK:
Define the data schema and evaluators
Specify the schema for your conversation data, “messages”, and select evaluators designed for conversation-level evaluation. Conversation-level evaluators assess the entire interaction rather than individual turns.Create evaluation and run
- Python
- cURL
Prep: download sample_data_multiturn_conversations.jsonl
Evaluate conversations by ID from traces
Evaluate specific conversations from Application Insights by providing their conversation IDs. Use this option to root-cause problems or verify fixes on specific interactions. For example, you can investigate a conversation flagged by an alert or verify a fix for a known issue.Where to find conversation IDs
Find conversation IDs in:- Application Insights trace logs UI — Browse to interesting traces and locate the
conversation_idfield in the trace details. - Your application’s logging output — If you set
conversation_idexplicitly when creating agent responses, retrieve it from your logs. - OpenTelemetry trace context — The
conversation_idmight also be derived from the traceparent header if your agent uses standard trace context propagation.
Tool definitions are automatically retrieved from the traces or queried from the agent registry. You don’t need to provide them in the request.
Parameters for conversation ID lookup
| Parameter | Required | Description |
|---|---|---|
conversation_ids | Yes | Array of conversation IDs to evaluate. |
lookback_hours | No | Hours to search back from end_time. Defaults to seven days (168 hours). |
end_time | No | End of the search window (ISO 8601 format). Defaults to the current time. |
- Application Insights data ingestion can cause a delay between when traces are generated and when they’re available for evaluation. If the query doesn’t find traces, wait a few minutes and retry.
- The maximum lookback is 7 days (168 hours). To access older traces, use
start_timeandend_timewithin your App Insights retention limits.
Evaluate sampled conversations by agent filter
Evaluate a sampled set of conversations from Application Insights by filtering on agent name. Use this option to assess overall agent quality across production traffic. For example, run regular quality assessments or monitor for quality degradation in production. The agent you specify for filtering can be part of a multi-agent conversation. The filter matches any conversation where that agent participated.Tool definitions are automatically retrieved from the traces or queried from the agent registry. You don’t need to provide them in the request.
Agent identity fields
Specify the agent to filter by using one of these formats:| Format | Example | Description |
|---|---|---|
agent_name + agent_version | "agent_name": "my-agent", "agent_version": "1" | Two separate fields. If agent_version is omitted, use the latest version. |
agent_id | "agent_id": "my-agent:1" | Single string in "name:version" format. |
Filter strategies
| Strategy | Description |
|---|---|
random_sampling | (Default) Uniformly random sample up to max_traces conversations. |
smart_filtering | Service-managed heuristic that biases toward “interesting” traces - conversations with potential problems, edge cases, or anomalies. |
Parameters
| Parameter | Required | Description |
|---|---|---|
agent_name | Yes | The agent name to filter traces by. |
agent_version | No | The agent version. If omitted, uses the latest version. |
agent_id | No | Alternative to agent_name + agent_version. Single string in format "name:version". |
start_time | Yes | Start of the time window (Unix epoch seconds, UTC). |
end_time | Yes | End of the time window (Unix epoch seconds, UTC). Pad by +600 seconds to avoid ingestion delay. |
max_traces | No | Maximum conversations to sample. Defaults to 1,000. |
filter_strategy | No | "random_sampling" (default) or "smart_filtering" (service-managed heuristic that biases toward interesting traces). |
The time window (
end_time - start_time) must be at least 15 minutes (900 seconds). This requirement exists because conversation-level queries apply a 5-minute inactivity buffer on each edge to avoid partial conversations.The App Insights query timespan is currently limited to a maximum of 7 days (168 hours). You can’t access traces older than 7 days without explicitly providing
start_time and end_time within App Insights retention limits.Conversation simulation
Generate simulated conversations from scenario descriptions and evaluate them at the conversation level. Use this scenario to test your agent’s behavior in controlled situations before deployment. The service generates realistic conversations based on your scenario descriptions and then evaluates them. This approach is useful for:- Pre-deployment testing: Validate agent behavior across diverse scenarios without real user traffic.
- Edge case coverage: Test scenarios that rarely occur naturally but are important to handle well.
- Regression testing: Ensure agent updates don’t degrade performance on known scenarios.
- Scale testing: Generate many conversations quickly to stress-test agent capabilities.
How conversation simulation works
- You provide a dataset of scenario descriptions—each row describes a situation the simulated user tries to accomplish.
- The service uses a simulator model to play the role of the user, interacting with your agent based on the scenario.
- Each scenario generates one or more complete conversations.
- Conversation-level evaluators assess the generated conversations.
- Your project stores both the conversations and evaluation results.
Prepare scenario data
Create a JSONL file where each line describes a scenario for the simulated user. The schema requiresid, test_case_description, and desired_num_turns. Include details about the user’s goal, context, and any constraints. For a complete example, see the conversation evaluation samples in the SDK.
Parameters
| Parameter | Required | Description |
|---|---|---|
num_conversations | No | Number of conversations to generate per scenario. Defaults to 5, server-side cap of 5. |
max_turns | No | Maximum number of turns (exchanges) per conversation. Defaults to 10, server-side cap of 20. |
model | Yes | Model deployment to use for simulating the user. For example, gpt-4.1. |
sampling_params | No | Sampling parameters for the simulator model, including temperature, top_p, and max_completion_tokens. |
data_mapping | No | Maps fields from your scenario JSONL to simulation parameters. Common mappings: test_case_description, id, desired_num_turns. |
Define evaluators
Select evaluators designed for conversation-level assessment. The simulated conversations automatically map to the evaluators.Create evaluation and run
- Python
- cURL
Prep: download sample_data_simulation_scenarios.jsonl.
Get results
After an evaluation run completes, retrieve the scored results and review them in the portal or programmatically.Poll for results
Evaluation runs are asynchronous. Poll the run status until it completes, then retrieve the results:Interpret results
For a single data example, all evaluators output the following schema:- Label: a binary “pass” or “fail” label, similar to a unit test’s output. Use this result to facilitate comparisons across evaluators.
- Score: a score from the natural scale of each evaluator. Some evaluators use a fine-grained rubric, scoring on a 5-point scale (quality evaluators) or a 7-point scale (content safety evaluators). Others, like textual similarity evaluators, use F1 scores, which are floats between 0 and 1. Any nonbinary “score” is binarized to “pass” or “fail” in the “label” field based on the “threshold”.
- Threshold: any nonbinary scores are binarized to “pass” or “fail” based on a default threshold, which the user can override in the SDK experience.
- Reason: To improve intelligibility, all LLM-judge evaluators also output a reasoning field to explain why a certain score is given.
- Details: (optional) For some evaluators, such as tool_call_accuracy, there might be a “details” field or flags that contain additional information to help users debug their applications.
Example output (single item)
Example output (aggregate)
For aggregate results over multiple data examples (a dataset), the average rate of the examples with a “pass” forms the passing rate for that dataset.Troubleshooting
Job running for a long time
Your evaluation job might remain in the Running state for an extended period. This condition typically occurs when the Azure OpenAI model deployment doesn’t have enough capacity, causing the service to retry requests. Resolution:- Cancel the current evaluation job by using
openai_client.evals.runs.cancel(run_id, eval_id=eval_id). - Increase the model capacity in the Azure portal.
- Run the evaluation again.
Authentication errors
If you receive a401 Unauthorized or 403 Forbidden error, verify that:
- Your
DefaultAzureCredentialis configured correctly. If you’re using Azure CLI, runaz login. - Your account has the Foundry User role on the Foundry project.
- The project endpoint URL is correct and includes both the account and project names.
Data format errors
If the evaluation fails with a schema or data mapping error:- Verify your JSONL file has one valid JSON object per line.
- Confirm that field names in
data_mappingmatch the field names in your JSONL file exactly (case-sensitive). - Check that
item_schemaproperties match the fields in your dataset.
Rate limit errors
Tenant, subscription, and project levels rate-limit evaluation run creations. If you receive a429 Too Many Requests response:
- Check the
retry-afterheader in the response for the recommended wait time. - Review the response body for rate limit details.
- Use exponential backoff when retrying failed requests.
429 error during execution:
- Reduce the size of your evaluation dataset or split it into smaller batches.
- Increase the tokens-per-minute (TPM) quota for your model deployment in the Azure portal.
Agent evaluator tool errors
If an agent evaluator returns an error for unsupported tools:- Check the supported tools for agent evaluators.
- As a workaround, wrap unsupported tools as user-defined function tools so the evaluator can assess them.