Run evaluations in the cloud by using the Microsoft Foundry SDK
This article refers to the Microsoft Foundry (new) portal.
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
How cloud evaluation works
To run a cloud evaluation, you create an evaluation definition with your data schema and testing criteria (evaluators), then create an evaluation run. The run executes each evaluator against your data and returns scored results that you can poll for completion. Cloud evaluation supports the following scenarios:| Scenario | When to use | Data source type | Target |
|---|---|---|---|
| Dataset evaluation | Evaluate pre-computed responses in a JSONL file. | jsonl | — |
| Model target evaluation | Provide queries and generate responses from a model at runtime for evaluation. | azure_ai_target_completions | azure_ai_model |
| Agent target evaluation | Provide queries and generate responses from a Foundry agent at runtime for evaluation. | azure_ai_target_completions | azure_ai_agent |
| Agent response evaluation | Retrieve and evaluate Foundry agent responses by response IDs. | azure_ai_responses | — |
| Red team evaluation | Run automated adversarial testing against a model or agent. | azure_ai_red_team | azure_ai_model or azure_ai_agent |
| Source type | Description |
|---|---|
file_id | Reference an uploaded dataset by ID. |
file_content | Provide data inline in the request. |
data_source_config that tells the service what fields to expect in your data:
custom— You define anitem_schemawith your field names and types. Setinclude_sample_schematotruewhen using a target so evaluators can reference generated responses.azure_ai_source— The schema is inferred from the service. Set"scenario"to"responses"for agent response evaluation or"red_team"for red teaming.
Prerequisites
- A Foundry project.
- An Azure OpenAI deployment with a GPT model that supports chat completion (for example,
gpt-5-mini). - Azure AI User role on the Foundry project.
Some evaluation features have regional restrictions. See supported regions for details.
Get started
Install the SDK and set up your client:Prepare input data
Most evaluation scenarios require input data. You can provide data in two ways:Upload a dataset (recommended)
Upload a JSONL file to create a versioned dataset in your Foundry project. Datasets support versioning and reuse across multiple evaluation runs. Use this approach for production testing and CI/CD workflows. Prepare a JSONL file with one JSON object per line containing the fields your evaluators need:Provide data inline
For quick experimentation with small test sets, provide data directly in the evaluation request usingfile_content.
source as the "source" field in your data source configuration when creating a run. The scenario sections that follow use file_id by default.
Dataset evaluation
Evaluate pre-computed responses in a JSONL file using thejsonl data source type. This scenario is useful when you already have model outputs and want to assess their quality.
Define the data schema and evaluators
Specify the schema that matches your JSONL fields, and select the evaluators (testing criteria) to run. Use thedata_mapping parameter to connect fields from your input data to evaluator parameters with {{item.field}} syntax. Always include data_mapping with the required input fields for each evaluator. Your field names must match those in your JSONL file — for example, if your data has "question" instead of "query", use "{{item.question}}" in the mapping. For the required parameters per evaluator, see built-in evaluators.
Create evaluation and run
Create the evaluation, then start a run against your uploaded dataset. The run executes each evaluator on every row in the dataset.Model target evaluation
Send queries to a deployed model at runtime and evaluate the responses using theazure_ai_target_completions data source type with an azure_ai_model target. Your input data contains queries; the model generates responses which are then evaluated.
Define the message template and target
Theinput_messages template controls how queries are sent to the model. Use {{item.query}} to reference fields from your input data. Specify the model to evaluate and optional sampling parameters:
Set up evaluators and data mappings
When the model generates responses at runtime, use{{sample.output_text}} in data_mapping to reference the model’s output. Use {{item.field}} to reference fields from your input data.
Create evaluation and run
Agent target evaluation
Send queries to a Foundry agent at runtime and evaluate the responses using theazure_ai_target_completions data source type with an azure_ai_agent target.
Define the message template and target
Theinput_messages template controls how queries are sent to the agent. Use {{item.query}} to reference fields from your input data. Specify the agent to evaluate by name:
Set up evaluators and data mappings
When the agent generates responses at runtime, use{{sample.*}} variables in data_mapping to reference the agent’s output:
| Variable | Description | Use for |
|---|---|---|
{{sample.output_text}} | The agent’s plain text response. | Evaluators that expect a string response (for example, coherence, violence). |
{{sample.output_items}} | The agent’s structured JSON output, including tool calls. | Evaluators that need full interaction context (for example, task_adherence). |
{{item.field}} | A field from your input data. | Input fields like query or ground_truth. |
Create evaluation and run
Agent response evaluation
Retrieve and evaluate Foundry agent responses by response IDs using theazure_ai_responses data source type. Use this scenario to evaluate specific agent interactions after they occur.
A response ID is a unique identifier returned each time a Foundry agent generates a response. You can collect response IDs from agent interactions by using the Responses API or from your application’s trace logs. Provide the IDs inline as file content, or upload them as a dataset (see Prepare input data).
Collect response IDs
Each call to the Responses API returns a response object with a uniqueid field. Collect these IDs from your application’s interactions, or generate them directly:
Create evaluation and run
Get results
After an evaluation run completes, retrieve the scored results and review them in the portal or programmatically.Poll for results
Evaluation runs are asynchronous. Poll the run status until it completes, then retrieve the results:Interpret results
For a single data example, all evaluators output the following schema:- Label: a binary “pass” or “fail” label, similar to a unit test’s output. Use this result to facilitate comparisons across evaluators.
- Score: a score from the natural scale of each evaluator. Some evaluators use a fine-grained rubric, scoring on a 5-point scale (quality evaluators) or a 7-point scale (content safety evaluators). Others, like textual similarity evaluators, use F1 scores, which are floats between 0 and 1. Any non-binary “score” is binarized to “pass” or “fail” in the “label” field based on the “threshold”.
- Threshold: any non-binary scores are binarized to “pass” or “fail” based on a default threshold, which the user can override in the SDK experience.
- Reason: To improve intelligibility, all LLM-judge evaluators also output a reasoning field to explain why a certain score is given.
- Details: (optional) For some evaluators, such as tool_call_accuracy, there might be a “details” field or flags that contain additional information to help users debug their applications.
Example output (single item)
Example output (aggregate)
For aggregate results over multiple data examples (a dataset), the average rate of the examples with a “pass” forms the passing rate for that dataset.Troubleshooting
Job running for a long time
Your evaluation job might remain in the Running state for an extended period. This typically occurs when the Azure OpenAI model deployment doesn’t have enough capacity, causing the service to retry requests. Resolution:- Cancel the current evaluation job using
client.evals.runs.cancel(run_id, eval_id=eval_id). - Increase the model capacity in the Azure portal.
- Run the evaluation again.
Authentication errors
If you receive a401 Unauthorized or 403 Forbidden error, verify that:
- Your
DefaultAzureCredentialis configured correctly (runaz loginif using Azure CLI). - Your account has the Azure AI User role on the Foundry project.
- The project endpoint URL is correct and includes both the account and project names.
Data format errors
If the evaluation fails with a schema or data mapping error:- Verify your JSONL file has one valid JSON object per line.
- Confirm that field names in
data_mappingmatch the field names in your JSONL file exactly (case-sensitive). - Check that
item_schemaproperties match the fields in your dataset.
Rate limit errors
Evaluation run creations are rate-limited at the tenant, subscription, and project levels. If you receive a429 Too Many Requests response:
- Check the
retry-afterheader in the response for the recommended wait time. - Review the response body for rate limit details.
- Use exponential backoff when retrying failed requests.
429 error during execution:
- Reduce the size of your evaluation dataset or split it into smaller batches.
- Increase the tokens-per-minute (TPM) quota for your model deployment in the Azure portal.
Agent evaluator tool errors
If an agent evaluator returns an error for unsupported tools:- Check the supported tools for agent evaluators.
- As a workaround, wrap unsupported tools as user-defined function tools so the evaluator can assess them.