Skip to main content
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Test your generative AI models and agents by running evaluations that measure performance, quality, and safety. Use evaluations before deployment to validate behavior, or after deployment to monitor production quality. Evaluations run your model or agent against test data and score the outputs using built-in or custom evaluators. This article shows you how to create and run evaluations in the Foundry portal.

Prerequisites

  • An Azure subscription. Create one for free.
  • A Microsoft Foundry project. Create a project if you don’t have one.
  • One of the following, depending on your evaluation target:
    • Agent evaluation: An agent in your project.
    • Model evaluation: A deployed model or access to instant models.
    • Dataset evaluation: A test dataset in CSV or JSONL format containing preexisting model or agent outputs.
  • An Azure OpenAI connection with a deployed GPT model (for example, gpt-4.1-mini). Required for AI-assisted quality evaluations.
  • Foundry User role on the Foundry project. For more information, see Role-based access control for Microsoft Foundry.
The Foundry RBAC roles were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner, and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner, and Azure AI Project Manager. You might still see the previous names in some places while the rename rolls out. The role IDs and core permissions are unchanged by the rename.

Choose an evaluation approach

Select an evaluation approach based on what you want to test:
TargetScopeData sourceBest for
AgentFull conversationsSimulated dataTesting end-to-end agent behavior with synthetic scenarios before deployment.
AgentFull conversationsExisting conversationsEvaluating real user interactions to monitor production quality.
AgentIndividual turnsExisting datasetDebugging specific agent responses, testing tool usage, fine-grained analysis.
AgentIndividual turnsSynthetic dataTesting single-turn Q&A or RAG scenarios with generated queries.
AgentIndividual turnsExisting tracesEvaluating historical agent traces from your project.
ModelIndividual turnsSynthetic dataTesting model completions with generated prompts.
ModelIndividual turnsExisting datasetBenchmarking model performance against a curated test set.
DatasetIndividual turns(Dataset is target)Evaluating preexisting outputs without re-running the model or agent.
Start with Agent > Full conversations > Simulated data to test your agent’s behavior in controlled scenarios. Use Existing conversations once your agent is in production to monitor real-world performance.

Create an evaluation

You can start an evaluation from several places in the Foundry portal:
  • Evaluation page: From the left pane, select Evaluation > Create.
  • Models page: Go to your model, select the Evaluation tab, then select Create.
  • Agents page: Go to your agent, select the Evaluation tab, then select Create.
  • Agent playground: Go to your agent, select the Playground tab, then select Metrics > Run full evaluation.

Step 1: Select evaluation target

When you create an evaluation, first choose the evaluation target. The target determines what the evaluation runs against:
TargetDescription
AgentEvaluates the output generated by your selected agent and user-defined input. Works for both prompt agents and hosted agents.
ModelEvaluates the output generated by your selected model and user-defined prompt.
DatasetEvaluates preexisting model or agent outputs from a test dataset.
TracesEvaluates agent interactions already captured in Application Insights. Select the agent and time range, and the portal retrieves the matching traces for evaluation. For the SDK equivalent, see Trace evaluation.
Instant models: Instant models are deployment-less models that you can use immediately without creating a deployment. When creating an evaluation, you can select an instant model as either the evaluation target or the judge model directly from the model picker.

Step 2: Select evaluation scope

This step appears for Agent and Dataset targets only. Model evaluations always use individual turns.
Choose how you want to evaluate your agent’s performance:
ScopeDescriptionBest for
Full conversations (preview)Evaluates complete multi-turn conversations from start to finish. Measures overall conversation quality, task completion, and user satisfaction.Testing end-to-end agent experiences, customer satisfaction, and conversation flow.
Individual turnsEvaluates individual agent responses within conversations. Measures per-turn metrics like tool selection accuracy and response quality.Debugging specific agent behaviors, testing tool usage, and fine-grained analysis.

Step 3: Select data source

The data source options depend on your evaluation target and scope.

For conversation evaluations (agent > full conversations) (preview)

Choose where your conversation data comes from:
Simulated data
Generate synthetic conversations by running your agent against scenario descriptions from a dataset. Use this option to test your agent’s behavior in controlled scenarios before deployment.
  1. Select Simulated data.
  2. Select Generate to open the simulation configuration dialog.
  3. Select your file: Choose a dataset containing scenario descriptions. Each row in your dataset describes a scenario that you use to generate a simulated conversation.
Screenshot showing the dataset preview in the simulation dialog.
  1. Select model: Choose the model that simulates the user in the conversation:
    • gpt-4.1 (recommended for complex scenarios)
    • gpt-4o
    • gpt-4o-mini
    • gpt-4.1-mini
  2. Configure simulation settings:
    • Number of simulated conversations per scenario: How many conversations to generate for each row in your dataset (1-5). Multiple conversations per scenario help identify variance in agent behavior.
    • Number of turns per conversation: Maximum turns allowed per conversation (1-50). The conversation ends when the task is complete or this limit is reached.
  3. Select Confirm to save your simulation configuration.
Existing conversations
Evaluate real conversations that your agent already had with users.
  1. Select Existing conversations.
  2. Configure filtering options:
    • Number of conversations: Maximum number of conversations to sample from the date range (1-100).
    • Time range: Filter conversations by time period. Use quick filters (Last Day, 7D, 1M, 3M) or select a custom date range.
  3. Browse and select specific conversations to include in the evaluation.

For individual turn evaluations

Choose where your evaluation data comes from:
Synthetic data
Generate test queries by using AI. Select Synthetic and configure the number of rows and a prompt that describes the data to generate. You can also upload files to improve relevance.
Synthetic data generation requires a model with Responses API capability. For availability, see Responses API region availability.
Existing dataset
Use a prepared dataset in CSV or JSONL format. Select Existing dataset and choose a file from your project’s data assets. Only CSV and JSONL file formats are supported.
Existing traces (agent only)
Evaluate historical agent traces from your project. Select Existing traces and filter by date range to select traces.

Multimodal content (preview)

All evaluation targets support image and audio content. Each content type uses a specific JSONL schema: Image content:
  • image_url: The image as a data URI (for example, data:image/png;base64,...) or a publicly accessible URL.
  • caption: A text description of the image content.
{"image_url": "data:image/png;base64,iVBOR...", "caption": "A red to blue color gradient"}
Audio content:
  • audio_data: The audio as a data URI with base64-encoded WAV data (for example, data:audio/wav;base64,...).
  • expected: A text description of the expected audio content.
Only WAV audio format is currently supported.
{"audio_data": "data:audio/wav;base64,UklGR...", "expected": "A short beep tone at 440 Hz"}
Datasets can also use the chat message conversation format, where audio and image data are embedded within a single chat message column as data URIs or publicly accessible URLs. The following example shows a conversation dataset column with embedded image and audio content:
[
  {
    "role": "system",
    "content": "..."
  },
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What are in these images?"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "https://example.com/path/image.png"
        }
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/png;base64,iVBORw0KGgo..."
        }
      }
    ]
  },
  {
    "role": "assistant",
    "content": "..."
  },
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Tell me the tones for the voices?"
      },
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://example.com/path/voice.wav",
          "format": "wav"
        }
      },
      {
        "type": "input_audio",
        "input_audio": {
          "data": "data:audio/wav;base64,UklGRigAAA...",
          "format": "wav"
        }
      }
    ]
  }
]
You can preview images and play audio clips directly in the evaluation creation flow and in the evaluation results view.

Step 4: Configure agents

This step appears for Agent evaluations only.
Customize how your agent behaves during the evaluation:
  1. Review the list of agents involved in your evaluation.
  2. For each agent, select Configure to customize its behavior:
    • System prompt: Modify the agent’s instructions for the evaluation.
    • User prompt: Specify how each dataset item is sent to your agent during evaluation.
  3. The evaluation run preserves agent configurations.

User prompt configuration

The user prompt defines how test inputs are passed to your agent. By default, the portal uses {{item.query}} to pass the dataset query directly to your agent. In most cases, you can use the default. Only change this value if your agent expects a different input format. For example, if your agent uses a hosted agent protocol or requires structured input with additional fields. Common patterns:
FormatWhen to use
{{item.query}}Default. Passes the query field from your dataset directly.
{{item.messages}}For agents expecting conversation history as input.
Custom JSONFor hosted agents or APIs that require structured request bodies.
Use custom prompts to test edge cases or specific scenarios that might not occur naturally in your dataset.

Step 5: Configure field mapping

This step appears when you use existing data (existing conversations, existing dataset, or existing traces).
Map your data fields to the fields each evaluator expects. The required fields depend on your evaluation scope.

For conversation evaluations (multi-turn)

FieldDescriptionRequired
messagesThe conversation messages in chat format.Yes
tool_definitionsTool or function definitions available to the agent.Yes

For individual turn evaluations (single-turn)

FieldDescriptionRequired
queryThe user query or prompt.Yes
responseThe model or agent response.Yes
contextRetrieved context for RAG scenarios.No
ground_truthExpected correct answer for comparison.No
tool_callsTool calls made by the agent.No
tool_definitionsAvailable tool definitions.No
The portal automatically attempts to map your dataset fields. If a field shows as Unassigned, select the dropdown to manually assign a column from your dataset.
Required fields are marked with an asterisk (*). Evaluators fail if required fields are left unassigned.

Step 6: Select testing criteria

Select the evaluators to use for your evaluation. Microsoft Foundry provides three categories of built-in evaluators. The available evaluators depend on your evaluation scope.

Agent evaluators

Evaluate how effectively agents handle tasks, tools, and user intent. Available for Individual turns scope only.
EvaluatorDescription
Intent ResolutionMeasures whether the agent correctly identified and addressed the user’s intent.
Task AdherenceMeasures how well the agent followed instructions and constraints.
Tool Call SuccessEvaluates whether tool calls executed successfully.
Tool SelectionMeasures whether the agent selected appropriate tools for the task.
Tool Output UtilizationEvaluates how effectively the agent used tool outputs in responses.
Tool Input AccuracyMeasures whether the agent provided correct inputs to tools.
Tool Call AccuracyOverall accuracy of tool usage.

Quality evaluators

Measure the overall quality of generated responses. Most quality evaluators are available for all evaluation scopes. Evaluators marked with ★ support both conversation-level and turn-level analysis.
EvaluatorDescriptionConversation support
Customer SatisfactionPredicts user satisfaction with the agent interaction.
Task CompletionEvaluates whether the agent successfully completed the requested task.
CoherenceMeasures logical flow and consistency of responses.
GroundednessMeasures whether responses are grounded in provided context.
Response CompletenessEvaluates whether responses fully address user queries.
FluencyEvaluates natural language quality.
RelevanceEvaluates how relevant responses are to the query.

Safety evaluators

Identify potential content and security risks. Available for Individual turns scope only.
EvaluatorDescription
ViolenceDetects violent content in responses.
SexualDetects sexual content.
Self-harmDetects self-harm related content.
Hate/UnfairnessDetects hateful or biased content.
The portal preselects recommended evaluators based on your evaluation target and scope:
  • Full conversations: Customer Satisfaction, Task Completion, Coherence, Groundedness
  • Individual turns (existing data): All Agent evaluators plus Quality and Safety evaluators
  • Individual turns (synthetic/traces): Relevance, Groundedness, Fluency, Coherence
You can add or remove evaluators as needed. Select Custom evaluators to use evaluators you defined in your project.

Step 7: Review and submit

  1. Enter a name for your evaluation.
  2. Review your configuration:
    • Evaluation target and scope
    • Data source and dataset
    • Selected evaluators
    • Field mappings (if applicable)
  3. Select Submit to start the evaluation.
After you submit, the evaluation run starts. Evaluations typically complete within a few minutes, depending on dataset size and the number of conversations being simulated. To verify your evaluation started successfully:
  1. In the left pane, select Evaluation.
  2. Find your evaluation in the list. The Status column shows the current state:
    • In Progress: The evaluation is running.
    • Completed: The evaluation finished successfully.
    • Partial: Some evaluators completed but others failed.
    • Failed: The evaluation encountered an error.
To view detailed results, select the evaluation name or see View the evaluation results.
For programmatic evaluation workflows, use the Azure AI Evaluation SDK. See How to run batch evaluation with the SDK.

Troubleshooting

Evaluation times out or runs slowly

  • Reduce the number of conversations or dataset rows.
  • For simulations, decrease the maximum turns per conversation.
  • Check that your judge model has sufficient quota.

Field mapping errors

  • Verify your dataset contains the required columns for your evaluation scope.
  • For conversation evaluations, ensure the messages column contains properly formatted chat messages.
  • Check that column names in your dataset match the expected field names.

Model quota exceeded

  • The judge model used for AI-assisted evaluations counts against your Azure OpenAI quota.
  • Use a smaller dataset or wait for quota to refresh.
  • Consider using gpt-4.1-mini instead of gpt-4.1 for cost-effective evaluations.

Best practices

For simulation-based evaluations

  • Start small: Begin with 1 conversation per scenario and 5-10 turns to validate your setup before scaling up.
  • Diverse scenarios: Include a variety of scenario descriptions to test different agent capabilities.
  • Iterate on prompts: If agents behave unexpectedly, use the Configure agents step to adjust prompts.

For existing conversation evaluations

  • Representative sample: Select conversations that represent typical user interactions.
  • Include edge cases: Don’t just evaluate successful conversations—include challenging scenarios.
  • Regular evaluation: Schedule recurring evaluations to track agent performance over time.

For model evaluations

  • Benchmark datasets: Use standardized datasets to compare model performance across versions.
  • Test both deployed and instant models: Compare your fine-tuned deployments against base models.

For dataset evaluations

  • Pre-compute outputs: Generate outputs offline and evaluate in bulk for cost efficiency.
  • Version your datasets: Track which dataset version produced which evaluation results.

General tips

  • Compare evaluators: Run the same data through multiple evaluators to get a comprehensive view.
  • Track trends: Use evaluation history to identify performance improvements or regressions.
  • Act on results: Use evaluation insights to refine agent prompts, tool definitions, and configurations.
Learn more about evaluating your generative AI models and agents: