Skip to main content

Agent evaluators (preview)

This article refers to the Microsoft Foundry (new) portal.
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
AI agents are powerful productivity assistants that can create workflows for business needs. However, observability can be a challenge due to their complex interaction patterns. Agent evaluators provide systematic observability into agentic workflows by measuring quality, safety, and performance. An agent workflow typically involves reasoning through user intents, calling relevant tools, and using tool results to complete tasks like updating a database or drafting a report. To build production-ready agentic applications, you need to evaluate not just the final output, but also the quality and efficiency of each step in the workflow. Foundry provides built-in agent evaluators that function like unit tests for agentic systems—they take agent messages as input and output binary Pass/Fail scores (or scaled scores converted to binary scores based on thresholds). These evaluators support two best practices for agent evaluation:
  • System evaluation - to examine the end-to-end outcomes of the agentic system.
  • Process evaluation - to verify the step-by-step execution to achieve the outcomes.
EvaluatorBest practiceUse whenPurposeOutput
Task Completion (preview)System evaluationAssessing end-to-end task success in workflow automation, goal-oriented AI interactions, or any scenario where full task completion is criticalMeasures if the agent completed the requested task with a usable deliverable that meets all user requirementsBinary: Pass/Fail
Task Adherence (preview)System evaluationEnsuring agents follow system instructions, validating compliance in regulated environmentsMeasures if the agent’s actions adhere to its assigned tasks according to rules, procedures, and policy constraints, based on its system message and prior stepsBinary: Pass/Fail
Task Navigation Efficiency (preview)System evaluationOptimizing agent workflows, reducing unnecessary steps, validating against known optimal paths (requires ground truth)Measures whether the agent made tool calls efficiently to complete a task by comparing them to expected tool sequencesBinary: Pass/Fail
Intent Resolution (preview)System evaluationCustomer support scenarios, conversational AI, FAQ systems where understanding user intent is essentialMeasures whether the agent correctly identifies the user’s intentBinary: Pass/Fail based on threshold (1-5 scale)
Tool Call Accuracy (preview)Process evaluationOverall tool call quality assessment in agent systems with tool integration, API interactions to complete its tasksMeasures whether the agent made the right tool calls with correct parameters to complete its taskBinary: Pass/Fail based on threshold (1-5 scale)
Tool Selection (preview)Process evaluationValidating tool choice quality in orchestration platforms, ensuring efficient tool usage without redundancyMeasures whether the agent selected the correct tools without selecting unnecessary onesBinary: Pass/Fail
Tool Input Accuracy (preview)Process evaluationStrict validation of tool parameters in production environments, API integration tests, critical workflows requiring 100% parameter correctnessMeasures if all tool call parameters are correct across six strict criteria: groundedness, type compliance, format compliance, required parameters, no unexpected parameters, and value appropriatenessBinary: Pass/Fail
Tool Output Utilization (preview)Process evaluationValidating correct use of API responses, database query results, search outputs in agent reasoning and responsesMeasures if the agent correctly understood and used tool call results contextually in its reasoning and final responseBinary: Pass/Fail
Tool Call Success (preview)Process evaluationMonitoring tool reliability, detecting API failures, timeout issues, or technical errors in tool executionMeasures if tool calls succeeded or resulted in technical errors or exceptionsBinary: Pass/Fail

System evaluation

System evaluation examines the quality of the final outcome of your agentic workflow. These evaluators are applicable to single agents and, in multi-agent systems, to the main orchestrator or the final agent responsible for task completion:
  • Task Completion - Did the agent fully complete the requested task?
  • Task Adherence - Did the agent follow the rules and constraints in its instructions?
  • Task Navigation Efficiency - Did the agent perform the expected steps efficiently?
  • Intent Resolution - Did the agent correctly identify and address user intentions?
Specifically, for textual outputs from agents, you can also apply RAG quality evaluators such as Relevance and Groundedness that take agentic inputs to assess the final response quality. Examples:

Process evaluation

Process evaluation examines the quality and efficiency of each step in your agentic workflow. These evaluators focus on the tool calls executed in a system to complete tasks:
  • Tool Call Accuracy - Did the agent make the right tool calls with correct parameters without redundancy?
  • Tool Selection - Did the agent select the correct and necessary tools?
  • Tool Input Accuracy - Did the agent provide correct parameters for tool calls?
  • Tool Output Utilization - Did the agent correctly use tool call results in its reasoning and final response?
  • Tool Call Success - Did the tool calls succeed without technical errors?
Examples:

Evaluator model and tool support for agent evaluators

For AI-assisted evaluators, you can use Azure OpenAI or OpenAI reasoning models and non-reasoning models for the LLM judge. For complex evaluation that requires refined reasoning, we recommend gpt-5-mini for its balance of performance, cost, and efficiency.

Supported tools

Agent evaluators support the following tools:
  • File Search
  • Function Tool (user-defined tools)
  • MCP
  • Knowledge-based MCP
The following tools currently have limited support. Avoid using tool_call_accuracy, tool input accuracy, tool_output_utilization, tool_call_success, or groundedness evaluators if your agent conversation includes calls to these tools:
  • Azure AI Search
  • Bing Grounding
  • Bing Custom Search
  • SharePoint Grounding
  • Fabric Data Agent
  • Web Search

Using agent evaluators

Agent evaluators assess how well AI agents perform tasks, follow instructions, and use tools effectively. Each evaluator requires specific data mappings and parameters:
EvaluatorRequired inputsRequired parameters
Task Completionquery, responsedeployment_name
Task Adherencequery, responsedeployment_name
Intent Resolutionquery, responsedeployment_name
Tool Call Accuracy(query, response, tool_definitions) OR (query, tool_calls, tool_definitions)deployment_name
Tool Selection(query, response, tool_definitions) OR (query, tool_calls, tool_definitions)deployment_name
Tool Input Accuracyquery, response, tool_definitionsdeployment_name
Tool Output Utilizationquery, response, tool_definitionsdeployment_name
Tool Call Successresponsedeployment_name
Task Navigation Efficiencyactions, expected_actions(none)

Example input

Your test dataset should contain the fields referenced in your data mappings. Both fields accept simple strings or conversation arrays:
{"query": "What's the weather in Seattle?", "response": "The weather in Seattle is rainy, 14°C."}
{"query": "Book a flight to Paris for next Monday", "response": "I've booked your flight to Paris departing next Monday at 9:00 AM."}
For more complex agent interactions with tool calls, use the conversation array format. This format follows the OpenAI message schema (see Agent message schema). The system message is optional but useful for evaluators that assess agent behavior against instructions, including task_adherence, task_completion, tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization, and groundedness:
{
    "query": [
        {"role": "system", "content": "You are a travel booking agent."},
        {"role": "user", "content": "Book a flight to Paris for next Monday"}
    ],
    "response": [
        {"role": "assistant", "content": [{"type": "tool_call", "name": "search_flights", "arguments": {"destination": "Paris", "date": "next Monday"}}]},
        {"role": "tool", "content": [{"type": "tool_result", "tool_result": {"flight": "AF123", "time": "9:00 AM"}}]},
        {"role": "assistant", "content": "I've booked flight AF123 to Paris departing next Monday at 9:00 AM."}
    ]
}

Configuration example

Data mapping syntax:
  • {{item.field_name}} references fields from your test dataset (for example, {{item.query}}).
  • {{sample.output_items}} references agent responses generated or retrieved during evaluation. Use this when evaluating with an agent target or agent response data source.
  • {{sample.tool_definitions}} references tool definitions. Use this when evaluating with an agent target or agent response data source. These are auto-populated for supported built-in tools or inferred for custom functions.
Here’s an example configuration for Task Adherence:
testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "task_adherence",
        "evaluator_name": "builtin.task_adherence",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{item.response}}",
        },
    },
]
See Run evaluations in the cloud for details on running evaluations and configuring data sources.

Example output

Agent evaluators return Pass/Fail results with reasoning. Key output fields:
{
    "type": "azure_ai_evaluator",
    "name": "Task Adherence",
    "metric": "task_adherence",
    "label": "pass",
    "reason": "Agent followed system instructions correctly",
    "threshold": 3,
    "passed": true
}

Task navigation efficiency

Task Navigation Efficiency measures whether the agent took an optimal sequence of actions by comparing against an expected sequence (ground truth). Use this evaluator for workflow optimization and regression testing.
{
    "type": "azure_ai_evaluator",
    "name": "task_navigation_efficiency",
    "evaluator_name": "builtin.task_navigation_efficiency",
    "initialization_parameters": {
        "matching_mode": "exact_match"  # Options: "exact_match", "in_order_match", "any_order_match"
    },
    "data_mapping": {
        "actions": "{{item.actions}}",
        "expected_actions": "{{item.expected_actions}}"
    },
}
Matching modes:
ModeDescription
exact_matchAgent’s trajectory must match the ground truth exactly (order and content)
in_order_matchAll ground truth steps must appear in the agent’s trajectory in correct order (extra steps allowed)
any_order_matchAll ground truth steps must appear in the agent’s trajectory, order doesn’t matter (extra steps allowed)
Expected actions format: The expected_actions can be a simple list of expected steps:
expected_actions = ["identify_tools_to_call", "call_tool_A", "call_tool_B", "response_synthesis"]
Or a tuple with tool names and parameters for more detailed validation:
expected_actions = (
    ["func_name1", "func_name2"],
    {
        "func_name1": {"param_key": "param_value"},
        "func_name2": {"param_key": "param_value"},
    }
)
Output: Returns a binary pass/fail result plus precision, recall, and F1 scores:
{
    "type": "azure_ai_evaluator",
    "name": "task_navigation_efficiency",
    "passed": true,
    "details": {
        "precision_score": 0.85,
        "recall_score": 1.0,
        "f1_score": 0.92
    }
}

Agent message schema

When using conversation array format, query and response follow the OpenAI message structure:
  • query: Contains the conversation history leading up to the user’s request. Include the system message to provide context for evaluators that assess agent behavior against instructions.
  • response: Contains the agent’s reply, including any tool calls and their results.
Message schema:
[
  {
    "role": "system" | "user" | "assistant" | "tool",
    "content": "string" | [                // string or array of content items
      {
        "type": "text" | "tool_call" | "tool_result",
        "text": "string",                  // if type == text
        "tool_call_id": "string",          // if type == tool_call
        "name": "string",                  // tool name if type == tool_call
        "arguments": { ... },              // tool args if type == tool_call
        "tool_result": { ... }             // result if type == tool_result
      }
    ]
  }
]
Role types:
RoleDescription
systemAgent instructions (optional, placed at start of query)
userUser messages and requests
assistantAgent responses, including tool calls
toolTool execution results
Example:
{
  "query": [
    {"role": "system", "content": "You are a weather assistant."},
    {"role": "user", "content": [{"type": "text", "text": "What's the weather in Seattle?"}]}
  ],
  "response": [
    {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_123", "name": "get_weather", "arguments": {"city": "Seattle"}}]},
    {"role": "tool", "content": [{"type": "tool_result", "tool_result": {"temp": "62°F", "condition": "Cloudy"}}]},
    {"role": "assistant", "content": [{"type": "text", "text": "It's currently 62°F and cloudy in Seattle."}]}
  ]
}