Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
- System evaluation - to examine the end-to-end outcomes of the agentic system.
- Process evaluation - to verify the step-by-step execution to achieve the outcomes.
| Evaluator | Best practice | Use when | Purpose | Output |
|---|---|---|---|---|
| Task Completion (preview) | System evaluation | Assessing end-to-end task success in workflow automation, goal-oriented AI interactions, or any scenario where full task completion is critical | Measures if the agent completed the requested task with a usable deliverable that meets all user requirements | Binary: Pass/Fail |
| Customer Satisfaction (preview) | System evaluation | Measuring overall user satisfaction across a conversation, detecting user frustration | Measures holistic user satisfaction across six dimensions: helpfulness, completeness, clarity, tone, resolution, and adaptability | 1-5 Likert scale |
| Task Adherence (preview) | System evaluation | Ensuring agents follow system instructions, validating compliance in regulated environments | Measures if the agent’s actions adhere to its assigned tasks according to rules, procedures, and policy constraints, based on its system message and prior steps | Binary: Pass/Fail |
| Task Navigation Efficiency | System evaluation | Optimizing agent workflows, reducing unnecessary steps, validating against known optimal paths (requires ground truth) | Measures whether the agent made tool calls efficiently to complete a task by comparing them to expected tool sequences | Binary: Pass/Fail |
| Intent Resolution (preview) | System evaluation | Customer support scenarios, conversational AI, FAQ systems where understanding user intent is essential | Measures whether the agent correctly identifies the user’s intent | Binary: Pass/Fail based on threshold (1-5 scale) |
| Tool Call Accuracy | Process evaluation | Overall tool call quality assessment in agent systems with tool integration, API interactions to complete its tasks | Measures whether the agent made the right tool calls with correct parameters to complete its task | Binary: Pass/Fail based on threshold (1-5 scale) |
| Tool Selection | Process evaluation | Validating tool choice quality in orchestration platforms, ensuring efficient tool usage without redundancy | Measures whether the agent selected the correct tools without selecting unnecessary ones | Binary: Pass/Fail |
| Tool Input Accuracy | Process evaluation | Strict validation of tool parameters in production environments, API integration tests, critical workflows requiring 100% parameter correctness | Measures if all tool call parameters are correct across six strict criteria: groundedness, type compliance, format compliance, required parameters, no unexpected parameters, and value appropriateness | Binary: Pass/Fail |
| Tool Output Utilization | Process evaluation | Validating correct use of API responses, database query results, search outputs in agent reasoning and responses | Measures if the agent correctly understood and used tool call results contextually in its reasoning and final response | Binary: Pass/Fail |
| Tool Call Success | Process evaluation | Monitoring tool reliability, detecting API failures, timeout issues, or technical errors in tool execution | Measures if tool calls succeeded or resulted in technical errors or exceptions | Binary: Pass/Fail |
| Quality Grader (preview) | Quality evaluation | Assesses overall response quality at the turn level, including relevance, abstention, answer completeness, and optionally groundedness and context coverage | Enables quality evaluation across multiple dimensions in a single evaluator instead of running individual evaluators separately | Binary: Pass/Fail |
System evaluation
System evaluation examines the quality of the final outcome of your agentic workflow. These evaluators are applicable to single agents and, in multi-agent systems, to the main orchestrator or the final agent responsible for task completion:- Task Completion - Did the agent fully complete the requested task?
- Customer Satisfaction - How satisfied would a user be with the agent’s performance?
- Task Adherence - Did the agent follow the rules and constraints in its instructions?
- Task Navigation Efficiency - Did the agent perform the expected steps efficiently?
- Intent Resolution - Did the agent correctly identify and address user intentions?
Relevance and Groundedness that take agentic inputs to assess the final response quality.
Examples:
- Task completion (preview) sample
- Task adherence sample
- Task navigation efficiency sample
- Intent resolution sample
Process evaluation
Process evaluation examines the quality and efficiency of each step in your agentic workflow. These evaluators focus on the tool calls executed in a system to complete tasks:- Tool Call Accuracy - Did the agent make the right tool calls with correct parameters without redundancy?
- Tool Selection - Did the agent select the correct and necessary tools?
- Tool Input Accuracy - Did the agent provide correct parameters for tool calls?
- Tool Output Utilization - Did the agent correctly use tool call results in its reasoning and final response?
- Tool Call Success - Did the tool calls succeed without technical errors?
- Tool call accuracy sample
- Tool selection sample
- Tool input accuracy sample
- Tool output utilization sample
- Tool call success sample
Quality evaluation (preview)
Quality evaluation assesses the overall quality of an AI assistant’s response at the turn level. The Quality Grader evaluator is the same quality evaluator used in Microsoft Copilot Studio agent evaluation. It examines multiple dimensions of response quality:- Relevance - Is the response relevant to the user’s query?
- Abstention - Does the agent appropriately abstain when it cannot or should not answer?
- Answer completeness - Does the response fully address the user’s question?
- Groundedness - Is the response grounded in the provided context?
- Context coverage - Does the response make use of the relevant information in the context?
Model and tool support
For AI-assisted evaluators, you can use Azure OpenAI or OpenAI reasoning models and non-reasoning models for the LLM judge. For complex evaluation that requires refined reasoning, we recommendgpt-5-mini for its balance of performance, cost, and efficiency.
Supported tools
Agent evaluators support the following tools:- File Search
- Function Tool (user-defined tools)
- MCP
- Knowledge-based MCP
tool_call_accuracy, tool input accuracy, tool_output_utilization, tool_call_success, or groundedness evaluators if your agent conversation includes calls to these tools:
- Azure AI Search
- Bing Grounding
- Bing Custom Search
- SharePoint Grounding
- Code Interpreter
- Fabric Data Agent
- Web Search
Using agent evaluators
Agent evaluators assess how well AI agents perform tasks, follow instructions, and use tools effectively. Each evaluator requires specific data mappings and parameters:| Evaluator | Required inputs | Required parameters |
|---|---|---|
| Task Completion (preview) | query, response; optional: tool_definitions | deployment_name |
| Customer Satisfaction (preview) | messages | model |
| Task Adherence (preview) | query, response | deployment_name |
| Intent Resolution (preview) | query, response | deployment_name |
| Tool Call Accuracy | (query, response, tool_definitions) OR (query, tool_calls, tool_definitions) | deployment_name |
| Tool Selection | (query, response, tool_definitions) OR (query, tool_calls, tool_definitions) | deployment_name |
| Tool Input Accuracy | query, response, tool_definitions | deployment_name |
| Tool Output Utilization | query, response, tool_definitions | deployment_name |
| Tool Call Success | response | deployment_name |
| Task Navigation Efficiency | actions, expected_actions | (none) |
Example input
Your test dataset should contain the fields referenced in your data mappings. Both fields accept simple strings or conversation arrays:task_adherence, task_completion, tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization, and groundedness:
Tool definitions format
Thetool_definitions field describes the tools available to the agent. It follows the OpenAI function-calling schema — a list of tool objects, where each object contains a type (always "function") and a function descriptor:
tool_definitions field in your test dataset alongside query and response.
Configuration example
Data mapping syntax:{{item.field_name}}references fields from your test dataset (for example,{{item.query}}).{{sample.output_items}}references the agent’s structured output, including tool calls and results. Use this for evaluators that need full interaction context (task_adherence,tool_call_accuracy,tool_selection,tool_input_accuracy,tool_output_utilization).{{sample.output_text}}references the agent’s plain text response. Use this for evaluators that expect a string response (for example,coherence,violence).
Example output
Agent evaluators return Pass/Fail results with reasoning. Key output fields:intent_resolution and tool_call_accuracy), the output includes a numeric score field alongside the pass/fail result:
Task navigation efficiency
Task Navigation Efficiency measures whether the agent took an optimal sequence of actions by comparing against an expected sequence (ground truth). Use this evaluator for workflow optimization and regression testing.| Mode | Description |
|---|---|
exact_match | Agent’s trajectory must match the ground truth exactly (order and content) |
in_order_match | All ground truth steps must appear in the agent’s trajectory in correct order (extra steps allowed) |
any_order_match | All ground truth steps must appear in the agent’s trajectory, order doesn’t matter (extra steps allowed) |
actions field takes a list of message objects that follow the OpenAI message schema. Each message represents a step the agent took during the conversation:
The
actions and expected_actions fields use different formats. actions requires OpenAI message-schema dictionaries (representing the agent’s actual behavior), while expected_actions uses a simple list of tool names (representing the ground truth).expected_actions can be a simple list of expected steps:
Agent message schema
When using conversation array format,query and response follow the OpenAI message structure:
- query: Contains the conversation history leading up to the user’s request. Include the system message to provide context for evaluators that assess agent behavior against instructions.
- response: Contains the agent’s reply, including any tool calls and their results.
| Role | Description |
|---|---|
system | Agent instructions (optional, placed at start of query) |
user | User messages and requests |
assistant | Agent responses, including tool calls |
tool | Tool execution results |