Azure OpenAI graders
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
Azure OpenAI graders are a new set of evaluation tools in the Microsoft Foundry SDK that evaluate the performance of AI models and their outputs. These graders include:
| Grader | What it measures | Required parameters | Output |
|---|
label_model | Classifies text into predefined categories | model, input, labels, passing_labels | Pass/Fail based on label |
score_model | Assigns a numeric score based on criteria | model, input, range, pass_threshold | 0-1 float |
string_check | Exact or pattern string matching | input, reference, operation | Pass/Fail |
text_similarity | Similarity between two text strings | input, reference, evaluation_metric, pass_threshold | 0-1 float |
You can run graders locally or remotely. Each grader assesses specific aspects of AI models and their outputs.
Using Azure OpenAI graders
Azure OpenAI graders provide flexible evaluation using LLM-based or deterministic approaches:
- Model-based graders (
label_model, score_model) - Use an LLM to evaluate outputs
- Deterministic graders (
string_check, text_similarity) - Use algorithmic comparison
Examples:
See Run evaluations in the cloud for details on running evaluations and configuring data sources.
Your test dataset should contain the fields referenced in your grader configurations.
{"query": "What is the weather like today?", "response": "It's sunny and warm with clear skies.", "ground_truth": "Today is sunny with temperatures around 75°F."}
{"query": "Summarize the meeting notes.", "response": "The team discussed Q3 goals and assigned action items.", "ground_truth": "Meeting covered quarterly objectives and task assignments."}
Label grader
The label grader (label_model) uses an LLM to classify text into predefined categories. Use it for sentiment analysis, content classification, or any multi-class labeling task.
{
"type": "label_model",
"name": "sentiment_check",
"model": model_deployment,
"input": [
{"role": "developer", "content": "Classify the sentiment as 'positive', 'neutral', or 'negative'"},
{"role": "user", "content": "Statement: {{item.query}}"},
],
"labels": ["positive", "neutral", "negative"],
"passing_labels": ["positive", "neutral"],
}
Output: Returns the assigned label from your defined set. The grader passes if the label is in passing_labels.
Score grader
The score grader (score_model) uses an LLM to assign a numeric score to model outputs, reflecting quality, correctness, or similarity to a reference. Use it for nuanced evaluation requiring reasoning.
{
"type": "score_model",
"name": "quality_score",
"model": model_deployment,
"input": [
{"role": "system", "content": "Rate the response quality from 0 to 1. 1 = perfect, 0 = completely wrong."},
{"role": "user", "content": "Response: {{item.response}}\nGround Truth: {{item.ground_truth}}"},
],
"pass_threshold": 0.7,
"range": [0, 1]
}
Output: Returns a float score (for example, 0.85). The grader passes if the score meets or exceeds pass_threshold.
String check grader
The string check grader (string_check) performs deterministic string comparisons. Use it for exact match validation where responses must match a reference exactly.
{
"type": "string_check",
"name": "exact_match",
"input": "{{item.response}}",
"reference": "{{item.ground_truth}}",
"operation": "eq",
}
Operations:
| Operation | Description |
|---|
eq | Exact match (case-sensitive) |
ne | Not equal |
like | Pattern match with wildcards |
ilike | Case-insensitive pattern match |
Output: Returns a score of 1 for match, 0 for no match.
Text similarity grader
The text similarity grader (text_similarity) compares two text strings using similarity metrics. Use it for open-ended or paraphrase matching where exact match is too strict.
{
"type": "text_similarity",
"name": "similarity_check",
"input": "{{item.response}}",
"reference": "{{item.ground_truth}}",
"evaluation_metric": "bleu",
"pass_threshold": 0.8,
}
Metrics:
| Metric | Description |
|---|
fuzzy_match | Approximate string matching using edit distance |
bleu | N-gram overlap score, commonly used for translation |
gleu | Google’s variant of BLEU with sentence-level scoring |
meteor | Alignment-based metric considering synonyms and paraphrases |
cosine | Cosine similarity on vectorized text |
rouge_* | N-gram overlap variants (rouge_1, rouge_2, …, rouge_l) |
Output: Returns a similarity score as a float (higher means more similar). The grader passes if the score meets or exceeds pass_threshold.
Example output
Graders return results with pass/fail status. Key output fields:
{
"type": "score_model",
"name": "quality_score",
"score": 0.85,
"passed": true
}
Related content