Skip to main content

Azure OpenAI graders

This article refers to the Microsoft Foundry (new) portal.
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
Azure OpenAI graders are a new set of evaluation tools in the Microsoft Foundry SDK that evaluate the performance of AI models and their outputs. These graders include:
GraderWhat it measuresRequired parametersOutput
label_modelClassifies text into predefined categoriesmodel, input, labels, passing_labelsPass/Fail based on label
score_modelAssigns a numeric score based on criteriamodel, input, range, pass_threshold0-1 float
string_checkExact or pattern string matchinginput, reference, operationPass/Fail
text_similaritySimilarity between two text stringsinput, reference, evaluation_metric, pass_threshold0-1 float
You can run graders locally or remotely. Each grader assesses specific aspects of AI models and their outputs.

Using Azure OpenAI graders

Azure OpenAI graders provide flexible evaluation using LLM-based or deterministic approaches:
  • Model-based graders (label_model, score_model) - Use an LLM to evaluate outputs
  • Deterministic graders (string_check, text_similarity) - Use algorithmic comparison
Examples: See Run evaluations in the cloud for details on running evaluations and configuring data sources.

Example input

Your test dataset should contain the fields referenced in your grader configurations.
{"query": "What is the weather like today?", "response": "It's sunny and warm with clear skies.", "ground_truth": "Today is sunny with temperatures around 75°F."}
{"query": "Summarize the meeting notes.", "response": "The team discussed Q3 goals and assigned action items.", "ground_truth": "Meeting covered quarterly objectives and task assignments."}

Label grader

The label grader (label_model) uses an LLM to classify text into predefined categories. Use it for sentiment analysis, content classification, or any multi-class labeling task.
{
    "type": "label_model",
    "name": "sentiment_check",
    "model": model_deployment,
    "input": [
        {"role": "developer", "content": "Classify the sentiment as 'positive', 'neutral', or 'negative'"},
        {"role": "user", "content": "Statement: {{item.query}}"},
    ],
    "labels": ["positive", "neutral", "negative"],
    "passing_labels": ["positive", "neutral"],
}
Output: Returns the assigned label from your defined set. The grader passes if the label is in passing_labels.

Score grader

The score grader (score_model) uses an LLM to assign a numeric score to model outputs, reflecting quality, correctness, or similarity to a reference. Use it for nuanced evaluation requiring reasoning.
{
    "type": "score_model",
    "name": "quality_score",
    "model": model_deployment,
    "input": [
        {"role": "system", "content": "Rate the response quality from 0 to 1. 1 = perfect, 0 = completely wrong."},
        {"role": "user", "content": "Response: {{item.response}}\nGround Truth: {{item.ground_truth}}"},
    ],
    "pass_threshold": 0.7,
    "range": [0, 1]
}
Output: Returns a float score (for example, 0.85). The grader passes if the score meets or exceeds pass_threshold.

String check grader

The string check grader (string_check) performs deterministic string comparisons. Use it for exact match validation where responses must match a reference exactly.
{
    "type": "string_check",
    "name": "exact_match",
    "input": "{{item.response}}",
    "reference": "{{item.ground_truth}}",
    "operation": "eq",
}
Operations:
OperationDescription
eqExact match (case-sensitive)
neNot equal
likePattern match with wildcards
ilikeCase-insensitive pattern match
Output: Returns a score of 1 for match, 0 for no match.

Text similarity grader

The text similarity grader (text_similarity) compares two text strings using similarity metrics. Use it for open-ended or paraphrase matching where exact match is too strict.
{
    "type": "text_similarity",
    "name": "similarity_check",
    "input": "{{item.response}}",
    "reference": "{{item.ground_truth}}",
    "evaluation_metric": "bleu",
    "pass_threshold": 0.8,
}
Metrics:
MetricDescription
fuzzy_matchApproximate string matching using edit distance
bleuN-gram overlap score, commonly used for translation
gleuGoogle’s variant of BLEU with sentence-level scoring
meteorAlignment-based metric considering synonyms and paraphrases
cosineCosine similarity on vectorized text
rouge_*N-gram overlap variants (rouge_1, rouge_2, …, rouge_l)
Output: Returns a similarity score as a float (higher means more similar). The grader passes if the score meets or exceeds pass_threshold.

Example output

Graders return results with pass/fail status. Key output fields:
{
    "type": "score_model",
    "name": "quality_score",
    "score": 0.85,
    "passed": true
}