Textual similarity evaluators
This article refers to the Microsoft Foundry (new) portal.
The Microsoft Foundry SDK for evaluation and Foundry portal are in public preview, but the APIs are generally available for model and dataset evaluation (agent evaluation remains in public preview). Evaluators marked (preview) in this article are currently in public preview everywhere.
Similarity
Similarity measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response, instead of simple overlap in tokens or n-grams. It also considers the broader context of a query.F1 score
F1 score measures the similarity by shared tokens between the generated text and the ground truth. It focuses on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. The ratio is computed over the individual words in the generated response against those words in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score.- Precision is the ratio of the number of shared words to the total number of words in the generation.
- Recall is the ratio of the number of shared words to the total number of words in the ground truth.
BLEU score
Bleu score computes the Bilingual Evaluation Understudy (BLEU) score commonly used in natural language processing and machine translation. It measures how closely the generated text matches the reference text.GLEU score
Gleu score computes the Google-BLEU (GLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth. Similar to the BLEU score, it focuses on both precision and recall. It addresses the drawbacks of the BLEU score using a per-sentence reward objective.ROUGE score
Rouge score computes the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.METEOR score
Meteor score measures the similarity by shared n-grams between the generated text and the ground truth. Similar to the BLEU score, it focuses on precision and recall. It addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.Using textual similarity evaluators
Textual similarity evaluators compare generated responses against ground truth text using different algorithms:- Similarity - LLM-based semantic similarity evaluation
- F1, BLEU, GLEU, ROUGE, METEOR - Algorithmic token/n-gram overlap metrics
| Evaluator | What it measures | Required inputs | Required parameters | Output | Default threshold |
|---|---|---|---|---|---|
builtin.similarity | Semantic similarity to ground truth | query, response, ground_truth | deployment_name | 1-5 integer | 3 |
builtin.f1_score | Token overlap using precision and recall | ground_truth, response | (none) | 0-1 float | 0.5 |
builtin.bleu_score | N-gram overlap (machine translation metric) | ground_truth, response | (none) | 0-1 float | 0.5 |
builtin.gleu_score | Per-sentence reward variant of BLEU | ground_truth, response | (none) | 0-1 float | 0.5 |
builtin.rouge_score | Recall-oriented n-gram overlap | ground_truth, response | rouge_type | 0-1 float | 0.5 |
builtin.meteor_score | Weighted alignment with synonyms | ground_truth, response | (none) | 0-1 float | 0.5 |
Example input
Your test dataset should contain the fields referenced in your data mappings:Configuration example
Data mapping syntax:{{item.field_name}}references fields from your test dataset (for example,{{item.response}}).
Example output
LLM-based evaluators likesimilarity use a 1-5 Likert scale. Algorithmic evaluators output 0-1 floats. All evaluators output pass or fail based on their thresholds. Key output fields: