Evaluate generative AI models and applications by using Microsoft Foundry

This article refers to the Microsoft Foundry (new) portal.

To thoroughly assess the performance of your generative AI models and applications on a substantial dataset, initiate an evaluation process. During this evaluation, the model or application is tested with the given dataset, and its performance is measured using mathematical metrics and AI-assisted metrics. This evaluation run provides comprehensive insights into the application’s capabilities and limitations. Use the evaluation functionality in the Microsoft Foundry portal, a platform that offers tools and features for assessing the performance and safety of generative AI models. In the Foundry portal, log, view, and analyze detailed evaluation metrics. This article explains how to create an evaluation run against a model, agent, or test dataset using built-in evaluation metrics from the Foundry UI. For greater flexibility, you can establish a custom evaluation flow and employ the custom evaluation feature. Use the custom evaluation feature to conduct a batch run without evaluation.

Prerequisites

A test dataset in one of these formats: A model, agent, or test dataset in one of these formats: CSV or JSON Lines (JSONL).
An Azure OpenAI connection. A deployment of one of these models: a GPT-3.5 model, a GPT-4 model, or a Davinci model. Required only when you run AI-assisted quality evaluations.

Create an evaluation with built-in evaluation metrics

From the evaluate page

From the left pane, select Evaluation > Create.

From the model or agent playground page

From the playground page for models or agent playground page, select Evaluation > Create or select Metrics > Run full evaluation.

Evaluation target

When you start an evaluation from the Evaluate page, you first need to choose the evaluation target. By specifying the appropriate evaluation target, we can tailor the evaluation to the specific nature of your application, ensuring accurate and relevant metrics. We support three types of evaluation targets:

Model: This choice evaluates the output generated by your selected model and user-defined prompt.
Agent: This choice evaluates the output generated by your selected agent and user-defined prompt
Dataset: Your model or agent-generated outputs are already in a test dataset.

Select or Create a Dataset

If you choose to evaluate a model or agent, you need a dataset to act as inputs to these targets so that responses can be assessed by evaluators. In the dataset step, you can choose to select or upload a dataset of your own, or you can synthetically generate a dataset.

Add new dataset: Upload files from your local storage. Only CSV and JSONL file formats are supported. A preview of your test data displays on the right pane.
Synthetic Dataset Generation: Synthetic datasets are useful in situations where you either lack data or lack access to data to test the model or agent you’ve built. With synthetic data generation, you choose the resource to generate the data, the number of rows you would like to generate, and must enter a prompt describing the type of data you would like to generate. Additionally, you can upload files to improve the relevance of your dataset to the desired task of your agent or model.

Synthetic data generation isn’t available in all regions. It is available in regions supporting Response API. For an up to date list of supporting regions, see Azure OpenAI Responses API region availability.

Configure testing criteria

We support three types of metrics curated by Microsoft to facilitate a comprehensive evaluation of your application:

AI quality (AI assisted): These metrics evaluate the overall quality and coherence of the generated content. You need a model deployment as judge to run these metrics.
AI quality (NLP): These natural language processing (NLP) metrics are mathematical-based, and they also evaluate the overall quality of the generated content. They often require ground truth data, but they don’t require a model deployment as judge.
Risk and safety metrics: These metrics focus on identifying potential content risks and ensuring the safety of the generated content.

You can also create custom metrics and select them as evaluators during the testing criteria step. As you add your testing criteria, different metrics are going to be used as part of the evaluation. You can refer to the table for the complete list of metrics we offer support for in each scenario. For more in-depth information on metric definitions and how they’re calculated, see Built in evaluators.

AI quality (AI assisted)	AI quality (NLP)	Risk and safety metrics
Groundedness, Relevance, Coherence, Fluency, GPT similarity	F1 score, ROUGE score, BLEU score, GLEU score, METEOR score	Self-harm-related content, Hateful and unfair content, Violent content, Sexual content, Protected material, Indirect attack

When you run AI-assisted quality evaluation, you must specify a GPT model for the calculation/grading process. AI Quality (NLP) metrics are mathematically based measurements that assess your application’s performance. They often require ground truth data for calculation. ROUGE is a family of metrics. You can select the ROUGE type to calculate the scores. Various types of ROUGE metrics offer ways to evaluate the quality of text generation. ROUGE-N measures the overlap of n-grams between the candidate and reference texts. For risk and safety metrics, you don’t need to provide a deployment. The Foundry portal provisions a GPT-4 model that can generate content risk severity scores and reasoning to enable you to evaluate your application for content harms.

Data mapping

Data mapping for evaluation: Different evaluation metrics demand distinct types of data inputs for accurate calculations. Based on the dataset you’ve generated or uploaded, we’ll automatically map those dataset fields to the fields present in the evaluators. However, you should always double check the field mapping to make sure it’s accurate. You can reassign fields if needed.

Query and response metric requirements

For guidance on the specific data mapping requirements for each metric, refer to the information provided in the table:

Metric	Query	Response	Context	Ground truth
Groundedness	Required: Str	Required: Str	Required: Str	Doesn’t apply
Coherence	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Fluency	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Relevance	Required: Str	Required: Str	Required: Str	Doesn’t apply
GPT-similarity	Required: Str	Required: Str	Doesn’t apply	Required: Str
F1 score	Doesn’t apply	Required: Str	Doesn’t apply	Required: Str
BLEU score	Doesn’t apply	Required: Str	Doesn’t apply	Required: Str
GLEU score	Doesn’t apply	Required: Str	Doesn’t apply	Required: Str
METEOR score	Doesn’t apply	Required: Str	Doesn’t apply	Required: Str
ROUGE score	Doesn’t apply	Required: Str	Doesn’t apply	Required: Str
Self-harm-related content	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Hateful and unfair content	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Violent content	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Sexual content	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Protected material	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply
Indirect attack	Required: Str	Required: Str	Doesn’t apply	Doesn’t apply

Query: A query seeking specific information.
Response: The response to a query generated by the model.
Context: The source that the response is based on. (Example: grounding documents.)
Ground truth: A query response generated by a human user that serves as the true answer.

Review and submit

After you complete all the necessary configurations, you can provide a name for your evaluation. Then you can review and select Submit to submit the evaluation run. Learn more about evaluating your generative AI applications:

What is Microsoft Foundry (new)?

Get started

Agent development

Agent tools & integration

Model capabilities

Fine-tuning

Manage agents, models, & tools

Observability, evaluation, & tracing

Developer experience

API & SDK

Responsible AI

Best practices

Setup & configure

Security & governance

Operate & support

Evaluate Generative AI Models and Apps with Microsoft Foundry

Evaluate generative AI models and applications by using Microsoft Foundry

Prerequisites

Create an evaluation with built-in evaluation metrics

From the evaluate page

From the model or agent playground page

Evaluation target

Select or Create a Dataset

Configure testing criteria

Data mapping

Query and response metric requirements

Review and submit

What is Microsoft Foundry (new)?

Get started

Agent development

Agent tools & integration

Model capabilities

Fine-tuning

Manage agents, models, & tools

Observability, evaluation, & tracing

Developer experience

API & SDK

Responsible AI

Best practices

Setup & configure

Security & governance

Operate & support

​Evaluate generative AI models and applications by using Microsoft Foundry

​Prerequisites

​Create an evaluation with built-in evaluation metrics

​From the evaluate page

​From the model or agent playground page

​Evaluation target

​Select or Create a Dataset

​Configure testing criteria

​Data mapping

Query and response metric requirements

​Review and submit

​Related content

Evaluate generative AI models and applications by using Microsoft Foundry

Prerequisites

Create an evaluation with built-in evaluation metrics

From the evaluate page

From the model or agent playground page

Evaluation target

Select or Create a Dataset

Configure testing criteria

Data mapping

Review and submit

Related content