Skip to main content

See evaluation results in the Microsoft Foundry portal

This article refers to the Microsoft Foundry (new) portal.
In this article, you learn to:
  • Locate and open evaluation runs.
  • View aggregate and sample-level metrics.
  • Compare results across runs.
  • Interpret metric categories and calculations.
  • Troubleshoot missing or partial metrics.

Prerequisites

See your evaluation results

After submitting an evaluation, you can track its progress on the Evaluation details page. When the evaluation completes, the page displays key information such as: -The evaluation creator -Evaluation token usage -Scores for each evaluator, broken down by run
Screenshot of the evaluation details page showing evaluation runs.
Select a specific run to drill into row‑level results. Select Learn more about metrics for definitions and formulas.

Evaluation run details

To view the row level data for individual runs, select the name of the run. This provides a view that allows you to see evaluation results at the individual query level against each evaluator used. Here, you can view details like query, response, ground truth, and the evaluator score and explanation.

Compare the evaluation results

To facilitate a comprehensive comparison between two or more runs, you can select the desired runs and initiate the process.
  1. Select two or more runs in the evaluation detail page.
  2. Select Compare.
It generates a side-by-side comparison view for all selected runs. The comparison is computed based on statistic t-testing, which provides more sensitive and reliable results for you to make decisions. You can use different functionalities of this feature:
  • Baseline comparison: By setting a baseline run, you can identify a reference point against which to compare the other runs. You can see how each run deviates from your chosen standard.
  • Statistic t-testing assessment: Each cell provides the stat-sig results with different color codes. You can also hover on the cell to get the sample size and p-value.
LegendDefinition
ImprovedStrongHighly stat-sig (p<=0.001) and moved in the desired direction
ImprovedWeakStat-sig (0.001<p<=0.05) and moved in the desired direction
DegradedStrongHighly stat-sig (p<=0.001) and moved in the wrong direction
DegradedWeakStat-sig (0.001<p<=0.05) and moved in the wrong direction
ChangedStrongHighly stat-sig (p<=0.001) and desired direction is neutral
ChangedWeakStat-sig (0.001<p<=0.05) and desired direction is neutral
InconclusiveToo few examples, or p>=0.05
The comparison view won’t be saved. If you leave the page, you can reselect the runs and select Compare to regenerate the view.

Understand the built-in evaluation metrics

Understanding the built-in metrics is essential for assessing the performance and effectiveness of your AI application. By learning about these key measurement tools, you can interpret the results, make informed decisions, and fine-tune your application to achieve optimal outcomes. To learn more, see Built in evaluators.

Troubleshooting

SymptomPossible causeAction
Run stays pendingHigh service load or queued jobsRefresh, verify quota, and resubmit if prolonged
Metrics missingNot selected at creationRerun and select required metrics
All safety metrics zeroCategory disabled or unsupported modelConfirm model and metric support matrix
Groundedness unexpectedly lowRetrieval/context incompleteVerify context construction / retrieval latency
Learn how to evaluate your generative AI applications: