Skip to main content

Compare models using the model leaderboard (preview)

This article refers to the Microsoft Foundry (new) portal.
This article shows you how to streamline model selection in the Foundry model catalog by using the model leaderboards (preview) and side-by-side comparison features in Foundry portal. Evaluate model benchmark scores across quality, safety, cost, and throughput to choose the best model for your scenario. After reading this article, you can identify the best model for your scenario by comparing benchmark scores, viewing trade-off charts, and evaluating models side by side. Analyze and compare models using:
  • Model leaderboard to quickly identify top-performing models for quality, safety, estimated cost, and throughput leaderboards
  • Trade-off charts to visually compare model performance across two metrics, such as quality versus cost
  • Leaderboards by scenario to find the most relevant benchmark leaderboard for your specific scenario
  • Compare models to evaluate features, performance, and estimated cost in a side-by-side view
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Prerequisites

Access model leaderboards

Use the model catalog to access the leaderboard and identify top-performing models.
  1. If you’re not already in your project, select it.
  2. Select Discover from the top navigation bar to browse the model catalog. The overview page shows a snapshot of the model leaderboard at the top.
  3. Sort on different metrics — quality, safety, estimated cost, and throughput — to view the top models for each criterion. Select any model to go to its detail page, or select Go to leaderboard to view the full list. The model detail page shows the model’s overview, benchmarks, and deployment options.
  4. On the Model leaderboard page, view the top leaders for quality, safety, throughput, and estimated cost. Expand these charts to visually view the top 10. Each expanded chart shows a bar graph of the top 10 models for that metric.
  5. Scroll past the charts to find the full model leaderboard. Sort on the metrics you care most about and select a model name to go to its detail page. The leaderboard table appears with sortable columns for quality, safety, throughput, and estimated cost.
  6. Select two or three models for detailed feature comparison in a side-by-side view. The comparison view appears, showing features, performance, and estimated cost for your selected models.

Trade-off charts

The trade-off chart allows you to compare trade-offs visually based on the criteria that matter most to you. For example, suppose the highest-quality model isn’t the cheapest. You might need to make trade-offs among quality, safety, cost, and throughput criteria. In the trade-off chart, you can compare how models perform along two metrics at a glance. Use the trade-off chart on the model leaderboard page to visualize how models balance competing criteria:
  1. Use the Compare quality against dropdown to switch between estimated cost, throughput, or safety comparisons.
  2. Add or remove models from the trade-off chart using the model selector on the right side of the chart.
  3. Hover over a data point to view the exact scores for the selected model. Models closer to the top-right corner of the chart perform well on both axes.

View leaderboards by scenario

Suppose you have a scenario that requires certain model capabilities. For example, if you’re building a question-answering chatbot that needs strong reasoning capabilities and resistance to harmful content, compare models in capability-specific leaderboards.
  1. On the model leaderboard page, scroll to the Leaderboards by scenario section.
  2. Select a scenario category, such as Reasoning, Coding, or Question answering.
  3. Review the ranked list of models for your selected scenario. The leaderboard shows how models perform on benchmarks that are specific to that capability.
After exploring the leaderboards, decide on a model to use.

Compare models

The side-by-side comparison view lets you evaluate up to three models simultaneously across multiple dimensions.
  1. From the model leaderboard, select two or three models by checking the boxes next to their names.
  2. Select Compare to open the side-by-side comparison view.
  3. Review the comparison across the following tabs:
    • Performance benchmarks: Quality, safety, and throughput scores from public datasets
    • Model details: Context window, training data, and supported languages
    • Supported endpoints: Deployment options such as serverless API and managed compute
    • Feature support: Capabilities like function calling, structured output, and vision
  4. To learn more about a specific model, select View details. If you’re ready to start using a model, select Deploy.
Screenshot showing the compare model experience in Microsoft Foundry.

View model benchmarks

To view benchmarks for a specific model, select the model name from the leaderboard or search for it in the model catalog. On the model detail page, select the Benchmarks tab.
Benchmark data isn’t available for all models in the catalog. If a model doesn’t have a Benchmarks tab, benchmark results haven’t been published for that model yet. Benchmarking results conducted by Microsoft are available for select models. Benchmarking information reported by providers is included when available.
When you’re in the Benchmarks tab for a specific model, you can gather extensive information to better understand and interpret the benchmark results:
  • Public data benchmark results: High-level aggregate scores for AI quality, safety, estimated cost, latency, and throughput that provide a quick overview of the model’s performance.
  • Comparative charts: These charts display the model’s relative position compared to related models.
  • Metric comparison table: This table presents detailed results for each metric.
Screenshot showing the metric comparison table on the benchmarks tab.
By default, Foundry displays an average index across various metrics and datasets to provide a high-level overview of model performance.
Benchmark scores are normalized indexes. A higher score indicates better performance for quality and safety metrics. For cost and throughput, lower estimated cost and higher throughput are generally preferred. Use the trade-off charts to balance these competing criteria for your scenario.
To access benchmark results for a specific metric and dataset:
  1. Select the expand button on the chart. The pop-up comparison chart reveals detailed information and offers greater flexibility for comparison.
  2. Select the metric of interest and choose different datasets, based on your specific scenario. For more detailed definitions of the metrics and descriptions of the public datasets used to calculate results, select Read more.
For side-by-side model comparisons, compare models to evaluate features, performance metrics, and estimated costs across multiple models.
The Try with your own data option on the benchmarks tab is available only in Foundry (classic). To evaluate a model with your own data in the new portal, see Evaluate generative AI apps.

Troubleshooting

IssueResolution
Model doesn’t appear in leaderboardNot all models are benchmarked. Check the model catalog for availability.
No Benchmarks tab on model cardBenchmark results haven’t been published for this model yet.
Benchmark scores differ from my resultsPublic benchmarks use standardized datasets and might not reflect performance on your specific data. To evaluate a model with your own data, see Evaluate generative AI apps.
Trade-off chart shows no data pointsEnsure you have models selected in the model selector. At least two models are required for trade-off comparisons.
Can’t compare more than three modelsThe side-by-side comparison view supports a maximum of three models. Deselect a model before adding another.
Benchmark scores seem outdatedMicrosoft updates benchmark scores periodically. Check the model detail page for the benchmark evaluation date.