Model leaderboards in Microsoft Foundry portal (preview)
This article refers to the Microsoft Foundry (new) portal.
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
- Quality benchmarking of language models to understand how well models perform on core tasks including reasoning, knowledge, question answering, math, and coding.
- Safety benchmarking of language models to understand how safe models are against harmful behavior generation.
- Performance benchmarking of language models to understand how models perform in terms of latency and throughput.
- Cost benchmarking of language models to understand the estimated cost of using models.
- Scenario leaderboard benchmarking of language models to help you find the best model for your specific use case or scenario.
- Quality benchmarking of embedding models to understand how well models perform on embedding-based tasks including search and retrieval.
Model benchmarking scope
The model leaderboards feature a curated selection of text-based language models from the Foundry model catalog. Models are included based on the following criteria:- Azure Direct Models prioritized: Azure Direct Models are selected for relevance to common generative AI scenarios.
- Core benchmark applicability: Models must support general-purpose language tasks such as reasoning, knowledge, question answering, mathematical reasoning, and coding. Specialized models (for example, protein folding or domain-specific QA) and other modalities aren’t supported.
Interpret leaderboard results
The leaderboards help you compare models across multiple dimensions so you can choose the right model for your use case. Here are some guidelines for interpreting the results:- Quality index: A higher quality index indicates stronger overall performance across reasoning, coding, math, and knowledge tasks. Compare the quality index across models to identify top performers for general-purpose language tasks.
- Safety scores: Lower attack success rates indicate more robust models. Consider safety scores alongside quality scores, especially for customer-facing applications where harmful output is a significant concern.
- Performance trade-offs: Use the latency and throughput metrics to understand the real-world responsiveness of a model. A model with high quality but high latency might not suit real-time applications.
- Cost considerations: The estimated cost metric uses a three-to-one input-to-output token ratio. Adjust your expectations based on your actual workload’s input-to-output ratio.
- Scenario leaderboards: If your use case maps to a specific scenario (for example, coding or math), start with the scenario leaderboard to find models optimized for that task rather than relying solely on the overall quality index.
Quality benchmarks of language models
Foundry assesses the quality of LLMs and SLMs using accuracy scores from standard benchmark datasets that measure reasoning, knowledge, question answering, math, and coding capabilities.| Index | Description |
|---|---|
| Quality index | Calculated by averaging applicable accuracy scores (exact_match, pass@1, arena_hard) across benchmark datasets. |
| Dataset Name | Category |
|---|---|
| arena_hard | QA |
| bigbench_hard (downsampled to 1,000 examples) | Reasoning |
| gpqa | QA |
| humanevalplus | Coding |
| ifeval | Reasoning |
| math | Math |
| mbppplus | Coding |
| mmlu_pro (downsampled to 1,000 examples) | General knowledge |
| Metric | Description |
|---|---|
| Accuracy | Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is exact_match in all cases, except for the HumanEval and MBPP datasets that use a pass@1 metric. Exact match compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. The pass@1 metric measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model. |
Safety benchmarks of language models
Safety benchmarks are selected through a structured filtering and validation process designed to ensure both relevance and rigor. A benchmark qualifies for onboarding if it addresses high-priority risks. The safety leaderboards include benchmarks that are reliable enough to provide meaningful signals on topics of interest as they relate to safety. The leaderboards use HarmBench to proxy model safety, and organize scenario leaderboards as follows:| Dataset Name | Leaderboard Scenario | Metric | Interpretation |
|---|---|---|---|
| HarmBench (standard) | Standard harmful behaviors | Attack Success Rate | Lower values mean better robustness against attacks designed to elicit standard harmful content |
| HarmBench (contextual) | Contextually harmful behaviors | Attack Success Rate | Lower values mean better robustness against attacks designed to elicit contextually harmful content |
| HarmBench (copyright violations) | Copyright violations | Attack Success Rate | Lower values indicate stronger robustness against copyright violations |
| WMDP | Knowledge in sensitive domains | Accuracy | Higher values indicate greater knowledge in sensitive domains |
| Toxigen | Toxic content detection | F1 Score | Higher values indicate better detection performance |
Harmful behavior detection
The HarmBench benchmark measures harmful behaviors using prompts designed to elicit unsafe responses. It covers seven semantic categories:- Cybercrime and unauthorized intrusion
- Chemical and biological weapons or drugs
- Copyright violations
- Misinformation and disinformation
- Harassment and bullying
- Illegal activities
- General harm
- Standard harmful behaviors
- Contextually harmful behaviors
- Copyright violations
Toxic content detection
Toxigen is a large-scale dataset for detecting adversarial and implicit hate speech. It includes implicitly toxic and benign sentences referencing 13 minority groups. Foundry uses annotated Toxigen samples and calculates F1 scores to measure classification performance. Higher scores indicate better toxic content detection. Benchmarking is performed with Foundry Guardrails (previously content filters) turned off.Sensitive domain knowledge
The Weapons of Mass Destruction Proxy (WMDP) benchmark measures model knowledge in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy scores across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (worse behavior from a safety standpoint). Model benchmarking is performed with the default Foundry Guardrails (previously content filters) on. These guardrails detect and block content harm in violence, self-harm, sexual, hate, and unfairness, but don’t target categories in cybersecurity, biosecurity, and chemical security.Limitations of safety benchmarks
Safety is a complex topic with several dimensions. No single open-source benchmark can test or represent the full safety of a system across all scenarios. Additionally, many benchmarks suffer from saturation or misalignment between benchmark design and risk definition. Some benchmarks also lack clear documentation on how targets risks are conceptualized and operationalized, making it difficult to assess whether results accurately capture the nuances of real-world risks. These limitations can lead to either overestimating or underestimating model performance in real-world safety scenarios.Performance benchmarks of language models
Performance metrics are aggregated over 14 days using 24 trials per day, with two requests per trial sent at one-hour intervals. Unless otherwise noted, the following default parameters apply to both serverless API deployments and Azure OpenAI:| Parameter | Value | Applicable for |
|---|---|---|
| Region | East US/East US2 | serverless API deployments and Azure OpenAI |
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) for non-reasoning and 100k for reasoning models N/A (serverless API deployments) | For Azure OpenAI models, selection is available for users with rate limit ranges based on deployment type (serverless API, global, global standard, and so on.) For serverless API deployments, this setting is abstracted. |
| Number of requests | Two requests in a trial for every hour (24 trials per day) | serverless API deployments, Azure OpenAI |
| Number of trials/runs | 14 days with 24 trials per day for 336 runs | serverless API deployments, Azure OpenAI |
| Prompt/Context length | Moderate length | serverless API deployments, Azure OpenAI |
| Number of tokens processed (moderate) | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | serverless API deployments, Azure OpenAI |
| Number of concurrent requests | One (requests are sent sequentially one after other) | serverless API deployments, Azure OpenAI |
| Data | Synthetic (input prompts prepared from static text) | serverless API deployments, Azure OpenAI |
| Deployment type | serverless API | Applicable only for Azure OpenAI |
| Streaming | True | Applies to serverless API deployments and Azure OpenAI. For models deployed via managed compute, or for endpoints when streaming isn’t supported TTFT is represented as P50 of latency metric. |
| SKU | Standard_NC24ads_A100_v4 (24 cores, 220GB RAM, 64GB storage) | Applicable only for Managed Compute (to estimate the cost and performance metrics) |
| Metric | Description |
|---|---|
| Latency mean | Average time in seconds to process a request, computed over multiple requests. A request is sent to the endpoint every hour for two weeks, and the average is computed. |
| Latency P50 | Median (50th percentile) latency. 50% of requests complete within this time. |
| Latency P90 | 90th percentile latency. 90% of requests complete within this time. |
| Latency P95 | 95th percentile latency. 95% of requests complete within this time. |
| Latency P99 | 99th percentile latency. 99% of requests complete within this time. |
| Throughput GTPS | Generated tokens per second (GTPS) is the number of output tokens that are getting generated per second from the time the request is sent to the endpoint. |
| Throughput TTPS | Total tokens per second (TTPS) is the number of total tokens processed per second including both from the input prompt and generated output tokens. For models which don’t support streaming, time to first token (ttft) represents the P50 value of latency (time taken to receive the response) |
| Latency TTFT | Total time to first token (TTFT) is the time taken for the first token in the response to be returned from the endpoint when streaming is enabled. |
| Time between tokens | This metric is the time between tokens received. |
| Metric | Description |
|---|---|
| Latency | Mean time to first token. Lower is better. |
| Throughput | Mean generated tokens per second. Higher is better. |
Cost benchmarks of language models
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Foundry platform. Foundry supports displaying the cost of serverless API deployments and Azure OpenAI models. Because these costs are subject to change, cost calculations are refreshed periodically to reflect the latest pricing. The cost of LLMs and SLMs is assessed across the following metrics:| Metric | Description |
|---|---|
| Cost per input tokens | Cost for serverless API deployment for 1 million input tokens |
| Cost per output tokens | Cost for serverless API deployment for 1 million output tokens |
| Estimated cost | Cost for the sum of cost per input tokens and cost per output tokens, with a ratio of 3:1. |
| Metric | Description |
|---|---|
| Cost | Estimated US dollar cost per 1 million tokens. The estimated workload uses the three-to-one ratio between input and output tokens. Lower values are better. |
Scenario leaderboard benchmarking
Scenario leaderboards group benchmark datasets by common real-world evaluation goals so you can quickly identify a model’s strengths and weaknesses by use case. Each scenario aggregates one or more public benchmark datasets. Use the following table to find your use case in the Scenario column, then review the associated benchmark datasets and what the results indicate. The following table summarizes the available scenario leaderboards and their associated datasets and descriptions:| Scenario | Datasets | Description |
|---|---|---|
| Standard harmful behavior | HarmBench (standard) | Attack success rate on standard harmful prompts. Lower is better. See Harmful behavior detection. |
| Contextually harmful behavior | HarmBench (contextual) | Attack success rate on contextual harmful prompts. Lower is better. See Harmful behavior detection. |
| Copyright violations | HarmBench (copyright) | Attack success rate for copyright violation prompts. Lower is better. See Harmful behavior detection. |
| Knowledge in sensitive domains | WMDP (biosecurity, chemical security, cybersecurity) | Accuracy across three sensitive domain subsets. Higher accuracy indicates more knowledge of sensitive capabilities. See Sensitive domain knowledge. |
| Toxicity detection | ToxiGen (annotated) | F1 score for toxic content detection ability. Higher is better. See Toxic content detection. |
| Reasoning | BIG-Bench Hard (1000 subsample) | Reasoning capabilities assessment. Higher values are better. |
| Coding | BigCodeBench (instruct), HumanEvalPlus, LiveBench (coding), MBPPPlus | Measures accuracy on code-related tasks. Higher values are better. |
| General knowledge | MMLU-Pro (1K English subsample) | 1,000‑example English-only subsample of MMLU-Pro. |
| Question & answering | Arena-Hard, GPQA (diamond) | Adversarial human preference QA (Arena-Hard) and graduate‑level multi‑discipline QA (GPQA diamond). Higher values are better. |
| Math | MATH (500 subsample) | Measures mathematical reasoning capabilities of language models. Higher values are better. |
| Groundedness | TruthfulQA (MC1) | Multiple‑choice groundedness / truthfulness assessment of language models. Higher values are better. |
Quality benchmarks of embedding models
The quality index of embedding models is defined as the averaged accuracy scores of a comprehensive set of serverless API benchmark datasets targeting Information Retrieval, Document Clustering, and Summarization tasks.| Metric | Description |
|---|---|
| Accuracy | Accuracy is the proportion of correct predictions among the total number of predictions processed. |
| F1 Score | F1 Score is the weighted mean of the precision and recall, where the best value is one (perfect precision and recall), and the worst is zero. |
| Mean average precision (MAP) | MAP evaluates the quality of ranking and recommender systems. It measures both the relevance of suggested items and how good the system is at placing more relevant items at the top. Values can range from zero to one, and the higher the MAP, the better the system can place relevant items high in the list. |
| Normalized discounted cumulative gain (NDCG) | NDCG evaluates a machine learning algorithm’s ability to sort items based on relevance. It compares rankings to an ideal order where all relevant items are at the top of the list, where k is the list length while evaluating ranking quality. In these benchmarks, k=10, indicated by a metric of ndcg_at_10, meaning that the top 10 items are evaluated. |
| Precision | Precision measures the model’s ability to identify instances of a particular class correctly. Precision shows how often a machine learning model is correct when predicting the target class. |
| Spearman correlation | Spearman correlation based on cosine similarity is calculated by first computing the cosine similarity between variables, then ranking these scores and using the ranks to compute the Spearman correlation. |
| V measure | V measure is a metric used to evaluate the quality of clustering. V measure is calculated as a harmonic mean of homogeneity and completeness, ensuring a balance between the two for a meaningful score. Possible scores lie between zero and one, with one being perfectly complete labeling. |
Calculation of scores
Individual scores
Benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Foundry evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics. Prompt construction follows best practices for each dataset, as specified by the paper introducing the dataset and industry standards. In most cases, each prompt contains several shots, that is, several examples of complete questions and answers to prime the model for the task. The number of shots varies by dataset and follows the methodology specified in each dataset’s original publication. The evaluation pipelines create shots by sampling questions and answers from a portion of the data held out from evaluation.Benchmark limitations
All benchmarks have inherent limitations that you should consider when interpreting results:- Quality benchmarks: Benchmark datasets can become saturated over time as models are trained or tuned on similar data. Evaluation results might also vary depending on prompt construction and the number of few-shot examples used.
- Performance benchmarks: Metrics are collected using synthetic workloads with a fixed input-to-output token ratio and single-region deployments. Real-world performance might differ based on workload patterns, concurrency, region, and deployment configuration.
- Cost benchmarks: Cost estimates are based on a three-to-one input-to-output token ratio and current pricing at the time of measurement. Actual costs depend on your workload and are subject to pricing changes.