Skip to main content
Currently viewing: New Foundry portal version - Switch to version for the classic Foundry portal Before creating a provisioned deployment, estimate how many provisioned throughput units (PTUs) your workload needs. This article provides the per-model throughput parameters you need and shows how to calculate PTU requirements using sizing formulas or the Foundry capacity calculator. If you’re new to provisioned throughput, start with What is provisioned throughput for Foundry Models?. When you’re ready to create your deployment, see Quickstart: Create a provisioned throughput deployment.

Prerequisites

Estimate PTUs required

Two approaches are available for estimating the number of PTUs required for a workload: Both approaches use per-model values from the deployment parameters tables to generate estimates. For the most accurate results, benchmark a deployment against representative traffic rather than relying solely on estimated inputs.
For older models (before GPT-4o), the request/call shape distribution affects capacity consumption: a small number of large calls can consume significantly more capacity than many small calls with the same average token count. For GPT-4o and later models, TPM per PTU is set for input and output tokens separately, so this tiering effect doesn’t apply.

Estimate manually

You can estimate the PTUs your workload requires using the model-specific values from the deployment parameters tables and information about your expected traffic as follows:
InputDescription
ModelThe model you plan to deploy, for example, gpt-5.2. Determines which Input TPM per PTU and output-to-input ratio values to use from the deployment parameters tables.
Deployment typeThe provisioned deployment type: Global Provisioned, Data Zone Provisioned, or Regional Provisioned.
Peak RPMThe expected peak number of calls per minute sent to the model.
Average prompt sizeThe average number of input tokens per request.
Average response sizeThe average number of output tokens per request.
Cache rateThe percentage of input tokens served from the prompt cache. Use 0 if caching isn’t used. Cached tokens are deducted 100% from the utilization calculation and don’t consume PTU capacity.

Normalized TPM

The manual calculation of PTUs converts your expected token volume into a single number called the normalized TPM. The number of PTUs required is then determined by dividing the normalized TPM by the model’s Input TPM per PTU value. Formulas:
  • Input TPM = Peak RPM × average prompt size (tokens)
  • Output TPM = Peak RPM × average response size (tokens)
  • Normalized TPM = (input TPM × (1 − cache rate)) + (output-to-input ratio × output TPM)
  • PTUs required = normalized TPM ÷ Input TPM per PTU
Worked example: Suppose your application sends requests at a peak rate of 1,000 RPM, with an average prompt size of 200 tokens and an average response size of 20 tokens, using the gpt-5.2 model with Data Zone provisioned throughput deployment. From the table, gpt-5.2 has an Input TPM per PTU of 3,400 and an output-to-input ratio of 8.
  • Input TPM = 1,000 × 200 = 200,000
  • Output TPM = 1,000 × 20 = 20,000
  • Normalized TPM (no cache) = 200,000 + (8 × 20,000) = 360,000
  • PTUs required = 360,000 ÷ 3,400 = 105.88 (110 PTUs rounded up to the nearest 5 PTUs, matching the Data Zone Provisioned scale increment for gpt-5.2.)
If 50% of input tokens are served from the prompt cache:
  • Effective input TPM = 200,000 × (1 − 0.50) = 100,000
  • Normalized TPM = 100,000 + (8 × 20,000) = 260,000
  • PTUs required = 260,000 ÷ 3,400 = 76.47 (80 PTUs rounded up to the nearest 5 PTUs, matching the Data Zone Provisioned scale increment for gpt-5.2.)
In summary, the PTUs needed for this example call shape with and without caching are as follows:
Peak calls per minute (RPM)Prompt size (tokens)Response size (tokens)Cache rateInput TPMOutput TPMNormalized TPMEstimated PTUsPTUs (rounded up)1
1,000200200%200,00020,000360,000105.88110
1,0002002050%100,00020,000260,00076.4780
1 Rounded up to the nearest 5 PTUs, matching the Data Zone Provisioned scale increment for gpt-5.2.

Use the capacity calculator

Use the capacity calculator in the Foundry portal to size specific workload shapes. Find the calculator on the Quota page and enter the following parameters based on your workload:
InputDescription
ModelThe model you plan to use.
VersionThe version of the model you plan to use.
Peak calls per minThe number of calls per minute expected to be sent to the model.
Tokens in prompt callThe number of tokens in the prompt for each call to the model. Calls with larger prompts consume more PTU capacity. The calculator assumes a single prompt value—for workloads with wide variance in prompt size, benchmark a deployment against your actual traffic for a more accurate estimate.
Tokens in model responseThe number of tokens generated per call, also called generation size. Calls with larger generation sizes consume more PTU capacity. As with prompt tokens, the calculator assumes a single value.
Cache ratePercentage of input tokens served from the prompt cache.
After you fill in the required details, select Calculate. The output shows:
  • The estimated PTU count required for the workload. This value is rounded up to the nearest PTU scale increment for the selected deployment type, or to the deployment type’s minimum PTU count, depending on which one is larger.
  • The raw (unrounded) estimated PTU count.

How input and output tokens affect throughput

The throughput (measured as tokens per minute, or TPM) that a deployment gets per PTU depends on the model and the mix of input and output tokens in a given minute. Generating output tokens requires more processing capacity than consuming input tokens. For GPT-4.1 models and later, the system determines an output-to-input ratio to match the global standard price ratio between input and output tokens, with exceptions for some models. For example,
  • For gpt-5, one output token counts as eight input tokens toward your utilization limit, matching the model’s global standard price ratio.
  • For gpt-4.1, one output token counts as four input tokens.
  • Older models use different ratios.
For all deployments, cached tokens are deducted 100% from the utilization calculation, meaning repeated prompt tokens don’t consume PTU capacity. See Prompt caching for more information.

Models with a non-standard output-to-input ratio

Some models use an output-to-input ratio that differs from their global standard price ratio. For example, with Llama-3.3-70B-Instruct, one output token counts as four input tokens toward your utilization limit, which differs from that model’s standard price ratio. See pricing for Llama models for the full input and output pricing breakdown.

Deployment parameters and throughput values by model

The tables in this section list the throughput and deployment parameters for each supported model. To understand what the parameters in each row mean, see the Appendix.

Latest Azure OpenAI models

gpt-5.4, gpt-4.1, gpt-4.1-mini, and gpt-4.1-nano don’t support long context (requests estimated at larger than 128k prompt tokens).
Topicgpt-5.5,
2026-04-24
gpt-5.4,
2026-03-05
gpt-5.4-mini,
2026-03-17
gpt-5.3-codex,
2026-02-24
gpt-5.2,
2025-12-11
gpt-5.2-codex,
2026-01-14
gpt-5.1,
2025-11-13
gpt-5.1-codex,
2025-11-13
gpt-5,
2025-08-07
gpt-5-mini,
2025-08-07
gpt-4.1,
2025-04-14
gpt-4.1-mini,
2025-04-14
gpt-4.1-nano,
2025-04-14
o3,
2025-04-16
o4-mini,
2025-04-16
Global & data zone provisioned minimum deployment151515151515151515151515151515
Global & data zone provisioned scale increment555555555555555
Regional provisioned minimum deployment505025505050505050255025255025
Regional provisioned scale increment505025505050505050255025255025
Input TPM per PTU1,2002,4007,9003,4003,4003,4004,7504,7504,75023,7503,00014,90059,4003,0005,400
Output-to-input ratio666888888844444
Latency target value199% > 100 TPS99% > 50 TPS99% > 100 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 80 TPS99% > 80 TPS99% > 90 TPS99% > 100 TPS99% > 80 TPS99% > 90 TPS
1 Calculated as p50 request latency on a per 5-minute basis. TPS = tokens per second.

Previous Azure OpenAI models

Topicgpt-4ogpt-4o-minio3-minio1
Global & data zone provisioned minimum deployment15151515
Global & data zone provisioned scale increment5555
Regional provisioned minimum deployment50252525
Regional provisioned scale increment50252550
Input TPM per PTU2,50037,0002,500230
Output-to-input ratio4444
Latency target value199% > 25 TPS99% > 33 TPS99% > 66 TPS99% > 25 TPS
1 Calculated as the average request latency on a per-minute basis across the month. TPS = tokens per second.

Foundry Models sold by Azure

This section lists other Foundry Models sold by Azure, not including the Azure OpenAI in Foundry Models listed in the previous tables.
TopicLlama-3.3-70B-InstructDeepSeek-R1DeepSeek-V3-0324
Global & data zone provisioned minimum deployment100100100
Global & data zone provisioned scale increment100100100
Regional provisioned minimum deploymentNANANA
Regional provisioned scale incrementNANANA
Input TPM per PTU8,4504,0004,000
Output-to-input ratio4144
Latency target value299% > 50 TPS99% > 50 TPS99% > 50 TPS
1 For Llama-3.3-70B-Instruct, one output token counts as four input tokens toward your utilization limit. This ratio differs from the global standard price ratio between input and output tokens. See Models with a non-standard output-to-input ratio and Llama model pricing. 2 Calculated as the average request latency on a per-minute basis across the month. TPS = tokens per second.

Fireworks on Microsoft Foundry models

The following Fireworks on Microsoft Foundry models support provisioned throughput.
TopicDeepSeek v3.1DeepSeek V4 FlashDeepSeek V4 ProGemma 4 26B A4B ITGemma 4 31B ITGLM-4.7GLM-5.1Kimi K2 Instruct 0905Kimi K2 ThinkingKimi K2.6Llama 3.1 8B InstructMinistral 3 3B Instruct 2512Qwen 3.5 9BQwen 3.5 35B A3BQwen 3.5 112B A10BQwen 3.5 397B
Global provisioned minimum deployment20010040020020020040020020020040404040450200
Global provisioned scale increment1005020010010010020010010010020202020225100
Input TPM per PTU2,1002,8002005,4002,2006,0009002,5001,4004,00057,80025,40010,70017,80037,2534,032
Latency Target Value199% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS99% > 50 TPS
1 Calculated as the average request latency on a per-minute basis across the month. TPS = tokens per second.

Appendix

Each row in the tables corresponds to one of the following parameters:
ParameterDescription
Global & data zone provisioned minimum deploymentThe smallest number of PTUs you can deploy for Global Provisioned or Data Zone Provisioned deployment types. For example, gpt-5.2 requires a minimum deployment of 15 PTUs.
Global & data zone provisioned scale incrementThe PTU increment in which you can increase or decrease a Global Provisioned or Data Zone Provisioned deployment. Continuing with the gpt-5.2 example, an increment of 5 means deployments can be sized at 15, 20, 25, and so on.
Regional provisioned minimum deploymentThe smallest number of PTUs you can deploy for a Regional Provisioned deployment. For example, gpt-5.2 requires a minimum regional provisioned deployment of 50 PTUs.
Regional provisioned scale incrementThe PTU increment for Regional Provisioned deployments. Continuing with the gpt-5.2 example, an increment of 50 means deployments can be sized at 50, 100, 150, and so on.
Input TPM per PTUThe maximum input tokens per minute (TPM) that one PTU supports. Use this value when estimating PTUs.
Output-to-input ratioThe weight applied to output tokens when estimating PTU requirements. This value reflects the model’s global standard price ratio between output and input tokens, with exceptions for some models. For example, a ratio of 8 means one output token counts as eight input tokens toward the model’s TPM limit. See Azure OpenAI pricing, Llama model pricing, and DeepSeek model pricing for current pricing.
Latency target valueThe expected request latency at the stated PTU utilization level. Expressed as a percentile threshold—for example, “99% > 50 TPS” means 99% of requests are processed at more than 50 tokens per second.