Skip to main content

Enforce token limits for models

Microsoft Foundry Control Plane enforces tokens-per-minute (TPM) rate limits and total token quotas for model deployments at the project scope. This enforcement prevents runaway token consumption and aligns usage with organizational guardrails. Foundry Control Plane integrates with AI gateways to provide advanced policy enforcement for models. This article explains how to configure token rate limiting and token quotas.

Prerequisites

Understand AI gateways

When you use an AI gateway with Foundry Control Plane to provide advanced policy enforcement for models, the AI gateway sits between clients and model deployments. It makes all requests flow through the API Management instance that’s associated with it. Limits apply at the project level. That is, each project can have its own TPM and quota settings.
Diagram of the logical flow of client requests passing through Azure API Management as an AI gateway before reaching model deployments within a project.
Use an AI gateway for:
Multiple-team token containment (prevent one project from monopolizing capacity).
Cost control by capping aggregate usage.
Compliance boundaries for regulated workloads (enforce predictable usage ceilings).

Configure token limits

You can configure token limits for specific model deployments within your projects:
  1. In the AI Gateway list, select the gateway that you want to use.
  2. On the gateway details pane that appears, select Token management.
  3. Select + Add limit to create a new limit for a model deployment.
  4. Select the project and deployment that you want to restrict, and enter a value for Limit (Token-per-minute).
  5. Select Create to save your changes.
Screenshot of the project settings pane that shows input boxes for tokens per minute and total token quota limits.

Understand quota windows

Token limits have two complementary enforcement dimensions:
  • TPM rate limit: Limits token consumption to a configured maximum per minute. When requests exceed the TPM limit, the caller receives a 429 Too Many Requests response status code.
  • Total token quota: Limits token consumption to a configured maximum per quota period (for example, hourly, daily, weekly, monthly, or yearly). When requests exceed the quota, the caller receives a 403 Forbidden response status code.
If you send many requests concurrently, token consumption can temporarily exceed the configured limits until responses are processed. Adjusting a quota or TPM value affects subsequent enforcement decisions. For more information, see AI gateway in Azure API Management and Limit large language model API token usage.

Verify enforcement

  1. Send test requests to a model deployment endpoint by using the project’s gateway URL and key.
  2. Gradually increase request frequency until the TPM limit triggers.
  3. Track cumulative tokens until the quota triggers.
  4. Validate that:
    • 429 Too Many Requests (rate-limited response) is returned when requests exceed the TPM limit.
    • 403 Forbidden (quota error) is returned when requests exhaust the quota.

Adjust limits

  1. Return to the project’s AI Gateway settings.
  2. Modify TPM or quota values.
  3. Save the changes. New limits apply immediately to subsequent requests.

Troubleshoot

ProblemPossible causeAction
API Management instance doesn’t appearProvisioning delayRefresh after a few minutes.
Limits aren’t enforcedMisconfiguration or project not linkedReopen settings and confirm that the enforcement toggle is on. Confirm that the AI gateway is enabled for the project and that correct limits are configured.
Latency is high after enablementAPI Management cold start or region mismatchCheck API Management region versus resource region. Call the model directly and compare the result with the call proxied through the AI gateway to identify if performance problems are related to the gateway.
If the admin console is slow, retry after a brief interval.