Prompt cache retention
Prompt caching can use either in-memory or extended retention policies. When available, extended prompt caching aims to retain the cache for longer, so that subsequent requests are more likely to match the cache. To configure the prompt cache retention policy, set theprompt_cache_retention parameter on the Responses or Chat Completions API.
In-memory prompt cache retention
The system typically clears caches within 5 to 10 minutes of inactivity and always removes them within one hour of the cache’s last use. The system doesn’t share prompt caches between Azure subscriptions. All Azure OpenAI models GPT-4o or newer support in-memory prompt cache retention. It applies to models that have chat-completion, completion, responses, or real-time operations. For models that don’t have these operations, this feature isn’t available.Extended prompt cache retention
Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended prompt caching works by offloading the key/value tensors to GPU-local storage when memory is full, which significantly increases the storage capacity available for caching. Extended prompt cache retention is available for the following models:gpt-5.4gpt-5.3-codexgpt-5.2gp5-5.1-codex-maxgpt-5.1gpt-5.1-codexgpt-5.1-codex-minigpt-5.1-chatgpt-5gpt-5-codexgpt-4.1
Configure per request
Forgpt-5.4 and older models, if you don’t specify a retention policy, the default is in_memory. Allowed values are in_memory and 24h. For all newer models, the default is 24h and in_memory isn’t supported.
Getting started
To take advantage of prompt caching, a request must meet both of these requirements:- A minimum of 1,024 tokens in length.
- The first 1,024 tokens in the prompt must be identical.
cached_tokens under prompt_tokens_details in the chat completions response.
cached_tokens value of 0. Prompt caching is enabled by default with no additional configuration needed for supported models.
If you provide the prompt_cache_key parameter, it’s combined with the prefix hash, so you can influence routing and improve cache hit rates. This benefit is especially beneficial when many requests share long, common prefixes. If requests for the same prefix and prompt_cache_key combination exceed a certain rate (approximately 15 requests per minute), some requests overflow and get routed to extra machines, reducing cache effectiveness.
Frequently asked questions
What is cached?
Feature support for o1-series models varies by model. For more information, see the dedicated reasoning models guide. Prompt caching supports:| Caching supported | Description |
|---|---|
| Messages | The complete messages array: system, developer, user, and assistant content |
| Images | Images included in user messages, both as links or as base64-encoded data. The detail parameter must be set the same across requests. |
| Tool use | Both the messages array and tool definitions. |
| Structured outputs | Structured output schema is appended as a prefix to the system message. |