Predicted outputs (preview)
Predicted outputs can improve model response latency for chat completions calls where minimal changes are needed to a larger body of text. If you’re asking the model to provide a response where a large portion of the expected response is already known, predicted outputs can significantly reduce the latency of this request. This capability is particularly well-suited for coding scenarios, including autocomplete, error detection, and real-time editing, where speed and responsiveness are critical for developers and end-users. Rather than have the model regenerate all the text from scratch, you can indicate to the model that most of the response is already known by passing the known text to theprediction parameter.
Prerequisites
- An Azure OpenAI model deployed
-
Upgrade the OpenAI Python library:
-
If you use Microsoft Entra ID, also install
azure-identity:
Model support
gpt-4o-miniversion:2024-07-18gpt-4oversion:2024-08-06gpt-4oversion:2024-11-20gpt-4.1version:2025-04-14gpt-4.1-nanoversion:2025-04-14gpt-4.1-miniversion:2025-04-14
API support
First introduced in2025-01-01-preview. Supported in all subsequent releases.
Unsupported features
Predicted outputs is currently text-only. These features can’t be used in conjunction with theprediction parameter and predicted outputs.
- Tools/Function calling
- audio models/inputs and outputs
nvalues higher than1logprobspresence_penaltyvalues greater than0frequency_penaltyvalues greater than0max_completion_tokens
The predicted outputs feature is currently unavailable for models in the South East Asia region.
Getting started
To demonstrate the basics of predicted outputs, we’ll start by asking a model to refactor the code from the common programmingFizzBuzz problem to replace the instance of FizzBuzz with MSFTBuzz. We’ll pass our example code to the model in two places. First as part of a user message in the messages array/list, and a second time as part of the content of the new prediction parameter.
Output
accepted_prediction_tokens and rejected_prediction_tokens:
accepted_prediction_tokens help reduce model response latency, but any rejected_prediction_tokens have the same cost implication as additional output tokens generated by the model. For this reason, while predicted outputs can improve model response times, it can result in greater costs. You’ll need to evaluate and balance the increased model performance against the potential increases in cost.
It’s also important to understand, that using predictive outputs doesn’t guarantee a reduction in latency. A large request with a greater percentage of rejected prediction tokens than accepted prediction tokens could result in an increase in model response latency, rather than a decrease.
Unlike prompt caching which only works when a set minimum number of initial tokens at the beginning of a request are identical, predicted outputs isn’t constrained by token location. Even if your response text contains new output that will be returned prior to the predicted output,
accepted_prediction_tokens can still occur.Streaming
Predicted outputs performance boost is often most obvious if you’re returning your responses with streaming enabled.Troubleshooting
- 401/403: If you use Microsoft Entra ID, confirm your identity has access to the Azure OpenAI resource. If you use
get_bearer_token_provider, request a token forhttps://cognitiveservices.azure.com/.default. - 404: Confirm
base_urluses your Azure OpenAI resource name, andmodeluses your deployment name. - 400: Remove optional parameters and features listed in Unsupported features, and try again.