Learn how to use the Azure OpenAI Responses API to create, retrieve, and delete stateful responses with Python or REST, including streaming and tools.
Use the Azure OpenAI Responses API to generate stateful, multi-turn responses. It brings together capabilities from chat completions and the Assistants API in one unified experience. The Responses API also supports the computer-use-preview model that powers Computer use.
Before you run the examples in this article, confirm that your resource region supports the Responses API. The v1 API is required to access the latest features — for details, see the API version lifecycle. The Responses API is currently available in the following regions:
Not every model is available in every supported region. Check the models page for model region availability. For the full set of request and response parameters, see the Responses API reference documentation.
Not currently supported:
Image generation using multi-turn editing and streaming.
Images can’t be uploaded as a file and then referenced as input.
There’s a known issue with the following:
PDF as an input file is now supported, but setting file upload purpose to user_data is not currently supported.
Performance issues when background mode is used with streaming. Microsoft is working to resolve this issue.
Alternatively, you can manually carry forward output items in the next request.
import osfrom openai import OpenAIclient = OpenAI( base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") )inputs = [{"type": "message", "role": "user", "content": "Define and explain the concept of catastrophic forgetting?"}] response = client.responses.create( model="gpt-4o", # replace with your model deployment name input=inputs ) inputs += response.outputinputs.append({"role": "user", "type": "message", "content": "Explain this at a level that could be understood by a college freshman"}) second_response = client.responses.create( model="MODEL_NAME", input=inputs)print(second_response.model_dump_json(indent=2))
The .NET SDK doesn’t yet provide a strongly typed surface for Response compaction. See the REST tab for the call shape, or invoke the protocol method directly with BinaryContent JSON.
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", apiKey: process.env.AZURE_OPENAI_API_KEY,});const compacted = await client.responses.compact({ model: "MODEL_NAME", input: [ { role: "user", content: "Create a simple landing page for a dog cafe." }, { id: "msg_001", type: "message", status: "completed", role: "assistant", content: [{ type: "output_text", text: "..." }], }, ],});const followUp = await client.responses.create({ model: "MODEL_NAME", input: [...compacted.output, { role: "user", content: "Add a booking form." }],});console.log(followUp.output_text);
You can compact all items returned from previous requests like reasoning, message, function call, etc.
curl -X POST https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/responses/compact \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \ -d '{ "model": "MODEL_NAME", "input": [ { "role" : "user", "content": "Create a simple landing page for a dog petting café." }, { "id": "msg_001", "type": "message", "status": "completed", "content": [ { "type": "output_text", "annotations": [], "logprobs": [], "text": "Below is a single file, ready-to-use landing page for a dog petting café:..." } ], "role": "assistant" } ] }'
# Use the compacted output as input for the next turn.next_response = client.responses.create( model="MODEL_NAME", input=[*compacted.output, {"role": "user", "content": "Add opening hours."}],)print(next_response.output_text)
You can also use server-side compaction directly in Responses (POST /responses or client.responses.create) by setting context_management with a compact_threshold.
When the output token count crosses the configured threshold, the Responses API automatically runs compaction.
In this mode, you do not need to call /responses/compact separately.
The response includes an encrypted compaction item.
Server-side compaction will work when you set store=false on your Responses create requests.
The compaction item carries forward the essential prior state and reasoning into the next turn using fewer tokens. It is opaque and not intended to be human-readable.If you are using stateless input-array chaining, append output items as usual. If you are using previous_response_id, pass only the new user message on each turn. In both patterns, the compaction item carries the context needed for the next window.
After appending output items to the previous input items, you can drop items that came before the most recent compaction item to keep requests smaller and reduce long-tail latency. The latest compaction item carries the necessary context to continue the conversation. If you use previous_response_id chaining, do not manually prune.
Call responses as usual. Add context_management with compact_threshold to enable server-side compaction.
If the output crosses the threshold, the service triggers compaction, emits a compaction item in the output stream, and prunes the context before continuing inference.
Continue the conversation using one of these patterns:
Stateless input-array chaining: append output items, including compaction items, to the next input array.
previous_response_id chaining: pass only the new user message on each turn and carry the latest response ID forward.
Stream the response as it’s generated by setting stream=true. The service emits incremental events you can consume to render output token-by-token.
During streaming, the Responses API might return an error event ( 500, 429, and similar errors) if the service encounters an error, such as token limits or parsing problems. Applications should detect this event and gracefully stop or restart streaming. You aren’t charged for tokens generated during failed streaming responses.
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) stream = client.responses.create( model="MODEL_NAME", input="Summarize Azure OpenAI Responses API in one sentence.", stream=True, ) for event in stream: if event.type == "response.output_text.delta": print(event.delta, end="")
The Code Interpreter tool enables models to write and execute Python code in a secure, sandboxed environment. It supports a range of advanced tasks, including:
Processing files with varied data formats and structures
Generating files that include data and visualizations (for example, graphs)
Iteratively writing and running code to solve problems—models can debug and retry code until successful
Enhancing visual reasoning in supported models (for example, o3, o4-mini) by enabling image transformations such as cropping, zooming, and rotation
This tool is especially useful for scenarios involving data analysis, mathematical computation, and code generation.
curl -X POST https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/responses \ -H "Content-Type: application/json" \ -H "api-key: $AZURE_OPENAI_API_KEY" \ -d '{ "model": "MODEL_NAME", "tools": [ { "type": "code_interpreter", "container": {"type": "auto"} } ], "instructions": "You are a personal math tutor. When asked a math question, write and run code using the python tool to answer the question.", "input": "I need to solve the equation 3x + 11 = 14. Can you help me?" }'
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) response = client.responses.create( model="MODEL_NAME", tools=[{"type": "code_interpreter", "container": {"type": "auto"}}], instructions="You are a math tutor. Write and run Python code to solve math problems.", input="Solve 3x + 11 = 14." ) print(response.output_text)
Code Interpreter has additional charges beyond the token based fees for Azure OpenAI usage. If your Responses API calls Code Interpreter simultaneously in two different threads, two code interpreter sessions are created. Each session is active by default for 1 hour with an idle timeout of 20 minutes.
The Code Interpreter tool requires a container—a fully sandboxed virtual machine where the model can execute Python code. Containers can include uploaded files or files generated during execution.To create a container, specify "container": { "type": "auto", "file_ids": ["file-1", "file-2"] } in the tool configuration when creating a new Response object. This automatically creates a new container or reuses an active one from a previous code_interpreter_call in the model’s context. The code_interpreter_call in the output of the APIwill contain the container_id that was generated. This container expires if it is not used for 20 minutes.
When running Code Interpreter, the model can create its own files. For example, if you ask it to construct a plot, or create a CSV, it creates these images directly on your container. It will cite these files in the annotations of its next message.Any files in the model input get automatically uploaded to the container. You do not have to explicitly upload it to the container.
Retrieve the input items that were sent to a response. This is useful for inspecting the full conversation context, including any items added by the model (for example, function calls or compaction items).
Vision-enabled models can interpret images alongside text. They can recognize objects, shapes, colors, and textures, and read text contained within an image, subject to the limitations listed later in this article.You can provide an image as input to a request in any of the following ways:
Send an image inline by encoding its bytes as a base64 data URI. Use this pattern when the image isn’t hosted at a public URL or when you want to avoid an extra network fetch.
import base64 import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) with open("path_to_your_image.jpg", "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode("utf-8") response = client.responses.create( model="MODEL_NAME", input=[ { "role": "user", "content": [ {"type": "input_text", "text": "What is in this image?"}, {"type": "input_image", "image_url": f"data:image/jpeg;base64,{base64_image}"} ] } ] ) print(response.output_text)
Upload an image with the Files API by using purpose="vision", then reference the returned file ID in your request. This approach is useful when you want to reuse the same image across multiple requests without resending its bytes.
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) def create_file(file_path): with open(file_path, "rb") as file_content: result = client.files.create( file=file_content, purpose="vision", ) return result.id file_id = create_file("path_to_your_image.jpg") response = client.responses.create( model="MODEL_NAME", input=[ { "role": "user", "content": [ {"type": "input_text", "text": "What is in this image?"}, {"type": "input_image", "file_id": file_id}, ], } ], ) print(response.output_text)
The following table lists the supported file types for image inputs.
File type
MIME type
PNG
image/png
JPEG
image/jpeg
WebP
image/webp
Non-animated GIF
image/gif
In a single request, you can include up to 100 images. Each individual image file must be under 50 MB, and the combined size of all images in the request must also be under 50 MB.Images must meet these additional requirements:
The image must be relevant to the prompt; the model isn’t designed for unrelated visual content.
Images shouldn’t contain harmful or sensitive content that violates content policies.
Image files can’t be corrupted or unreadable. If the model can’t process an image, the request fails.
Use the detail property on an input_image content part to control how the model processes the image. Lower detail uses fewer tokens and is faster, while higher detail uses more tokens but lets the model capture finer features.
The model uses a lower-resolution version of the image. This option uses the fewest tokens and produces the fastest response, but the model might miss fine details.
high
The model uses a higher-resolution version of the image. This option captures finer details but uses more tokens and takes longer to respond.
auto
The default. The model selects the appropriate detail level based on the image and the prompt.
Vision-enabled models have the following limitations:
Medical images: The model isn’t suitable for interpreting specialized medical images such as CT scans and shouldn’t be used for medical advice.
Non-English text: The model might not perform optimally when handling images that contain text in non-Latin alphabets, such as Japanese or Korean.
Small text: Enlarge text within an image to improve readability, but avoid cropping out important details.
Rotation: The model might misinterpret rotated or upside-down text and images.
Visual elements: The model might struggle with graphs or text where colors or styles—such as solid, dashed, or dotted lines—vary.
Spatial reasoning: The model has difficulty with tasks that require precise spatial localization, such as identifying chess positions.
Accuracy: The model might generate incorrect descriptions or captions in some cases.
Image shape: The model has difficulty with panoramic and fisheye images.
Metadata and resizing: The model doesn’t process original file names or metadata, and images are resized before analysis, which affects their original dimensions.
Counting: The model might give approximate counts for objects in images.
CAPTCHAs: For safety reasons, a system is in place to block the submission of CAPTCHAs.
Models with vision capabilities support PDF input. PDF files can be provided either as Base64-encoded data or as file IDs. To help models interpret PDF content, both the extracted text and an image of each page are included in the model’s context. This is useful when key information is conveyed through diagrams or non-textual content.
All extracted text and images are put into the model’s context. Make sure you understand the pricing and token usage implications of using PDFs as input.
In a single API request, you can include more than one file, but each file must be under 50 MB. The combined limit across all files in the request is 50 MB.
Only models that support both text and image inputs can accept PDF files as input.
A purpose of user_data is currently not supported. As a temporary workaround you will need to set purpose to assistants.
You can extend the capabilities of your model by connecting it to tools hosted on remote Model Context Protocol (MCP) servers. These servers are maintained by developers and organizations and expose tools that can be accessed by MCP-compatible clients, such as the Responses API.Model Context Protocol (MCP) is an open standard that defines how applications provide tools and contextual data to large language models (LLMs). It enables consistent, scalable integration of external tools into model workflows.The following example shows how to use a remote MCP server to query information about an Azure REST API repository. The model retrieves and reasons over repository content in real time.
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) response = client.responses.create( model="MODEL_NAME", tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://contoso.com/Azure/azure-rest-api-specs", "require_approval": "never" } ], input="What transport protocols are supported in the 2025-03-26 version of the MCP spec?" ) print(response.output_text)
The MCP tool works only in the Responses API, and is available across all newer models (gpt-4o, gpt-4.1, and our reasoning models). When you’re using the MCP tool, you only pay for tokens used when importing tool definitions or making tool calls—there are no additional fees involved.
By default, the Responses API requires explicit approval before any data is shared with a remote MCP server. This approval step helps ensure transparency and gives you control over what information is sent externally.We recommend reviewing all data being shared with remote MCP servers and optionally logging it for auditing purposes.When an approval is required, the model returns a mcp_approval_request item in the response output. This object contains the details of the pending request and allows you to inspect or modify the data before proceeding.
To proceed with the remote MCP call, you must respond to the approval request by creating a new response object that includes an mcp_approval_response item. This object confirms your intent to allow the model to send the specified data to the remote MCP server.
Unlike the GitHub MCP server, most remote MCP servers require authentication. The MCP tool in the Responses API supports custom headers, allowing you to securely connect to these servers using the authentication scheme they require.You can specify headers such as API keys, OAuth access tokens, or other credentials directly in your request. The most commonly used header is the Authorization header.
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) response = client.responses.create( model="MODEL_NAME", input="What is this repo in 100 words?", tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://contoso.com/Azure/azure-rest-api-specs", "headers": {"Authorization": "Bearer $YOUR_MCP_TOKEN"} } ] ) print(response.output_text)
Background mode lets you run long-running tasks asynchronously with reasoning models such as o3 and o1-pro. It’s useful for complex tasks that can take several minutes to complete (for example, Codex- or Deep Research-style agents). When a request is sent with "background": true, the task is processed asynchronously, and you poll for its status.
Set background=true on the request to queue the task. The service returns immediately with a response ID and a queued status — use that ID to poll, stream, or cancel the task.
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) response = client.responses.create( model="MODEL_NAME", input="Write me a very long story.", background=True ) print(response.status)
Continue polling while the status is queued or in_progress. Once the response reaches a terminal state, it’s available for retrieval.
from time import sleep while response.status in {"queued", "in_progress"}: print(f"Current status: {response.status}") sleep(2) response = client.responses.retrieve(response.id) print(f"Final status: {response.status}\nOutput:\n{response.output_text}")
To stream a background response, set both background and stream to true. This pattern lets you resume streaming if the connection drops. Track your position with the sequence_number from each event.
stream = client.responses.create( model="MODEL_NAME", input="Write me a very long story.", background=True, stream=True, ) cursor = None for event in stream: print(event) cursor = event["sequence_number"]
Background responses currently have a higher time-to-first-token latency than synchronous responses. Improvements are underway to reduce this gap.
If a streaming connection drops, you can resume from a known event by passing stream=true along with starting_after=<sequence_number> on a GET to the response. The service replays events emitted after that sequence number.
When you use the Responses API in stateless mode (store=false), you must still preserve reasoning context across conversation turns. To do this, include encrypted reasoning items in your requests.To retain reasoning items across turns, add reasoning.encrypted_content to the include parameter. The response then contains an encrypted version of the reasoning trace, which you can pass to future requests.
import os from openai import OpenAI client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=os.getenv("AZURE_OPENAI_API_KEY") ) response = client.responses.create( model="MODEL_NAME", reasoning={"effort": "medium"}, input="What is the weather like today?", tools=[ # Replace with your function or tool definitions. ], include=["reasoning.encrypted_content"], store=False, ) print(response.output_text)
The Responses API enables image generation as part of conversations and multi-step workflows. It supports image inputs and outputs within context, and it includes built-in tools for generating and editing images.Compared to the standalone Image API, the Responses API offers two advantages:
Streaming: Display partial image outputs during generation to improve perceived latency.
Flexible inputs: Accept image file IDs as inputs in addition to raw image bytes.
The image generation tool in the Responses API is supported by gpt-image-1-series models, and you can call it from a set of compatible chat and reasoning models. For the current list of supported orchestration models, see the Supported models section later in this article.The image generation tool doesn’t currently support streaming mode. To stream partial images, call the image generation API directly outside of the Responses API.
Use the Responses API to build conversational image experiences with GPT Image models.
import base64 import os from openai import OpenAI from azure.identity import DefaultAzureCredential, get_bearer_token_provider token_provider = get_bearer_token_provider( DefaultAzureCredential(), "https://ai.azure.com/.default" ) client = OpenAI( base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/", api_key=token_provider, default_headers={ "x-ms-oai-image-generation-deployment": os.getenv("IMAGE_MODEL_NAME"), "api_version": "preview", }, ) response = client.responses.create( model="MODEL_NAME", input="Generate an image of a gray tabby cat hugging an otter with an orange scarf.", tools=[{"type": "image_generation"}], ) image_data = [ output.result for output in response.output if output.type == "image_generation_call" ] if image_data: with open("otter.png", "wb") as f: f.write(base64.b64decode(image_data[0]))
401/403: If you use Microsoft Entra ID, verify your token is scoped for https://ai.azure.com/.default. If you use an API key, confirm you’re using the correct key for the resource.