Learn how to use the Responses API in WebSocket mode for fast, efficient Azure OpenAI integrations. Get started with step-by-step guidance now.
The Responses API supports a WebSocket mode for long-running, tool-heavy workflows. In WebSocket mode, you keep a persistent connection to /v1/responses and continue each turn by sending only new input items together with a previous_response_id. This approach reduces per-turn overhead and improves end-to-end latency across long chains.WebSocket mode works with store=false.
Use WebSocket mode when a workflow involves many model-tool round trips, such as agentic coding or orchestration loops with repeated tool calls. Because the connection stays open and each turn sends only incremental input, continuation latency is lower than with repeated HTTP requests.For single-shot requests or short conversations, keep using the standard HTTP Responses API.
You open one WebSocket connection to /v1/responses and drive it with response.create events:
The first response.create starts a new turn. The payload mirrors the HTTP create body, except that transport-specific fields like stream and background don’t apply.
Follow-up response.create messages chain from the prior response using previous_response_id and include only new input items.
Server events and ordering match the existing Responses streaming event model.
Send a response.create event on the open socket. The following examples connect to the WebSocket endpoint and ask the model a question. WebSocket mode supports both API key and Microsoft Entra ID authentication — choose the tab that matches your auth method.
from websocket import create_connection import json ws = create_connection( f"wss://{YOUR_RESOURCE_NAME}.openai.azure.com/openai/v1/responses", header=[f"Authorization: Bearer {YOUR_AOAI_API_KEY}"], ) ws.send(json.dumps({ "type": "response.create", "model": "gpt-4.1", # Replace with your model deployment name "store": False, "input": [ { "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Find fizz_buzz()"}], } ], "tools": [], }))
You can optionally warm up request state by sending response.create with generate: false. Use this option when you already know the tools, instructions, or messages you plan to send with an upcoming turn. A warmup doesn’t return model output but prepares request state so the next generated turn can start faster. The warmup request returns a response ID that you can chain from by using previous_response_id.
Read events from the WebSocket, print text as it streams in, and stop when the response is done.
while True: event = json.loads(ws.recv()) if event["type"] == "response.output_text.delta": print(event["delta"], end="", flush=True) elif event["type"] == "response.completed": response_id = event.get("response", {}).get("id") print(f"\nResponse ID: {response_id}") break# Close the socket only when you are done with all turns.# ws.close()
WebSocket mode uses the same previous_response_id chaining as HTTP mode, but adds a lower-latency continuation path on the active socket.On an active WebSocket connection, the service keeps one previous-response state in a connection-local in-memory cache (the most recent response). Continuing from that response is fast because the service reuses connection-local state. Because this state is retained only in memory and isn’t written to disk, WebSocket mode is compatible with store=false.If a previous_response_id isn’t in the in-memory cache, behavior depends on whether you store responses:
With store=true, the service might hydrate older response IDs from persisted state. Continuation still works but usually loses the in-memory latency benefit.
With store=false, there’s no persisted fallback. If the ID is uncached, the request returns previous_response_not_found.
If a turn fails (4xx or 5xx), the service evicts the referenced previous_response_id from the connection-local cache. This prevents reusing stale cached state for that failed continuation.
When you enable server-side compaction (context_management with compact_threshold), compaction happens during normal /responses generation. In WebSocket mode, you continue the same way you normally do: send the next response.create with the latest previous_response_id and only new input items.
The standalone /responses/compact endpoint returns a new compacted input window, not a response ID. After compaction, start a new response on your WebSocket connection by omitting previous_response_id (or setting it to null) and passing the compacted output as input, plus the next user or tool items. Pass the compacted output as-is; don’t prune the returned window.
# Compact your current window (HTTP call)compacted = client.responses.compact( model="gpt-4.1", input=long_input_items_array,)# Start a new response on the WebSocket using the compacted windowws.send(json.dumps({ "type": "response.create", "model": "gpt-4.1", "store": False, "input": [ *compacted.output, { "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Continue from here."}], }, ], "tools": [],}))
When a connection closes or hits the 60-minute limit, open a new WebSocket connection and continue with one of these patterns:
If your prior response is persisted (store=true) and you have a valid response ID, continue with previous_response_id and new input items.
If you can’t continue the chain (for example, store=false or previous_response_not_found), start a new response by omitting previous_response_id (or setting it to null) and send the full input context for the next turn.
If you compacted context with /responses/compact, use the returned compacted window as the base input for the new response, then append the latest user or tool items.
previous_response_not_found: The referenced response ID isn’t in the connection-local cache and there’s no persisted state to hydrate from. Start a new chain, or enable store=true if your scenario allows it.
{ "type": "error", "status": 400, "error": { "code": "previous_response_not_found", "message": "Previous response with id 'resp_abc' not found.", "param": "previous_response_id" }}
websocket_connection_limit_reached: The connection is open for 60 minutes. Open a new WebSocket connection and continue using one of the Reconnect and recover patterns.
{ "type": "error", "status": 400, "error": { "type": "invalid_request_error", "code": "websocket_connection_limit_reached", "message": "Responses websocket connection limit reached (60 minutes). Create a new websocket connection to continue." }}