Skip to main content

Chat Completions API

Generate conversational responses using large language models. The Chat Completions API enables you to build chatbots, virtual assistants, and other conversational AI applications.

Base URL

https://your-endpoint.inference.ml.azure.com/v1/chat/completions

Authentication

All requests must include authentication via one of these methods:

API Key (Header)

Authorization: Bearer YOUR_API_KEY

Microsoft Entra ID Token

Authorization: Bearer YOUR_ENTRA_TOKEN

Request Format

HTTP Method

POST /v1/chat/completions

Headers

HeaderRequiredDescription
Content-TypeYesMust be application/json
AuthorizationYesBearer token for authentication
User-AgentNoClient identification string

Request Body Parameters

ParameterTypeRequiredDefaultDescription
messagesArrayYes-Array of message objects representing the conversation
modelStringNo-Model deployment name (if multiple models available)
max_tokensIntegerNo4096Maximum tokens to generate in response
temperatureFloatNo1.0Controls randomness (0.0 to 2.0)
top_pFloatNo1.0Nucleus sampling parameter (0.0 to 1.0)
frequency_penaltyFloatNo0.0Penalize frequent tokens (-2.0 to 2.0)
presence_penaltyFloatNo0.0Penalize repeated tokens (-2.0 to 2.0)
stopString/ArrayNonullSequences that stop generation
streamBooleanNofalseEnable streaming responses
seedIntegerNonullDeterministic sampling seed
toolsArrayNonullAvailable function tools
tool_choiceString/ObjectNo”auto”Tool selection strategy

Message Object Structure

{
  "role": "user|assistant|system|tool",
  "content": "Message content",
  "name": "function_name",
  "tool_calls": [],
  "tool_call_id": "call_id"
}

Message Roles

RoleDescriptionRequired Fields
systemSystem instructions and behaviorrole, content
userUser messages and queriesrole, content
assistantModel responsesrole, content or tool_calls
toolFunction call resultsrole, content, tool_call_id

Request Examples

Basic Chat Request

curl -X POST "https://your-endpoint.inference.ml.azure.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user", 
        "content": "What is the capital of France?"
      }
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }'

Multi-turn Conversation

{
  "messages": [
    {
      "role": "system",
      "content": "You are a knowledgeable travel advisor."
    },
    {
      "role": "user",
      "content": "I'm planning a trip to Japan. What should I know?"
    },
    {
      "role": "assistant",
      "content": "Japan is a fascinating destination! Key things to know: the best time to visit is spring (cherry blossoms) or fall (mild weather), you'll need cash as many places don't accept cards, and learning basic phrases like 'arigatou gozaimasu' (thank you) is appreciated."
    },
    {
      "role": "user",
      "content": "What about transportation within cities?"
    }
  ],
  "max_tokens": 200,
  "temperature": 0.8
}

Function Calling

{
  "messages": [
    {
      "role": "user",
      "content": "What's the weather like in Seattle?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "City name"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "Temperature unit"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Response Format

Standard Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 12,
    "total_tokens": 40
  },
  "system_fingerprint": "fp_abc123"
}

Response with Function Call

{
  "id": "chatcmpl-def456",
  "object": "chat.completion", 
  "created": 1677858242,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"Seattle\", \"unit\": \"fahrenheit\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 15,
    "total_tokens": 60
  }
}

Streaming Response

When stream: true, responses are sent as Server-Sent Events:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677858242,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677858242,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677858242,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

data: [DONE]

Response Fields

Root Level Fields

FieldTypeDescription
idStringUnique identifier for the completion
objectStringObject type: chat.completion
createdIntegerUnix timestamp of creation
modelStringModel used for completion
choicesArrayArray of completion choices
usageObjectToken usage information
system_fingerprintStringSystem configuration identifier

Choice Object Fields

FieldTypeDescription
indexIntegerChoice index in the array
messageObjectThe completion message
finish_reasonStringReason completion stopped

Finish Reasons

ReasonDescription
stopNatural stopping point or stop sequence reached
lengthMaximum token limit reached
tool_callsModel called a function
content_filterContent filtered by safety systems

Usage Object Fields

FieldTypeDescription
prompt_tokensIntegerTokens in the input prompt
completion_tokensIntegerTokens in the generated completion
total_tokensIntegerTotal tokens used (prompt + completion)

Error Responses

Error Format

{
  "error": {
    "message": "Error description",
    "type": "error_type",
    "param": "parameter_name",
    "code": "error_code"
  }
}

Common Error Types

Status CodeError TypeDescription
400invalid_request_errorMalformed request
401authentication_errorInvalid or missing API key
403permission_errorInsufficient permissions
404not_found_errorEndpoint or resource not found
429rate_limit_errorRate limit exceeded
500api_errorInternal server error
503service_unavailableService temporarily unavailable

Rate Limits

Rate limits vary by deployment and pricing tier:
MetricLimit
Requests per minuteVaries by tier
Tokens per minuteVaries by tier
Concurrent requests100 (typical)
Rate limit headers included in responses:
  • x-ratelimit-limit-requests
  • x-ratelimit-remaining-requests
  • x-ratelimit-reset-requests

Content Filtering

Azure AI Foundry includes built-in content filtering for safety:

Filter Categories

  • Hate: Discriminatory content
  • Violence: Violent or harmful content
  • Sexual: Sexual content
  • Self-harm: Content promoting self-harm

Filter Levels

  • Safe: Content passes all filters
  • Low: Low-risk content allowed
  • Medium: Moderate-risk content blocked
  • High: High-risk content blocked

Filter Response

When content is filtered, the response includes:
{
  "choices": [
    {
      "finish_reason": "content_filter",
      "content_filter_results": {
        "hate": {"filtered": false, "severity": "safe"},
        "violence": {"filtered": true, "severity": "medium"}
      }
    }
  ]
}

Best Practices

Performance Optimization

  • Use appropriate max_tokens to avoid unnecessary generation
  • Implement caching for repeated queries
  • Use streaming for long responses to improve perceived latency
  • Batch multiple independent requests when possible

Cost Management

  • Monitor token usage with the usage field
  • Set reasonable max_tokens limits
  • Use shorter prompts when possible
  • Implement request deduplication

Security

  • Never expose API keys in client-side code
  • Use Microsoft Entra ID for production applications
  • Implement rate limiting on your application side
  • Validate and sanitize user inputs

Error Handling

  • Implement exponential backoff for rate limit errors
  • Handle content filter responses gracefully
  • Log errors for debugging and monitoring
  • Provide meaningful error messages to users

SDK Examples

Python

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="YOUR_API_KEY",
    api_version="2024-02-01",
    azure_endpoint="https://your-endpoint.inference.ml.azure.com"
)

response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=150
)

print(response.choices[0].message.content)

JavaScript

const { AzureOpenAI } = require("openai");

const client = new AzureOpenAI({
    apiKey: "YOUR_API_KEY",
    apiVersion: "2024-02-01",
    endpoint: "https://your-endpoint.inference.ml.azure.com"
});

const response = await client.chat.completions.create({
    messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "Hello!" }
    ],
    max_tokens: 150
});

console.log(response.choices[0].message.content);

C#

using Azure.AI.OpenAI;

var client = new AzureOpenAIClient(
    new Uri("https://your-endpoint.inference.ml.azure.com"),
    new AzureKeyCredential("YOUR_API_KEY")
);

var response = await client.GetChatCompletionsAsync(
    new ChatCompletionsOptions()
    {
        Messages = {
            new ChatMessage(ChatRole.System, "You are a helpful assistant."),
            new ChatMessage(ChatRole.User, "Hello!")
        },
        MaxTokens = 150
    }
);

Console.WriteLine(response.Value.Choices[0].Message.Content);