Quickstart
Get started using images in your chats with Azure OpenAI in Microsoft Foundry Models.API details
The following commands show how to call the Chat Completion API with vision-enabled models. For more information, see the API reference.- REST
- Python
Send a POST request to
https://{RESOURCE_NAME}.openai.azure.com/openai/v1/chat/completions where- RESOURCE_NAME is the name of your Azure OpenAI resource
Content-Type: application/jsonapi-key: {API_KEY}
Remember to set a
"max_tokens" or max_completion_tokens value, or the return output will be cut off. For o-series reasoning models, use max_completion_tokens instead of max_tokens.When uploading images, there’s a limit of 10 images per chat request.
Supported image formats include JPEG, PNG, GIF (first frame only), and WEBP.
Configure image detail level
You can optionally define a"detail" parameter in the "image_url" field. Choose one of three values, low, high, or auto, to adjust the way the model interprets and processes images.
autosetting: The default setting. The model decides between low or high based on the size of the image input.lowsetting: the model doesn’t activate the “high res” mode, instead processes a lower resolution 512x512 version, resulting in quicker responses and reduced token consumption for scenarios where fine detail isn’t crucial.highsetting: the model activates “high res” mode. Here, the model initially views the low-resolution image and then generates detailed 512x512 segments from the input image. Each segment uses double the token budget, allowing for a more detailed interpretation of the image.
Output
When you send an image to a vision-enabled model, the API returns a chat completion response with the model’s analysis. The response includes content filter results specific to Azure OpenAI."finish_reason" field. It has the following possible values:
stop: API returned complete model output.length: Incomplete model output due to themax_tokensinput parameter or model’s token limit.content_filter: Omitted content due to a flag from our content filters.
Input limitations
This section describes the limitations of vision-enabled chat models.Image support
- Maximum input image size: The maximum size for input images is restricted to 20 MB.
- Low resolution accuracy: When images are analyzed using the “low resolution” setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
- Image chat restriction: When you upload images in Microsoft Foundry portal or the API, you’re limited to 10 images per chat call.
Special pricing information
The following content is an example only, and prices are subject to change in the future.
- Input: $0.01 per 1,000 tokens
- Output: $0.03 per 1,000 tokens
Example image price calculation
For a typical use case, take an image with both visible objects and text and a 100-token prompt input. When the service processes the prompt, it generates 100 tokens of output. In the image, both text and objects can be detected. The price of this transaction would be:| Item | Detail | Cost |
|---|---|---|
| Text prompt input | 100 text tokens | $0.001 |
| Example image input (see Image tokens) | 170 + 85 image tokens | $0.00255 |
| Enhanced add-on features for OCR | $1.50 / 1,000 transactions | $0.0015 |
| Enhanced add-on features for Object Grounding | $1.50 / 1,000 transactions | $0.0015 |
| Output Tokens | 100 tokens (assumed) | $0.003 |
| Total | $0.00955 |
Troubleshooting
| Issue | Resolution |
|---|---|
| Output truncated | Increase max_tokens or max_completion_tokens value |
| Image not processed | Verify URL is publicly accessible or base64 encoding is correct |
| Rate limit exceeded | Implement retry logic with exponential backoff |