How to use vision-enabled chat models - Microsoft Foundry Docs

Vision-enabled chat models are large multimodal models (LMM) developed by OpenAI that can analyze images and provide textual responses to questions about them. They incorporate both natural language processing and visual understanding. The current vision-enabled models are the o-series reasoning models, GPT-5 series, GPT-4.1 series, GPT-4.5, GPT-4o series. The vision-enabled models can answer general questions about what’s present in the images you upload.

To use vision-enabled models, you call the Chat Completion API on a supported model that you have deployed. If you’re not familiar with the Chat Completion API, see the Vision-enabled chat how-to guide.

Quickstart

Get started using images in your chats with Azure OpenAI in Microsoft Foundry Models.

API details

The following commands show how to call the Chat Completion API with vision-enabled models. For more information, see the API reference.

REST
Python

Send a POST request to https://{RESOURCE_NAME}.openai.azure.com/openai/v1/chat/completions where

RESOURCE_NAME is the name of your Azure OpenAI resource

Required headers:

Content-Type: application/json
api-key: {API_KEY}

Body: The following is a sample request body. The format is the same as the chat completions API for GPT-4o, except that the message content can be an array containing text and images (either a valid publicly accessible HTTP or HTTPS URL to an image, or a base-64-encoded image).

Remember to set a "max_tokens" or max_completion_tokens value, or the return output will be cut off. For o-series reasoning models, use max_completion_tokens instead of max_tokens.

When uploading images, there’s a limit of 10 images per chat request.

Supported image formats include JPEG, PNG, GIF (first frame only), and WEBP.

{
    "model": "MODEL-DEPLOYMENT-NAME",
    "messages": [ 
        {
            "role": "system", 
            "content": "You are a helpful assistant." 
        },
        {
            "role": "user", 
            "content": [
	            {
	                "type": "text",
	                "text": "Describe this picture:"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url": "<image URL>"
                    }
                } 
           ] 
        }
    ],
    "max_tokens": 100, 
    "stream": false 
} 

Define your Azure OpenAI base_url and api-key.

Create a client object using those values.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)

Then call the client’s create method. The following code shows a sample request body. The format is the same as the chat completions API for GPT-4o, except that the message content can be an array containing text and images (either a valid HTTP or HTTPS URL to an image, or a base-64-encoded image).

Remember to set a "max_tokens" or max_completion_tokens value, or the return output will be cut off. For o-series reasoning models, use max_completion_tokens instead of max_tokens.

response = client.chat.completions.create(
    model="MODEL-DEPLOYMENT-NAME",
    messages=[
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": [  
            { 
                "type": "text", 
                "text": "Describe this picture:" 
            },
            { 
                "type": "image_url",
                "image_url": {
                    "url": "<image URL>"
                }
            }
        ] } 
    ],
    max_tokens=2000 
)
print(response)

Use a local image

If you want to use a local image, you can use the following Python code to convert it to base64 so it can be passed to the API. Alternative file conversion tools are available online.

import base64
from mimetypes import guess_type

# Function to encode a local image into data URL 
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

# Example usage
image_path = '<path_to_image>'
data_url = local_image_to_data_url(image_path)
print("Data URL:", data_url)

When your base64 image data is ready, you can pass it to the API in the request body like this:

...
"type": "image_url",
"image_url": {
   "url": "data:image/jpeg;base64,<your_image_data>"
}
...

Configure image detail level

You can optionally define a "detail" parameter in the "image_url" field. Choose one of three values, low, high, or auto, to adjust the way the model interprets and processes images.

auto setting: The default setting. The model decides between low or high based on the size of the image input.
low setting: the model doesn’t activate the “high res” mode, instead processes a lower resolution 512x512 version, resulting in quicker responses and reduced token consumption for scenarios where fine detail isn’t crucial.
high setting: the model activates “high res” mode. Here, the model initially views the low-resolution image and then generates detailed 512x512 segments from the input image. Each segment uses double the token budget, allowing for a more detailed interpretation of the image.

You set the value using the format shown in this example:

{ 
    "type": "image_url",
    "image_url": {
        "url": "<image URL>",
        "detail": "high"
    }
}

Output

When you send an image to a vision-enabled model, the API returns a chat completion response with the model’s analysis. The response includes content filter results specific to Azure OpenAI.

{
    "id": "chatcmpl-8VAVx58veW9RCm5K1ttmxU6Cm4XDX",
    "object": "chat.completion",
    "created": 1702439277,
    "model": "gpt-4o",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_reason":"stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The picture shows an individual dressed in formal attire, which includes a black tuxedo with a black bow tie. There is an American flag on the left lapel of the individual's jacket. The background is predominantly blue with white text that reads \"THE KENNEDY PROFILE IN COURAGE AWARD\" and there are also visible elements of the flag of the United States placed behind the individual."
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1156,
        "completion_tokens": 80,
        "total_tokens": 1236
    }
}

Every response includes a "finish_reason" field. It has the following possible values:

stop: API returned complete model output.
length: Incomplete model output due to the max_tokens input parameter or model’s token limit.
content_filter: Omitted content due to a flag from our content filters.

Input limitations

This section describes the limitations of vision-enabled chat models.

Image support

Maximum input image size: The maximum size for input images is restricted to 20 MB.
Low resolution accuracy: When images are analyzed using the “low resolution” setting, it allows for faster responses and uses fewer input tokens for certain use cases. However, this could impact the accuracy of object and text recognition within the image.
Image chat restriction: When you upload images in Microsoft Foundry portal or the API, you’re limited to 10 images per chat call.

Special pricing information

The following content is an example only, and prices are subject to change in the future.

Vision-enabled models accrue charges like other Azure OpenAI chat models. You pay a per-token rate for the prompts and completions, detailed on the Pricing page. The base charges and other features are outlined here: Base Pricing for GPT-4 Turbo with Vision is:

Input: $0.01 per 1,000 tokens
Output: $0.03 per 1,000 tokens

See the Tokens section of the overview for information on how text and images translate to tokens.

Example image price calculation

For a typical use case, take an image with both visible objects and text and a 100-token prompt input. When the service processes the prompt, it generates 100 tokens of output. In the image, both text and objects can be detected. The price of this transaction would be:

Item	Detail	Cost
Text prompt input	100 text tokens	$0.001
Example image input (see Image tokens)	170 + 85 image tokens	$0.00255
Enhanced add-on features for OCR	$1.50 / 1,000 transactions	$0.0015
Enhanced add-on features for Object Grounding	$1.50 / 1,000 transactions	$0.0015
Output Tokens	100 tokens (assumed)	$0.003
Total		$0.00955

Troubleshooting

Issue	Resolution
Output truncated	Increase `max_tokens` or `max_completion_tokens` value
Image not processed	Verify URL is publicly accessible or base64 encoding is correct
Rate limit exceeded	Implement retry logic with exponential backoff

​Quickstart

​API details

​Use a local image

​Configure image detail level

​Output

​Input limitations

​Image support

​Special pricing information

​Example image price calculation

​Troubleshooting

​Related content

Quickstart

API details

Use a local image

Configure image detail level

Output

Input limitations

Image support

Special pricing information

Example image price calculation

Troubleshooting

Related content