Skip to main content

Vision fine-tuning

Learn how to fine-tune Azure OpenAI models with image data to customize visual understanding for your use case. Vision fine-tuning lets you include image inputs in your training examples, following the same chat completions format used for text fine-tuning. Images can be provided either as publicly accessible URLs or data URIs containing base64 encoded images.

Prerequisites

Model support

Vision fine-tuning is supported for the following models only:
ModelVersionRegion availability
gpt-4o2024-08-06Supported regions.
gpt-4.12025-04-14Supported regions.

Image dataset requirements

ConstraintLimit
Max examples with images per training file50,000
Max images per example64
Max image file size10 MB

Format

Images must be:
  • JPEG
  • PNG
  • WEBP
Images must be in the RGB or RGBA image mode. You can’t include images as output from messages with the assistant role. Your example file requires at least 10 examples.

Image detail control

You can control the fidelity of image processing using the detail parameter in the image_url object. Note that the detail parameter impacts the cost of your training job; Low is lower cost but might lose fine visual details.
  • low — Downscales images to 512×512 pixels. Uses fewer tokens and reduces training cost.
  • high — Processes images at full resolution. Provides more visual detail but increases token usage.
  • auto — Lets the model decide based on image size (default).
{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.png",
    "detail": "low"
  }
}

Best practices

  • Diverse examples: Include variety in image content, angles, and lighting conditions.
  • Consistent annotations: Ensure assistant responses are consistent in style and detail level across examples.

Example file format

{
  "messages": [
    { "role": "system", "content": "You are a helpful AI assistant." },
    { "role": "user", "content": "Describe the image?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://raw.githubusercontent.com/MicrosoftDocs/azure-ai-docs/main/articles/ai-services/openai/media/how-to/generated-seattle.png"
          }
        }
      ]
    },
    { "role": "assistant", "content": "The image appears to be a watercolor painting of a city skyline, featuring tall buildings and a recognizable structure often associated with Seattle, like the Space Needle. The artwork uses soft colors and brushstrokes to create a somewhat abstract and artistic representation of the cityscape." }
  ]
}

Create a vision fine-tuning job

After preparing your dataset with image examples, follow the standard fine-tuning workflow to submit your job:
  1. Upload your training file using the Files API or the Microsoft Foundry portal. Image validation might take longer than text-only uploads due to content moderation screening.
  2. Create a fine-tuning job specifying your uploaded file and a supported vision model.
  3. Monitor the job until completion.
For detailed steps, see Fine-tune an Azure OpenAI model.

Content moderation policy

We scan your images before training to ensure that they comply with our usage policy. For details, see the Transparency Note. This might introduce latency in file validation before fine-tuning begins. Images containing the following are excluded from your dataset and not used for training:
  • People
  • Faces
  • CAPTCHAs
Face screening process for vision fine-tuning:
  • Images are screened for faces/people and skipped from training.
  • The screening uses face detection only, not face identification.
  • No facial templates are created, and no specific facial geometry is measured.
  • The technology can’t uniquely identify individuals.
For more information about data privacy, see Data and privacy for Face - Foundry Tools.

Troubleshooting

Images skipped during training

Images can be excluded from training for several reasons:
ReasonResolution
Image URL not accessibleEnsure URLs are publicly accessible or use base64 data URIs
Image exceeds 10 MBResize or compress the image
Unsupported formatConvert to JPEG, PNG, or WEBP
Not RGB/RGBA modeConvert image color mode
Content policy violationImages with people, faces, children, or CAPTCHAs are automatically excluded
Too many images in exampleReduce to 64 or fewer images per example

Next steps