Vision fine-tuning
Learn how to fine-tune Azure OpenAI models with image data to customize visual understanding for your use case. Vision fine-tuning lets you include image inputs in your training examples, following the same chat completions format used for text fine-tuning. Images can be provided either as publicly accessible URLs or data URIs containing base64 encoded images.Prerequisites
- An Azure subscription. Create one for free.
- A Microsoft Foundry resource. See Create an Azure AI Foundry resource.
- Familiarity with the fine-tuning workflow. Vision fine-tuning follows the same process with image-specific data formatting.
- Fine-tuning access for the supported models in a supported region.
Model support
Vision fine-tuning is supported for the following models only:| Model | Version | Region availability |
|---|---|---|
gpt-4o | 2024-08-06 | Supported regions. |
gpt-4.1 | 2025-04-14 | Supported regions. |
Image dataset requirements
| Constraint | Limit |
|---|---|
| Max examples with images per training file | 50,000 |
| Max images per example | 64 |
| Max image file size | 10 MB |
Format
Images must be:- JPEG
- PNG
- WEBP
Image detail control
You can control the fidelity of image processing using thedetail parameter in the image_url object. Note that the detail parameter impacts the cost of your training job; Low is lower cost but might lose fine visual details.
low— Downscales images to 512×512 pixels. Uses fewer tokens and reduces training cost.high— Processes images at full resolution. Provides more visual detail but increases token usage.auto— Lets the model decide based on image size (default).
Best practices
- Diverse examples: Include variety in image content, angles, and lighting conditions.
- Consistent annotations: Ensure assistant responses are consistent in style and detail level across examples.
Example file format
Create a vision fine-tuning job
After preparing your dataset with image examples, follow the standard fine-tuning workflow to submit your job:- Upload your training file using the Files API or the Microsoft Foundry portal. Image validation might take longer than text-only uploads due to content moderation screening.
- Create a fine-tuning job specifying your uploaded file and a supported vision model.
- Monitor the job until completion.
Content moderation policy
We scan your images before training to ensure that they comply with our usage policy. For details, see the Transparency Note. This might introduce latency in file validation before fine-tuning begins. Images containing the following are excluded from your dataset and not used for training:- People
- Faces
- CAPTCHAs
Face screening process for vision fine-tuning:
- Images are screened for faces/people and skipped from training.
- The screening uses face detection only, not face identification.
- No facial templates are created, and no specific facial geometry is measured.
- The technology can’t uniquely identify individuals.
Troubleshooting
Images skipped during training
Images can be excluded from training for several reasons:| Reason | Resolution |
|---|---|
| Image URL not accessible | Ensure URLs are publicly accessible or use base64 data URIs |
| Image exceeds 10 MB | Resize or compress the image |
| Unsupported format | Convert to JPEG, PNG, or WEBP |
| Not RGB/RGBA mode | Convert image color mode |
| Content policy violation | Images with people, faces, children, or CAPTCHAs are automatically excluded |
| Too many images in example | Reduce to 64 or fewer images per example |
Next steps
- Fine-tune a model — Complete fine-tuning workflow including upload, training, and monitoring.
- Deploy a fine-tuned model — Deploy your customized model for inference.
- Fine-tuning model regional availability — Check which regions support vision fine-tuning.
- Model quotas and limits — Review rate limits and quotas for fine-tuned models.