Deploy Models to Azure AI Foundry
This guide shows you how to deploy AI models in Azure AI Foundry for production use. Whether you need serverless APIs or dedicated compute, this guide will help you choose the right deployment option and configure it correctly.Choose your deployment type
Before deploying, determine which option fits your needs: Use serverless API deployment when:- You have variable or unpredictable traffic
- You want to minimize setup and management
- You need quick scaling without infrastructure planning
- Cost efficiency for sporadic usage is important
- You have consistent, high-volume traffic
- You need guaranteed capacity and performance
- You require custom model configurations
- You have specific compliance or isolation requirements
Deploy via serverless API
Prerequisites
- An Azure AI Foundry project
- Contributor access to the project
- Available quota for the target model
Step 1: Select and configure the model
- Navigate to your Azure AI Foundry project
- Go to Models → Model catalog
- Find your desired model (e.g.,
Llama-3.1-8B-Instruct) - Click Deploy → Serverless API
Step 2: Configure deployment settings
- Deployment name: Use descriptive names like
{model}-{purpose}-{env} - Content filtering: Choose appropriate safety level for your use case
- Rate limits: Set based on your expected traffic patterns
Step 3: Deploy and verify
- Click Deploy - this typically takes 2-3 minutes
- Monitor the deployment status in the Deployments tab
- Once status shows “Succeeded”, test the endpoint:
Deploy via managed compute
Prerequisites
- An Azure AI Foundry hub with compute resources
- Sufficient compute quota for your chosen SKU
- Understanding of your performance requirements
Step 1: Provision compute resources
- Go to Management → Compute
- Click + New compute
- Configure based on your needs:
Step 2: Deploy model to managed compute
- In Models → Model catalog, select your model
- Click Deploy → Real-time endpoint
- Configure the deployment:
Step 3: Configure traffic allocation
If you have multiple deployments behind one endpoint:Configure authentication and security
API key authentication (simpler setup)
Microsoft Entra ID authentication (recommended for production)
Monitor and scale your deployment
Set up monitoring
- Go to Monitoring → Metrics
- Configure alerts for key metrics:
- Request latency > 5 seconds
- Error rate > 5%
- Token usage approaching quota limits
Configure auto-scaling
For managed compute deployments:Optimize for cost and performance
Cost optimization strategies
- Right-size your compute: Start small and scale based on actual usage
- Use auto-scaling: Reduce costs during low-traffic periods
- Monitor quota usage: Track token consumption to avoid overages
- Consider spot instances: For non-critical workloads, use spot pricing
Performance optimization
- Batch requests: Group multiple requests to improve throughput
- Implement caching: Cache responses for repeated queries
- Use appropriate instance types: Match compute to model requirements
- Optimize prompts: Shorter, well-crafted prompts reduce latency
Troubleshooting common issues
Deployment fails with quota errors
High latency issues
- Check instance health: Verify all instances are healthy
- Review traffic patterns: Look for unusual spikes or patterns
- Optimize model settings: Adjust max_tokens and other parameters
- Consider geographic distribution: Deploy closer to your users
Authentication errors
Next steps
Once your model is deployed successfully:- Monitor Deployments - Set up comprehensive monitoring
- Scale Workloads - Handle increasing traffic efficiently
- Secure Your Endpoints - Implement production security measures

