Deploy Models to Azure AI Foundry

This guide shows you how to deploy AI models in Azure AI Foundry for production use. Whether you need serverless APIs or dedicated compute, this guide will help you choose the right deployment option and configure it correctly.

Choose your deployment type

Before deploying, determine which option fits your needs: Use serverless API deployment when:

You have variable or unpredictable traffic
You want to minimize setup and management
You need quick scaling without infrastructure planning
Cost efficiency for sporadic usage is important

Use managed compute deployment when:

You have consistent, high-volume traffic
You need guaranteed capacity and performance
You require custom model configurations
You have specific compliance or isolation requirements

Deploy via serverless API

Prerequisites

An Azure AI Foundry project
Contributor access to the project
Available quota for the target model

Step 1: Select and configure the model

Navigate to your Azure AI Foundry project
Go to Models → Model catalog
Find your desired model (e.g., Llama-3.1-8B-Instruct)
Click Deploy → Serverless API

Step 2: Configure deployment settings

{
  "deployment_name": "llama-chat-prod",
  "model_name": "Llama-3.1-8B-Instruct",
  "sku": "Standard",
  "content_filter": "default_content_filter",
  "rate_limit": {
    "requests_per_minute": 1000,
    "tokens_per_minute": 100000
  }
}

Key configuration options:

Deployment name: Use descriptive names like {model}-{purpose}-{env}
Content filtering: Choose appropriate safety level for your use case
Rate limits: Set based on your expected traffic patterns

Step 3: Deploy and verify

Click Deploy - this typically takes 2-3 minutes
Monitor the deployment status in the Deployments tab
Once status shows “Succeeded”, test the endpoint:

curl -X POST "https://your-endpoint.inference.ml.azure.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "messages": [{"role": "user", "content": "Hello, world!"}],
    "max_tokens": 100
  }'

Deploy via managed compute

Prerequisites

An Azure AI Foundry hub with compute resources
Sufficient compute quota for your chosen SKU
Understanding of your performance requirements

Step 1: Provision compute resources

Go to Management → Compute
Click + New compute
Configure based on your needs:

compute_configuration:
  name: "model-inference-cluster"
  type: "AmlCompute"
  size: "Standard_NC24ads_A100_v4"  # GPU optimized for AI inference
  min_nodes: 1
  max_nodes: 10
  idle_seconds_before_scaledown: 300

Step 2: Deploy model to managed compute

In Models → Model catalog, select your model
Click Deploy → Real-time endpoint
Configure the deployment:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineDeployment, Model

# Initialize client
ml_client = MLClient.from_config()

# Configure deployment
deployment = ManagedOnlineDeployment(
    name="llama-managed-prod",
    endpoint_name="llama-endpoint",
    model=Model(path="azureml://registries/azureml/models/Llama-3.1-8B-Instruct"),
    instance_type="Standard_NC24ads_A100_v4",
    instance_count=2,
    environment_variables={
        "WORKER_COUNT": "2",
        "MAX_CONCURRENT_REQUESTS": "100"
    }
)

# Deploy
ml_client.online_deployments.begin_create_or_update(deployment)

Step 3: Configure traffic allocation

If you have multiple deployments behind one endpoint:

# Set traffic allocation
endpoint.traffic = {
    "llama-managed-prod": 80,
    "llama-managed-canary": 20
}
ml_client.online_endpoints.begin_create_or_update(endpoint)

Configure authentication and security

API key authentication (simpler setup)

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.getenv("AZURE_AI_API_KEY"),
    api_version="2024-02-01",
    azure_endpoint="https://your-endpoint.inference.ml.azure.com"
)

Microsoft Entra ID authentication (recommended for production)

from azure.identity import DefaultAzureCredential
from openai import AzureOpenAI

credential = DefaultAzureCredential()
token = credential.get_token("https://ml.azure.com/.default")

client = AzureOpenAI(
    api_key=token.token,
    api_version="2024-02-01",
    azure_endpoint="https://your-endpoint.inference.ml.azure.com"
)

Monitor and scale your deployment

Set up monitoring

Go to Monitoring → Metrics
Configure alerts for key metrics:
- Request latency > 5 seconds
- Error rate > 5%
- Token usage approaching quota limits

Configure auto-scaling

For managed compute deployments:

from azure.ai.ml.entities import OnlineRequestSettings

# Configure auto-scaling
deployment.request_settings = OnlineRequestSettings(
    request_timeout_ms=90000,
    max_concurrent_requests_per_instance=10,
    max_queue_wait_ms=5000
)

# Update deployment
ml_client.online_deployments.begin_create_or_update(deployment)

Optimize for cost and performance

Cost optimization strategies

Right-size your compute: Start small and scale based on actual usage
Use auto-scaling: Reduce costs during low-traffic periods
Monitor quota usage: Track token consumption to avoid overages
Consider spot instances: For non-critical workloads, use spot pricing

Performance optimization

Batch requests: Group multiple requests to improve throughput
Implement caching: Cache responses for repeated queries
Use appropriate instance types: Match compute to model requirements
Optimize prompts: Shorter, well-crafted prompts reduce latency

Troubleshooting common issues

Deployment fails with quota errors

# Check current quota usage
az cognitiveservices account list-usage \
  --name "your-resource-name" \
  --resource-group "your-resource-group"

# Request quota increase if needed
az support tickets create \
  --ticket-name "Quota increase request" \
  --severity "minimal" \
  --issue-type "quota"

High latency issues

Check instance health: Verify all instances are healthy
Review traffic patterns: Look for unusual spikes or patterns
Optimize model settings: Adjust max_tokens and other parameters
Consider geographic distribution: Deploy closer to your users

Authentication errors

# Test endpoint connectivity
import requests

response = requests.get(
    "https://your-endpoint.inference.ml.azure.com/v1/models",
    headers={"Authorization": f"Bearer {api_key}"}
)

if response.status_code == 401:
    print("Authentication failed - check your API key")
elif response.status_code == 403:
    print("Access denied - check your permissions")

Next steps

Once your model is deployed successfully:

Monitor Deployments - Set up comprehensive monitoring
Scale Workloads - Handle increasing traffic efficiently
Secure Your Endpoints - Implement production security measures

Getting Started

Tutorials

How-to Guides

Concepts

Core Concepts

Deploy models

Deploy Models to Azure AI Foundry

Choose your deployment type

Deploy via serverless API

Prerequisites

Step 1: Select and configure the model

Step 2: Configure deployment settings

Step 3: Deploy and verify

Deploy via managed compute

Prerequisites

Step 1: Provision compute resources

Step 2: Deploy model to managed compute

Step 3: Configure traffic allocation

Configure authentication and security

API key authentication (simpler setup)

Microsoft Entra ID authentication (recommended for production)

Monitor and scale your deployment

Set up monitoring

Configure auto-scaling

Optimize for cost and performance

Cost optimization strategies

Performance optimization

Troubleshooting common issues

Deployment fails with quota errors

High latency issues

Authentication errors

Next steps

Getting Started

Tutorials

How-to Guides

Concepts

Core Concepts

​Deploy Models to Azure AI Foundry

​Choose your deployment type

​Deploy via serverless API

​Prerequisites

​Step 1: Select and configure the model

​Step 2: Configure deployment settings

​Step 3: Deploy and verify

​Deploy via managed compute

​Prerequisites

​Step 1: Provision compute resources

​Step 2: Deploy model to managed compute

​Step 3: Configure traffic allocation

​Configure authentication and security

​API key authentication (simpler setup)

​Microsoft Entra ID authentication (recommended for production)

​Monitor and scale your deployment

​Set up monitoring

​Configure auto-scaling

​Optimize for cost and performance

​Cost optimization strategies

​Performance optimization

​Troubleshooting common issues

​Deployment fails with quota errors

​High latency issues

​Authentication errors

​Next steps

​Related guides

Deploy Models to Azure AI Foundry

Choose your deployment type

Deploy via serverless API

Prerequisites

Step 1: Select and configure the model

Step 2: Configure deployment settings

Step 3: Deploy and verify

Deploy via managed compute

Prerequisites

Step 1: Provision compute resources

Step 2: Deploy model to managed compute

Step 3: Configure traffic allocation

Configure authentication and security

API key authentication (simpler setup)

Microsoft Entra ID authentication (recommended for production)

Monitor and scale your deployment

Set up monitoring

Configure auto-scaling

Optimize for cost and performance

Cost optimization strategies

Performance optimization

Troubleshooting common issues

Deployment fails with quota errors

High latency issues

Authentication errors

Next steps

Related guides