Skip to main content

Deploy Models to Azure AI Foundry

This guide shows you how to deploy AI models in Azure AI Foundry for production use. Whether you need serverless APIs or dedicated compute, this guide will help you choose the right deployment option and configure it correctly.

Choose your deployment type

Before deploying, determine which option fits your needs: Use serverless API deployment when:
  • You have variable or unpredictable traffic
  • You want to minimize setup and management
  • You need quick scaling without infrastructure planning
  • Cost efficiency for sporadic usage is important
Use managed compute deployment when:
  • You have consistent, high-volume traffic
  • You need guaranteed capacity and performance
  • You require custom model configurations
  • You have specific compliance or isolation requirements

Deploy via serverless API

Prerequisites

  • An Azure AI Foundry project
  • Contributor access to the project
  • Available quota for the target model

Step 1: Select and configure the model

  1. Navigate to your Azure AI Foundry project
  2. Go to ModelsModel catalog
  3. Find your desired model (e.g., Llama-3.1-8B-Instruct)
  4. Click DeployServerless API

Step 2: Configure deployment settings

{
  "deployment_name": "llama-chat-prod",
  "model_name": "Llama-3.1-8B-Instruct",
  "sku": "Standard",
  "content_filter": "default_content_filter",
  "rate_limit": {
    "requests_per_minute": 1000,
    "tokens_per_minute": 100000
  }
}
Key configuration options:
  • Deployment name: Use descriptive names like {model}-{purpose}-{env}
  • Content filtering: Choose appropriate safety level for your use case
  • Rate limits: Set based on your expected traffic patterns

Step 3: Deploy and verify

  1. Click Deploy - this typically takes 2-3 minutes
  2. Monitor the deployment status in the Deployments tab
  3. Once status shows “Succeeded”, test the endpoint:
curl -X POST "https://your-endpoint.inference.ml.azure.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "messages": [{"role": "user", "content": "Hello, world!"}],
    "max_tokens": 100
  }'

Deploy via managed compute

Prerequisites

  • An Azure AI Foundry hub with compute resources
  • Sufficient compute quota for your chosen SKU
  • Understanding of your performance requirements

Step 1: Provision compute resources

  1. Go to ManagementCompute
  2. Click + New compute
  3. Configure based on your needs:
compute_configuration:
  name: "model-inference-cluster"
  type: "AmlCompute"
  size: "Standard_NC24ads_A100_v4"  # GPU optimized for AI inference
  min_nodes: 1
  max_nodes: 10
  idle_seconds_before_scaledown: 300

Step 2: Deploy model to managed compute

  1. In ModelsModel catalog, select your model
  2. Click DeployReal-time endpoint
  3. Configure the deployment:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineDeployment, Model

# Initialize client
ml_client = MLClient.from_config()

# Configure deployment
deployment = ManagedOnlineDeployment(
    name="llama-managed-prod",
    endpoint_name="llama-endpoint",
    model=Model(path="azureml://registries/azureml/models/Llama-3.1-8B-Instruct"),
    instance_type="Standard_NC24ads_A100_v4",
    instance_count=2,
    environment_variables={
        "WORKER_COUNT": "2",
        "MAX_CONCURRENT_REQUESTS": "100"
    }
)

# Deploy
ml_client.online_deployments.begin_create_or_update(deployment)

Step 3: Configure traffic allocation

If you have multiple deployments behind one endpoint:
# Set traffic allocation
endpoint.traffic = {
    "llama-managed-prod": 80,
    "llama-managed-canary": 20
}
ml_client.online_endpoints.begin_create_or_update(endpoint)

Configure authentication and security

API key authentication (simpler setup)

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.getenv("AZURE_AI_API_KEY"),
    api_version="2024-02-01",
    azure_endpoint="https://your-endpoint.inference.ml.azure.com"
)
from azure.identity import DefaultAzureCredential
from openai import AzureOpenAI

credential = DefaultAzureCredential()
token = credential.get_token("https://ml.azure.com/.default")

client = AzureOpenAI(
    api_key=token.token,
    api_version="2024-02-01",
    azure_endpoint="https://your-endpoint.inference.ml.azure.com"
)

Monitor and scale your deployment

Set up monitoring

  1. Go to MonitoringMetrics
  2. Configure alerts for key metrics:
    • Request latency > 5 seconds
    • Error rate > 5%
    • Token usage approaching quota limits

Configure auto-scaling

For managed compute deployments:
from azure.ai.ml.entities import OnlineRequestSettings

# Configure auto-scaling
deployment.request_settings = OnlineRequestSettings(
    request_timeout_ms=90000,
    max_concurrent_requests_per_instance=10,
    max_queue_wait_ms=5000
)

# Update deployment
ml_client.online_deployments.begin_create_or_update(deployment)

Optimize for cost and performance

Cost optimization strategies

  1. Right-size your compute: Start small and scale based on actual usage
  2. Use auto-scaling: Reduce costs during low-traffic periods
  3. Monitor quota usage: Track token consumption to avoid overages
  4. Consider spot instances: For non-critical workloads, use spot pricing

Performance optimization

  1. Batch requests: Group multiple requests to improve throughput
  2. Implement caching: Cache responses for repeated queries
  3. Use appropriate instance types: Match compute to model requirements
  4. Optimize prompts: Shorter, well-crafted prompts reduce latency

Troubleshooting common issues

Deployment fails with quota errors

# Check current quota usage
az cognitiveservices account list-usage \
  --name "your-resource-name" \
  --resource-group "your-resource-group"

# Request quota increase if needed
az support tickets create \
  --ticket-name "Quota increase request" \
  --severity "minimal" \
  --issue-type "quota"

High latency issues

  1. Check instance health: Verify all instances are healthy
  2. Review traffic patterns: Look for unusual spikes or patterns
  3. Optimize model settings: Adjust max_tokens and other parameters
  4. Consider geographic distribution: Deploy closer to your users

Authentication errors

# Test endpoint connectivity
import requests

response = requests.get(
    "https://your-endpoint.inference.ml.azure.com/v1/models",
    headers={"Authorization": f"Bearer {api_key}"}
)

if response.status_code == 401:
    print("Authentication failed - check your API key")
elif response.status_code == 403:
    print("Access denied - check your permissions")

Next steps

Once your model is deployed successfully: