Skip to main content
Managed compute in Foundry is currently in public preview and registration is required to use it. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Managed compute deployment (preview) in Microsoft Foundry hosts open-source models on dedicated GPU capacity. Microsoft owns the GPU topology, runtime, container image, and security patching. You choose the model, deployment template, accelerator family, and scaling behavior that fit your workload. This article walks through the end-to-end workflow for deploying an open-source model onto managed compute in Microsoft Foundry. In this article, you learn how to:
  • Choose a model in the model catalog
  • Select a deployment template
  • Deploy the model using the Foundry portal or Python SDK
  • Perform inferencing using the OpenAI SDK
  • Scale and monitor the deployment
  • Request more quota
For an overview of managed compute deployment in Foundry, including model instances, deployment templates, runtimes, accelerator families, billing, and current limitations, see Managed compute in Microsoft Foundry (Preview).

Prerequisites

  • An active Azure subscription. To create one, see Create your Azure free account.
  • A resource group in the subscription where you have permission to create resources.
  • A Microsoft Foundry account (Cognitive Services account of kind AIServices) and a Foundry project. To create one, see Create a Foundry project.
  • The following Azure role assignments on the Foundry account scope:
  • Approved managed compute quota for the accelerator family you plan to deploy on (A100, H100, or MI300X) in the target region. Managed compute quota is separate from Azure VM quota. See Request more quota at the end of this article.
  • Local tools for the SDK and CLI examples:
    pip install "azure-mgmt-cognitiveservices==15.0.0b2" azure-identity openai requests
    az login
    
  • Azure CLI 2.60 or later.
Managed compute in Foundry is in public preview. APIs, SKU names, and supported regions might change before general availability. Built-in content filtering isn’t part of the managed compute data path in public preview. If you need request-level or response-level filtering, call the Azure AI Content Safety APIs directly from your application.

Choose a model in the catalog

Managed compute deploys models from the Hugging Face Collection in the Foundry model catalog, served from the azure-huggingface registry.
  1. Sign in to Microsoft Foundry. Make sure the New Foundry toggle is on. These steps refer to Foundry (new).
  1. Select your subscription and Foundry resource.
  2. Select Build in the upper-right navigation, then select Models in the left pane.
  3. Filter the catalog by Collections. Choose Hugging Face. You can also use any of the other filters to narrow down the model you want to deploy (for example, pick a model family like Qwen) or by modality or task. You can also search by model name.
  4. Select a model card (for example, nvidia-nemotron-3-nano-30b-a3b-fp8) to open its details.
The model card shows the upstream license, the modality, supported tasks, and the deployment templates published for the model. If you plan to deploy via the Python SDK or REST instead of using the portal wizard, you’ll need three values as input to the deployment call. You can find these values in the Foundry portal as follows:
  • Model ID: the fully qualified registry asset ID for the model. Available on the model card in the catalog (copy from the model details pane). Example:
    azureml://registries/azure-huggingface/models/nvidia--nvidia-nemotron-3-nano-30b-a3b-fp8/versions/2
    
  • Deployment template ID: identifies the runtime, accelerator family and count, and context length for the model. Available in the deployment wizard that opens when you select Deploy on the model card. Select a template and copy the Deployment template ID from the wizard. Example:
    azureml://registries/azure-huggingface/deploymenttemplates/nvidia--nvidia-nemotron-3-nano-30b-a3b-fp8--nvidia-h100/labels/latest
    
A model ID and a deployment template ID must be compatible; every template lists the model versions it supports. The portal wizard only shows compatible templates for the model you selected. If you deploy using code, verify that both references resolve to valid registry assets in the azure-huggingface registry.
To learn more about deployment templates, see Deployment template in the Managed compute overview article.
  • Accelerator type: for example H100_80GB, A100_80GB, or MI_300_192GB. Shown next to each template in the deployment wizard.

Deploy the model

Access control summary

ActionMinimum role
Create, update, or delete a managed compute deploymentCognitive Services Contributor (or Foundry Owner / Foundry Account Owner) on the Foundry account
Read a deployment or list deploymentsCognitive Services User, Foundry User, Foundry Project Manager, or any of the roles above
Call the deployment with Microsoft Entra IDFoundry User on the Foundry account
Call the deployment with an API keyThe account key (no Azure role required for the call itself; key retrieval requires read access)
For the full Azure resource provider operation list, the role-to-permission matrix, and the comparison with standard deployments, see Role-based access control for Microsoft Foundry — managed compute control-plane operations.

Troubleshooting

provisioningState: Failed

Confirm that the requested accelerator family has approved quota in the target region, and that the chosen deployment template lists that accelerator family. A mismatched model and deployment template, for example, a template that was published for a different model version, is a common cause. Verify both references resolve to valid registry assets in the azure-huggingface registry.

”Quota exceeded” on create

The Foundry account doesn’t have enough managed compute quota in the region for the requested accelerator family. Request more quota. Azure VM quota doesn’t apply to managed compute.

”Insufficient capacity” in the region

The region returned no capacity for the requested accelerator family. Try a different family (for example, deploy on MI300X instead of H100), pick a template with fewer accelerators per instance, or target a different region. Larger-memory families such as MI300X often have capacity for models that don’t fit on A100.

404 from the /openai/v1/ route

If a chat-completion request to https://<account>.services.ai.azure.com/openai/v1/chat/completions returns 404, verify that:
  • The deployment name in the request body matches the deployment you created.
  • The deployment’s provisioningState is Succeeded.
  • The model’s runtime exposes chat completions. Some runtimes (for example, TEI for embeddings) don’t expose the chat completions route; use the route documented on the model card instead.

Deployment stuck in Creating for longer than 20 minutes

Some larger models take longer than the typical 10–15 minutes to come up. If provisioningState is still Creating after 20 minutes, check the deployment details page in the Foundry portal for an operation status message, and confirm that the underlying region hasn’t degraded. If the deployment stays in Creating past 30 minutes with no operation message, delete it and retry. Provisioning is idempotent on the deployment name.