Deploy open-source models with managed compute in Microsoft Foundry

Managed compute in Foundry is currently in public preview and registration is required to use it. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Managed compute deployment (preview) in Microsoft Foundry hosts open-source models on dedicated GPU capacity. Microsoft owns the GPU topology, runtime, container image, and security patching. You choose the model, deployment template, accelerator family, and scaling behavior that fit your workload. This article walks through the end-to-end workflow for deploying an open-source model onto managed compute in Microsoft Foundry. In this article, you learn how to:

Choose a model in the model catalog
Select a deployment template
Deploy the model using the Foundry portal or Python SDK
Perform inferencing using the OpenAI SDK
Scale and monitor the deployment
Request more quota

For an overview of managed compute deployment in Foundry, including model instances, deployment templates, runtimes, accelerator families, billing, and current limitations, see Managed compute in Microsoft Foundry (Preview).

Prerequisites

An active Azure subscription. To create one, see Create your Azure free account.
A resource group in the subscription where you have permission to create resources.
A Microsoft Foundry account (Cognitive Services account of kind AIServices) and a Foundry project. To create one, see Create a Foundry project.
The following Azure role assignments on the Foundry account scope:
- Cognitive Services Contributor (or Foundry Owner / Foundry Account Owner) — required to create, update, and delete managed compute deployments. See Role-based access control for Microsoft Foundry — managed compute control-plane operations.
- Foundry User — required to call the deployment with Microsoft Entra ID from the Playground, the SDK, or REST.
Approved managed compute quota for the accelerator family you plan to deploy on (A100, H100, or MI300X) in the target region. Managed compute quota is separate from Azure VM quota. See Request more quota at the end of this article.

Local tools for the SDK and CLI examples:

pip install "azure-mgmt-cognitiveservices==15.0.0b2" azure-identity openai requests
az login

Azure CLI 2.60 or later.

Managed compute in Foundry is in public preview. APIs, SKU names, and supported regions might change before general availability. Built-in content filtering isn’t part of the managed compute data path in public preview. If you need request-level or response-level filtering, call the Azure AI Content Safety APIs directly from your application.

Choose a model in the catalog

Managed compute deploys models from the Hugging Face Collection in the Foundry model catalog, served from the azure-huggingface registry.

Select your subscription and Foundry resource.
Select Build in the upper-right navigation, then select Models in the left pane.
Filter the catalog by Collections. Choose Hugging Face. You can also use any of the other filters to narrow down the model you want to deploy (for example, pick a model family like Qwen) or by modality or task. You can also search by model name.
Select a model card (for example, nvidia-nemotron-3-nano-30b-a3b-fp8) to open its details.

The model card shows the upstream license, the modality, supported tasks, and the deployment templates published for the model. If you plan to deploy via the Python SDK or REST instead of using the portal wizard, you’ll need three values as input to the deployment call. You can find these values in the Foundry portal as follows:

Model ID: the fully qualified registry asset ID for the model. Available on the model card in the catalog (copy from the model details pane). Example:
```
azureml://registries/azure-huggingface/models/nvidia--nvidia-nemotron-3-nano-30b-a3b-fp8/versions/2
```
Deployment template ID: identifies the runtime, accelerator family and count, and context length for the model. Available in the deployment wizard that opens when you select Deploy on the model card. Select a template and copy the Deployment template ID from the wizard. Example:
```
azureml://registries/azure-huggingface/deploymenttemplates/nvidia--nvidia-nemotron-3-nano-30b-a3b-fp8--nvidia-h100/labels/latest
```

A model ID and a deployment template ID must be compatible; every template lists the model versions it supports. The portal wizard only shows compatible templates for the model you selected. If you deploy using code, verify that both references resolve to valid registry assets in the azure-huggingface registry.

To learn more about deployment templates, see Deployment template in the Managed compute overview article.

Accelerator type: for example H100_80GB, A100_80GB, or MI_300_192GB. Shown next to each template in the deployment wizard.

Deploy the model

Access control summary

Action	Minimum role
Create, update, or delete a managed compute deployment	Cognitive Services Contributor (or Foundry Owner / Foundry Account Owner) on the Foundry account
Read a deployment or list deployments	Cognitive Services User, Foundry User, Foundry Project Manager, or any of the roles above
Call the deployment with Microsoft Entra ID	Foundry User on the Foundry account
Call the deployment with an API key	The account key (no Azure role required for the call itself; key retrieval requires read access)

For the full Azure resource provider operation list, the role-to-permission matrix, and the comparison with standard deployments, see Role-based access control for Microsoft Foundry — managed compute control-plane operations.

Troubleshooting

`provisioningState: Failed`

Confirm that the requested accelerator family has approved quota in the target region, and that the chosen deployment template lists that accelerator family. A mismatched model and deployment template, for example, a template that was published for a different model version, is a common cause. Verify both references resolve to valid registry assets in the azure-huggingface registry.

”Quota exceeded” on create

The Foundry account doesn’t have enough managed compute quota in the region for the requested accelerator family. Request more quota. Azure VM quota doesn’t apply to managed compute.

”Insufficient capacity” in the region

The region returned no capacity for the requested accelerator family. Try a different family (for example, deploy on MI300X instead of H100), pick a template with fewer accelerators per instance, or target a different region. Larger-memory families such as MI300X often have capacity for models that don’t fit on A100.

404 from the `/openai/v1/` route

If a chat-completion request to https://<account>.services.ai.azure.com/openai/v1/chat/completions returns 404, verify that:

The deployment name in the request body matches the deployment you created.
The deployment’s provisioningState is Succeeded.
The model’s runtime exposes chat completions. Some runtimes (for example, TEI for embeddings) don’t expose the chat completions route; use the route documented on the model card instead.

Deployment stuck in `Creating` for longer than 20 minutes

Some larger models take longer than the typical 10–15 minutes to come up. If provisioningState is still Creating after 20 minutes, check the deployment details page in the Foundry portal for an operation status message, and confirm that the underlying region hasn’t degraded. If the deployment stays in Creating past 30 minutes with no operation message, delete it and retry. Provisioning is idempotent on the deployment name.

​Prerequisites

​Choose a model in the catalog

​Deploy the model

​Access control summary

​Troubleshooting

​provisioningState: Failed

​”Quota exceeded” on create

​”Insufficient capacity” in the region

​404 from the /openai/v1/ route

​Deployment stuck in Creating for longer than 20 minutes

​Related content

Prerequisites

Choose a model in the catalog

Deploy the model

Access control summary

Troubleshooting

`provisioningState: Failed`

”Quota exceeded” on create

”Insufficient capacity” in the region

404 from the `/openai/v1/` route

Deployment stuck in `Creating` for longer than 20 minutes

Related content