Skip to main content

Guardrails and controls overview in Microsoft Foundry

Microsoft Foundry provides safety and security guardrails that you can apply to core models and agents. Agent guardrails are in preview. Guardrails consist of a set of controls. The controls define a risk to be detected, intervention points to scan for the risk, and the response action to take in the model or agent when the risk is detected. A guardrail is a named collection of controls. Variations in API configurations and application design might affect completions and thus filtering behavior. Variations in API configurations and application design might affect completions and thus filtering behavior. Risks are flagged by classification models designed to detect harmful content. Four intervention points are supported:
  • User input — The prompt sent to a model or agent.
  • Tool call (Preview) — The action and data the agent proposes to send to a tool. Agents only.
  • Tool response (Preview) — The content returned from a tool to the agent. Agents only.
  • Output — The final completion returned to the user.
For more information about intervention points, see Intervention points and controls.
The guardrail system applies to all Models sold directly by Azure, except for prompts and completions processed by audio models such as Whisper. For more information, see Audio models. The guardrail system currently applies only to agents developed in the Foundry Agent Service, not to other agents registered in the Foundry Control Plane.

Prerequisites

Guardrails for agents vs models

An individual Foundry guardrail can be applied to one or many models and one or many agents in a project. Some controls within a guardrail may not be relevant to models because the risk, intervention point, or action is specific to agentic behavior or tool calls. Those controls aren’t run on models using that guardrail. Some risks in Preview aren’t yet supported for agents. When controls involving those risks are added to a guardrail and the guardrail is applied to an agent, those controls don’t take effect for that agent. They still apply to models that use the same guardrail.

Risk applicability

The following table summarizes which risks are applicable to models and agents:
RiskApplicable to ModelsApplicable to Agents (Preview)
Hate
Sexual
Self-harm
Violence
User prompt attacks
Indirect attacks
Spotlighting (Preview)
Protected material for code
Protected material for text
Groundedness (Preview)
Personally identifiable information (Preview)

Severity levels

For content risks (Hate, Sexual, Self-harm, Violence), each control uses a severity level threshold that determines which content is flagged:
Severity levelBehavior
OffDetection is disabled for this risk. Only available for approved customers, see content filters
LowFlags content at low severity and above. Most restrictive.
MediumFlags content at medium severity and above.
HighFlags only the most severe content. Least restrictive.
For a detailed breakdown of what each severity level detects, see Content filtering categories.

Intervention point applicability

The following table summarizes which intervention points are applicable to models and agents:
Risks are detected in an agent based on the guardrail it’s assigned, not the guardrail of its underlying model. The agentic guardrail fully overrides the model’s guardrail.
  • A model deployment has a control with Violence detection set to High for user input and output
  • An agent using that model has a control with Violence detection set to Low for user input and output. The agent has no controls for Violence detection at all for tool calls and responses

Action applicability

The following table summarizes which actions are applicable to models and agents:
ActionApplicable to ModelsApplicable to Agents (Preview)
Annotate
Annotate and block

Guardrail inheritance and override

Risks are detected in an agent based on the guardrail it’s assigned, not the guardrail of its underlying model. The agentic guardrail fully overrides the model’s guardrail.
Example scenario:
  • A model deployment has a control with Violence detection set to High for user input and output
  • An agent using that model has a control with Violence detection set to Low for user input and output. The agent has no controls for Violence detection at all for tool calls and responses|
Expected behavior for Violence detection in that agent:
  • User queries to the agent are scanned for Violence at a Low level
  • Tool calls generated internally to the agent by its underlying model, including the content then sent to that tool during the tool call’s execution, will not be scanned for Violence
  • The response back from the tool will not be scanned for Violence
  • The final output returned to the user in response to their original query are scanned for Violence at a Low level

Default guardrails

By default, models are assigned the Microsoft.DefaultV2 guardrail. For more information about what controls are included, see Content filtering. Default guardrail assignment for agents follows these rules:
  • If you assign a custom guardrail to an agent, that guardrail is used.
  • If no custom guardrail is assigned, the agent inherits the guardrail of its underlying model deployment.
  • An agent only uses the Microsoft.DefaultV2 guardrail if its model deployment uses that guardrail, or if you explicitly assign it.
For example, if no custom guardrails are specified for an agent and that agent uses a GPT-4o mini deployment with a guardrail named “MyCustomGuardrails,” the agent also uses “MyCustomGuardrails” until you assign a different guardrail.

Next steps