Skip to main content

Harm categories and severity levels in Microsoft Foundry

This article refers to the Microsoft Foundry (new) portal.
Microsoft Foundry content filtering ensures that AI-generated outputs align with ethical guidelines and safety standards. Content filtering capabilities classify harmful content into four categories — hate, sexual, violence, and self-harm — each graded at four severity levels (safe, low, medium, and high) for both text and image content. Use these categories and levels to configure guardrail controls that detect and mitigate risks associated with harmful content in your model deployments and agents. Guardrails in Microsoft Foundry ensure that AI-generated outputs align with ethical guidelines and safety standards. Guardrails classify harmful content into four categories — hate, sexual, violence, and self-harm — each graded at four severity levels (safe, low, medium, and high) for both text and image content. Use these categories and levels to configure Guardrail controls that detect and mitigate risks associated with harmful content in your model deployments and agents. For an overview of how guardrails work, see Guardrails and controls overview. The content safety system uses neural multiclass classification models to detect and filter harmful content for both text and image. Content detected at the “safe” severity level is labeled in annotations but isn’t subject to filtering and isn’t configurable.
The text content safety models for the hate, sexual, violence, and self-harm categories are trained and tested on the following languages: English, German, Japanese, Spanish, French, Italian, Portuguese, and Chinese. However, the service can work in many other languages, but the quality might vary. In all cases, you should do your own testing to ensure that it works for your application.

Harm category descriptions

The following table summarizes the harm categories supported by Foundry guardrails:
CategoryDescription
Hate and FairnessHate and fairness-related harms refer to any content that attacks or uses discriminatory language with reference to a person or identity group based on certain differentiating attributes of these groups.

This category includes, but isn’t limited to:
• Race, ethnicity, nationality
• Gender identity groups and expression
• Sexual orientation
• Religion
• Personal appearance and body size
• Disability status
• Harassment and bullying
SexualSexual describes language related to anatomical organs and genitals, romantic relationships and sexual acts, acts portrayed in erotic or affectionate terms, including those portrayed as an assault or a forced sexual violent act against one’s will.

This category includes but isn’t limited to:
• Vulgar content
• Prostitution
• Nudity and pornography
• Abuse
• Child exploitation, child abuse, child grooming
ViolenceViolence describes language related to physical actions intended to hurt, injure, damage, or kill someone or something; describes weapons, guns, and related entities.

This category includes, but isn’t limited to:
• Weapons
• Bullying and intimidation
• Terrorist and violent extremism
• Stalking
Self-HarmSelf-harm describes language related to physical actions intended to purposely hurt, injure, damage one’s body or kill oneself.

This category includes, but isn’t limited to:
• Eating disorders
• Bullying and intimidation

Severity levels

The content safety system classifies harmful content at four severity levels:
Severity levelDescription
SafeNo harmful material detected. Annotated but never filtered.
LowMild harmful material. Includes prejudiced views, mild depictions in fictional contexts, or personal experiences.
MediumModerate harmful material. Includes graphic depictions, bullying, or content that promotes harmful acts.
HighSevere harmful material. Includes extremist content, explicit depictions, or content that endorses serious harm.

How severity levels map to guardrail configuration

When you configure a guardrail control for a harm category, you set a severity threshold that determines which content is flagged:
Threshold settingBehavior
OffDetection is disabled for this category. No content is flagged or blocked.
LowFlags content at low severity and higher. Least restrictive setting.
MediumFlags content at medium severity and higher.
HighFlags only the most severe content. Most restrictive setting.
Content at the “safe” level is always annotated but never blocked, regardless of threshold setting. To configure these thresholds, see How to configure guardrails and controls.

Detailed severity definitions for text

The following tables provide detailed descriptions and examples for each severity level within each harm category for text content. Select the Severity definitions tab to view examples.

Text content

The Severity definitions tab in this document contains examples of harmful content that may be disturbing to some readers.

Detailed severity definitions for images

The following tables provide detailed descriptions and examples for each severity level within each harm category for image content. Select the Severity definitions tab to view examples.

Image content

The Severity definitions tab in this document contains examples of harmful content that may be disturbing to some readers.

Next steps