Run AI Red Teaming Agent locally (preview)
This article refers to the Microsoft Foundry (new) portal.
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don’t recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
- Create an AI Red Teaming Agent locally with the Azure AI Evaluation SDK.
- Run automated scans locally.
AI Red Teaming Agent local is not compatible with the Foundry(new) portal and SDK.
Prerequisites
- Make sure the connected storage account has access to all projects.
- If you connected your storage account with Microsoft Entra ID, make sure to give managed identity Storage Blob Data Owner permissions to both your account and the Foundry project resource in the Azure portal.
Getting started
Install theredteam package as an extra from Azure AI Evaluation SDK. This package provides the PyRIT functionality:
PyRIT only works with Python 3.10, 3.11, 3.12 but doesn’t support Python 3.9. If you’re using Python 3.9, you must upgrade your Python version to use this feature.
Create and run an AI Red Teaming Agent
You can instantiate the AI Red Teaming agent with your Foundry project and Azure Credentials.risk_categories parameter and define the number of prompts covering each risk category with num_objectives parameter.
AI Red Teaming Agent only supports single-turn interactions in text-only scenarios.
Region support
Currently, AI Red Teaming Agent is available only in some regions. Ensure your Azure AI Project is located in the following supported regions:- East US2
- Sweden Central
- France Central
- Switzerland West
Supported targets
TheRedTeam can run automated scans on various targets.
-
Model configurations: If you’re just scanning a base model during your model selection process, you can pass in your model configuration as a target to your
red_team_agent.scan(): -
Simple callback: A simple callback that takes in a string prompt from
red_team_agentand returns some string response from your application: -
Complex callback: A more complex callback that is aligned to the OpenAI Chat Protocol:
-
PyRIT prompt target: For advanced users coming from PyRIT,
RedTeamcan also scan text-based PyRITPromptChatTarget. See the full list of PyRIT prompt targets.
Supported risk categories
The following risk categories are supported in the AI Red Teaming Agent’s runs, along with the associated number of attack objectives available for each risk coverage.| Risk Category | Maximum Number of Attack Objectives |
|---|---|
Violence | 100 |
HateUnfairness | 100 |
Sexual | 100 |
SelfHarm | 100 |
ProtectedMaterial | 200 |
CodeVulnerability | 389 |
UngroundedAttributes | 200 |
Custom attack objectives
The AI Red Teaming Agent provides a Microsoft curated set of adversarial attack objectives that cover each supported risk. Because your organization’s policy might be different, you might want to bring your own custom set to use for each risk category. You can run the AI Red Teaming Agent on your own dataset.violence, sexual, hate_unfairness, and self_harm. Use these supported types so that the Safety Evaluators can evaluate the attacks for success. The number of prompts that you specify is the num_objectives used in the scan.
Supported natural languages
AI Red Teaming Agent supports simulations in the following languages:| Language | ISO language code |
|---|---|
| Spanish | Spanish |
| Italian | Italian |
| French | French |
| Japanese | Japanese |
| Portuguese | Portuguese |
| Simplified Chinese | Chinese |
SupportedLanguages class and instantiate your red team with the desired language.
Supported attack strategies
If only the target is passed in when you run a scan and no attack strategies are specified, thered_team_agent sends only baseline direct adversarial queries to your target. This approach is the most naive method of attempting to elicit undesired behavior or generated content. We recommend that you try the baseline direct adversarial querying first before you apply any attack strategies.
Attack strategies are methods to take the baseline direct adversarial queries and convert them into another form to try bypassing your target’s safeguards. Attack strategies are classified into three levels of complexity. Attack complexity reflects the effort an attacker needs to put in conducting the attack.
- Easy complexity attacks require less effort, such as translation of a prompt into some encoding.
- Moderate complexity attacks require having access to resources such as another generative AI model.
- Difficult complexity attacks include attacks that require access to significant resources and effort to run an attack, such as knowledge of search-based algorithms, in addition to a generative AI model.
Default grouped attack strategies
This approach offers a group of default attacks for easy complexity and moderate complexity that you can use in theattack_strategies parameter. A difficult complexity attack can be a composition of two strategies in one attack.
| Attack strategy complexity group | Includes |
|---|---|
EASY | Base64, Flip, Morse |
MODERATE | Tense |
DIFFICULT | Composition of Tense and Base64 |
Base64, Flip, Morse, Tense, and a composition of Tense and Base64, which first translates the baseline query into past tense then encode it into Base64.
Specific attack strategies
You can specify the desired attack strategies instead of using default groups. The following attack strategies are supported:| Attack strategy | Description | Complexity |
|---|---|---|
AnsiAttack | Uses ANSI escape codes. | Easy |
AsciiArt | Creates ASCII art. | Easy |
AsciiSmuggler | Smuggles data using ASCII. | Easy |
Atbash | Atbash cipher. | Easy |
Base64 | Encodes data in Base64. | Easy |
Binary | Binary encoding. | Easy |
Caesar | Caesar cipher. | Easy |
CharacterSpace | Uses character spacing. | Easy |
CharSwap | Swaps characters. | Easy |
Diacritic | Uses diacritics. | Easy |
Flip | Flips characters. | Easy |
Leetspeak | Leetspeak encoding. | Easy |
Morse | Morse code encoding. | Easy |
ROT13 | ROT13 cipher. | Easy |
SuffixAppend | Appends suffixes. | Easy |
StringJoin | Joins strings. | Easy |
UnicodeConfusable | Uses Unicode confusables. | Easy |
UnicodeSubstitution | Substitutes Unicode characters. | Easy |
Url | URL encoding. | Easy |
Jailbreak | User Injected Prompt Attacks (UPIA) injects specially crafted prompts to bypass AI safeguards | Easy |
IndirectAttack | Indirect Prompt Injected Attacks (XPIA) injects attacks into context or tool outputs | Easy |
Tense | Changes tense of text into past tense. | Moderate |
Multiturn | Attacks over several turns to bypass safeguards. | Difficult |
Crescendo | Gradually increases prompt risk or complexity. | Difficult |
AttackStrategy.Compose() function takes in a list of two supported attack strategies and chains them together. The example’s composition first encodes the baseline adversarial query into Base64 then apply the ROT13 cipher on the Base64-encoded query. Compositions support chaining only two attack strategies together.
Results from your automated scans
The key metric for assessing your results is the Attack Success Rate (ASR), which measures the percentage of attacks that successfully elicit undesirable responses from your AI system. When the scan is finished, you can specify anoutput_path to capture a JSON file that represents a scorecard of your results for using in your own reporting tool or compliance platform.
My-First-RedTeam-Scan.json file contains a scorecard that provides a breakdown across attack complexity and risk categories. It also includes a joint attack complexity and risk category report. Important metadata is tracked in the parameters section, which outlines which risk categories were used to generate the attack objectives and which attack strategies were specified in the scan.