Foundry Agent Service disaster recovery
This article refers to the Microsoft Foundry (new) portal.
This is the overview article in a three-part series.
- You’re here: Understand limitations, prevention controls, and required baseline configuration.
- For prolonged regional outages and platform incidents, see Agent Service platform outage recovery strategies.
- For human-caused or automation-caused deletions and localized data loss, see Agent Service resource and data loss recovery strategies.
Scope and definitions
This series focuses on DR for Foundry projects that use Agent Service in Standard deployment mode.- Blast radius boundary: In most workloads, a single Foundry project is the recovery unit.
- State: Agent definitions, conversation threads (including user-uploaded files), and any file-based knowledge stored in the capability host dependencies.
- Data plane APIs: APIs used to create, update, and invoke agents and threads. For details, see AI Agents REST API operation groups.
DR readiness checklist
Complete these actions before you rely on Agent Service in production:- Choose a recovery strategy per project (for example, warm standby and reconstruction) and document your recovery objectives.
- Configure required baseline protections and recovery features on your dependencies. For guidance, see High availability and resiliency for Foundry projects and agent services.
- Treat agent definitions as code. Store agent definitions, knowledge assets, and tool bindings in source control so you can redeploy them quickly.
- Automate redeployment of agents and any client updates needed for new agent IDs.
- Practice recovery. Run periodic drills so operators can execute the recovery steps under time pressure.
- Data plane APIs: Services responsible for creating, updating, and invoking agents
- Agent capability host: Per-project infrastructure that houses your agents
- Agent definitions: Prompts, knowledge connections, file-based context, and tool integrations
- Conversation threads: Text conversations and user-uploaded files
Built-in recovery capabilities
The Agent Service has important limitations that shape your workload’s disaster recovery (DR) design. Consider these factors when you set realistic recovery point objectives (RPOs) and recovery time objectives (RTOs).Agent Service doesn’t provide built-in disaster recovery capabilities. It doesn’t replicate state, create backups, or support point-in-time restore. One project can’t use the data of another project. The service doesn’t have any supported method for active-active, multi-region replication. Microsoft Support can’t recover orphaned data, migrate data between projects, or combine state from multiple sources.The recommendations in this guide are compensating controls. Recovery might not be possible. An incident can permanently remove an agent and its data, such as threads and knowledge.
Unrecoverable scenarios and expectations
Plan for scenarios where recovery isn’t possible or where recovery restores only functionality (not state):- Thread deletion: There isn’t a supported way to restore a deleted conversation thread.
- Project reconstruction: If a project is deleted and you recreate it, you redeploy agents as new resources with new agent IDs. Thread history and user-uploaded files from the deleted project aren’t recoverable.
- Cross-region failover: In a regional outage, you typically restore service by recreating projects and redeploying agents in another region. Standby-region agents don’t have access to prior threads, and any standby-region state is lost during failback.
- State migration: There isn’t a supported way to merge or migrate agent state between projects or between regions.
General implications for your recovery design
- Treat each independent workload capability as an isolated blast radius. Design recovery decisions and procedures to support independent recovery. This boundary is usually a single Foundry project, but it can be multiple projects that share the same dependencies and recovery requirements.
- The recovery point for stateful content can be total loss. Plan for business and user acceptance of that loss.
- Recovery time mostly depends on how fast you can reapply infrastructure as code and redeploy agent definitions. Invest in automation accordingly.
- Warm standby environments start mostly empty. Recovery is reconstruction, not promotion of a hot replica.
- Avoid designs or user expectations that assume you can later consolidate a recovery environment’s data back into a production environment’s data.