This is the overview article in a three-part series.
- You’re here: Understand limitations, prevention controls, and required baseline configuration.
- For prolonged regional outages and platform incidents, see Agent Service platform outage recovery strategies.
- For human-caused or automation-caused deletions and localized data loss, see Agent Service resource and data loss recovery strategies.
Scope and definitions
This series focuses on DR for Foundry projects that use Agent Service in Standard deployment mode.- Blast radius boundary: In most workloads, a single Foundry project is the recovery unit.
- State: Agent definitions, conversation threads (including user-uploaded files), and any file-based knowledge stored in the capability host dependencies (Azure Cosmos DB, Azure AI Search, and Azure Storage).
- Data plane APIs: APIs used to create, update, and invoke agents and threads. For details, see Azure AI Foundry REST API reference.
This series doesn’t cover the Basic deployment mode. Basic mode uses Microsoft-managed infrastructure with different DR characteristics. For details, see Basic vs. Standard agent setup.
DR readiness checklist
Complete these actions before you rely on Agent Service in production:- Choose a recovery strategy per project (for example, warm standby and reconstruction) and document your recovery objectives.
- Configure required baseline protections and recovery features on your dependencies. For guidance, see High availability and resiliency for Foundry projects and agent services.
- Treat agent definitions as code. Store agent definitions, knowledge assets, and tool bindings in source control so you can redeploy them quickly.
- Automate redeployment of agents and any client updates needed for new agent IDs. Use the Azure AI Foundry REST API reference or the Azure AI Projects SDK to script agent creation. Store and version infrastructure as code (IaC) templates for your capability host dependencies.
- Practice recovery. Run periodic drills so operators can execute the recovery steps under time pressure.
- Set up monitoring and alerts. Configure Azure Monitor alerts for your capability host dependencies (Azure Cosmos DB, Azure AI Search, and Azure Storage) to detect availability degradation early.
Incident types and affected components
Agent Service deployments can encounter incidents that affect availability and data integrity in these components:- Data plane APIs: Services responsible for creating, updating, and invoking agents
- Agent capability host: Per-project infrastructure that houses your agents
- Agent definitions: Prompts, knowledge connections, file-based context, and tool integrations
- Conversation threads: Text conversations and user-uploaded files
Recovery capability limitations
The Agent Service has important limitations that shape your workload’s disaster recovery (DR) design. Consider these factors when you set realistic recovery point objectives (RPOs) and recovery time objectives (RTOs).Agent Service doesn’t provide built-in disaster recovery capabilities. It doesn’t replicate state, create backups, or support point-in-time restore. One project can’t use the data of another project. The service doesn’t have any supported method for active-active, multi-region replication. Microsoft Support can’t recover orphaned data, migrate data between projects, or combine state from multiple sources.The recommendations in this guide are compensating controls. Recovery might not be possible. An incident can permanently remove an agent and its data, such as threads and knowledge.
Unrecoverable scenarios and expectations
Plan for scenarios where recovery isn’t possible or where recovery restores only functionality (not state):| Scenario | Impact |
|---|---|
| Thread deletion | There isn’t a supported way to restore a deleted conversation thread. |
| Project reconstruction | If you delete and recreate a project, you redeploy agents with new agent IDs. Thread history and user-uploaded files aren’t recoverable. |
| Cross-region failover | You restore service by recreating projects in another region. Standby-region agents don’t have prior threads, and standby state is lost during failback. |
| State migration | There isn’t a supported way to merge or migrate agent state between projects or regions. |
General implications for your recovery design
- Treat each independent workload capability as an isolated blast radius. Design recovery decisions and procedures to support independent recovery. This boundary is usually a single Foundry project, but it can be multiple projects that share the same dependencies and recovery requirements.
- The recovery point for stateful content can be total loss. Plan for business and user acceptance of that loss.
- Recovery time mostly depends on how fast you can reapply infrastructure as code and redeploy agent definitions. Invest in automation accordingly.
- Warm standby environments start mostly empty. Recovery is reconstruction, not promotion of a hot replica.
- Avoid designs or user expectations that assume you can later consolidate a recovery environment’s data back into a production environment’s data.