Reference architecture for an incident-response mesh built with Google ADK, A2A, and MCP.
This sample shows how to structure a multi-agent system with explicit boundaries:
- A root incident commander that normalizes an operator prompt.
- Remote specialist agents exposed over A2A.
- MCP servers that keep telemetry, runbooks, and postmortem memory separated.
- Deterministic fan-out and fan-in using workflow agents.
- A human approval boundary before risky mitigations.
- A final executive brief artifact that is actually readable.
A lot of demos still do this:
- one agent,
- one giant prompt,
- too many tools,
- fuzzy control flow,
- and no clean story for safety, failure, or reuse.
This repo does the opposite:
- A2A for specialist-to-specialist collaboration,
- MCP for tool and data access,
- ADK orchestration for deterministic control,
- and a clear upgrade path into ADK 2.0 workflow concepts.
The sample handles an incident-response scenario for checkout-api:
- live latency spike,
- error-rate spike,
- suspicious recent deploy,
- saturated database connections,
- and degraded cache efficiency.
That makes the architecture easy to explain because each specialist has a clean job:
- telemetry correlator = what is happening,
- runbook resolver = what is safe to do,
- incident memory = what worked before,
- remediation planner = what we should do next.
flowchart LR
User[On-call engineer] --> Orchestrator[Incident Commander<br/>ADK orchestrator]
Orchestrator -->|A2A| Telemetry[Telemetry Correlator]
Orchestrator -->|A2A| Runbook[Runbook Resolver]
Orchestrator -->|A2A| Memory[Incident Memory]
Orchestrator -->|A2A| Planner[Remediation Planner]
Telemetry -->|MCP| TelemetryMCP[(Telemetry MCP)]
Runbook -->|MCP| RunbookMCP[(Runbook MCP)]
Memory -->|MCP| MemoryMCP[(Incident Memory MCP)]
TelemetryMCP --> Metrics[(metrics_snapshot.json)]
TelemetryMCP --> Alerts[(active_incidents.json)]
RunbookMCP --> Runbooks[(runbooks/*.md)]
MemoryMCP --> Postmortems[(postmortems.json)]
Telemetry -. findings .-> Orchestrator
Runbook -. guidance .-> Orchestrator
Memory -. history .-> Orchestrator
Planner -. plan .-> Orchestrator
Orchestrator --> Approval[Approval gate]
Approval --> Artifact[Executive incident brief]
sequenceDiagram
participant Ops as Operator
participant Cmd as Incident Commander
participant Tel as Telemetry Correlator
participant Run as Runbook Resolver
participant Mem as Incident Memory
participant Plan as Remediation Planner
participant Human as Human approval
Ops->>Cmd: Investigate checkout-api degradation
Cmd->>Cmd: Build incident_packet
par Parallel evidence collection
Cmd->>Tel: A2A task: correlate signals
Tel->>Tel: MCP calls for alerts and metrics
Tel-->>Cmd: fault domain + blast radius
and
Cmd->>Run: A2A task: fetch runbook guidance
Run->>Run: MCP calls for runbook search
Run-->>Cmd: safe actions + rollback notes
and
Cmd->>Mem: A2A task: retrieve analogs
Mem->>Mem: MCP calls for postmortem search
Mem-->>Cmd: what worked / what failed
end
Cmd->>Plan: A2A task: build mitigation plan
Plan-->>Cmd: ordered steps + risk_level
Cmd->>Human: approval packet if risk is high
Human-->>Cmd: approve / deny
Cmd-->>Ops: executive brief + audit stamp
adk2-a2a-mcp-incident-mesh/
├── .devcontainer/
├── common/
├── data/
│ ├── incidents/
│ ├── runbooks/
│ └── telemetry/
├── docs/
├── mcp_servers/
├── orchestrator/
├── remote_a2a/
│ ├── incident_memory/
│ ├── remediation_planner/
│ ├── runbook_resolver/
│ └── telemetry_correlator/
└── scripts/
The system keeps collaboration and tool access separate.
- A2A is the collaboration layer between peer agents.
- MCP is the access layer between each specialist and its domain tools.
That one distinction alone makes the whole architecture easier to explain.
Each specialist owns exactly one cognitive domain:
- telemetry,
- runbooks,
- incident memory,
- planning.
You can swap them, scale them, or replace them independently.
The main sample uses:
SequentialAgentfor strict ordering,ParallelAgentfor evidence collection.
That means the process topology is explicit in code, not buried in prompt prose.
If the recommended mitigation crosses a risk threshold, the orchestrator creates an approval packet before finalizing the brief.
The orchestrator writes a markdown brief and a machine-readable JSON summary to the local artifacts/ directory.
This repository is intentionally opinionated:
- The main path uses stable workflow agents today.
orchestrator/agent_adk2_preview.pyshows how to map the same design into an ADK 2.0-style workflow with aRequestInputapproval gate.
That is the move a mature team makes: ship on the stable surface, design for the next surface.
flowchart TD
Start([START]) --> Triage[incident_triage]
Triage --> Fanout[parallel_evidence]
Fanout --> Telemetry[telemetry_specialist]
Fanout --> Runbook[runbook_specialist]
Fanout --> Memory[memory_specialist]
Telemetry --> Synthesize[decision_synthesizer]
Runbook --> Synthesize
Memory --> Synthesize
Synthesize --> Approval[RequestInput approval gate]
Approval --> Finalize[final brief artifact]
cp .env.example .env
python3 -m pip install -r requirements.txtbash scripts/start_demo.shThat launches:
- the orchestrator on port
8000, - telemetry specialist on
8101, - runbook specialist on
8102, - incident-memory specialist on
8103, - remediation planner on
8104.
bash scripts/smoke_test.shpython3 scripts/validate_local.pyThis validates:
- Python and ADK imports
- deterministic MCP data paths
- A2A service startup and agent-card reachability
- optional online end-to-end execution when credentials are present
- deterministic artifact persistence for both markdown and JSON outputs
Set these env vars when you want the services to export telemetry through ADK:
ADK_ENABLE_CLOUD_TRACING=trueADK_ENABLE_CLOUD_METRICS=trueADK_ENABLE_CLOUD_LOGGING=true
For non-GCP collectors, standard OTLP env vars are also honored through ADK's OpenTelemetry bootstrap.
PYTHONPATH=. adk web .For local development, this sample keeps MCP simple with stdio-backed servers. For production, the cleaner move is:
- remote MCP over HTTP or SSE,
- separately deployed A2A services,
- explicit auth on both layers,
- and structured tracing injected into A2A requests.
flowchart LR
subgraph Dev[Local or Codespaces]
DevCmd[Orchestrator]
DevA2A[Remote A2A agents]
DevMCP[stdio MCP servers]
end
subgraph Prod[Cloud Run or GKE]
ProdCmd[Orchestrator service]
ProdA2A[Remote A2A services]
ProdMCP[Remote MCP over HTTP or SSE]
end
Dev --> Prod
Use this prompt in your UI or smoke-test walkthrough:
Investigate the active checkout-api incident. I need:
1. the most likely root cause,
2. the strongest evidence,
3. the safest immediate mitigation,
4. whether rollback needs human approval,
5. and a concise executive brief.
orchestrator/agent.pyfor the core fan-out/fan-in design.remote_a2a/*/agent.pyfor specialist boundaries.mcp_servers/*.pyfor domain-specific tool surfaces.orchestrator/agent_adk2_preview.pyfor the ADK 2.0 angle.
This is not another mega-agent demo. It is a modular incident-response mesh that uses ADK for orchestration, A2A for specialist delegation, and MCP for domain access, with a clear path into ADK 2.0 workflow patterns.