Skip to content

EaCognitive/incident-command-mesh

Repository files navigation

Incident Command Mesh

Validate

Reference architecture for an incident-response mesh built with Google ADK, A2A, and MCP.

Incident Command Mesh overview

This sample shows how to structure a multi-agent system with explicit boundaries:

  • A root incident commander that normalizes an operator prompt.
  • Remote specialist agents exposed over A2A.
  • MCP servers that keep telemetry, runbooks, and postmortem memory separated.
  • Deterministic fan-out and fan-in using workflow agents.
  • A human approval boundary before risky mitigations.
  • A final executive brief artifact that is actually readable.

Why this pattern works

A lot of demos still do this:

  • one agent,
  • one giant prompt,
  • too many tools,
  • fuzzy control flow,
  • and no clean story for safety, failure, or reuse.

This repo does the opposite:

  • A2A for specialist-to-specialist collaboration,
  • MCP for tool and data access,
  • ADK orchestration for deterministic control,
  • and a clear upgrade path into ADK 2.0 workflow concepts.

The use case

The sample handles an incident-response scenario for checkout-api:

  • live latency spike,
  • error-rate spike,
  • suspicious recent deploy,
  • saturated database connections,
  • and degraded cache efficiency.

That makes the architecture easy to explain because each specialist has a clean job:

  • telemetry correlator = what is happening,
  • runbook resolver = what is safe to do,
  • incident memory = what worked before,
  • remediation planner = what we should do next.

Architecture at a glance

flowchart LR
    User[On-call engineer] --> Orchestrator[Incident Commander<br/>ADK orchestrator]

    Orchestrator -->|A2A| Telemetry[Telemetry Correlator]
    Orchestrator -->|A2A| Runbook[Runbook Resolver]
    Orchestrator -->|A2A| Memory[Incident Memory]
    Orchestrator -->|A2A| Planner[Remediation Planner]

    Telemetry -->|MCP| TelemetryMCP[(Telemetry MCP)]
    Runbook -->|MCP| RunbookMCP[(Runbook MCP)]
    Memory -->|MCP| MemoryMCP[(Incident Memory MCP)]

    TelemetryMCP --> Metrics[(metrics_snapshot.json)]
    TelemetryMCP --> Alerts[(active_incidents.json)]
    RunbookMCP --> Runbooks[(runbooks/*.md)]
    MemoryMCP --> Postmortems[(postmortems.json)]

    Telemetry -. findings .-> Orchestrator
    Runbook -. guidance .-> Orchestrator
    Memory -. history .-> Orchestrator
    Planner -. plan .-> Orchestrator

    Orchestrator --> Approval[Approval gate]
    Approval --> Artifact[Executive incident brief]
Loading

Sequence flow

sequenceDiagram
    participant Ops as Operator
    participant Cmd as Incident Commander
    participant Tel as Telemetry Correlator
    participant Run as Runbook Resolver
    participant Mem as Incident Memory
    participant Plan as Remediation Planner
    participant Human as Human approval

    Ops->>Cmd: Investigate checkout-api degradation
    Cmd->>Cmd: Build incident_packet

    par Parallel evidence collection
        Cmd->>Tel: A2A task: correlate signals
        Tel->>Tel: MCP calls for alerts and metrics
        Tel-->>Cmd: fault domain + blast radius
    and
        Cmd->>Run: A2A task: fetch runbook guidance
        Run->>Run: MCP calls for runbook search
        Run-->>Cmd: safe actions + rollback notes
    and
        Cmd->>Mem: A2A task: retrieve analogs
        Mem->>Mem: MCP calls for postmortem search
        Mem-->>Cmd: what worked / what failed
    end

    Cmd->>Plan: A2A task: build mitigation plan
    Plan-->>Cmd: ordered steps + risk_level
    Cmd->>Human: approval packet if risk is high
    Human-->>Cmd: approve / deny
    Cmd-->>Ops: executive brief + audit stamp
Loading

Repository layout

adk2-a2a-mcp-incident-mesh/
├── .devcontainer/
├── common/
├── data/
│   ├── incidents/
│   ├── runbooks/
│   └── telemetry/
├── docs/
├── mcp_servers/
├── orchestrator/
├── remote_a2a/
│   ├── incident_memory/
│   ├── remediation_planner/
│   ├── runbook_resolver/
│   └── telemetry_correlator/
└── scripts/

Design choices

1) A2A and MCP are not treated like the same thing

The system keeps collaboration and tool access separate.

  • A2A is the collaboration layer between peer agents.
  • MCP is the access layer between each specialist and its domain tools.

That one distinction alone makes the whole architecture easier to explain.

2) Boundaries are real

Each specialist owns exactly one cognitive domain:

  • telemetry,
  • runbooks,
  • incident memory,
  • planning.

You can swap them, scale them, or replace them independently.

3) Fan-out and fan-in are deterministic

The main sample uses:

  • SequentialAgent for strict ordering,
  • ParallelAgent for evidence collection.

That means the process topology is explicit in code, not buried in prompt prose.

4) Human approval lives at the action boundary

If the recommended mitigation crosses a risk threshold, the orchestrator creates an approval packet before finalizing the brief.

5) The system produces an artifact

The orchestrator writes a markdown brief and a machine-readable JSON summary to the local artifacts/ directory.

Stable path now, ADK 2.0 path next

This repository is intentionally opinionated:

  • The main path uses stable workflow agents today.
  • orchestrator/agent_adk2_preview.py shows how to map the same design into an ADK 2.0-style workflow with a RequestInput approval gate.

That is the move a mature team makes: ship on the stable surface, design for the next surface.

ADK 2.0 preview diagram

flowchart TD
    Start([START]) --> Triage[incident_triage]
    Triage --> Fanout[parallel_evidence]
    Fanout --> Telemetry[telemetry_specialist]
    Fanout --> Runbook[runbook_specialist]
    Fanout --> Memory[memory_specialist]
    Telemetry --> Synthesize[decision_synthesizer]
    Runbook --> Synthesize
    Memory --> Synthesize
    Synthesize --> Approval[RequestInput approval gate]
    Approval --> Finalize[final brief artifact]
Loading

Local development

1) Install dependencies

cp .env.example .env
python3 -m pip install -r requirements.txt

2) Start the A2A mesh

bash scripts/start_demo.sh

That launches:

  • the orchestrator on port 8000,
  • telemetry specialist on 8101,
  • runbook specialist on 8102,
  • incident-memory specialist on 8103,
  • remediation planner on 8104.

3) Verify the agent cards

bash scripts/smoke_test.sh

4) Run the local validation gate

python3 scripts/validate_local.py

This validates:

  • Python and ADK imports
  • deterministic MCP data paths
  • A2A service startup and agent-card reachability
  • optional online end-to-end execution when credentials are present
  • deterministic artifact persistence for both markdown and JSON outputs

Observability

Set these env vars when you want the services to export telemetry through ADK:

  • ADK_ENABLE_CLOUD_TRACING=true
  • ADK_ENABLE_CLOUD_METRICS=true
  • ADK_ENABLE_CLOUD_LOGGING=true

For non-GCP collectors, standard OTLP env vars are also honored through ADK's OpenTelemetry bootstrap.

5) Optional: run the local ADK UI

PYTHONPATH=. adk web .

Production posture

For local development, this sample keeps MCP simple with stdio-backed servers. For production, the cleaner move is:

  • remote MCP over HTTP or SSE,
  • separately deployed A2A services,
  • explicit auth on both layers,
  • and structured tracing injected into A2A requests.
flowchart LR
    subgraph Dev[Local or Codespaces]
        DevCmd[Orchestrator]
        DevA2A[Remote A2A agents]
        DevMCP[stdio MCP servers]
    end

    subgraph Prod[Cloud Run or GKE]
        ProdCmd[Orchestrator service]
        ProdA2A[Remote A2A services]
        ProdMCP[Remote MCP over HTTP or SSE]
    end

    Dev --> Prod
Loading

Demo prompt

Use this prompt in your UI or smoke-test walkthrough:

Investigate the active checkout-api incident. I need:
1. the most likely root cause,
2. the strongest evidence,
3. the safest immediate mitigation,
4. whether rollback needs human approval,
5. and a concise executive brief.

Files to point people at

  • orchestrator/agent.py for the core fan-out/fan-in design.
  • remote_a2a/*/agent.py for specialist boundaries.
  • mcp_servers/*.py for domain-specific tool surfaces.
  • orchestrator/agent_adk2_preview.py for the ADK 2.0 angle.

One-liner summary

This is not another mega-agent demo. It is a modular incident-response mesh that uses ADK for orchestration, A2A for specialist delegation, and MCP for domain access, with a clear path into ADK 2.0 workflow patterns.

About

Production reference architecture for multi-agent incident response with Google ADK, A2A, MCP-style tools, evaluation, and observability.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors