Tool failure と連鎖遮断 ─ Circuit Breaker と Bulkhead

2026/05/10

Tool failure と連鎖遮断 ─ Circuit Breaker と Bulkhead

ch4 で扱った $437 nightly pipeline 事故の本質は、外部依存（MCP server）の障害がエージェントの retry ループに連鎖したことだった。本章では、外部ツールが落ちたときにエージェントが暴走しない設計 ── **連鎖遮断（cascade prevention）**を扱う。

なぜエージェントは外部依存に脆いか

エージェントは確率的 + 長時間 + 外部依存の総合体だ。第1部 ch7 で扱った durable execution の原則がここに効いてくる。

外部依存が落ちると、典型的に次のシナリオが起きる：

sequenceDiagram
    participant A as Agent
    participant T as Tool / MCP server
    participant L as LLM API
    
    A->>T: tool call
    T-->>A: timeout / 503
    A->>L: 「失敗した、retry すべきか考えて」
    L->>A: 「retry してください」
    A->>T: tool call (retry 1)
    T-->>A: timeout
    A->>L: 「再度失敗、どうする？」
    L->>A: 「もう一度 retry」
    Note over A,T: 8 時間 retry ループが続く
    Note over A: トークン破裂

連鎖の構造：

Tool が失敗 → エージェントが retry を判断 → LLM 呼び出し（コスト発生）→ Tool 再呼び出し → 失敗 → ループ
各 retry がトークンを消費し、LLM 呼び出しでさらに消費する

これを止めるには、**「失敗した瞬間に escalate ではなく fail fast」**の設計が必要だ。

Circuit Breaker パターン

古典的な Circuit Breaker を AI エージェントに適用する。

3 状態

graph LR
    Closed[Closed<br/>正常動作]
    Open[Open<br/>失敗続出 → 即時失敗を返す]
    Half[Half-Open<br/>少量で試験]

    Closed -->|失敗閾値超過| Open
    Open -->|タイムアウト| Half
    Half -->|成功| Closed
    Half -->|失敗| Open

実装の例：

class ToolCircuitBreaker:
    def __init__(self, failure_threshold=5, timeout_seconds=60):
        self.state = "Closed"
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.opened_at = None
        self.timeout_seconds = timeout_seconds

    def call(self, tool_func, *args):
        if self.state == "Open":
            if time.time() - self.opened_at > self.timeout_seconds:
                self.state = "Half-Open"
            else:
                raise CircuitOpenError("Tool unavailable, fail fast")
        
        try:
            result = tool_func(*args)
            if self.state == "Half-Open":
                self.state = "Closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.state = "Open"
                self.opened_at = time.time()
            raise

AI エージェントでの注意点：

「失敗 fast」がトークン破裂を防ぐ最小要件
LLM に「retry すべきか」を毎回聞かない（その判断のために LLM 呼び出しが発生してコストが膨れる）
Circuit が Open になったら、エージェントに「Tool is down」と通知し、別の経路を試させる or 停止

Bulkhead パターン

船の隔壁（bulkhead）のように、リソースを区画化する設計。

graph TB
    A[Agent]
    P1[Pool 1<br/>Weather API]
    P2[Pool 2<br/>Flight API]
    P3[Pool 3<br/>DB]

    A --> P1
    A --> P2
    A --> P3

    style P1 fill:#fcc
    style P2 fill:#cfc
    style P3 fill:#cfc

設計：

tool 種別ごとに スレッドプール / connection pool を分離
Weather API が overload しても、Flight API のコネクションは確保される
Pool が枯渇したら、その tool だけ fail fast

MCP weather API → 503 cascade の実例

retry が thread pool を埋めて、健全な flight tool まで connection 取れず全停止 → 単一 upstream が platform-wide 障害化

これは Bulkhead が無かったから。Pool を分離していれば、Weather API のダウンは Weather agent だけに影響を留められる。

Idempotency と durable execution

エージェントの tool call には 必ず idempotency key を付ける。

# ✅ Idempotency key 付き tool call
result = mcp_client.call_tool(
    name="send_email",
    arguments={"to": "[email protected]", "body": "..."},
    idempotency_key="task-123-step-5"  # ← 重要
)

役割：

同じ idempotency key の call は副作用が 1 回だけ起きる
retry が安全になる（重複送信、二重課金を防ぐ）
durable execution（第1部 ch7）の journal-replay と組み合わせると、crash recovery が完全に safe

postgres-backed state を context-window state より優先：エージェントが context window で覚えた状態は揮発する。永続的な真実は外部 DB に置く。

Saga / Compensation pattern

複数の tool call をまたぐワークフローでは、部分失敗時の補償処理が必要。AWS Prescriptive Guidance の Agentic AI Patterns で saga orchestration が標準パターン化されている。

graph TB
    Start[Workflow Start]
    S1[Step 1: Reserve flight]
    S2[Step 2: Reserve hotel]
    S3[Step 3: Charge card]
    Done[Done]

    Start --> S1 --> S2 --> S3 --> Done

    S2 -.失敗.-> C1[Compensation 1:<br/>Cancel flight reservation]
    S3 -.失敗.-> C2[Compensation 2:<br/>Cancel flight + hotel]

    C1 --> Failed[Failed Cleanly]
    C2 --> Failed

設計原則：

各 step に対する compensating transaction を定義
途中で失敗したら、過去の step を逆順で取り消す
エージェントが「何を取り消すべきか」を忘れないよう、durable execution と組み合わせる

Token budget enforcement

ch4 で扱った hard cap は、Tool failure の連鎖遮断にも効く。

✅ 連鎖遮断のための token budget
─────────
Per-tool budget cap：1 つの tool call に N トークンまで
Per-task budget cap：1 タスク全体に M トークンまで
Per-agent budget cap：1 agent の累計 K トークンまで

これらを組み合わせると、retry ループが長く続いても
最終的にコストは bounded

Portal26 Agentic Token Controls（2026-04 ローンチ）：cap に達したら throttle、突破したら kill。alert ではなく enforcement。

Cheap-model-first / expensive-model-on-retry

retry の度に強いモデルへ切り替える戦略。

Try 1: Haiku（安価、高速）
  ↓ 失敗
Try 2: Sonnet
  ↓ 失敗
Try 3: Opus（高価、強力）
  ↓ 失敗
Try 4: HITL に escalate

これは「安いモデルで普通は十分、難しい時だけ高価」の発想を retry に適用したもの。Tool failure ではなく LLM 自身の応答品質が低い場合に効く。

2026 の本番事例：durable execution プラットフォーム

第1部 ch7 で扱った durable execution が、Tool failure の連鎖遮断でも中核になる。

プラットフォーム	連鎖遮断の特徴
Temporal（2026-02-17 に $300M 調達, $5B valuation）	9.1T lifetime action のうち 1.86T が AI-native。OpenAI Agents SDK と GA 統合
Restate	`ctx.run()` 単位で side effect を明示的にラップ、sandbox 制約が緩い
Vercel Workflow DevKit（2025-10 GA）	100M+ runs / 500M+ steps / 1500+ 顧客。DurableAgent が 50 step を 50 invocation に分割
Cognition Devin V3	hypervisor-level snapshot（メモリ・プロセスツリー・FS）。Snapshot 作成 30 分 → 15 秒、time-to-first-message 25s → 10s

これらは「failure → 完了済み step は再実行されない」という durable execution の原則で、連鎖の入口を絶つ。

MCP server の本番運用：2026 のセキュリティ大型事件

2026 年は MCP server に関する大型インシデントが続発した。

MCPwn (CVE-2026-33032, CVSS 9.8)：2,600+ 公開インスタンスが影響
MCPwnfluence (CVE-2026-27825/27826)：Atlassian MCP に SSRF + arbitrary file write の RCE chain
Endor Labs の 2,614 MCP 実装解析で 82% が path traversal リスクあり
Anthropic core MCP spec の設計欠陥（2026-04）：LettaAI、LangFlow、Windsurf 等に波及

運用対策：

MCP gateway 経由で接続を集約（Cloudflare / TrueFoundry / Stacklok / IBM ContextForge）
CVE 通知の購読（NVD、ベンダ blog）
定期的に MCP server バージョンを更新
path traversal / SSRF の自動 scanを CI に組み込む

連鎖遮断の統合設計

すべてを統合した、エージェントの tool 呼び出し層の設計：

graph TB
    A[Agent decides tool call]
    CB[Circuit Breaker check]
    BH[Bulkhead pool]
    Idem[Idempotency key check]
    Tool[Tool / MCP server]
    Result[Result with citation]

    A --> CB
    CB -->|Open| FailFast[Fail fast<br/>do not call LLM for retry]
    CB -->|Closed/Half| BH
    BH --> Idem
    Idem --> Tool
    Tool -->|success| Result
    Tool -->|failure| Counter[Update failure counter]
    Counter --> CB

    Token[Token budget enforcer<br/>cap 突破で kill] -.全層に被せる.-> A
    Token -.全層に被せる.-> Tool

5 つのレイヤ：

Token budget enforcer：全層を覆う hard cap
Circuit Breaker：失敗続出で fail fast
Bulkhead：tool 種別ごとに pool 分離
Idempotency key：副作用を 1 回だけ
Durable execution：crash recovery で完了済み step を再実行しない

❌ アンチパターン：「retry policy さえあれば大丈夫」

症状
─────────
- 「retry を 3 回まで」と書いてあるのに $437 焼けた
- Circuit Breaker は実装したが、トークン消費は止まらなかった
- Tool A がダウンすると Tool B / C も応答しない

根本原因
─────────
- retry 中に LLM 呼び出しが発生する設計（その分のトークンが計上されない）
- Bulkhead が無い（全 tool が同じ pool を共有）
- Idempotency key を付けていない（重複副作用）
- durable execution と組み合わせていない（crash recovery で全 step 再実行）

脱出法
─────────
1. retry の判断は LLM ではなく decision tree に固定（LLM 呼ばない）
2. tool 種別ごとに pool を分離（Bulkhead）
3. すべての tool call に idempotency key を必ず付与
4. Temporal / Restate / Vercel Workflow など durable execution 採用
5. Token budget enforcer を全層に被せる

業務投入の観点で重要な 3 点

連鎖遮断は 5 レイヤで設計：Token budget / Circuit Breaker / Bulkhead / Idempotency / Durable execution。1 つでは足りない
retry の判断に LLM を使わない：retry 自体がコストを膨らます。decision tree で固定化
MCP server は gateway 経由で集約：監査・脆弱性対応・更新を一元化。CVE 通知を購読

次章への接続

ch7 では、Tool failure と並んで運用を破綻させる HITL の SLAを扱う。承認 queue が overflow したとき何が起きるか、tier-based escalation の設計、SLA 違反時のフォールバックを実装の解像度で。

この章のまとめ

連鎖崩壊は 5 レイヤで遮断：Token budget / Circuit Breaker / Bulkhead / Idempotency / Durable execution
retry の判断に LLM を使わない：decision tree で固定化、トークン破裂を防ぐ
Bulkhead で tool 種別ごとに pool 分離：1 tool のダウンが platform-wide 障害化するのを防ぐ
Idempotency key + Saga パターン：副作用を 1 回だけ、部分失敗で補償処理
2026 の MCP 大型 CVE：MCPwn / MCPwnfluence、82% に path traversal リスク。gateway 経由で集約