Kubernetes execution upgrade

Kubernetes execution upgrade roadmap

Living document for evolving Agentic Orchestration from in-process CrewAI kickoff() to optional Kubernetes-backed step execution (pod-per-step / pod-per-agent), while keeping the planner, catalogs, sessions, learning, and web UI.

Status: K3 MVP + K4 + K5 implemented.

Companion plan: Dual execution framework — Python refactor for pluggable execution backends (CrewAI in-process default, subprocess, kubernetes). This page owns cluster delivery; the companion page owns the code seam. Framework F0–F4 is complete; K8s Phase 3 implements framework F5.

Relationship to the dual execution framework plan

Two separate roadmaps; one system.

	Dual execution framework	This page (K8s upgrade)
Delivers	`ExecutionBackend`, `StepCoordinator`, backend factory, CrewAI extract	Worker image, Jobs, PVCs, sidecars, kind tests
Required first	F0–F4 complete ✅	K0 sign-off + K2.3 worker image before K3
Value without cluster	✅ F4 = subprocess workers prove distributed contract	❌ until Phase 3
Shared artifacts	Python types	Step spec + result JSON schemas (canonical)

See the companion plan for module layout (orchestration/backends/), ExecutionBackend protocol, and framework Phases F0–F5.

Goals

Goal	Notes
Optional K8s execution	Local in-process mode remains the default for dev; K8s is opt-in.
Preserve orchestration brain	Planner, YAML catalogs, session/KB/learning, artifact pipeline stay central.
Shared YAML config	Agent, MCP, and workflow YAML unchanged across backends — Dual execution framework.
Isolate agents per step	Each planned step can run in its own pod with secrets, resources, and crash isolation.
Minimize CrewAI loss	Prefer mini-Crew per pod (1 agent, 1 task) so MCP tool loops and provider code reuse stay intact.
Incremental delivery	Each phase ships value; no big-bang rewrite.

Non-goals (initial phases)

Replacing the dynamic planner with a K8s-native workflow engine (Argo, Temporal, etc.).
In-run CrewAI delegation / hierarchical manager agents (not used today; defer to Agent societies roadmap K6).
Multi-tenant hardening beyond what Compose/K8s secrets already imply.
Forked or backend-specific agent/MCP/workflow YAML catalogs (runtime policy only; see Dual execution framework).

Terminology

Term	Meaning in this doc
Coordinator	Process that owns the step loop: spawn workers, hand off outputs, retries, sessions. Today this logic lives in `main.py` + `runner.py`.
Worker	Ephemeral pod/Job that executes one planned step and writes results to shared storage.
Crew run	One user goal → one plan → N sequential steps. Maps to a K8s Job graph or coordinator-managed Job chain, not a K8s Node.
Run store	PVC mounted at a shared path for v1; same `FileSystemRunStore` as local/subprocess (S3/Redis deferred)

Current vs target architecture

Today (default + subprocess)

Web UI → spawn python main.py
         → execution_backend_from_env()
         → inprocess: build_workflow() → Crew.kickoff() in one process
         → subprocess (AGENTIC_SUBPROCESS_WORKERS=1): StepCoordinator → --execute-step workers
         → sessions / KB / artifacts

Key modules: orchestration/execution_dispatch.py, orchestration/runner.py, orchestration/backends/subprocess_runner.py, orchestration/execute_step.py, orchestration/dynamic_planner.py.

Target (K8s mode)

Web UI → Coordinator (Deployment)
         → Planner → WorkflowConfig / step specs
         → for each step: K8s Job (worker pod)
         → run store (handoff prior output)
         → sessions / KB / artifacts (unchanged)

flowchart TB
  subgraph coordinator [Orchestrator Deployment]
    Web[Web UI]
    Planner[Dynamic Planner]
    Coord[Step Coordinator]
    Store[(Run Store)]
  end

  subgraph workers [Per-step Jobs]
    W1[Worker: step_1]
    W2[Worker: step_2]
    Wn[Worker: step_n]
  end

  Web --> Planner
  Planner --> Coord
  Coord -->|spawn| W1
  W1 -->|output| Store
  Coord -->|spawn| W2
  W2 -->|output| Store
  Coord -->|spawn| Wn
  Wn -->|output| Store
  Coord --> Store
  Coord --> Web

What we keep vs change

Component	Keep?	Change
Dynamic planner	✅	May filter MCP catalog in K8s mode
Agent provider YAML + factory	✅	Workers call same `build_agent()`
MCP catalog resolution	✅	Prefer HTTP MCPs; stdio → sidecar or cluster service
`runner.py` sequential inject logic	✅	Ported to `step_context.prepare_step_description` + `StepCoordinator` ✅
`execution_fallback.py`	✅	Workflow-level HF fallback via `main._run_dynamic_workflow_with_hf_fallback`; per-step retry deferred (see Phase 3)
Session / learning / KB	✅	Add K8s run metadata to session JSON
`crew.kickoff()` whole crew	❌ (K8s mode)	Replaced by coordinator step loop
Web spawn single process	⚠️	Coordinator still spawns or embeds tool; workers separate

Recommended execution strategy

Option A — Mini-Crew per pod (recommended)

Each worker runs a new CLI mode, e.g. --execute-step, that:

Loads one step spec (JSON).
Builds 1 Agent, 1 Task, 1 Crew (single task).
Calls kickoff().
Writes result.json + artifacts to the run store.

Option B — Custom agent loop (defer)

Replace CrewAI in workers with LiteLLM + MCP SDK. Only consider if CrewAI coupling becomes painful.

Losses and mitigations

Use this table when prioritizing work. Each row maps to phase tasks below.

Loss	Mitigation	Primary phase
In-memory step handoffs	Coordinator + run store; port `_inject_previous_output_into_next_task`	1, 3
Stdio MCP subprocesses	HTTP MCPs first; sidecars for fetch/filesystem; catalog filter in K8s mode	3, 4
HF → Ollama execution fallback	Workflow-level retry in `main` + per-step retry in distributed backends (`step_recovery.py`)	3 ✅
Provider recovery retry	Workflow-level in-process + per-step via `recovery_hint` in `StepCoordinator`	3 ✅
CrewAI MCP tool loop	Mini-Crew per pod (Option A)	2
LLM provider abstraction	Same `AgentProvider` code inside worker image	2
CrewOutput / artifacts	Worker `result.json` contract; thin adapter in coordinator	1, 2
Delegation / hierarchical	Don’t replicate initially; planner already orchestrates	—
CrewAI ecosystem upgrades	Pin worker image; `ExecutionBackend` interface	2, 5
Step latency / cold start	Warm pool (optional); image pre-pull; slim worker image	4, 5
Cross-pod debuggability	`run_id`/`step_id` logging; session run record; Loki/ELK	3

Phase dependencies and parallelism

Phases are not fully independent. Some stand alone and ship value without later work; others are integration milestones that require earlier contracts. Use this section when scheduling work across sessions or contributors.

Dependency graph

flowchart LR
  P0[Phase 0 Design lock]
  P1[Phase 1 Step spec + store]
  P2[Phase 2 Worker entrypoint]
  P3[Phase 3 K8s backend]
  P4[Phase 4 MCP sidecars]
  P5[Phase 5 Ops polish]

  P0 --> P1
  P0 --> P2
  P1 --> P2
  P1 --> P3
  P2 --> P3
  P3 --> P4
  P3 --> P5
  P4 -.-> P5

Legend: solid arrows = hard dependency; dotted = soft (Phase 5 benefits from Phase 4 but can start without it).

Independence summary

Phase	Standalone?	Depends on	Ships value without later phases?
0 — Design lock	✅ Yes	Nothing	✅ Yes — unblocks all other work
1 — Step spec + store	⚠️ Mostly	0 (schemas agreed)	✅ Yes — cleaner in-process execution, same UX
2 — Worker entrypoint	⚠️ Mostly	0; 1 strongly recommended	✅ Yes — subprocess/container worker, no K8s
3 — K8s backend	❌ No	1 + 2	❌ No — needs spec, store, and worker
4 — MCP sidecars	⚠️ Partially	3 for end-to-end validation	⚠️ Partial — manifests/docs yes; E2E proof needs K8s
5 — Ops polish	⚠️ Partially	3 for most items	⚠️ Partial — image pinning/runbook can start early

What can run in parallel

After Phase 0 is locked:

Track A	Track B	Notes
Phase 1 (coordinator + run store, in-process)	Phase 2 (worker CLI + Dockerfile)	Parallel if step spec JSON is frozen in Phase 0. Prefer 1 as lead — it defines how specs are built; 2 consumes them.
Phase 4 design/docs (sidecar manifests, MCP matrix)	Phase 1 or 2	Draft sidecars before K8s exists; cannot prove until Phase 3.
Phase 5 CrewAI pin / worker image policy	Phase 2	Image tagging and upgrade runbook do not require a cluster.

Phase 3 should start when K0 is signed off, subprocess demo works (✅ F4), and worker image (K2.3) exists for Job pods.

Phases 4 and 5 are enhancements, not prerequisites for a minimal K8s demo (HTTP MCPs only).

Independent workstreams (tracks)

Treat these as separate milestones you can stop after:

Track	Phases	Outcome	Skip
Refactor only	0, 1	`ExecutionBackend`, step specs, in-process loop — no containers, no K8s	2–5
Worker isolation	0, 1, 2	`--execute-step`, `SubprocessExecutionBackend`, subprocess integration tests	3–5 ✅ shipped via F4
Kubernetes (HTTP MCPs)	0, 1, 2, 3	Full K8s sequential runs with streamable HTTP MCPs	4 (initially), 5
K8s MCP parity	0–4	Stdio MCPs via sidecars; planner catalog policy	5 until needed
Production hardening	5 (after 3)	Warm pool, centralized logging, load tests	—

Minimal paths by goal

Goal	Phases required	Can skip
Better code structure only	0, 1	2–5
Containerized agents, no K8s	0, 1, 2	3–5	Subprocess path ✅; container image (2.3) still open
K8s with HTTP MCPs only	0, 1, 2, 3	4 (initially), 5
Full K8s parity with local MCPs	0–4	5 until needed

Practical rules

Phase 0 is the only true prerequisite for everything — without frozen step/result schemas, Phases 1 and 2 will diverge.
Phases 1 and 2 are loosely coupled — parallelizable after 0, but Phase 1 should own spec generation; Phase 2 only consumes specs.
Phase 3 is not independent — it glues Phase 1 (coordinator + store) and Phase 2 (worker) onto a cluster.
Phases 4 and 5 are optional depth — not required for a first K8s demo.
Each track can ship on its own — stopping after Phase 1 or Phase 1+2 is valid; you do not need K8s to get value from this roadmap.

Post-F4 plan adjustments (2026-06)

Framework F4 (subprocess backend) validated the distributed contract locally. No architecture replan — K3 remains “swap subprocess spawn for K8s Job.” Updates from implementation:

Topic	Adjustment
K1 / K2	Effectively done via F2 + F4; remaining K2 work is worker Dockerfile (2.3) and log prefixes (2.2).
Run store v1	PVC + `FileSystemRunStore` at a mounted path (e.g. `/run/store`) — reuse subprocess code; S3/Redis deferred.
K3 runner	Add `kubernetes_runner.py` mirroring `subprocess_runner.py` — same `StepCoordinator`, different spawn.
CLI dispatch	K3.0: extend `execution_dispatch.py` so `kubernetes` backend routes to `execute_config` (today only `AGENTIC_SUBPROCESS_WORKERS=1` enables distributed path).
Phase 1.5	Cancelled — in-process keeps whole-crew kickoff; not required for K8s.
HF fallback / recovery	K3 MVP: workflow-level only (already in `main`); per-step Job retry deferred post-MVP.
Phase 1.6	Done in framework F3 (`output_artifacts.py` adapters).

Minimal path to K3: ~~K0 sign-off~~ ✅ → K2.3 worker image → K3.0 dispatch → K3.1–3.3 Job + PVC → K3.8 kind test (HTTP MCPs only; K4 for stdio).

Phased roadmap

Track progress by checking boxes as work completes. See Phase dependencies and parallelism before scheduling work out of order.

Phase 0 — Design lock (no K8s required)

Standalone: ✅ Yes — gate for all other phases.

Companion plan: Pair with Dual execution framework Phase F0 (Python types must match schemas below).

Objective: Agree on contracts so phases 1–3 can proceed in parallel later.

0.1 Step spec JSON schema v0.1 — implemented in StepSpec.to_dict() / materializer; formal wiki review optional.
0.2 Worker result.json contract v0.1 — implemented in StepResult / execute_step.py; formal wiki review optional.
0.3 Run store v1: PVC + FileSystemRunStore at mounted path (S3/MinIO/Redis deferred).
0.4 Sign off ExecutionBackend protocol — shipped in framework F0 (Dual execution framework).
0.5 Sign off env flag: AGENTIC_EXECUTION_BACKEND=inprocess|subprocess|kubernetes (default inprocess) — implemented in orchestration/backends/factory.py.
0.6 K8s MCP compatibility matrix signed off (MCP matrix) — policy in orchestration/k8s_mcp_compat.py; planner filter K4.3 ✅.

Exit criteria: Schema + contracts merged; Phase 0 complete ✅. K3 MVP uses HTTP-native MCPs only unless AGENTIC_K8S_ALLOW_STDIO_MCPS=1 after K4 sidecars.

Phase 1 — Step spec + run store (local only)

Standalone: ✅ Yes — delivers run store and step contract for distributed backends.

Depends on: Phase 0.

Companion plan: Overlaps Dual execution framework Phases F2 (materializer, StepCoordinator) and F1 (CrewAI backend extract). Coordinate so run store and step specs are not implemented twice.

Parallel with: Phase 2 (after Phase 0; Phase 1 should lead spec generation).

Objective: Run store and step handoff contract ready for subprocess/K8s workers. ✅ Shipped via framework F2 + F4.

Code touchpoints:

orchestration/run_store.py — abstract + filesystem impl ✅
orchestration/workflow_materializer.py, step_coordinator.py — framework F2 ✅
orchestration/backends/crewai.py — framework F1 ✅

Tasks:

1.1 Implement StepSpec / StepResult dataclasses aligned with schema below (orchestration/backends/base.py).
1.2 Implement build_step_specs(config: WorkflowConfig) -> list[StepSpec] (workflow_materializer.py).
1.3 Port prior-output injection to prepare_step_description(step, prior_output) (step_context.py).
1.4 Implement filesystem run store: {run_id}/{step_id}/result.json (run_store.py).
1.5 ~~Implement InProcessExecutionBackend step loop~~ — Cancelled (in-process keeps whole-crew kickoff per framework F2.5; not required for K8s).
1.6 Adapter for output_artifacts.py to consume StepResult / result.json — shipped in framework F3.
1.7 Unit tests: inject logic, store round-trip, materializer from default workflow YAML.
1.8 AGENTIC_RUN_STORE_PATH + run_store_session() — PVC-friendly {base}/{run_id}/ layout; temp dir when unset (local dev).

Exit criteria: Distributed backends use StepCoordinator + run store; subprocess integration test passes. ✅

Phase 2 — Worker entrypoint (mini-Crew per pod)

Standalone: ✅ Yes — isolated worker via subprocess/container; no K8s required.

Depends on: Phase 0 (required); Phase 1 (strongly recommended — spec builder + run store).

Parallel with: Phase 1 (after Phase 0).

Objective: Worker image can execute one step from a spec file and write results.

Code touchpoints:

main.py — add --execute-step PATH (and --run-id, --step-id)
agent_providers/* — unchanged
orchestration/crewai_mcp_hotfix.py — loaded in worker

Tasks:

2.1 CLI: --execute-step loads JSON, builds one agent/task/crew, kickoff, writes result.json (execute_step.py).
2.2 Worker writes stderr/stdout logs with run_id and step_id prefixes (orchestration/worker_logging.py).
2.3 Dockerfile: orchestrator-worker image (docker/Dockerfile.worker, docker/README.worker.md).
2.4 Local smoke test: coordinator calls worker via subprocess + shared temp dir — subprocess_runner.py + tests/test_backend_subprocess.py (framework F4.3).
2.5 Document required Secrets → env mapping (.env.example, docker/README.worker.md).

Exit criteria: Subprocess integration test ✅; worker image builds and passes scripts/docker-worker-smoke.ps1 (invalid spec → exit 2). ✅

Phase 3 — Kubernetes coordinator backend

Standalone: ❌ No — integration milestone only.

Depends on: Phase 1 ✅ + Phase 2 subprocess proof ✅ + worker image (2.3).

Companion plan: Requires framework F0–F4 complete ✅. Implements framework F5 (KubernetesExecutionBackend).

Blocks: Phase 4 E2E validation; most of Phase 5.

Objective: Coordinator creates Jobs per step, waits for completion, reads run store — same loop as subprocess_runner.py.

Shared runner pattern:

subprocess_runner.py          kubernetes_runner.py
        │                              │
        └─ StepCoordinator.run_sequential
        └─ build_step_specs + FileSystemRunStore
        └─ spawn: subprocess.run       spawn: K8s Job (worker image, --execute-step)
        └─ read result.json            read result.json (same path on PVC)

Code touchpoints:

orchestration/backends/kubernetes_runner.py — new (mirror subprocess_runner.py)
orchestration/backends/kubernetes.py — delegate execute_config to runner (framework F5)
orchestration/execution_dispatch.py — K3.0: route kubernetes backend to execute_config
main.py — already wired via execution_backend_from_env() + execute_workflow_from_config (framework F3/F4)
orchestration/orchestrator_session.py — store pod names, exit codes, timing
agentic-orchestration-web/server.mjs — progress lines from coordinator (unchanged format)

Tasks:

3.0 Extend execution_dispatch.py: use_distributed_execute_config() routes kubernetes → execute_config.
3.1 K8s client — kubernetes Python package + KubernetesJobRunner (kubernetes_jobs.py).
3.2 Job template: labels run_id, step_id, agent_provider_id; TTL; worker args …/{run_id}/{step_id}-spec.json.
3.3 PVC mount on worker Jobs (AGENTIC_K8S_RUN_STORE_PVC); coordinator uses AGENTIC_RUN_STORE_PATH + FileSystemRunStore.
3.4 Per-step HF execution fallback: failed Job → parse error → rebuild config → new Job (step_recovery.py, StepCoordinator retry).
3.5 Per-step provider recovery via recovery_hint in StepCoordinator (provider_recovery → recover_from_workflow_error).
3.6 Workflow result records k8s_jobs metadata per step Job (WorkflowExecutionResult.k8s_jobs); session wiring deferred.
3.7 Coordinator Deployment + RBAC (deploy/k8s/coordinator/); sample worker Job (worker-job.example.yaml).
3.8 Integration test: mocked Jobs (tests/test_backend_kubernetes.py) + live kind e2e in CI (tests/test_kind_kubernetes_e2e.py, stub worker).

K3 MVP exit criteria: Code path complete ✅; kind cluster e2e in CI (stub worker, no LLM). Manual kind + real worker image for LLM validation optional.

Phase 4 — MCP sidecars and catalog policy

Standalone: ⚠️ Partial — manifests and catalog flags can land early; E2E proof needs Phase 3.

Depends on: Phase 3 for validation.

Parallel with: Phase 5 (after Phase 3).

Objective: Stdio MCPs work in K8s or are cleanly excluded from planner catalog.

Tasks:

4.1 Sidecar pattern doc + example manifest for fetch_url (deploy/k8s/mcp-sidecars/).
4.2 Optional cluster Deployments for fetch/filesystem MCP as HTTP gateways.
4.3 When AGENTIC_EXECUTION_BACKEND=kubernetes, filter planner MCP catalog (apply_kubernetes_mcp_catalog_policy).
4.4 Image pre-pull DaemonSet (deploy/k8s/worker-image-prep.yaml) + docs.
4.5 Resource requests/limits + GPU nodeSelector via env (K8sSettings, k8s_worker_pod.py).

Exit criteria: At least one stdio MCP works via sidecar (manual kind smoke with AGENTIC_K8S_POD_SIDECAR_MCPS=fetch_url or cluster gateway); planner never assigns broken MCP combos in K8s mode ✅ (unit tests + apply_kubernetes_mcp_catalog_policy).

Phase 5 — Operational polish (optional)

Standalone: ⚠️ Partial — image pinning and runbook (5.3) can start during Phase 2; warm pool and load tests need Phase 3.

Depends on: Phase 3 for most items; benefits from Phase 4.

Parallel with: Phase 4 (after Phase 3).

Objective: Production readiness for longer-running collaboration.

5.1 Warm pool: idle worker pods + coordinator dispatch (deploy/k8s/warm-pool.yaml, kubernetes_warm_pool.py).
5.2 Centralized logging contract (deploy/k8s/LOGGING.md, AGENTIC_LOG_FORMAT=json).
5.3 Pin CrewAI in worker image; document upgrade runbook (crewai==1.12.2, docker/CREWAI_UPGRADE.md).
5.4 Load test: scripts/k8s-load-test.ps1 / .sh (N concurrent runs, p50/p95).
5.5 Delegation RPC: worker k8s_delegate_task tool + delegation-broker Deployment spawns child Jobs (kubernetes_delegation.py).

Next (optional): Agent societies roadmap — K6 phased plan for autonomous multi-agent societies (blackboard, protocol engine, society broker, web graph UI). Builds on K5.5 delegation and warm pool; revisits deferred in-run CrewAI delegation / hierarchical managers.

Step spec JSON schema (draft v0.1)

Written by coordinator, consumed by worker --execute-step.

{
  "schema_version": "0.1",
  "run_id": "uuid",
  "step_id": "step_2",
  "step_index": 1,
  "workflow_name": "dynamic-plan-2025-06-26",
  "topic": "User goal text",
  "task": {
    "description": "Full task description after prior-output injection",
    "expected_output": "What success looks like"
  },
  "agent_provider": {
    "id": "gpt_research",
    "type": "openai",
    "role": "Research Analyst",
    "goal": "...",
    "backstory": "...",
    "model": "gpt-4o-mini",
    "verbose": true,
    "allow_delegation": false,
    "openai_base_url": "",
    "ollama_host": ""
  },
  "mcp_providers": [
    { "id": "search_brave", "resolved": { "streamable_http": { "url": "..." } } }
  ],
  "prior_output": "Text from previous step, or empty string",
  "inputs": {
    "topic": "User goal text"
  },
  "paths": {
    "run_store": "/run/store",
    "artifacts_dir": "/run/store/artifacts"
  }
}

Rules:

prior_output is already merged into task.description by coordinator (same as today’s inject marker); worker may ignore duplicate.
agent_provider is the resolved catalog entry (post credential filter), not a live object — same dict shape as today’s YAML-derived provider payloads.
mcp_providers[].resolved matches output of resolve_workflow_mcp_refs().
This JSON is worker transport, not a replacement for config/agent_providers/ or config/mcp_providers/ YAML.

Worker result contract (draft v0.1)

Written to {run_store}/{step_id}/result.json.

{
  "schema_version": "0.1",
  "run_id": "uuid",
  "step_id": "step_2",
  "exit_code": 0,
  "result_text": "Final agent output as string",
  "result_format": "plain",
  "error": null,
  "recoverable": false,
  "recovery_hint": null,
  "artifacts": [
    { "relative_path": "artifacts/report.md", "mime": "text/markdown" }
  ],
  "timing": {
    "started_at": "ISO-8601",
    "finished_at": "ISO-8601"
  },
  "k8s": {
    "pod_name": "step-2-abc123",
    "node_name": "worker-7"
  }
}

On failure:

{
  "exit_code": 1,
  "error": "LiteLLM HuggingFaceException: ...",
  "recoverable": true,
  "recovery_hint": "hf_litellm_fallback"
}

Coordinator maps recovery_hint to existing Python (execution_fallback, recover_from_workflow_error).

ExecutionBackend (defined in companion plan)

The protocol, factory, and three backend classes are specified in Dual execution framework — not duplicated here.

Backend	Env value	Defined in
`CrewAIExecutionBackend`	`inprocess` (default)	Framework F1
`SubprocessExecutionBackend`	`subprocess`	Framework F4 + K8s Phase 2 CLI
`KubernetesExecutionBackend`	`kubernetes`	Framework F5 + K8s Phase 3

MCP compatibility matrix (K8s mode)

Status: ✅ Signed off (K0.6, 2026-06-27) — verified against shipped catalog in config/mcp_providers/. Code: orchestration/k8s_mcp_compat.py. Planner filter: K4.3 ✅ (apply_kubernetes_mcp_catalog_policy).

Shipped catalog

MCP id	Transport today	K3 MVP	K8s v2+ (K4 sidecars)
`search_brave`	streamable_http	✅ native	✅
`search_tavily`	streamable_http	✅ native	✅
`home_assistant`	streamable_http	✅ native	✅
`search_exa`	stdio (`npx exa-mcp-server`)	❌ excluded	⚠️ sidecar
`fetch_url`	stdio (`python -m mcp_server_fetch`)	❌ excluded	✅ worker stdio (default) or cluster gateway / sidecar
`filesystem_local`	stdio (`npx` filesystem server)	❌ excluded	✅ worker stdio + PVC subdir (default) or sidecar/gateway
`memory_knowledge_graph`	stdio (`npx` memory server)	❌ excluded	⚠️ sidecar

K3 MVP allowlist: search_brave, search_tavily, home_assistant only.

Approved planner rule (implementation: K4.3)

When AGENTIC_EXECUTION_BACKEND=kubernetes:

Default: planner / materializer sees only K3 MVP MCP ids (K8S_NATIVE_MCP_IDS in code).
Opt-in stdio: set AGENTIC_K8S_ALLOW_STDIO_MCPS=1 only after a sidecar template exists for that MCP (K4).
Unknown ids (extra catalog paths): excluded in K8s mode until explicitly classified in k8s_mcp_compat.py or documented.

No new YAML schema — runtime policy only (Dual execution framework).

Environment variables (proposed)

Add to .env.example when implementing:

Variable	Default	Purpose
`AGENTIC_EXECUTION_BACKEND`	`inprocess`	`inprocess` \| `kubernetes` \| `subprocess`
`AGENTIC_RUN_STORE_PATH`	(unset)	Mounted run store root; per-run dir `{path}/{run_id}/`. Temp dir when unset.
`AGENTIC_K8S_NAMESPACE`	`agentic-orchestration`	Job namespace
`AGENTIC_K8S_WORKER_IMAGE`	—	Worker container image
`AGENTIC_K8S_RUN_STORE_PVC`	—	PVC name for run handoffs
`AGENTIC_K8S_JOB_TTL_SECONDS`	`3600`	Finished Job TTL
`AGENTIC_K8S_ALLOW_STDIO_MCPS`	`0`	When `1`, allow stdio MCP ids in K8s mode (requires K4 sidecar/gateway).
`AGENTIC_K8S_WORKER_STDIO_MCPS`	`fetch_url`	Stdio MCP ids spawned inside the worker container (`mcp-server-fetch` in worker image). Preferred for `fetch_url`.
`AGENTIC_K8S_MCP_FETCH_URL`	—	Cluster HTTP gateway for `fetch_url` (stdio → streamable_http rewrite)
`AGENTIC_K8S_MCP_FILESYSTEM_URL`	—	Cluster HTTP gateway for `filesystem_local`
`AGENTIC_K8S_POD_SIDECAR_MCPS`	—	Comma-separated MCP ids for in-pod supergateway sidecars (e.g. `filesystem_local`)
`AGENTIC_K8S_SUPERGATEWAY_IMAGE`	`supercorp/supergateway:uvx`	Sidecar image for stdio→HTTP bridge
`AGENTIC_K8S_SUPERGATEWAY_STATEFUL`	`0`	When `1`, pass `--stateful` to supergateway (bridge tuning for CrewAI HTTP client)
`AGENTIC_K8S_WORKER_RESOURCES`	—	Optional JSON requests/limits for worker container
`AGENTIC_K8S_GPU_NODE_SELECTOR`	—	Optional JSON nodeSelector when step provider is in `AGENTIC_K8S_GPU_PROVIDER_IDS`
`AGENTIC_K8S_GPU_PROVIDER_IDS`	—	Comma-separated provider ids that trigger GPU nodeSelector
`AGENTIC_K8S_ENV_SECRET`	—	Secret name for worker/coordinator env (`agentic-orchestrator-env`)
`AGENTIC_K8S_WARM_POOL_ENABLED`	`0`	When `1`, dispatch steps without sidecars via PVC queue (`deploy/k8s/warm-pool.yaml`)
`AGENTIC_K8S_DELEGATION_ENABLED`	`0`	When `1`, workers get `k8s_delegate_task` tool; requires `delegation-broker` Deployment
`AGENTIC_K8S_DELEGATION_TIMEOUT_SECONDS`	`3600`	Worker wait for delegation broker response
`AGENTIC_LOG_FORMAT`	`text`	`json` for structured logs (K5.2 — Loki/Datadog)

Existing variables (AGENTIC_STEP_CONTEXT_CHARS, AGENTIC_PROGRESS, provider API keys) unchanged.

Testing strategy

Phase	Tests
1	Unit: inject, store, step spec builder — ✅ shipped
2	Subprocess 2-step mocked worker — ✅ `tests/test_backend_subprocess.py`; Docker smoke — ✅ CI `docker-worker-smoke` job
3	kind/minikube: 2-step plan; workflow-level HF fallback
4	Sidecar MCP smoke test (manual kind); unit: `test_k8s_mcp_compat.py`, `test_k8s_worker_pod.py` — ✅
5	Warm pool + JSON logging + load test (`test_k5_operational.py`, `scripts/k8s-load-test.*`) — ✅
5	Load + log correlation

Regression bar: In-process mode (AGENTIC_EXECUTION_BACKEND=inprocess) must pass existing behavior for --dynamic and static workflows after each phase.

Decision log

Record decisions here as the project proceeds.

Date	Decision	Rationale
2025-06-26	Adopt mini-Crew per pod (Option A)	Preserves MCP loop and provider code; matches current sequential usage
2025-06-26	Coordinator owns step loop, not whole-crew kickoff	Aligns with existing planner-as-brain architecture
2025-06-26	Phases 1 ‖ 2 parallel after Phase 0; Phase 3 requires both	Maximizes independent delivery; avoids K8s before local subprocess works
2025-06-26	Split plans: Dual execution framework (code seam) + this page (K8s)	Framework F0–F4 before K8s Phase 3
2026-06-26	Framework F4 complete — subprocess proves distributed contract	K3 is adapter swap, not replan
2026-06-26	Run store v1: PVC + `FileSystemRunStore`	Reuse local/subprocess code; mount at `/run/store`; S3/Redis deferred
2026-06-26	K3 MVP: workflow-level HF fallback only	Per-step Job retry (3.4–3.5) deferred; `main._run_dynamic_workflow_with_hf_fallback` already covers distributed runs
2026-06-26	Phase 1.5 (`InProcessExecutionBackend` step loop) cancelled	In-process keeps whole-crew kickoff; K8s uses `StepCoordinator` only
2026-06-27	K0.6 MCP matrix signed off	K3 MVP: `search_brave`, `search_tavily`, `home_assistant`; stdio excluded until K4; policy in `k8s_mcp_compat.py`

Open questions

One Job per step vs one Job with init containers per step? — Per-step Jobs simplify isolation and match sequential semantics; init-chain is faster but weaker isolation.
Coordinator in web pod vs separate Deployment? — Start embedded in existing orchestration container; split later if needed.
Iterative dynamic mode: replan between steps — coordinator must refresh step specs mid-run; confirm session/planner API.
GPU scheduling: derive from provider YAML min_vram_gb or separate scheduling profile id?
Multi-tenant: namespace per tenant vs label isolation?

Suggested work order (current — post-F4)

Next up (K3 track)

K0.6 — sign off MCP compatibility matrix ✅ Done.
K2.3 — orchestrator-worker Dockerfile (required for Job pods).
K3.0 — execution_dispatch routes kubernetes → execute_config.
K3.1–3.3 — kubernetes_runner.py + Job template + PVC mount.
K3.8 — kind/minikube 2-step integration test.
K3.6–3.7 — session metadata + coordinator Deployment manifest.
K4 — stdio MCP sidecars + planner filter (K4.3 uses k8s_mcp_compat.py).
K3.4–3.5, K5 — per-step retry, warm pool, load tests (post-MVP).

Completed (framework + K1/K2 overlap)

Done	Item
✅	F0–F4 + K1 (materializer, coordinator, run store)
✅	K2.1, K2.4, K2.5 (`--execute-step`, subprocess integration test)
✅	F3 post-run adapters (K1.6)

Linear default (historical — for reference)

Framework plan (Dual execution framework): F0 → F1 → F2 → F3 → F4 ✅

K8s plan (this page): K2.3 → K3 → K4 → K5

Parallel schedule (two contributors or split focus)

gantt
  title Parallel workstreams after Phase 0
  dateFormat YYYY-MM-DD
  section Gate
  Phase 0 Design lock           :p0, 2025-07-01, 7d
  section Track A
  Phase 1 Step spec + store     :p1, after p0, 21d
  Phase 3 K8s backend           :p3, after p1 p2, 21d
  section Track B
  Phase 2 Worker entrypoint     :p2, after p0, 21d
  Phase 4 MCP sidecars prep     :p4prep, after p0, 14d
  Phase 4 MCP sidecars E2E      :p4, after p3, 14d
  section Either
  Phase 5 Ops polish            :p5, after p3, 14d

Week / focus	Owner A	Owner B
1	Phase 0 (pair)	Phase 0 (pair)
2–4	Phase 1	Phase 2
3–4	—	Phase 4 prep (manifests, catalog flags)
5–7	Phase 3 (after 1+2 subprocess demo)	Phase 2 finish / worker image hardening
8+	Phase 5 or iterative mode in K8s	Phase 4 E2E sidecars

Valid stopping points

You do not need to complete all phases:

Stop after Phase 1 — architecture improvement only; zero deployment change.
Stop after Phase 2 — subprocess workers on bare metal ✅ (F4); container image optional.
Stop after Phase 3 — K8s execution with HTTP MCPs; defer stdio MCPs and warm pool.

Wiki maintenance

Cross-check phase checkboxes after F4 subprocess smoke (this page updated 2026-06).
Update Infrastructure with K8s Deployment/Job manifests and networking (K3.7).
Update Architecture execution diagram for distributed backends.
Update Configuration with K8s env vars when K3 lands.