Kubernetes execution upgrade
Kubernetes execution upgrade roadmap
Living document for evolving Agentic Orchestration from in-process CrewAI kickoff() to optional Kubernetes-backed step execution (pod-per-step / pod-per-agent), while keeping the planner, catalogs, sessions, learning, and web UI.
Status: K3 MVP + K4 + K5 implemented.
Companion plan: Dual execution framework — Python refactor for pluggable execution backends (CrewAI in-process default, subprocess, kubernetes). This page owns cluster delivery; the companion page owns the code seam. Framework F0–F4 is complete; K8s Phase 3 implements framework F5.
Related: Architecture, Infrastructure, Dynamic planning, MCP providers, Configuration, Sessions learning and knowledge base, Dual execution framework
Relationship to the dual execution framework plan
Two separate roadmaps; one system.
| Dual execution framework | This page (K8s upgrade) | |
|---|---|---|
| Delivers | ExecutionBackend, StepCoordinator, backend factory, CrewAI extract |
Worker image, Jobs, PVCs, sidecars, kind tests |
| Required first | F0–F4 complete ✅ | K0 sign-off + K2.3 worker image before K3 |
| Value without cluster | ✅ F4 = subprocess workers prove distributed contract | ❌ until Phase 3 |
| Shared artifacts | Python types | Step spec + result JSON schemas (canonical) |
See the companion plan for module layout (orchestration/backends/), ExecutionBackend protocol, and framework Phases F0–F5.
Goals
| Goal | Notes |
|---|---|
| Optional K8s execution | Local in-process mode remains the default for dev; K8s is opt-in. |
| Preserve orchestration brain | Planner, YAML catalogs, session/KB/learning, artifact pipeline stay central. |
| Shared YAML config | Agent, MCP, and workflow YAML unchanged across backends — Dual execution framework. |
| Isolate agents per step | Each planned step can run in its own pod with secrets, resources, and crash isolation. |
| Minimize CrewAI loss | Prefer mini-Crew per pod (1 agent, 1 task) so MCP tool loops and provider code reuse stay intact. |
| Incremental delivery | Each phase ships value; no big-bang rewrite. |
Non-goals (initial phases)
- Replacing the dynamic planner with a K8s-native workflow engine (Argo, Temporal, etc.).
- In-run CrewAI delegation / hierarchical manager agents (not used today; defer to Agent societies roadmap K6).
- Multi-tenant hardening beyond what Compose/K8s secrets already imply.
- Forked or backend-specific agent/MCP/workflow YAML catalogs (runtime policy only; see Dual execution framework).
Terminology
| Term | Meaning in this doc |
|---|---|
| Coordinator | Process that owns the step loop: spawn workers, hand off outputs, retries, sessions. Today this logic lives in main.py + runner.py. |
| Worker | Ephemeral pod/Job that executes one planned step and writes results to shared storage. |
| Crew run | One user goal → one plan → N sequential steps. Maps to a K8s Job graph or coordinator-managed Job chain, not a K8s Node. |
| Run store | PVC mounted at a shared path for v1; same FileSystemRunStore as local/subprocess (S3/Redis deferred) |
Current vs target architecture
Today (default + subprocess)
Web UI → spawn python main.py
→ execution_backend_from_env()
→ inprocess: build_workflow() → Crew.kickoff() in one process
→ subprocess (AGENTIC_SUBPROCESS_WORKERS=1): StepCoordinator → --execute-step workers
→ sessions / KB / artifacts
Key modules: orchestration/execution_dispatch.py, orchestration/runner.py, orchestration/backends/subprocess_runner.py, orchestration/execute_step.py, orchestration/dynamic_planner.py.
Target (K8s mode)
Web UI → Coordinator (Deployment)
→ Planner → WorkflowConfig / step specs
→ for each step: K8s Job (worker pod)
→ run store (handoff prior output)
→ sessions / KB / artifacts (unchanged)
flowchart TB
subgraph coordinator [Orchestrator Deployment]
Web[Web UI]
Planner[Dynamic Planner]
Coord[Step Coordinator]
Store[(Run Store)]
end
subgraph workers [Per-step Jobs]
W1[Worker: step_1]
W2[Worker: step_2]
Wn[Worker: step_n]
end
Web --> Planner
Planner --> Coord
Coord -->|spawn| W1
W1 -->|output| Store
Coord -->|spawn| W2
W2 -->|output| Store
Coord -->|spawn| Wn
Wn -->|output| Store
Coord --> Store
Coord --> Web
What we keep vs change
| Component | Keep? | Change |
|---|---|---|
| Dynamic planner | ✅ | May filter MCP catalog in K8s mode |
| Agent provider YAML + factory | ✅ | Workers call same build_agent() |
| MCP catalog resolution | ✅ | Prefer HTTP MCPs; stdio → sidecar or cluster service |
runner.py sequential inject logic |
✅ | Ported to step_context.prepare_step_description + StepCoordinator ✅ |
execution_fallback.py |
✅ | Workflow-level HF fallback via main._run_dynamic_workflow_with_hf_fallback; per-step retry deferred (see Phase 3) |
| Session / learning / KB | ✅ | Add K8s run metadata to session JSON |
crew.kickoff() whole crew |
❌ (K8s mode) | Replaced by coordinator step loop |
| Web spawn single process | ⚠️ | Coordinator still spawns or embeds tool; workers separate |
Recommended execution strategy
Option A — Mini-Crew per pod (recommended)
Each worker runs a new CLI mode, e.g. --execute-step, that:
- Loads one step spec (JSON).
- Builds 1
Agent, 1Task, 1Crew(single task). - Calls
kickoff(). - Writes
result.json+ artifacts to the run store.
Option B — Custom agent loop (defer)
Replace CrewAI in workers with LiteLLM + MCP SDK. Only consider if CrewAI coupling becomes painful.
Losses and mitigations
Use this table when prioritizing work. Each row maps to phase tasks below.
| Loss | Mitigation | Primary phase |
|---|---|---|
| In-memory step handoffs | Coordinator + run store; port _inject_previous_output_into_next_task |
1, 3 |
| Stdio MCP subprocesses | HTTP MCPs first; sidecars for fetch/filesystem; catalog filter in K8s mode | 3, 4 |
| HF → Ollama execution fallback | Workflow-level retry in main + per-step retry in distributed backends (step_recovery.py) |
3 ✅ |
| Provider recovery retry | Workflow-level in-process + per-step via recovery_hint in StepCoordinator |
3 ✅ |
| CrewAI MCP tool loop | Mini-Crew per pod (Option A) | 2 |
| LLM provider abstraction | Same AgentProvider code inside worker image |
2 |
| CrewOutput / artifacts | Worker result.json contract; thin adapter in coordinator |
1, 2 |
| Delegation / hierarchical | Don’t replicate initially; planner already orchestrates | — |
| CrewAI ecosystem upgrades | Pin worker image; ExecutionBackend interface |
2, 5 |
| Step latency / cold start | Warm pool (optional); image pre-pull; slim worker image | 4, 5 |
| Cross-pod debuggability | run_id/step_id logging; session run record; Loki/ELK |
3 |
Phase dependencies and parallelism
Phases are not fully independent. Some stand alone and ship value without later work; others are integration milestones that require earlier contracts. Use this section when scheduling work across sessions or contributors.
Dependency graph
flowchart LR
P0[Phase 0 Design lock]
P1[Phase 1 Step spec + store]
P2[Phase 2 Worker entrypoint]
P3[Phase 3 K8s backend]
P4[Phase 4 MCP sidecars]
P5[Phase 5 Ops polish]
P0 --> P1
P0 --> P2
P1 --> P2
P1 --> P3
P2 --> P3
P3 --> P4
P3 --> P5
P4 -.-> P5
Legend: solid arrows = hard dependency; dotted = soft (Phase 5 benefits from Phase 4 but can start without it).
Independence summary
| Phase | Standalone? | Depends on | Ships value without later phases? |
|---|---|---|---|
| 0 — Design lock | ✅ Yes | Nothing | ✅ Yes — unblocks all other work |
| 1 — Step spec + store | ⚠️ Mostly | 0 (schemas agreed) | ✅ Yes — cleaner in-process execution, same UX |
| 2 — Worker entrypoint | ⚠️ Mostly | 0; 1 strongly recommended | ✅ Yes — subprocess/container worker, no K8s |
| 3 — K8s backend | ❌ No | 1 + 2 | ❌ No — needs spec, store, and worker |
| 4 — MCP sidecars | ⚠️ Partially | 3 for end-to-end validation | ⚠️ Partial — manifests/docs yes; E2E proof needs K8s |
| 5 — Ops polish | ⚠️ Partially | 3 for most items | ⚠️ Partial — image pinning/runbook can start early |
What can run in parallel
After Phase 0 is locked:
| Track A | Track B | Notes |
|---|---|---|
| Phase 1 (coordinator + run store, in-process) | Phase 2 (worker CLI + Dockerfile) | Parallel if step spec JSON is frozen in Phase 0. Prefer 1 as lead — it defines how specs are built; 2 consumes them. |
| Phase 4 design/docs (sidecar manifests, MCP matrix) | Phase 1 or 2 | Draft sidecars before K8s exists; cannot prove until Phase 3. |
| Phase 5 CrewAI pin / worker image policy | Phase 2 | Image tagging and upgrade runbook do not require a cluster. |
Phase 3 should start when K0 is signed off, subprocess demo works (✅ F4), and worker image (K2.3) exists for Job pods.
Phases 4 and 5 are enhancements, not prerequisites for a minimal K8s demo (HTTP MCPs only).
Independent workstreams (tracks)
Treat these as separate milestones you can stop after:
| Track | Phases | Outcome | Skip |
|---|---|---|---|
| Refactor only | 0, 1 | ExecutionBackend, step specs, in-process loop — no containers, no K8s |
2–5 |
| Worker isolation | 0, 1, 2 | --execute-step, SubprocessExecutionBackend, subprocess integration tests |
3–5 ✅ shipped via F4 |
| Kubernetes (HTTP MCPs) | 0, 1, 2, 3 | Full K8s sequential runs with streamable HTTP MCPs | 4 (initially), 5 |
| K8s MCP parity | 0–4 | Stdio MCPs via sidecars; planner catalog policy | 5 until needed |
| Production hardening | 5 (after 3) | Warm pool, centralized logging, load tests | — |
Minimal paths by goal
| Goal | Phases required | Can skip | |
|---|---|---|---|
| Better code structure only | 0, 1 | 2–5 | |
| Containerized agents, no K8s | 0, 1, 2 | 3–5 | Subprocess path ✅; container image (2.3) still open |
| K8s with HTTP MCPs only | 0, 1, 2, 3 | 4 (initially), 5 | |
| Full K8s parity with local MCPs | 0–4 | 5 until needed |
Practical rules
- Phase 0 is the only true prerequisite for everything — without frozen step/result schemas, Phases 1 and 2 will diverge.
- Phases 1 and 2 are loosely coupled — parallelizable after 0, but Phase 1 should own spec generation; Phase 2 only consumes specs.
- Phase 3 is not independent — it glues Phase 1 (coordinator + store) and Phase 2 (worker) onto a cluster.
- Phases 4 and 5 are optional depth — not required for a first K8s demo.
- Each track can ship on its own — stopping after Phase 1 or Phase 1+2 is valid; you do not need K8s to get value from this roadmap.
Post-F4 plan adjustments (2026-06)
Framework F4 (subprocess backend) validated the distributed contract locally. No architecture replan — K3 remains “swap subprocess spawn for K8s Job.” Updates from implementation:
| Topic | Adjustment |
|---|---|
| K1 / K2 | Effectively done via F2 + F4; remaining K2 work is worker Dockerfile (2.3) and log prefixes (2.2). |
| Run store v1 | PVC + FileSystemRunStore at a mounted path (e.g. /run/store) — reuse subprocess code; S3/Redis deferred. |
| K3 runner | Add kubernetes_runner.py mirroring subprocess_runner.py — same StepCoordinator, different spawn. |
| CLI dispatch | K3.0: extend execution_dispatch.py so kubernetes backend routes to execute_config (today only AGENTIC_SUBPROCESS_WORKERS=1 enables distributed path). |
| Phase 1.5 | Cancelled — in-process keeps whole-crew kickoff; not required for K8s. |
| HF fallback / recovery | K3 MVP: workflow-level only (already in main); per-step Job retry deferred post-MVP. |
| Phase 1.6 | Done in framework F3 (output_artifacts.py adapters). |
Minimal path to K3: K0 sign-off ✅ → K2.3 worker image → K3.0 dispatch → K3.1–3.3 Job + PVC → K3.8 kind test (HTTP MCPs only; K4 for stdio).
Phased roadmap
Track progress by checking boxes as work completes. See Phase dependencies and parallelism before scheduling work out of order.
Phase 0 — Design lock (no K8s required)
Standalone: ✅ Yes — gate for all other phases.
Companion plan: Pair with Dual execution framework Phase F0 (Python types must match schemas below).
Objective: Agree on contracts so phases 1–3 can proceed in parallel later.
- 0.1 Step spec JSON schema v0.1 — implemented in
StepSpec.to_dict()/ materializer; formal wiki review optional. - 0.2 Worker
result.jsoncontract v0.1 — implemented inStepResult/execute_step.py; formal wiki review optional. - 0.3 Run store v1: PVC +
FileSystemRunStoreat mounted path (S3/MinIO/Redis deferred). - 0.4 Sign off
ExecutionBackendprotocol — shipped in framework F0 (Dual execution framework). - 0.5 Sign off env flag:
AGENTIC_EXECUTION_BACKEND=inprocess|subprocess|kubernetes(defaultinprocess) — implemented inorchestration/backends/factory.py. - 0.6 K8s MCP compatibility matrix signed off (MCP matrix) — policy in
orchestration/k8s_mcp_compat.py; planner filter K4.3 ✅.
Exit criteria: Schema + contracts merged; Phase 0 complete ✅. K3 MVP uses HTTP-native MCPs only unless AGENTIC_K8S_ALLOW_STDIO_MCPS=1 after K4 sidecars.
Phase 1 — Step spec + run store (local only)
Standalone: ✅ Yes — delivers run store and step contract for distributed backends.
Depends on: Phase 0.
Companion plan: Overlaps Dual execution framework Phases F2 (materializer, StepCoordinator) and F1 (CrewAI backend extract). Coordinate so run store and step specs are not implemented twice.
Parallel with: Phase 2 (after Phase 0; Phase 1 should lead spec generation).
Objective: Run store and step handoff contract ready for subprocess/K8s workers. ✅ Shipped via framework F2 + F4.
Code touchpoints:
orchestration/run_store.py— abstract + filesystem impl ✅orchestration/workflow_materializer.py,step_coordinator.py— framework F2 ✅orchestration/backends/crewai.py— framework F1 ✅
Tasks:
- 1.1 Implement
StepSpec/StepResultdataclasses aligned with schema below (orchestration/backends/base.py). - 1.2 Implement
build_step_specs(config: WorkflowConfig) -> list[StepSpec](workflow_materializer.py). - 1.3 Port prior-output injection to
prepare_step_description(step, prior_output)(step_context.py). - 1.4 Implement filesystem run store:
{run_id}/{step_id}/result.json(run_store.py). - 1.5
Implement— Cancelled (in-process keeps whole-crew kickoff per framework F2.5; not required for K8s).InProcessExecutionBackendstep loop - 1.6 Adapter for
output_artifacts.pyto consumeStepResult/result.json— shipped in framework F3. - 1.7 Unit tests: inject logic, store round-trip, materializer from default workflow YAML.
- 1.8
AGENTIC_RUN_STORE_PATH+run_store_session()— PVC-friendly{base}/{run_id}/layout; temp dir when unset (local dev).
Exit criteria: Distributed backends use StepCoordinator + run store; subprocess integration test passes. ✅
Phase 2 — Worker entrypoint (mini-Crew per pod)
Standalone: ✅ Yes — isolated worker via subprocess/container; no K8s required.
Depends on: Phase 0 (required); Phase 1 (strongly recommended — spec builder + run store).
Parallel with: Phase 1 (after Phase 0).
Objective: Worker image can execute one step from a spec file and write results.
Code touchpoints:
main.py— add--execute-step PATH(and--run-id,--step-id)agent_providers/*— unchangedorchestration/crewai_mcp_hotfix.py— loaded in worker
Tasks:
- 2.1 CLI:
--execute-steploads JSON, builds one agent/task/crew, kickoff, writesresult.json(execute_step.py). - 2.2 Worker writes stderr/stdout logs with
run_idandstep_idprefixes (orchestration/worker_logging.py). - 2.3 Dockerfile:
orchestrator-workerimage (docker/Dockerfile.worker,docker/README.worker.md). - 2.4 Local smoke test: coordinator calls worker via subprocess + shared temp dir —
subprocess_runner.py+tests/test_backend_subprocess.py(framework F4.3). - 2.5 Document required Secrets → env mapping (
.env.example,docker/README.worker.md).
Exit criteria: Subprocess integration test ✅; worker image builds and passes scripts/docker-worker-smoke.ps1 (invalid spec → exit 2). ✅
Phase 3 — Kubernetes coordinator backend
Standalone: ❌ No — integration milestone only.
Depends on: Phase 1 ✅ + Phase 2 subprocess proof ✅ + worker image (2.3).
Companion plan: Requires framework F0–F4 complete ✅. Implements framework F5 (KubernetesExecutionBackend).
Blocks: Phase 4 E2E validation; most of Phase 5.
Objective: Coordinator creates Jobs per step, waits for completion, reads run store — same loop as subprocess_runner.py.
Shared runner pattern:
subprocess_runner.py kubernetes_runner.py
│ │
└─ StepCoordinator.run_sequential
└─ build_step_specs + FileSystemRunStore
└─ spawn: subprocess.run spawn: K8s Job (worker image, --execute-step)
└─ read result.json read result.json (same path on PVC)
Code touchpoints:
orchestration/backends/kubernetes_runner.py— new (mirrorsubprocess_runner.py)orchestration/backends/kubernetes.py— delegateexecute_configto runner (framework F5)orchestration/execution_dispatch.py— K3.0: routekubernetesbackend toexecute_configmain.py— already wired viaexecution_backend_from_env()+execute_workflow_from_config(framework F3/F4)orchestration/orchestrator_session.py— store pod names, exit codes, timingagentic-orchestration-web/server.mjs— progress lines from coordinator (unchanged format)
Tasks:
- 3.0 Extend
execution_dispatch.py:use_distributed_execute_config()routeskubernetes→execute_config. - 3.1 K8s client —
kubernetesPython package +KubernetesJobRunner(kubernetes_jobs.py). - 3.2 Job template: labels
run_id,step_id,agent_provider_id; TTL; worker args…/{run_id}/{step_id}-spec.json. - 3.3 PVC mount on worker Jobs (
AGENTIC_K8S_RUN_STORE_PVC); coordinator usesAGENTIC_RUN_STORE_PATH+FileSystemRunStore. - 3.4 Per-step HF execution fallback: failed Job → parse error → rebuild config → new Job (
step_recovery.py,StepCoordinatorretry). - 3.5 Per-step provider recovery via
recovery_hintinStepCoordinator(provider_recovery→recover_from_workflow_error). - 3.6 Workflow result records
k8s_jobsmetadata per step Job (WorkflowExecutionResult.k8s_jobs); session wiring deferred. - 3.7 Coordinator Deployment + RBAC (
deploy/k8s/coordinator/); sample worker Job (worker-job.example.yaml). - 3.8 Integration test: mocked Jobs (
tests/test_backend_kubernetes.py) + live kind e2e in CI (tests/test_kind_kubernetes_e2e.py, stub worker).
K3 MVP exit criteria: Code path complete ✅; kind cluster e2e in CI (stub worker, no LLM). Manual kind + real worker image for LLM validation optional.
Phase 4 — MCP sidecars and catalog policy
Standalone: ⚠️ Partial — manifests and catalog flags can land early; E2E proof needs Phase 3.
Depends on: Phase 3 for validation.
Parallel with: Phase 5 (after Phase 3).
Objective: Stdio MCPs work in K8s or are cleanly excluded from planner catalog.
Tasks:
- 4.1 Sidecar pattern doc + example manifest for
fetch_url(deploy/k8s/mcp-sidecars/). - 4.2 Optional cluster Deployments for fetch/filesystem MCP as HTTP gateways.
- 4.3 When
AGENTIC_EXECUTION_BACKEND=kubernetes, filter planner MCP catalog (apply_kubernetes_mcp_catalog_policy). - 4.4 Image pre-pull DaemonSet (
deploy/k8s/worker-image-prep.yaml) + docs. - 4.5 Resource requests/limits + GPU nodeSelector via env (
K8sSettings,k8s_worker_pod.py).
Exit criteria: At least one stdio MCP works via sidecar (manual kind smoke with AGENTIC_K8S_POD_SIDECAR_MCPS=fetch_url or cluster gateway); planner never assigns broken MCP combos in K8s mode ✅ (unit tests + apply_kubernetes_mcp_catalog_policy).
Phase 5 — Operational polish (optional)
Standalone: ⚠️ Partial — image pinning and runbook (5.3) can start during Phase 2; warm pool and load tests need Phase 3.
Depends on: Phase 3 for most items; benefits from Phase 4.
Parallel with: Phase 4 (after Phase 3).
Objective: Production readiness for longer-running collaboration.
- 5.1 Warm pool: idle worker pods + coordinator dispatch (
deploy/k8s/warm-pool.yaml,kubernetes_warm_pool.py). - 5.2 Centralized logging contract (
deploy/k8s/LOGGING.md,AGENTIC_LOG_FORMAT=json). - 5.3 Pin CrewAI in worker image; document upgrade runbook (
crewai==1.12.2,docker/CREWAI_UPGRADE.md). - 5.4 Load test:
scripts/k8s-load-test.ps1/.sh(N concurrent runs, p50/p95). - 5.5 Delegation RPC: worker
k8s_delegate_tasktool +delegation-brokerDeployment spawns child Jobs (kubernetes_delegation.py).
Next (optional): Agent societies roadmap — K6 phased plan for autonomous multi-agent societies (blackboard, protocol engine, society broker, web graph UI). Builds on K5.5 delegation and warm pool; revisits deferred in-run CrewAI delegation / hierarchical managers.
Step spec JSON schema (draft v0.1)
Written by coordinator, consumed by worker --execute-step.
{
"schema_version": "0.1",
"run_id": "uuid",
"step_id": "step_2",
"step_index": 1,
"workflow_name": "dynamic-plan-2025-06-26",
"topic": "User goal text",
"task": {
"description": "Full task description after prior-output injection",
"expected_output": "What success looks like"
},
"agent_provider": {
"id": "gpt_research",
"type": "openai",
"role": "Research Analyst",
"goal": "...",
"backstory": "...",
"model": "gpt-4o-mini",
"verbose": true,
"allow_delegation": false,
"openai_base_url": "",
"ollama_host": ""
},
"mcp_providers": [
{ "id": "search_brave", "resolved": { "streamable_http": { "url": "..." } } }
],
"prior_output": "Text from previous step, or empty string",
"inputs": {
"topic": "User goal text"
},
"paths": {
"run_store": "/run/store",
"artifacts_dir": "/run/store/artifacts"
}
}
Rules:
prior_outputis already merged intotask.descriptionby coordinator (same as today’s inject marker); worker may ignore duplicate.agent_provideris the resolved catalog entry (post credential filter), not a live object — same dict shape as today’s YAML-derived provider payloads.mcp_providers[].resolvedmatches output ofresolve_workflow_mcp_refs().- This JSON is worker transport, not a replacement for
config/agent_providers/orconfig/mcp_providers/YAML.
Worker result contract (draft v0.1)
Written to {run_store}/{step_id}/result.json.
{
"schema_version": "0.1",
"run_id": "uuid",
"step_id": "step_2",
"exit_code": 0,
"result_text": "Final agent output as string",
"result_format": "plain",
"error": null,
"recoverable": false,
"recovery_hint": null,
"artifacts": [
{ "relative_path": "artifacts/report.md", "mime": "text/markdown" }
],
"timing": {
"started_at": "ISO-8601",
"finished_at": "ISO-8601"
},
"k8s": {
"pod_name": "step-2-abc123",
"node_name": "worker-7"
}
}
On failure:
{
"exit_code": 1,
"error": "LiteLLM HuggingFaceException: ...",
"recoverable": true,
"recovery_hint": "hf_litellm_fallback"
}
Coordinator maps recovery_hint to existing Python (execution_fallback, recover_from_workflow_error).
ExecutionBackend (defined in companion plan)
The protocol, factory, and three backend classes are specified in Dual execution framework — not duplicated here.
| Backend | Env value | Defined in |
|---|---|---|
CrewAIExecutionBackend |
inprocess (default) |
Framework F1 |
SubprocessExecutionBackend |
subprocess |
Framework F4 + K8s Phase 2 CLI |
KubernetesExecutionBackend |
kubernetes |
Framework F5 + K8s Phase 3 |
MCP compatibility matrix (K8s mode)
Status: ✅ Signed off (K0.6, 2026-06-27) — verified against shipped catalog in config/mcp_providers/. Code: orchestration/k8s_mcp_compat.py. Planner filter: K4.3 ✅ (apply_kubernetes_mcp_catalog_policy).
Shipped catalog
| MCP id | Transport today | K3 MVP | K8s v2+ (K4 sidecars) |
|---|---|---|---|
search_brave |
streamable_http | ✅ native | ✅ |
search_tavily |
streamable_http | ✅ native | ✅ |
home_assistant |
streamable_http | ✅ native | ✅ |
search_exa |
stdio (npx exa-mcp-server) |
❌ excluded | ⚠️ sidecar |
fetch_url |
stdio (python -m mcp_server_fetch) |
❌ excluded | ✅ worker stdio (default) or cluster gateway / sidecar |
filesystem_local |
stdio (npx filesystem server) |
❌ excluded | ✅ worker stdio + PVC subdir (default) or sidecar/gateway |
memory_knowledge_graph |
stdio (npx memory server) |
❌ excluded | ⚠️ sidecar |
K3 MVP allowlist: search_brave, search_tavily, home_assistant only.
Approved planner rule (implementation: K4.3)
When AGENTIC_EXECUTION_BACKEND=kubernetes:
- Default: planner / materializer sees only K3 MVP MCP ids (
K8S_NATIVE_MCP_IDSin code). - Opt-in stdio: set
AGENTIC_K8S_ALLOW_STDIO_MCPS=1only after a sidecar template exists for that MCP (K4). - Unknown ids (extra catalog paths): excluded in K8s mode until explicitly classified in
k8s_mcp_compat.pyor documented.
No new YAML schema — runtime policy only (Dual execution framework).
Environment variables (proposed)
Add to .env.example when implementing:
| Variable | Default | Purpose |
|---|---|---|
AGENTIC_EXECUTION_BACKEND |
inprocess |
inprocess | kubernetes | subprocess |
AGENTIC_RUN_STORE_PATH |
(unset) | Mounted run store root; per-run dir {path}/{run_id}/. Temp dir when unset. |
AGENTIC_K8S_NAMESPACE |
agentic-orchestration |
Job namespace |
AGENTIC_K8S_WORKER_IMAGE |
— | Worker container image |
AGENTIC_K8S_RUN_STORE_PVC |
— | PVC name for run handoffs |
AGENTIC_K8S_JOB_TTL_SECONDS |
3600 |
Finished Job TTL |
AGENTIC_K8S_ALLOW_STDIO_MCPS |
0 |
When 1, allow stdio MCP ids in K8s mode (requires K4 sidecar/gateway). |
AGENTIC_K8S_WORKER_STDIO_MCPS |
fetch_url |
Stdio MCP ids spawned inside the worker container (mcp-server-fetch in worker image). Preferred for fetch_url. |
AGENTIC_K8S_MCP_FETCH_URL |
— | Cluster HTTP gateway for fetch_url (stdio → streamable_http rewrite) |
AGENTIC_K8S_MCP_FILESYSTEM_URL |
— | Cluster HTTP gateway for filesystem_local |
AGENTIC_K8S_POD_SIDECAR_MCPS |
— | Comma-separated MCP ids for in-pod supergateway sidecars (e.g. filesystem_local) |
AGENTIC_K8S_SUPERGATEWAY_IMAGE |
supercorp/supergateway:uvx |
Sidecar image for stdio→HTTP bridge |
AGENTIC_K8S_SUPERGATEWAY_STATEFUL |
0 |
When 1, pass --stateful to supergateway (bridge tuning for CrewAI HTTP client) |
AGENTIC_K8S_WORKER_RESOURCES |
— | Optional JSON requests/limits for worker container |
AGENTIC_K8S_GPU_NODE_SELECTOR |
— | Optional JSON nodeSelector when step provider is in AGENTIC_K8S_GPU_PROVIDER_IDS |
AGENTIC_K8S_GPU_PROVIDER_IDS |
— | Comma-separated provider ids that trigger GPU nodeSelector |
AGENTIC_K8S_ENV_SECRET |
— | Secret name for worker/coordinator env (agentic-orchestrator-env) |
AGENTIC_K8S_WARM_POOL_ENABLED |
0 |
When 1, dispatch steps without sidecars via PVC queue (deploy/k8s/warm-pool.yaml) |
AGENTIC_K8S_DELEGATION_ENABLED |
0 |
When 1, workers get k8s_delegate_task tool; requires delegation-broker Deployment |
AGENTIC_K8S_DELEGATION_TIMEOUT_SECONDS |
3600 |
Worker wait for delegation broker response |
AGENTIC_LOG_FORMAT |
text |
json for structured logs (K5.2 — Loki/Datadog) |
Existing variables (AGENTIC_STEP_CONTEXT_CHARS, AGENTIC_PROGRESS, provider API keys) unchanged.
Testing strategy
| Phase | Tests |
|---|---|
| 1 | Unit: inject, store, step spec builder — ✅ shipped |
| 2 | Subprocess 2-step mocked worker — ✅ tests/test_backend_subprocess.py; Docker smoke — ✅ CI docker-worker-smoke job |
| 3 | kind/minikube: 2-step plan; workflow-level HF fallback |
| 4 | Sidecar MCP smoke test (manual kind); unit: test_k8s_mcp_compat.py, test_k8s_worker_pod.py — ✅ |
| 5 | Warm pool + JSON logging + load test (test_k5_operational.py, scripts/k8s-load-test.*) — ✅ |
| 5 | Load + log correlation |
Regression bar: In-process mode (AGENTIC_EXECUTION_BACKEND=inprocess) must pass existing behavior for --dynamic and static workflows after each phase.
Decision log
Record decisions here as the project proceeds.
| Date | Decision | Rationale |
|---|---|---|
| 2025-06-26 | Adopt mini-Crew per pod (Option A) | Preserves MCP loop and provider code; matches current sequential usage |
| 2025-06-26 | Coordinator owns step loop, not whole-crew kickoff | Aligns with existing planner-as-brain architecture |
| 2025-06-26 | Phases 1 ‖ 2 parallel after Phase 0; Phase 3 requires both | Maximizes independent delivery; avoids K8s before local subprocess works |
| 2025-06-26 | Split plans: Dual execution framework (code seam) + this page (K8s) | Framework F0–F4 before K8s Phase 3 |
| 2026-06-26 | Framework F4 complete — subprocess proves distributed contract | K3 is adapter swap, not replan |
| 2026-06-26 | Run store v1: PVC + FileSystemRunStore |
Reuse local/subprocess code; mount at /run/store; S3/Redis deferred |
| 2026-06-26 | K3 MVP: workflow-level HF fallback only | Per-step Job retry (3.4–3.5) deferred; main._run_dynamic_workflow_with_hf_fallback already covers distributed runs |
| 2026-06-26 | Phase 1.5 (InProcessExecutionBackend step loop) cancelled |
In-process keeps whole-crew kickoff; K8s uses StepCoordinator only |
| 2026-06-27 | K0.6 MCP matrix signed off | K3 MVP: search_brave, search_tavily, home_assistant; stdio excluded until K4; policy in k8s_mcp_compat.py |
Open questions
- One Job per step vs one Job with init containers per step? — Per-step Jobs simplify isolation and match sequential semantics; init-chain is faster but weaker isolation.
- Coordinator in web pod vs separate Deployment? — Start embedded in existing orchestration container; split later if needed.
- Iterative dynamic mode: replan between steps — coordinator must refresh step specs mid-run; confirm session/planner API.
- GPU scheduling: derive from provider YAML
min_vram_gbor separate scheduling profile id? - Multi-tenant: namespace per tenant vs label isolation?
Suggested work order (current — post-F4)
Next up (K3 track)
K0.6 — sign off MCP compatibility matrix✅ Done.- K2.3 —
orchestrator-workerDockerfile (required for Job pods). - K3.0 —
execution_dispatchrouteskubernetes→execute_config. - K3.1–3.3 —
kubernetes_runner.py+ Job template + PVC mount. - K3.8 — kind/minikube 2-step integration test.
- K3.6–3.7 — session metadata + coordinator Deployment manifest.
- K4 — stdio MCP sidecars + planner filter (K4.3 uses
k8s_mcp_compat.py). - K3.4–3.5, K5 — per-step retry, warm pool, load tests (post-MVP).
Completed (framework + K1/K2 overlap)
| Done | Item |
|---|---|
| ✅ | F0–F4 + K1 (materializer, coordinator, run store) |
| ✅ | K2.1, K2.4, K2.5 (--execute-step, subprocess integration test) |
| ✅ | F3 post-run adapters (K1.6) |
Linear default (historical — for reference)
Framework plan (Dual execution framework): F0 → F1 → F2 → F3 → F4 ✅
K8s plan (this page): K2.3 → K3 → K4 → K5
Parallel schedule (two contributors or split focus)
gantt
title Parallel workstreams after Phase 0
dateFormat YYYY-MM-DD
section Gate
Phase 0 Design lock :p0, 2025-07-01, 7d
section Track A
Phase 1 Step spec + store :p1, after p0, 21d
Phase 3 K8s backend :p3, after p1 p2, 21d
section Track B
Phase 2 Worker entrypoint :p2, after p0, 21d
Phase 4 MCP sidecars prep :p4prep, after p0, 14d
Phase 4 MCP sidecars E2E :p4, after p3, 14d
section Either
Phase 5 Ops polish :p5, after p3, 14d
| Week / focus | Owner A | Owner B |
|---|---|---|
| 1 | Phase 0 (pair) | Phase 0 (pair) |
| 2–4 | Phase 1 | Phase 2 |
| 3–4 | — | Phase 4 prep (manifests, catalog flags) |
| 5–7 | Phase 3 (after 1+2 subprocess demo) | Phase 2 finish / worker image hardening |
| 8+ | Phase 5 or iterative mode in K8s | Phase 4 E2E sidecars |
Valid stopping points
You do not need to complete all phases:
- Stop after Phase 1 — architecture improvement only; zero deployment change.
- Stop after Phase 2 — subprocess workers on bare metal ✅ (F4); container image optional.
- Stop after Phase 3 — K8s execution with HTTP MCPs; defer stdio MCPs and warm pool.
Wiki maintenance
- Cross-check phase checkboxes after F4 subprocess smoke (this page updated 2026-06).
- Update Infrastructure with K8s Deployment/Job manifests and networking (K3.7).
- Update Architecture execution diagram for distributed backends.
- Update Configuration with K8s env vars when K3 lands.