Agent harness roadmap
Platform agent harness
Shipped in v1.4.0 — tiered probes that turn the Agent catalog from a static registry into a verified inventory for your environment.
Distinct from: User agent harnesses (domain scenario packs you maintain separately).
Why
| Without harness | With harness |
|---|---|
| Broken model names fail at run time | L0/L1 catch config and connectivity in CI |
| Manual smoke per agent | Shared profiles (research, coding, …) scale to 182 agents |
| Hard to debug “which agent is broken?” | python main.py --harness-agent ID --harness-tier smoke isolates one id |
Tiers
| Tier | CLI value | Checks | When to run |
|---|---|---|---|
| L0 | static |
YAML valid, credentials present | Every PR (CI); locally before adding YAML |
| L1 | connectivity |
validate_config → initialize → health_check |
After env/credential changes |
| L2 | smoke |
One-task kickoff + deterministic assertions | Before promoting model swaps |
| L3 | capability |
L2 + LLM rubric (evaluate_run_quality) |
Release gate / manual QA |
Quick commands
cd agentic-orchestration-tool
# Full catalog — no API keys
python main.py --harness-batch --harness-tier static
# Cloud subset
python main.py --harness-batch --harness-tier connectivity --harness-filter "gpt_*"
# Single agent smoke (needs credentials)
python main.py --harness-agent gpt_research --harness-tier smoke
# JSON report for automation
python main.py --harness-batch --harness-tier static --harness-json
# Helpers
powershell -File scripts/run-agent-harness.ps1 -Tier static -Filter "gpt_*"
python scripts/harness-report.py
Profiles and per-agent YAML
Shared templates live in config/agent_harnesses/:
| Profile | Typical agents |
|---|---|
general |
Default, general_purpose: true |
research |
Research Analyst roles |
write |
Technical Writer roles |
reason |
Staff Engineer roles |
coding |
*_coder_* ids |
vision |
VLM / vision entries |
Optional fields on agent provider YAML:
harness_profile: research
harness:
skip_live: true # skip L2/L3 in batch (e.g. huge local models)
smoke_override:
description: "..." # rare per-agent prompt override
The Harness column in the Agent catalog shows inferred or explicit profiles.
Execution and reports
- Uses the same
build_workflow/execute_steppaths as production (no second runner). - L2/L3 support
--harness-backend subprocessfor worker-image regression. - Reports written to
harness_runs/(gitignored); aggregate withscripts/harness-report.py. - Pass/fail stats optionally recorded in
__orchestrator_learning__/stats.jsonand fed to the planner whenAGENTIC_HARNESS_FEED_PLANNER=1.
CI
| Job | Tier |
|---|---|
agent-harness-static |
L0 — full catalog every PR |
agent-harness-connectivity |
L1 — gpt_* + unit tests |
agent-harness-smoke-nightly |
L2 — weekly (optional secrets) |
Details: Testing and CI.
Related
- CLI reference — all
--harness-*flags - Configuration —
AGENTIC_HARNESS_*env vars - Features — product overview