User agent harnesses

User agent harnesses (per-catalog-agent eval packs)

Shipped in v1.5.0 — domain scenario libraries you maintain outside the core repo (or under vertical examples) to verify catalog agents meet your bar: prompts, fixtures, rubrics, optional MCP pairings.

Distinct from: Platform agent harness (generic L0–L3 catalog health in CI).

Why

Platform harness	User harness
Is `gpt_research` configured and reachable?	Does `gpt_research` produce an RPM brief without inventing trial IDs?
Shared profiles in `config/agent_harnesses/`	Your YAML scenarios + fixtures under `harnesses/<agent_id>/`
Runs on every PR (L0/L1)	Runs in your CI before deploy or model swaps

Directory layout

my-deployment/harnesses/
  gpt_research/
    harness.yaml              # manifest — links to catalog id
    scenarios/
      rpm_council_brief.yaml
    fixtures/
      sample_context.txt
    rubrics/
      healthcare_claims.yaml  # optional LLM judge

Healthcare vertical ships a reference pack at examples/verticals/healthcare/harnesses/gpt_research/ (three scenarios from the healthcare README).

Quick commands

cd agentic-orchestration-tool

# Healthcare vertical (overlay adds harnesses/ to discovery)
python main.py --example healthcare --harness-agent gpt_research

# Your own pack directory
python main.py --harness-dir /path/to/harnesses --harness-agent gpt_research

# All packs under merged harness dirs
python main.py --harness-dir ./harnesses --user-harness-run-all

# JSON report + fail fast
python main.py --example healthcare --harness-agent gpt_research --harness-json --harness-fail-fast

# Helpers
powershell -File scripts/run-user-harness.ps1 -Example healthcare -Agent gpt_research
bash scripts/run-user-harness.sh --example healthcare --harness-agent gpt_research

Discovery

Mechanism	Purpose
`AGENTIC_EXTRA_AGENT_HARNESS_DIRS`	`os.pathsep`-separated harness root directories
`--harness-dir PATH`	CLI override (repeatable)
`--example healthcare`	Prepends `examples/verticals/healthcare/harnesses/` when present

One pack per agent_provider_id (folder name under each root). Duplicate ids across merged dirs → error.

Scenario YAML (summary)

Each scenarios/*.yaml runs a single-task workflow via the same build_workflow / kickoff path as production.

Deterministic assertions (Phase 1): min_chars, max_chars, bullet_count, contains_any, forbids_regex, json_parse.

Matrix runs (Phase 4): set inputs.matrix to a list of variant dicts (label, optional topic, description_append, …). Each variant runs as scenario_id[label].

Optional optional_eval with rubric_file reuses evaluate_run_quality (disable with AGENTIC_HARNESS_EVAL=0).

Backends: --harness-backend subprocess|kubernetes or manifest defaults.execution_backend (same path as production execute_step).

Reports: harness_runs/user_batch_*.json (gitignored).

Environment

Variable	Default	Purpose
`AGENTIC_EXTRA_AGENT_HARNESS_DIRS`	—	Extra harness roots
`AGENTIC_USER_HARNESS_RECORD_STATS`	`1`	Rolling pass/fail in `user_harness_stats`
`AGENTIC_USER_HARNESS_FEED_PLANNER`	`1`	Inject scenario pass/fail hints into dynamic planner
`AGENTIC_HARNESS_EVAL`	`1`	LLM rubric per scenario when `optional_eval` set
`AGENTIC_EXECUTION_BACKEND`	`inprocess`	Override via manifest `defaults.execution_backend`

CI and tests

Unit tests: pytest -m user_harness (mocked kickoff; included in the harness CI job alongside @pytest.mark.agent_harness).

Adopters can copy the healthcare pack or author packs in their own repos and point AGENTIC_EXTRA_AGENT_HARNESS_DIRS at them.