User agent harnesses (per-catalog-agent eval packs)

Shipped in v1.5.0 — domain scenario libraries you maintain outside the core repo (or under vertical examples) to verify catalog agents meet your bar: prompts, fixtures, rubrics, optional MCP pairings.

Distinct from: Platform agent harness (generic L0–L3 catalog health in CI).

Why

Platform harness User harness
Is gpt_research configured and reachable? Does gpt_research produce an RPM brief without inventing trial IDs?
Shared profiles in config/agent_harnesses/ Your YAML scenarios + fixtures under harnesses/<agent_id>/
Runs on every PR (L0/L1) Runs in your CI before deploy or model swaps

Directory layout

my-deployment/harnesses/
  gpt_research/
    harness.yaml              # manifest — links to catalog id
    scenarios/
      rpm_council_brief.yaml
    fixtures/
      sample_context.txt
    rubrics/
      healthcare_claims.yaml  # optional LLM judge

Healthcare vertical ships a reference pack at examples/verticals/healthcare/harnesses/gpt_research/ (three scenarios from the healthcare README).

Quick commands

cd agentic-orchestration-tool

# Healthcare vertical (overlay adds harnesses/ to discovery)
python main.py --example healthcare --harness-agent gpt_research

# Your own pack directory
python main.py --harness-dir /path/to/harnesses --harness-agent gpt_research

# All packs under merged harness dirs
python main.py --harness-dir ./harnesses --user-harness-run-all

# JSON report + fail fast
python main.py --example healthcare --harness-agent gpt_research --harness-json --harness-fail-fast

# Helpers
powershell -File scripts/run-user-harness.ps1 -Example healthcare -Agent gpt_research
bash scripts/run-user-harness.sh --example healthcare --harness-agent gpt_research

Discovery

Mechanism Purpose
AGENTIC_EXTRA_AGENT_HARNESS_DIRS os.pathsep-separated harness root directories
--harness-dir PATH CLI override (repeatable)
--example healthcare Prepends examples/verticals/healthcare/harnesses/ when present

One pack per agent_provider_id (folder name under each root). Duplicate ids across merged dirs → error.

Scenario YAML (summary)

Each scenarios/*.yaml runs a single-task workflow via the same build_workflow / kickoff path as production.

Deterministic assertions (Phase 1): min_chars, max_chars, bullet_count, contains_any, forbids_regex, json_parse.

Matrix runs (Phase 4): set inputs.matrix to a list of variant dicts (label, optional topic, description_append, …). Each variant runs as scenario_id[label].

Optional optional_eval with rubric_file reuses evaluate_run_quality (disable with AGENTIC_HARNESS_EVAL=0).

Backends: --harness-backend subprocess|kubernetes or manifest defaults.execution_backend (same path as production execute_step).

Reports: harness_runs/user_batch_*.json (gitignored).

Environment

Variable Default Purpose
AGENTIC_EXTRA_AGENT_HARNESS_DIRS Extra harness roots
AGENTIC_USER_HARNESS_RECORD_STATS 1 Rolling pass/fail in user_harness_stats
AGENTIC_USER_HARNESS_FEED_PLANNER 1 Inject scenario pass/fail hints into dynamic planner
AGENTIC_HARNESS_EVAL 1 LLM rubric per scenario when optional_eval set
AGENTIC_EXECUTION_BACKEND inprocess Override via manifest defaults.execution_backend

CI and tests

Unit tests: pytest -m user_harness (mocked kickoff; included in the harness CI job alongside @pytest.mark.agent_harness).

Adopters can copy the healthcare pack or author packs in their own repos and point AGENTIC_EXTRA_AGENT_HARNESS_DIRS at them.