A network of agentic research specialists coordinated by a principal investigator agent. V1: web-search researcher MCP server + CLI shim.
Find a file
Jeff Smith 13215d7ddb docs(stress-tests): M3.3 Phase A — calibration data collection
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:21:47 -06:00
cli fix(observability): persist full ResearchResult and per-item trace events 2026-04-08 19:27:33 -06:00
docs/stress-tests docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
obs M2.5.2: Cost ledger with price table (#25) 2026-04-08 15:52:25 -06:00
orchestrator Initial project structure and scaffolding 2026-04-08 11:57:15 -06:00
researchers fix(observability): persist full ResearchResult and per-item trace events 2026-04-08 19:27:33 -06:00
scripts docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
tests fix(observability): persist full ResearchResult and per-item trace events 2026-04-08 19:27:33 -06:00
.dockerignore chore: add docker-based test environment (#13) 2026-04-08 15:06:12 -06:00
.gitignore docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
CLAUDE.md chore: update CLAUDE.md for session 2 2026-04-08 17:30:59 -06:00
CONTRIBUTING.md Initial project structure and scaffolding 2026-04-08 11:57:15 -06:00
Dockerfile M2.5.3: marchwarden costs CLI command (#26) 2026-04-08 15:57:39 -06:00
Makefile chore: add Makefile with venv-based dev workflow 2026-04-08 16:31:00 -06:00
pyproject.toml M2.5.1: Structured application logger via structlog (#24) 2026-04-08 15:46:51 -06:00
README.md chore: add Makefile with venv-based dev workflow 2026-04-08 16:31:00 -06:00

Marchwarden

A network of agentic research specialists coordinated by a principal investigator agent.

Marchwarden researchers are stationed at the frontier of knowledge — they watch, search, synthesize, and report back what they find. Each specialist is self-contained, fault-tolerant, and exposed via MCP. The PI agent orchestrates them to answer complex, multi-domain questions.

V1: Single web-search researcher + CLI shim for development.
V2+: Multiple specialists (arxiv, database, internal docs, etc.) + PI orchestrator.

Quick start

# Clone
git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
cd marchwarden

# Install (Makefile shortcut — creates .venv and installs deps)
make install
# or manually:
python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[dev]"

# Ask a question
marchwarden ask "What are ideal crops for a garden in Utah?"

# Replay a research session
marchwarden replay <trace_id>

Docker test environment

A reproducible container is available for running the test suite and the CLI without depending on the host's Python install:

scripts/docker-test.sh build           # build the image
scripts/docker-test.sh test             # run pytest
scripts/docker-test.sh ask "question"   # run `marchwarden ask` end-to-end
                                        # (mounts ~/secrets ro and ~/.marchwarden rw)
scripts/docker-test.sh replay <id>      # replay a trace from ~/.marchwarden/traces
scripts/docker-test.sh shell            # interactive bash in the container

Documentation

Status

  • V1 scope: Issue #1
  • Branch: main (development)
  • Tests: pytest tests/

Stack

License

(TBD)