A network of agentic research specialists coordinated by a principal investigator agent. V1: web-search researcher MCP server + CLI shim.
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| cli | ||
| docs/stress-tests | ||
| obs | ||
| orchestrator | ||
| researchers | ||
| scripts | ||
| tests | ||
| .dockerignore | ||
| .gitignore | ||
| CLAUDE.md | ||
| CONTRIBUTING.md | ||
| Dockerfile | ||
| Makefile | ||
| pyproject.toml | ||
| README.md | ||
Marchwarden
A network of agentic research specialists coordinated by a principal investigator agent.
Marchwarden researchers are stationed at the frontier of knowledge — they watch, search, synthesize, and report back what they find. Each specialist is self-contained, fault-tolerant, and exposed via MCP. The PI agent orchestrates them to answer complex, multi-domain questions.
V1: Single web-search researcher + CLI shim for development.
V2+: Multiple specialists (arxiv, database, internal docs, etc.) + PI orchestrator.
Quick start
# Clone
git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
cd marchwarden
# Install (Makefile shortcut — creates .venv and installs deps)
make install
# or manually:
python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[dev]"
# Ask a question
marchwarden ask "What are ideal crops for a garden in Utah?"
# Replay a research session
marchwarden replay <trace_id>
Docker test environment
A reproducible container is available for running the test suite and the CLI without depending on the host's Python install:
scripts/docker-test.sh build # build the image
scripts/docker-test.sh test # run pytest
scripts/docker-test.sh ask "question" # run `marchwarden ask` end-to-end
# (mounts ~/secrets ro and ~/.marchwarden rw)
scripts/docker-test.sh replay <id> # replay a trace from ~/.marchwarden/traces
scripts/docker-test.sh shell # interactive bash in the container
Documentation
- Architecture — system design, researcher contract, MCP flow
- Development Guide — setup, running tests, debugging
- Research Contract — the
research()tool specification - Contributing — branching, commits, PR workflow
Status
- V1 scope: Issue #1
- Branch:
main(development) - Tests:
pytest tests/
Stack
- Language: Python 3.10+
- Agent framework: Anthropic Claude Agent SDK
- MCP server: Model Context Protocol
- Web search: Tavily API
License
(TBD)