Update contract: raw_excerpt, categorized gaps, discovery events, confidence factors, content hashing

Incorporates architectural critique:
- raw_excerpt on citations prevents synthesis paradox (double-summarization)
- Gap categories (SOURCE_NOT_FOUND, ACCESS_DENIED, BUDGET_EXHAUSTED,
  CONTRADICTORY_SOURCES, SCOPE_EXCEEDED) enable PI to choose correct response
- discovery_events capture lateral findings for V2 orchestrator
- confidence_factors expose scoring inputs for future calibration
- content_hash (SHA-256) on trace fetches enables pseudo-CAS change detection
- Known Limitations table documents intentional V1 tradeoffs

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Jeff Smith 2026-04-08 12:28:10 -06:00
parent 36bfe1813a
commit 7ad91b7ca9
2 changed files with 434 additions and 85 deletions

@ -18,6 +18,8 @@ Marchwarden is a network of agentic researchers coordinated by a principal inves
│ - Fetch URLs │ │ │ │ - Fetch URLs │ │ │
│ - Internal loop │ │ │ │ - Internal loop │ │ │
│ - Return citations │ │ │ │ - Return citations │ │ │
│ - Raw evidence │ │ │
│ - Discovery events │ │ │
└──────────────────────┘ └─────────────────────────┘ └──────────────────────┘ └─────────────────────────┘
``` ```
@ -28,9 +30,9 @@ Marchwarden is a network of agentic researchers coordinated by a principal inves
Each researcher is a **standalone MCP server** that: Each researcher is a **standalone MCP server** that:
- Exposes a single tool: `research(question, context, depth, constraints)` - Exposes a single tool: `research(question, context, depth, constraints)`
- Runs an internal agentic loop (plan → search → fetch → iterate → synthesize) - Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
- Returns structured data: `answer`, `citations`, `gaps`, `cost_metadata`, `trace_id` - Returns structured data: `answer`, `citations` (with raw evidence), `gaps` (categorized), `discovery_events`, `confidence` + `confidence_factors`, `cost_metadata`, `trace_id`
- Enforces budgets: iteration cap and token limit - Enforces budgets: iteration cap and token limit
- Logs all internal steps to JSONL trace files - Logs all internal steps to JSONL trace files with content hashes
**V1 researcher**: Web search + fetch **V1 researcher**: Web search + fetch
- Uses Tavily for searching - Uses Tavily for searching
@ -48,6 +50,8 @@ Marchwarden uses the **Model Context Protocol (MCP)** as the boundary between re
- **Clean contract** — one tool signature, versioned independently - **Clean contract** — one tool signature, versioned independently
- **Parallel dispatch** — PI can await multiple researchers simultaneously - **Parallel dispatch** — PI can await multiple researchers simultaneously
**MCP constraint:** The protocol is JSON-RPC (request-response). A researcher cannot emit streaming events or notifications mid-loop. All output — including discovery events — is returned in the final response. This is a known V1 limitation; see Known Limitations below.
### CLI Shim ### CLI Shim
For V1, the CLI is the test harness that stands in for the PI: For V1, the CLI is the test harness that stands in for the PI:
@ -79,10 +83,26 @@ Each line is a JSON object:
} }
``` ```
For fetch actions, traces include a `content_hash` (SHA-256):
```json
{
"step": 2,
"action": "fetch_url",
"url": "https://extension.usu.edu/gardening/utah-crops",
"content_hash": "sha256:a3f2b8c91d...",
"content_length": 14523,
"timestamp": "2026-04-08T12:00:05Z",
"decision": "Relevant to question; extracting crop data"
}
```
Traces support: Traces support:
- **Debugging** — see exactly what the researcher did - **Auditing** — see exactly what the researcher did and decided
- **Replay** — re-run a past session, same results - **Change detection**`content_hash` reveals if web sources changed between runs
- **Eval** — audit decision-making - **Debugging** — diagnose why a researcher produced a particular answer
- **Future replay** — with Content Addressable Storage (V2+), traces become reproducible
**V1 note:** Traces are audit logs, not deterministic replays. True replay requires storing the full fetched content (CAS), not just its hash. See Known Limitations.
## Data Flow ## Data Flow
@ -96,33 +116,92 @@ MCP: research(question="What are ideal crops for Utah?", ...)
Researcher agent loop: Researcher agent loop:
1. Plan: "I need climate data for Utah + crop requirements" 1. Plan: "I need climate data for Utah + crop requirements"
2. Search: Tavily query for "Utah climate zones crops" 2. Search: Tavily query for "Utah climate zones crops"
3. Fetch: Read top 3 URLs 3. Fetch: Read top 3 URLs (hash each for pseudo-CAS)
4. Parse: Extract relevant info 4. Parse: Extract relevant info, preserve raw excerpts
5. Synthesize: "Based on X sources, ideal crops are Y" 5. Synthesize: "Based on X sources, ideal crops are Y"
6. Check gaps: "Couldn't find pest info" 6. Check gaps: "Couldn't find pest info" → categorize as SOURCE_NOT_FOUND
7. Return if confident, else iterate 7. Check discoveries: "Found reference to USU soil study" → emit DiscoveryEvent
8. Compute confidence + factors
9. Return if confident, else iterate
Response: Response:
{ {
"answer": "...", "answer": "...",
"citations": [ "citations": [
{"source": "web", "locator": "https://...", "snippet": "...", "confidence": 0.95}, {
... "source": "web",
"locator": "https://...",
"snippet": "...",
"raw_excerpt": "verbatim text from source...",
"confidence": 0.95
}
], ],
"gaps": [ "gaps": [
{"topic": "pest resistance", "reason": "no sources found"}, {
"topic": "pest resistance",
"category": "source_not_found",
"detail": "No pest data found in general gardening sources"
}
], ],
"discovery_events": [
{
"type": "related_research",
"suggested_researcher": "database",
"query": "Utah soil salinity crop impact",
"reason": "Multiple sources reference USU study data not available on web"
}
],
"confidence": 0.82,
"confidence_factors": {
"num_corroborating_sources": 3,
"source_authority": "high",
"contradiction_detected": false,
"query_specificity_match": 0.85,
"budget_exhausted": false,
"recency": "current"
},
"cost_metadata": { "cost_metadata": {
"tokens_used": 8452, "tokens_used": 8452,
"iterations_run": 3, "iterations_run": 3,
"wall_time_sec": 42.5 "wall_time_sec": 42.5,
"budget_exhausted": false
}, },
"trace_id": "uuid-1234" "trace_id": "uuid-1234"
} }
CLI: Print answer + citations, save trace CLI: Print answer + citations + gaps + discoveries, save trace
``` ```
## Design Decisions
### Raw Evidence (The Synthesis Paradox)
When the PI synthesizes answers from multiple researchers, it risks "recursive compression loss" — each researcher has already summarized the raw data, and the PI summarizes those summaries. Subtle nuances and contradictions can be smoothed away.
**Solution:** Every citation includes a `raw_excerpt` field — verbatim text from the source. The PI can verify claims against raw evidence, detect when researchers interpret the same source differently, and flag high-entropy points for human review.
### Categorized Gaps
Gaps are not just "things we didn't find." Different gap categories demand different responses from the PI:
| Category | PI Response |
|:---|:---|
| `SOURCE_NOT_FOUND` | Accept the gap or try a different researcher |
| `ACCESS_DENIED` | Specialized fetcher or human intervention |
| `BUDGET_EXHAUSTED` | Re-dispatch with larger budget |
| `CONTRADICTORY_SOURCES` | Examine raw_excerpts, flag for human review |
| `SCOPE_EXCEEDED` | Dispatch the appropriate specialist |
### Discovery Events (Lateral Metadata)
A researcher often encounters information relevant to other researchers' domains. Rather than ignoring these findings (hub-and-spoke limitation) or acting on them (scope creep), the researcher logs them as `DiscoveryEvent` objects.
In V1, discovery events are logged for analysis. In V2, the PI orchestrator processes them dynamically, enabling mid-investigation dispatch of additional researchers.
### Content Hashing (Pseudo-CAS)
Every fetched URL produces a SHA-256 hash in the trace. This provides change detection (did the source change between runs?) without the storage overhead of full content archiving. It's the foundation for V2's Content Addressable Storage.
## Contract Versioning ## Contract Versioning
The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that: The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that:
@ -130,7 +209,7 @@ The `research()` tool signature is the stable contract. Changes to the contract
- The PI knows what version it's calling - The PI knows what version it's calling
- Backwards compatibility (or breaking changes) is explicit - Backwards compatibility (or breaking changes) is explicit
See [ResearchContract.md](ResearchContract.md) for the full spec. See [ResearchContract](/wiki/ResearchContract) for the full spec.
## Future: The PI Agent ## Future: The PI Agent
@ -138,37 +217,59 @@ V2 will introduce the orchestrator:
```python ```python
class PIAgent: class PIAgent:
def research_topic(self, question: str) -> Answer: async def research_topic(self, question: str) -> Answer:
# Dispatch multiple researchers in parallel # Dispatch multiple researchers in parallel
web_results = await self.web_researcher.research(question) web_results, arxiv_results = await asyncio.gather(
arxiv_results = await self.arxiv_researcher.research(question) self.web_researcher.research(question),
self.arxiv_researcher.research(question),
)
# Synthesize # Process discovery events from both
return self.synthesize([web_results, arxiv_results]) for event in web_results.discovery_events + arxiv_results.discovery_events:
if self.should_dispatch(event):
additional = await self.dispatch_researcher(event)
all_results.append(additional)
# Synthesize using raw_excerpts for ground-truth verification
return self.synthesize(all_results)
``` ```
The PI: The PI:
- Decides which researchers to dispatch - Decides which researchers to dispatch (initially in parallel)
- Waits for all responses - Processes discovery events and dispatches follow-ups
- Checks for conflicts, gaps, consensus - Compares raw_excerpts across researchers to detect contradictions
- Synthesizes into a final answer - Uses gap categories to decide whether to re-dispatch or accept
- Can re-dispatch if gaps are critical - Synthesizes into a final answer with full provenance
## Assumptions & Constraints ## Assumptions & Constraints
- **Researchers are honest** — they don't hallucinate citations. If they cite something, it exists in the source. - **Citation grounding is structural, not assumed** — `raw_excerpt` provides verifiable evidence. Citation validation (programmatic URL ping) is V2 work. V1 relies on the researcher having actually fetched the source.
- **Tavily API is available** — for V1 web search. Degradation strategy TBD. - **Tavily API is available** — for V1 web search. Degradation strategy: note in gaps with `ACCESS_DENIED` category.
- **Token budgets are enforced** the researcher respects its budget; the MCP server enforces it at the process level. - **Token budgets are enforced** — the MCP server enforces at the process level, not just the agent level.
- **Traces are ephemeral** — stored locally for debugging, not synced to a database yet. - **Traces are audit logs** — stored locally, hashed for integrity, but not full content archives (V2).
- **No multi-user** — single-user CLI for V1. - **No multi-user** — single-user CLI for V1.
- **Confidence is directional** — LLM-generated with exposed factors; formal calibration after V1 data collection.
## Known Limitations (V1)
| Limitation | Rationale | Future Resolution |
|:---|:---|:---|
| No citation validation | Adds latency; document as known risk | V2: Validator node pings URLs/DOIs |
| Traces are audit logs, not replays | True replay requires CAS for fetched content | V2: Content Addressable Storage |
| Discovery events are logged only | MCP is request-response; no mid-flight dispatch | V2: PI processes events dynamically |
| No streaming of progress | MCP tool responses are one-shot | V2+: Streaming MCP or polling pattern |
| Hub-and-spoke only | V1 simplicity; PI is only coordinator | V2: Dynamic priority queue in PI |
| Confidence not calibrated | Need empirical data first | V1.1: Rubric after 20-30 queries |
## Terminology ## Terminology
- **Researcher**: An agentic system specialized in a domain or source type - **Researcher**: An agentic system specialized in a domain or source type
- **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back - **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back
- **Rihla**: (V2+) A unit of research work dispatched by the PI; one researcher's journey to answer a question - **Discovery Event**: A lateral finding relevant to another researcher's domain
- **Trace**: A JSONL log of all decisions made during one research call - **Trace**: A JSONL audit log of all decisions made during one research call
- **Gap**: An unresolved aspect of the question; the researcher couldn't find an answer - **Gap**: An unresolved aspect of the question, categorized by cause
- **Raw Excerpt**: Verbatim text from a source, bypassing researcher synthesis
- **Content Hash**: SHA-256 of fetched content, enabling change detection
--- ---

@ -2,6 +2,8 @@
This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI). This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI).
**Contract version: v1**
## Tool Signature ## Tool Signature
```python ```python
@ -52,27 +54,32 @@ class ResearchConstraints:
If not provided, defaults are: If not provided, defaults are:
- `max_iterations`: 5 - `max_iterations`: 5
- `token_budget`: 20000 (Sonnet 3.5 equivalent) - `token_budget`: 20000
- `max_sources`: 10 - `max_sources`: 10
The MCP server **enforces** these constraints and will stop the researcher if they exceed them. The MCP server **enforces** these constraints and will stop the researcher if they exceed them.
--- ---
### Output: ResearchResult ## Output: ResearchResult
```python ```python
@dataclass @dataclass
class ResearchResult: class ResearchResult:
answer: str # The synthesized answer answer: str # The synthesized answer
citations: List[Citation] # Sources used citations: List[Citation] # Sources used, with raw evidence
gaps: List[Gap] # What couldn't be resolved gaps: List[Gap] # What couldn't be resolved (categorized)
confidence: float # 0.01.0 overall confidence discovery_events: List[DiscoveryEvent] # Lateral findings for other researchers
cost_metadata: CostMetadata # Resource usage confidence: float # 0.01.0 overall confidence
trace_id: str # UUID linking to JSONL trace log confidence_factors: ConfidenceFactors # What fed the confidence score
cost_metadata: CostMetadata # Resource usage
trace_id: str # UUID linking to JSONL trace log
``` ```
#### `answer` (string) ---
### `answer` (string)
The synthesized answer. Should be: The synthesized answer. Should be:
- **Grounded** — every claim traces back to a citation - **Grounded** — every claim traces back to a citation
- **Humble** — includes caveats and confidence levels - **Humble** — includes caveats and confidence levels
@ -98,7 +105,9 @@ location.
See sources below for varietal recommendations by specific county. See sources below for varietal recommendations by specific county.
``` ```
#### `citations` (list of Citation objects) ---
### `citations` (list of Citation objects)
```python ```python
@dataclass @dataclass
@ -106,18 +115,34 @@ class Citation:
source: str # "web", "file", "database", etc source: str # "web", "file", "database", etc
locator: str # URL, file path, row ID, or unique identifier locator: str # URL, file path, row ID, or unique identifier
title: Optional[str] # Human-readable title (for web) title: Optional[str] # Human-readable title (for web)
snippet: Optional[str] # Relevant excerpt (50200 chars) snippet: Optional[str] # Researcher's summary of relevant content (50200 chars)
raw_excerpt: str # Verbatim text from the source (up to 500 chars)
confidence: float # 0.01.0: researcher's confidence in this source's accuracy confidence: float # 0.01.0: researcher's confidence in this source's accuracy
``` ```
#### The `raw_excerpt` field
The `raw_excerpt` is a **verbatim copy** of the relevant passage from the source, bypassing the researcher's synthesis. This prevents the "Synthesis Paradox" — when the PI synthesizes already-synthesized data, subtle nuances and contradictions get smoothed away.
The PI uses `raw_excerpt` to:
- Perform its own ground-truth verification
- Detect when two researchers interpret the same evidence differently
- Identify high-entropy points where human review is needed
**Rules:**
- Must be copied verbatim from the fetched source (no paraphrasing)
- Up to 500 characters; truncate with `[...]` if longer
- If the source cannot be excerpted (e.g., image, binary), set to `"[non-text source]"`
Example: Example:
```python ```python
Citation( Citation(
source="web", source="web",
locator="https://extension.oregonstate.edu/ask-expert/featured/what-are-ideal-garden-crops-utah-zone", locator="https://extension.usu.edu/gardening/research/utah-crops",
title="Oregon State Extension: Ideal Crops for Utah Gardens", title="USU Extension: Utah Crop Guide",
snippet="Cool-season crops (peas, lettuce, potatoes) thrive above 7,000 feet. Irrigation essential.", snippet="Cool-season crops thrive above 7,000 feet; irrigation essential.",
confidence=0.9 raw_excerpt="In Utah's high-elevation gardens (above 7,000 ft), cool-season vegetables such as peas, lettuce, spinach, and potatoes consistently outperform warm-season crops. Average growing season at these elevations is 90-120 days. Supplemental irrigation is essential; natural precipitation averages 12-16 inches annually, well below the 20-30 inches most vegetable crops require.",
confidence=0.92
) )
``` ```
@ -125,42 +150,159 @@ Citations must be:
- **Verifiable** — a human can follow the locator and confirm the claim - **Verifiable** — a human can follow the locator and confirm the claim
- **Not hallucinated** — the researcher actually read/fetched the source - **Not hallucinated** — the researcher actually read/fetched the source
- **Attributed** — each claim in `answer` should link to at least one citation - **Attributed** — each claim in `answer` should link to at least one citation
- **Evidence-bearing**`raw_excerpt` must contain the actual text that supports the claim
#### `gaps` (list of Gap objects) ---
### `gaps` (list of Gap objects)
```python ```python
class GapCategory(str, Enum):
SOURCE_NOT_FOUND = "source_not_found" # No relevant sources exist
ACCESS_DENIED = "access_denied" # Paywall, robots.txt, auth required
BUDGET_EXHAUSTED = "budget_exhausted" # Ran out of iterations or tokens
CONTRADICTORY_SOURCES = "contradictory_sources" # Sources disagree, can't resolve
SCOPE_EXCEEDED = "scope_exceeded" # Question requires a different researcher type
@dataclass @dataclass
class Gap: class Gap:
topic: str # What aspect wasn't resolved topic: str # What aspect wasn't resolved
reason: str # Why: "no sources found", "contradictory sources", "outside researcher scope" category: GapCategory # Structured reason category
detail: str # Human-readable explanation
``` ```
#### Gap categories
| Category | Meaning | PI Action |
|:---|:---|:---|
| `SOURCE_NOT_FOUND` | The information doesn't appear to exist in this researcher's domain | Dispatch a different researcher, or accept the gap |
| `ACCESS_DENIED` | Source exists but is behind a paywall, login, or robots.txt | May need a specialized fetcher or human intervention |
| `BUDGET_EXHAUSTED` | Researcher hit iteration or token cap before resolving this | Re-dispatch with a larger budget, or accept partial answer |
| `CONTRADICTORY_SOURCES` | Multiple sources disagree and the researcher can't resolve the conflict | PI should examine raw_excerpts and flag for human review |
| `SCOPE_EXCEEDED` | The question requires capabilities this researcher doesn't have (e.g., web researcher finds a paper DOI but can't access arxiv) | Dispatch the appropriate specialist |
Example: Example:
```python ```python
[ [
Gap(topic="pest management by county", reason="no county-specific sources found"), Gap(
Gap(topic="commercial varietals", reason="limited to hobby gardening sources"), topic="pest management by county",
category=GapCategory.SOURCE_NOT_FOUND,
detail="No county-specific pest data found in general web sources"
),
Gap(
topic="2026 varietal trial results",
category=GapCategory.ACCESS_DENIED,
detail="USU extension database requires institutional login"
),
Gap(
topic="soil pH requirements by crop",
category=GapCategory.BUDGET_EXHAUSTED,
detail="Identified relevant sources but hit iteration cap before extraction"
),
] ]
``` ```
Gaps are **critical for the PI**. They tell the orchestrator: Gaps are **critical for the PI**. They tell the orchestrator:
- Whether to dispatch a different researcher - Whether to dispatch a different researcher (SCOPE_EXCEEDED)
- Whether to accept partial answers - Whether to retry with more budget (BUDGET_EXHAUSTED)
- Which questions remain for human input - Whether to escalate to human review (CONTRADICTORY_SOURCES, ACCESS_DENIED)
- Whether to accept the answer as-is (SOURCE_NOT_FOUND — info may not exist)
A researcher that admits gaps is more trustworthy than one that fabricates answers. A researcher that admits and categorizes gaps is more trustworthy than one that fabricates answers.
#### `confidence` (float, 0.01.0) ---
Overall confidence in the answer: ### `discovery_events` (list of DiscoveryEvent objects)
- `0.91.0`: High. All claims grounded in multiple strong sources.
```python
@dataclass
class DiscoveryEvent:
type: str # "related_research", "new_source", "contradiction"
suggested_researcher: Optional[str] # "arxiv", "database", "legal", etc.
query: str # Suggested query for the target researcher
reason: str # Why this is relevant
source_locator: Optional[str] # Where the discovery was found (URL, DOI, etc.)
```
Discovery events capture **lateral findings** — things the researcher noticed that fall outside its own scope but are relevant to the overall investigation. They are the "nervous system" for the future V2 orchestrator.
In V1, the CLI logs these events for analysis. In V2, the PI uses them to dynamically dispatch additional researchers mid-investigation.
Example:
```python
[
DiscoveryEvent(
type="related_research",
suggested_researcher="arxiv",
query="Utah agricultural extension soil salinity studies 2024-2026",
reason="Multiple web sources reference USU soil salinity research but don't include the data",
source_locator="https://extension.usu.edu/news/2025/soil-salinity-study"
),
DiscoveryEvent(
type="contradiction",
suggested_researcher=None,
query="tomato growing season length Utah valley vs mountain",
reason="Two sources disagree on whether tomatoes are viable above 6,000 ft",
source_locator=None
),
]
```
**Rules:**
- Discovery events are **informational only** — the researcher does not act on them
- The researcher should not generate discovery events speculatively; each must be grounded in something encountered during the research loop
- The PI decides whether to act on discovery events; the researcher does not second-guess
---
### `confidence` (float, 0.01.0)
Overall confidence in the answer. Accompanied by `confidence_factors` to prevent "vibe check" scoring.
General ranges:
- `0.91.0`: High. Multiple corroborating sources, strong authority, no contradictions.
- `0.70.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved. - `0.70.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved.
- `0.50.7`: Low. Few direct sources; lots of synthesis; clear gaps. - `0.50.7`: Low. Few direct sources; significant synthesis; clear gaps.
- `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review. - `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review.
The PI uses this to decide whether to act on the answer or seek more sources. The PI uses this to decide whether to act on the answer or seek more sources.
#### `cost_metadata` (object) **V1 note:** The confidence score is produced by the LLM researcher and is not yet calibrated against empirical data. Treat it as directional, not precise. Calibration rubric will be formalized after sufficient V1 testing data is collected.
---
### `confidence_factors` (object)
```python
@dataclass
class ConfidenceFactors:
num_corroborating_sources: int # How many sources agree
source_authority: str # "high" (.gov, .edu, peer-reviewed), "medium" (established orgs), "low" (blogs, forums)
contradiction_detected: bool # Were conflicting claims found?
query_specificity_match: float # 0.01.0: did results address the actual question?
budget_exhausted: bool # Hard penalty if true
recency: Optional[str] # "current" (< 1yr), "recent" (1-3yr), "dated" (> 3yr), None if unknown
```
The `confidence_factors` dict exposes *why* the researcher chose a particular confidence score. This serves two purposes:
1. **Auditability** — a human or PI can verify that the score is reasonable
2. **Calibration data** — after 20-30 real queries, these factors become the basis for a formal confidence rubric
Example:
```python
ConfidenceFactors(
num_corroborating_sources=4,
source_authority="high",
contradiction_detected=False,
query_specificity_match=0.85,
budget_exhausted=False,
recency="current"
)
```
---
### `cost_metadata` (object)
```python ```python
@dataclass @dataclass
@ -186,7 +328,9 @@ The PI uses this to:
- Detect runaway loops (budget_exhausted = True) - Detect runaway loops (budget_exhausted = True)
- Plan timeouts (wall_time_sec tells you if this is acceptable latency) - Plan timeouts (wall_time_sec tells you if this is acceptable latency)
#### `trace_id` (string, UUID) ---
### `trace_id` (string, UUID)
A unique identifier linking to the JSONL trace log: A unique identifier linking to the JSONL trace log:
@ -194,7 +338,30 @@ A unique identifier linking to the JSONL trace log:
~/.marchwarden/traces/{trace_id}.jsonl ~/.marchwarden/traces/{trace_id}.jsonl
``` ```
The trace contains every decision, search, fetch, parse step for debugging and replay. The trace contains every decision, search, fetch, parse step for debugging and audit.
#### Trace entries and content hashing
Each trace entry for a fetched source includes a `content_hash` (SHA-256 of the fetched content). This provides a **pseudo-CAS** (Content Addressable Storage) capability:
```json
{
"step": 2,
"action": "fetch_url",
"url": "https://extension.usu.edu/gardening/research/utah-crops",
"content_hash": "sha256:a3f2b8c91d...",
"content_length": 14523,
"timestamp": "2026-04-08T12:00:05Z",
"decision": "Relevant to question; extracting crop data"
}
```
The `content_hash` enables:
- **Change detection** — comparing two audit runs to see if the underlying source changed
- **Integrity verification** — confirming the raw_excerpt came from the content that was actually fetched
- **Future CAS** — when full content storage is added (V2+), the hash becomes the content address
**V1 limitation:** We store the hash, not the full content. True replay requires Content Addressable Storage (V2+). V1 traces are "audit logs," not deterministic replays.
--- ---
@ -203,16 +370,22 @@ The trace contains every decision, search, fetch, parse step for debugging and r
### The Researcher Must ### The Researcher Must
1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it. 1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it.
2. **Admit gaps.** If you can't find something, say so. Don't guess. 2. **Provide raw evidence.** Every citation must include a `raw_excerpt` copied verbatim from the source.
3. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`. 3. **Admit and categorize gaps.** If you can't find something, say so with the appropriate `GapCategory`.
4. **Ground claims.** Every factual claim in `answer` must link to at least one citation. 4. **Report lateral discoveries.** If you encounter something relevant to another researcher's domain, emit a `DiscoveryEvent`.
5. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` and continue with what you have. 5. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`.
6. **Ground claims.** Every factual claim in `answer` must link to at least one citation.
7. **Explain confidence.** Populate `confidence_factors` honestly; do not inflate scores.
8. **Hash fetched content.** Every URL/source fetch in the trace must include a `content_hash`.
9. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` with the appropriate category and continue with what you have.
### The Caller (PI/CLI) Must ### The Caller (PI/CLI) Must
1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly. 1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly.
2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer. 2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer.
3. **Check locators.** For important decisions, verify citations by following the locators. 3. **Check raw_excerpts.** For important decisions, verify claims against `raw_excerpt` before acting.
4. **Process discovery_events.** Log them (V1) or dispatch additional researchers (V2+).
5. **Respect gap categories.** Use the category to decide the appropriate response (retry, re-dispatch, escalate, accept).
--- ---
@ -238,11 +411,21 @@ Response:
"locator": "https://en.wikipedia.org/wiki/Paris", "locator": "https://en.wikipedia.org/wiki/Paris",
"title": "Paris - Wikipedia", "title": "Paris - Wikipedia",
"snippet": "Paris is the capital and largest city of France", "snippet": "Paris is the capital and largest city of France",
"raw_excerpt": "Paris is the capital and most populous city of France, with an estimated population of 2,102,650 residents as of 1 January 2023.",
"confidence": 0.99 "confidence": 0.99
} }
], ],
"gaps": [], "gaps": [],
"discovery_events": [],
"confidence": 0.99, "confidence": 0.99,
"confidence_factors": {
"num_corroborating_sources": 1,
"source_authority": "high",
"contradiction_detected": false,
"query_specificity_match": 1.0,
"budget_exhausted": false,
"recency": "current"
},
"cost_metadata": { "cost_metadata": {
"tokens_used": 450, "tokens_used": 450,
"iterations_run": 1, "iterations_run": 1,
@ -253,7 +436,7 @@ Response:
} }
``` ```
### Example 2: Partial Answer with Gaps ### Example 2: Partial Answer with Gaps and Discoveries
Request: Request:
```json ```json
@ -270,23 +453,43 @@ Response:
"citations": [ "citations": [
{ {
"source": "web", "source": "web",
"locator": "https://www.crunchbase.com/...", "locator": "https://www.crunchbase.com/hub/crispr-startups",
"title": "Crunchbase: CRISPR Startups", "title": "Crunchbase: CRISPR Startups",
"snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics...", "snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics lead the field.",
"raw_excerpt": "The CRISPR gene-editing space has attracted over $12B in venture funding since 2018. Key players include Editas Medicine (EDIT), Beam Therapeutics (BEAM), CRISPR Therapeutics (CRSP), and newer entrants Prime Medicine and Verve Therapeutics.",
"confidence": 0.8 "confidence": 0.8
} }
], ],
"gaps": [ "gaps": [
{ {
"topic": "funding rounds in 2026", "topic": "funding rounds in 2026",
"reason": "Web sources only go through Q1 2026; may be stale" "category": "source_not_found",
"detail": "Web sources only go through Q1 2026; most recent rounds may not be indexed yet"
}, },
{ {
"topic": "clinical trial status", "topic": "clinical trial status",
"reason": "Requires access to clinical trials database (outside web search scope)" "category": "scope_exceeded",
"detail": "Requires access to ClinicalTrials.gov database; outside web search scope"
}
],
"discovery_events": [
{
"type": "related_research",
"suggested_researcher": "database",
"query": "CRISPR gene therapy clinical trials Phase I II III 2025-2026",
"reason": "Multiple sources reference ongoing trials but web results don't include trial data",
"source_locator": "https://www.crunchbase.com/hub/crispr-startups"
} }
], ],
"confidence": 0.72, "confidence": 0.72,
"confidence_factors": {
"num_corroborating_sources": 3,
"source_authority": "medium",
"contradiction_detected": false,
"query_specificity_match": 0.75,
"budget_exhausted": false,
"recency": "recent"
},
"cost_metadata": { "cost_metadata": {
"tokens_used": 19240, "tokens_used": 19240,
"iterations_run": 4, "iterations_run": 4,
@ -297,7 +500,7 @@ Response:
} }
``` ```
### Example 3: Budget Exhausted ### Example 3: Budget Exhausted with Contradictions
Request: Request:
```json ```json
@ -316,19 +519,45 @@ Response:
{ {
"answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]", "answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]",
"citations": [ "citations": [
{ ... 3-4 citations ... } {
"source": "web",
"locator": "https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
"title": "History of AI - Wikipedia",
"snippet": "AI research began at a 1956 Dartmouth workshop.",
"raw_excerpt": "The field of AI research was founded at a workshop held on the campus of Dartmouth College, USA during the summer of 1956. Those who attended would become the leaders of AI research for decades.",
"confidence": 0.95
}
], ],
"gaps": [ "gaps": [
{ {
"topic": "detailed timeline 2020-2026", "topic": "detailed timeline 2020-2026",
"reason": "budget exhausted before deep synthesis" "category": "budget_exhausted",
"detail": "Identified relevant sources but hit iteration cap before extraction"
}, },
{ {
"topic": "minor research directions", "topic": "AI winter causes and resolution",
"reason": "out of scope due to token limit" "category": "budget_exhausted",
"detail": "Found conflicting narratives; needed more iterations to resolve"
}
],
"discovery_events": [
{
"type": "related_research",
"suggested_researcher": "arxiv",
"query": "survey papers history of deep learning 2020-2026",
"reason": "Web sources are shallow on recent technical developments; academic surveys would be more authoritative",
"source_locator": null
} }
], ],
"confidence": 0.55, "confidence": 0.55,
"confidence_factors": {
"num_corroborating_sources": 2,
"source_authority": "high",
"contradiction_detected": true,
"query_specificity_match": 0.5,
"budget_exhausted": true,
"recency": "dated"
},
"cost_metadata": { "cost_metadata": {
"tokens_used": 4998, "tokens_used": 4998,
"iterations_run": 3, "iterations_run": 3,
@ -341,9 +570,28 @@ Response:
--- ---
## Known Limitations (V1)
These are documented architectural decisions, not oversights:
| Limitation | Rationale | Future Resolution |
|:---|:---|:---|
| Confidence is LLM-generated, not calibrated | Need empirical data before formalizing rubric | V1.1: Calibrate after 20-30 real queries |
| No citation validation (URL/DOI ping) | Adds latency and complexity; document as known risk | V2: Validator node that programmatically verifies locators |
| Traces are audit logs, not true replays | True replay requires CAS for fetched content | V2: Content Addressable Storage for all fetched data |
| Discovery events are logged, not acted on | MCP is request-response; no mid-flight dispatch | V2: PI orchestrator processes events dynamically |
| No streaming of intermediate progress | MCP tool responses are one-shot | V2+: Evaluate streaming MCP or polling pattern |
| Hub-and-spoke only (no inter-researcher comms) | Keeps V1 simple; PI is the only coordinator | V2: Dynamic priority queue in PI |
---
## Versioning ## Versioning
The contract is versioned as `v1`. If breaking changes are needed (e.g., new required fields), the next version becomes `v2` and both can coexist in the network for a transition period. The contract is versioned as `v1`. If breaking changes are needed (e.g., removing a required field or changing its type), the next version becomes `v2` and both can coexist in the network for a transition period.
**Backward-compatible changes** (adding optional fields) do not require a version bump.
**Breaking changes** (removing fields, changing types, changing required/optional status) require a new version.
Current version: **v1** Current version: **v1**