Update contract: raw_excerpt, categorized gaps, discovery events, confidence factors, content hashing
Incorporates architectural critique: - raw_excerpt on citations prevents synthesis paradox (double-summarization) - Gap categories (SOURCE_NOT_FOUND, ACCESS_DENIED, BUDGET_EXHAUSTED, CONTRADICTORY_SOURCES, SCOPE_EXCEEDED) enable PI to choose correct response - discovery_events capture lateral findings for V2 orchestrator - confidence_factors expose scoring inputs for future calibration - content_hash (SHA-256) on trace fetches enables pseudo-CAS change detection - Known Limitations table documents intentional V1 tradeoffs Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
parent
36bfe1813a
commit
7ad91b7ca9
2 changed files with 434 additions and 85 deletions
165
Architecture.md
165
Architecture.md
|
|
@ -18,6 +18,8 @@ Marchwarden is a network of agentic researchers coordinated by a principal inves
|
|||
│ - Fetch URLs │ │ │
|
||||
│ - Internal loop │ │ │
|
||||
│ - Return citations │ │ │
|
||||
│ - Raw evidence │ │ │
|
||||
│ - Discovery events │ │ │
|
||||
└──────────────────────┘ └─────────────────────────┘
|
||||
```
|
||||
|
||||
|
|
@ -28,9 +30,9 @@ Marchwarden is a network of agentic researchers coordinated by a principal inves
|
|||
Each researcher is a **standalone MCP server** that:
|
||||
- Exposes a single tool: `research(question, context, depth, constraints)`
|
||||
- Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
|
||||
- Returns structured data: `answer`, `citations`, `gaps`, `cost_metadata`, `trace_id`
|
||||
- Returns structured data: `answer`, `citations` (with raw evidence), `gaps` (categorized), `discovery_events`, `confidence` + `confidence_factors`, `cost_metadata`, `trace_id`
|
||||
- Enforces budgets: iteration cap and token limit
|
||||
- Logs all internal steps to JSONL trace files
|
||||
- Logs all internal steps to JSONL trace files with content hashes
|
||||
|
||||
**V1 researcher**: Web search + fetch
|
||||
- Uses Tavily for searching
|
||||
|
|
@ -48,6 +50,8 @@ Marchwarden uses the **Model Context Protocol (MCP)** as the boundary between re
|
|||
- **Clean contract** — one tool signature, versioned independently
|
||||
- **Parallel dispatch** — PI can await multiple researchers simultaneously
|
||||
|
||||
**MCP constraint:** The protocol is JSON-RPC (request-response). A researcher cannot emit streaming events or notifications mid-loop. All output — including discovery events — is returned in the final response. This is a known V1 limitation; see Known Limitations below.
|
||||
|
||||
### CLI Shim
|
||||
|
||||
For V1, the CLI is the test harness that stands in for the PI:
|
||||
|
|
@ -79,10 +83,26 @@ Each line is a JSON object:
|
|||
}
|
||||
```
|
||||
|
||||
For fetch actions, traces include a `content_hash` (SHA-256):
|
||||
```json
|
||||
{
|
||||
"step": 2,
|
||||
"action": "fetch_url",
|
||||
"url": "https://extension.usu.edu/gardening/utah-crops",
|
||||
"content_hash": "sha256:a3f2b8c91d...",
|
||||
"content_length": 14523,
|
||||
"timestamp": "2026-04-08T12:00:05Z",
|
||||
"decision": "Relevant to question; extracting crop data"
|
||||
}
|
||||
```
|
||||
|
||||
Traces support:
|
||||
- **Debugging** — see exactly what the researcher did
|
||||
- **Replay** — re-run a past session, same results
|
||||
- **Eval** — audit decision-making
|
||||
- **Auditing** — see exactly what the researcher did and decided
|
||||
- **Change detection** — `content_hash` reveals if web sources changed between runs
|
||||
- **Debugging** — diagnose why a researcher produced a particular answer
|
||||
- **Future replay** — with Content Addressable Storage (V2+), traces become reproducible
|
||||
|
||||
**V1 note:** Traces are audit logs, not deterministic replays. True replay requires storing the full fetched content (CAS), not just its hash. See Known Limitations.
|
||||
|
||||
## Data Flow
|
||||
|
||||
|
|
@ -96,33 +116,92 @@ MCP: research(question="What are ideal crops for Utah?", ...)
|
|||
Researcher agent loop:
|
||||
1. Plan: "I need climate data for Utah + crop requirements"
|
||||
2. Search: Tavily query for "Utah climate zones crops"
|
||||
3. Fetch: Read top 3 URLs
|
||||
4. Parse: Extract relevant info
|
||||
3. Fetch: Read top 3 URLs (hash each for pseudo-CAS)
|
||||
4. Parse: Extract relevant info, preserve raw excerpts
|
||||
5. Synthesize: "Based on X sources, ideal crops are Y"
|
||||
6. Check gaps: "Couldn't find pest info"
|
||||
7. Return if confident, else iterate
|
||||
6. Check gaps: "Couldn't find pest info" → categorize as SOURCE_NOT_FOUND
|
||||
7. Check discoveries: "Found reference to USU soil study" → emit DiscoveryEvent
|
||||
8. Compute confidence + factors
|
||||
9. Return if confident, else iterate
|
||||
↓
|
||||
Response:
|
||||
{
|
||||
"answer": "...",
|
||||
"citations": [
|
||||
{"source": "web", "locator": "https://...", "snippet": "...", "confidence": 0.95},
|
||||
...
|
||||
{
|
||||
"source": "web",
|
||||
"locator": "https://...",
|
||||
"snippet": "...",
|
||||
"raw_excerpt": "verbatim text from source...",
|
||||
"confidence": 0.95
|
||||
}
|
||||
],
|
||||
"gaps": [
|
||||
{"topic": "pest resistance", "reason": "no sources found"},
|
||||
{
|
||||
"topic": "pest resistance",
|
||||
"category": "source_not_found",
|
||||
"detail": "No pest data found in general gardening sources"
|
||||
}
|
||||
],
|
||||
"discovery_events": [
|
||||
{
|
||||
"type": "related_research",
|
||||
"suggested_researcher": "database",
|
||||
"query": "Utah soil salinity crop impact",
|
||||
"reason": "Multiple sources reference USU study data not available on web"
|
||||
}
|
||||
],
|
||||
"confidence": 0.82,
|
||||
"confidence_factors": {
|
||||
"num_corroborating_sources": 3,
|
||||
"source_authority": "high",
|
||||
"contradiction_detected": false,
|
||||
"query_specificity_match": 0.85,
|
||||
"budget_exhausted": false,
|
||||
"recency": "current"
|
||||
},
|
||||
"cost_metadata": {
|
||||
"tokens_used": 8452,
|
||||
"iterations_run": 3,
|
||||
"wall_time_sec": 42.5
|
||||
"wall_time_sec": 42.5,
|
||||
"budget_exhausted": false
|
||||
},
|
||||
"trace_id": "uuid-1234"
|
||||
}
|
||||
↓
|
||||
CLI: Print answer + citations, save trace
|
||||
CLI: Print answer + citations + gaps + discoveries, save trace
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Raw Evidence (The Synthesis Paradox)
|
||||
|
||||
When the PI synthesizes answers from multiple researchers, it risks "recursive compression loss" — each researcher has already summarized the raw data, and the PI summarizes those summaries. Subtle nuances and contradictions can be smoothed away.
|
||||
|
||||
**Solution:** Every citation includes a `raw_excerpt` field — verbatim text from the source. The PI can verify claims against raw evidence, detect when researchers interpret the same source differently, and flag high-entropy points for human review.
|
||||
|
||||
### Categorized Gaps
|
||||
|
||||
Gaps are not just "things we didn't find." Different gap categories demand different responses from the PI:
|
||||
|
||||
| Category | PI Response |
|
||||
|:---|:---|
|
||||
| `SOURCE_NOT_FOUND` | Accept the gap or try a different researcher |
|
||||
| `ACCESS_DENIED` | Specialized fetcher or human intervention |
|
||||
| `BUDGET_EXHAUSTED` | Re-dispatch with larger budget |
|
||||
| `CONTRADICTORY_SOURCES` | Examine raw_excerpts, flag for human review |
|
||||
| `SCOPE_EXCEEDED` | Dispatch the appropriate specialist |
|
||||
|
||||
### Discovery Events (Lateral Metadata)
|
||||
|
||||
A researcher often encounters information relevant to other researchers' domains. Rather than ignoring these findings (hub-and-spoke limitation) or acting on them (scope creep), the researcher logs them as `DiscoveryEvent` objects.
|
||||
|
||||
In V1, discovery events are logged for analysis. In V2, the PI orchestrator processes them dynamically, enabling mid-investigation dispatch of additional researchers.
|
||||
|
||||
### Content Hashing (Pseudo-CAS)
|
||||
|
||||
Every fetched URL produces a SHA-256 hash in the trace. This provides change detection (did the source change between runs?) without the storage overhead of full content archiving. It's the foundation for V2's Content Addressable Storage.
|
||||
|
||||
## Contract Versioning
|
||||
|
||||
The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that:
|
||||
|
|
@ -130,7 +209,7 @@ The `research()` tool signature is the stable contract. Changes to the contract
|
|||
- The PI knows what version it's calling
|
||||
- Backwards compatibility (or breaking changes) is explicit
|
||||
|
||||
See [ResearchContract.md](ResearchContract.md) for the full spec.
|
||||
See [ResearchContract](/wiki/ResearchContract) for the full spec.
|
||||
|
||||
## Future: The PI Agent
|
||||
|
||||
|
|
@ -138,37 +217,59 @@ V2 will introduce the orchestrator:
|
|||
|
||||
```python
|
||||
class PIAgent:
|
||||
def research_topic(self, question: str) -> Answer:
|
||||
async def research_topic(self, question: str) -> Answer:
|
||||
# Dispatch multiple researchers in parallel
|
||||
web_results = await self.web_researcher.research(question)
|
||||
arxiv_results = await self.arxiv_researcher.research(question)
|
||||
web_results, arxiv_results = await asyncio.gather(
|
||||
self.web_researcher.research(question),
|
||||
self.arxiv_researcher.research(question),
|
||||
)
|
||||
|
||||
# Synthesize
|
||||
return self.synthesize([web_results, arxiv_results])
|
||||
# Process discovery events from both
|
||||
for event in web_results.discovery_events + arxiv_results.discovery_events:
|
||||
if self.should_dispatch(event):
|
||||
additional = await self.dispatch_researcher(event)
|
||||
all_results.append(additional)
|
||||
|
||||
# Synthesize using raw_excerpts for ground-truth verification
|
||||
return self.synthesize(all_results)
|
||||
```
|
||||
|
||||
The PI:
|
||||
- Decides which researchers to dispatch
|
||||
- Waits for all responses
|
||||
- Checks for conflicts, gaps, consensus
|
||||
- Synthesizes into a final answer
|
||||
- Can re-dispatch if gaps are critical
|
||||
- Decides which researchers to dispatch (initially in parallel)
|
||||
- Processes discovery events and dispatches follow-ups
|
||||
- Compares raw_excerpts across researchers to detect contradictions
|
||||
- Uses gap categories to decide whether to re-dispatch or accept
|
||||
- Synthesizes into a final answer with full provenance
|
||||
|
||||
## Assumptions & Constraints
|
||||
|
||||
- **Researchers are honest** — they don't hallucinate citations. If they cite something, it exists in the source.
|
||||
- **Tavily API is available** — for V1 web search. Degradation strategy TBD.
|
||||
- **Token budgets are enforced** — the researcher respects its budget; the MCP server enforces it at the process level.
|
||||
- **Traces are ephemeral** — stored locally for debugging, not synced to a database yet.
|
||||
- **Citation grounding is structural, not assumed** — `raw_excerpt` provides verifiable evidence. Citation validation (programmatic URL ping) is V2 work. V1 relies on the researcher having actually fetched the source.
|
||||
- **Tavily API is available** — for V1 web search. Degradation strategy: note in gaps with `ACCESS_DENIED` category.
|
||||
- **Token budgets are enforced** — the MCP server enforces at the process level, not just the agent level.
|
||||
- **Traces are audit logs** — stored locally, hashed for integrity, but not full content archives (V2).
|
||||
- **No multi-user** — single-user CLI for V1.
|
||||
- **Confidence is directional** — LLM-generated with exposed factors; formal calibration after V1 data collection.
|
||||
|
||||
## Known Limitations (V1)
|
||||
|
||||
| Limitation | Rationale | Future Resolution |
|
||||
|:---|:---|:---|
|
||||
| No citation validation | Adds latency; document as known risk | V2: Validator node pings URLs/DOIs |
|
||||
| Traces are audit logs, not replays | True replay requires CAS for fetched content | V2: Content Addressable Storage |
|
||||
| Discovery events are logged only | MCP is request-response; no mid-flight dispatch | V2: PI processes events dynamically |
|
||||
| No streaming of progress | MCP tool responses are one-shot | V2+: Streaming MCP or polling pattern |
|
||||
| Hub-and-spoke only | V1 simplicity; PI is only coordinator | V2: Dynamic priority queue in PI |
|
||||
| Confidence not calibrated | Need empirical data first | V1.1: Rubric after 20-30 queries |
|
||||
|
||||
## Terminology
|
||||
|
||||
- **Researcher**: An agentic system specialized in a domain or source type
|
||||
- **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back
|
||||
- **Rihla**: (V2+) A unit of research work dispatched by the PI; one researcher's journey to answer a question
|
||||
- **Trace**: A JSONL log of all decisions made during one research call
|
||||
- **Gap**: An unresolved aspect of the question; the researcher couldn't find an answer
|
||||
- **Discovery Event**: A lateral finding relevant to another researcher's domain
|
||||
- **Trace**: A JSONL audit log of all decisions made during one research call
|
||||
- **Gap**: An unresolved aspect of the question, categorized by cause
|
||||
- **Raw Excerpt**: Verbatim text from a source, bypassing researcher synthesis
|
||||
- **Content Hash**: SHA-256 of fetched content, enabling change detection
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -2,6 +2,8 @@
|
|||
|
||||
This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI).
|
||||
|
||||
**Contract version: v1**
|
||||
|
||||
## Tool Signature
|
||||
|
||||
```python
|
||||
|
|
@ -52,27 +54,32 @@ class ResearchConstraints:
|
|||
|
||||
If not provided, defaults are:
|
||||
- `max_iterations`: 5
|
||||
- `token_budget`: 20000 (Sonnet 3.5 equivalent)
|
||||
- `token_budget`: 20000
|
||||
- `max_sources`: 10
|
||||
|
||||
The MCP server **enforces** these constraints and will stop the researcher if they exceed them.
|
||||
|
||||
---
|
||||
|
||||
### Output: ResearchResult
|
||||
## Output: ResearchResult
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ResearchResult:
|
||||
answer: str # The synthesized answer
|
||||
citations: List[Citation] # Sources used
|
||||
gaps: List[Gap] # What couldn't be resolved
|
||||
citations: List[Citation] # Sources used, with raw evidence
|
||||
gaps: List[Gap] # What couldn't be resolved (categorized)
|
||||
discovery_events: List[DiscoveryEvent] # Lateral findings for other researchers
|
||||
confidence: float # 0.0–1.0 overall confidence
|
||||
confidence_factors: ConfidenceFactors # What fed the confidence score
|
||||
cost_metadata: CostMetadata # Resource usage
|
||||
trace_id: str # UUID linking to JSONL trace log
|
||||
```
|
||||
|
||||
#### `answer` (string)
|
||||
---
|
||||
|
||||
### `answer` (string)
|
||||
|
||||
The synthesized answer. Should be:
|
||||
- **Grounded** — every claim traces back to a citation
|
||||
- **Humble** — includes caveats and confidence levels
|
||||
|
|
@ -98,7 +105,9 @@ location.
|
|||
See sources below for varietal recommendations by specific county.
|
||||
```
|
||||
|
||||
#### `citations` (list of Citation objects)
|
||||
---
|
||||
|
||||
### `citations` (list of Citation objects)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
|
|
@ -106,18 +115,34 @@ class Citation:
|
|||
source: str # "web", "file", "database", etc
|
||||
locator: str # URL, file path, row ID, or unique identifier
|
||||
title: Optional[str] # Human-readable title (for web)
|
||||
snippet: Optional[str] # Relevant excerpt (50–200 chars)
|
||||
snippet: Optional[str] # Researcher's summary of relevant content (50–200 chars)
|
||||
raw_excerpt: str # Verbatim text from the source (up to 500 chars)
|
||||
confidence: float # 0.0–1.0: researcher's confidence in this source's accuracy
|
||||
```
|
||||
|
||||
#### The `raw_excerpt` field
|
||||
|
||||
The `raw_excerpt` is a **verbatim copy** of the relevant passage from the source, bypassing the researcher's synthesis. This prevents the "Synthesis Paradox" — when the PI synthesizes already-synthesized data, subtle nuances and contradictions get smoothed away.
|
||||
|
||||
The PI uses `raw_excerpt` to:
|
||||
- Perform its own ground-truth verification
|
||||
- Detect when two researchers interpret the same evidence differently
|
||||
- Identify high-entropy points where human review is needed
|
||||
|
||||
**Rules:**
|
||||
- Must be copied verbatim from the fetched source (no paraphrasing)
|
||||
- Up to 500 characters; truncate with `[...]` if longer
|
||||
- If the source cannot be excerpted (e.g., image, binary), set to `"[non-text source]"`
|
||||
|
||||
Example:
|
||||
```python
|
||||
Citation(
|
||||
source="web",
|
||||
locator="https://extension.oregonstate.edu/ask-expert/featured/what-are-ideal-garden-crops-utah-zone",
|
||||
title="Oregon State Extension: Ideal Crops for Utah Gardens",
|
||||
snippet="Cool-season crops (peas, lettuce, potatoes) thrive above 7,000 feet. Irrigation essential.",
|
||||
confidence=0.9
|
||||
locator="https://extension.usu.edu/gardening/research/utah-crops",
|
||||
title="USU Extension: Utah Crop Guide",
|
||||
snippet="Cool-season crops thrive above 7,000 feet; irrigation essential.",
|
||||
raw_excerpt="In Utah's high-elevation gardens (above 7,000 ft), cool-season vegetables such as peas, lettuce, spinach, and potatoes consistently outperform warm-season crops. Average growing season at these elevations is 90-120 days. Supplemental irrigation is essential; natural precipitation averages 12-16 inches annually, well below the 20-30 inches most vegetable crops require.",
|
||||
confidence=0.92
|
||||
)
|
||||
```
|
||||
|
||||
|
|
@ -125,42 +150,159 @@ Citations must be:
|
|||
- **Verifiable** — a human can follow the locator and confirm the claim
|
||||
- **Not hallucinated** — the researcher actually read/fetched the source
|
||||
- **Attributed** — each claim in `answer` should link to at least one citation
|
||||
- **Evidence-bearing** — `raw_excerpt` must contain the actual text that supports the claim
|
||||
|
||||
#### `gaps` (list of Gap objects)
|
||||
---
|
||||
|
||||
### `gaps` (list of Gap objects)
|
||||
|
||||
```python
|
||||
class GapCategory(str, Enum):
|
||||
SOURCE_NOT_FOUND = "source_not_found" # No relevant sources exist
|
||||
ACCESS_DENIED = "access_denied" # Paywall, robots.txt, auth required
|
||||
BUDGET_EXHAUSTED = "budget_exhausted" # Ran out of iterations or tokens
|
||||
CONTRADICTORY_SOURCES = "contradictory_sources" # Sources disagree, can't resolve
|
||||
SCOPE_EXCEEDED = "scope_exceeded" # Question requires a different researcher type
|
||||
|
||||
@dataclass
|
||||
class Gap:
|
||||
topic: str # What aspect wasn't resolved
|
||||
reason: str # Why: "no sources found", "contradictory sources", "outside researcher scope"
|
||||
category: GapCategory # Structured reason category
|
||||
detail: str # Human-readable explanation
|
||||
```
|
||||
|
||||
#### Gap categories
|
||||
|
||||
| Category | Meaning | PI Action |
|
||||
|:---|:---|:---|
|
||||
| `SOURCE_NOT_FOUND` | The information doesn't appear to exist in this researcher's domain | Dispatch a different researcher, or accept the gap |
|
||||
| `ACCESS_DENIED` | Source exists but is behind a paywall, login, or robots.txt | May need a specialized fetcher or human intervention |
|
||||
| `BUDGET_EXHAUSTED` | Researcher hit iteration or token cap before resolving this | Re-dispatch with a larger budget, or accept partial answer |
|
||||
| `CONTRADICTORY_SOURCES` | Multiple sources disagree and the researcher can't resolve the conflict | PI should examine raw_excerpts and flag for human review |
|
||||
| `SCOPE_EXCEEDED` | The question requires capabilities this researcher doesn't have (e.g., web researcher finds a paper DOI but can't access arxiv) | Dispatch the appropriate specialist |
|
||||
|
||||
Example:
|
||||
```python
|
||||
[
|
||||
Gap(topic="pest management by county", reason="no county-specific sources found"),
|
||||
Gap(topic="commercial varietals", reason="limited to hobby gardening sources"),
|
||||
Gap(
|
||||
topic="pest management by county",
|
||||
category=GapCategory.SOURCE_NOT_FOUND,
|
||||
detail="No county-specific pest data found in general web sources"
|
||||
),
|
||||
Gap(
|
||||
topic="2026 varietal trial results",
|
||||
category=GapCategory.ACCESS_DENIED,
|
||||
detail="USU extension database requires institutional login"
|
||||
),
|
||||
Gap(
|
||||
topic="soil pH requirements by crop",
|
||||
category=GapCategory.BUDGET_EXHAUSTED,
|
||||
detail="Identified relevant sources but hit iteration cap before extraction"
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
Gaps are **critical for the PI**. They tell the orchestrator:
|
||||
- Whether to dispatch a different researcher
|
||||
- Whether to accept partial answers
|
||||
- Which questions remain for human input
|
||||
- Whether to dispatch a different researcher (SCOPE_EXCEEDED)
|
||||
- Whether to retry with more budget (BUDGET_EXHAUSTED)
|
||||
- Whether to escalate to human review (CONTRADICTORY_SOURCES, ACCESS_DENIED)
|
||||
- Whether to accept the answer as-is (SOURCE_NOT_FOUND — info may not exist)
|
||||
|
||||
A researcher that admits gaps is more trustworthy than one that fabricates answers.
|
||||
A researcher that admits and categorizes gaps is more trustworthy than one that fabricates answers.
|
||||
|
||||
#### `confidence` (float, 0.0–1.0)
|
||||
---
|
||||
|
||||
Overall confidence in the answer:
|
||||
- `0.9–1.0`: High. All claims grounded in multiple strong sources.
|
||||
### `discovery_events` (list of DiscoveryEvent objects)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DiscoveryEvent:
|
||||
type: str # "related_research", "new_source", "contradiction"
|
||||
suggested_researcher: Optional[str] # "arxiv", "database", "legal", etc.
|
||||
query: str # Suggested query for the target researcher
|
||||
reason: str # Why this is relevant
|
||||
source_locator: Optional[str] # Where the discovery was found (URL, DOI, etc.)
|
||||
```
|
||||
|
||||
Discovery events capture **lateral findings** — things the researcher noticed that fall outside its own scope but are relevant to the overall investigation. They are the "nervous system" for the future V2 orchestrator.
|
||||
|
||||
In V1, the CLI logs these events for analysis. In V2, the PI uses them to dynamically dispatch additional researchers mid-investigation.
|
||||
|
||||
Example:
|
||||
```python
|
||||
[
|
||||
DiscoveryEvent(
|
||||
type="related_research",
|
||||
suggested_researcher="arxiv",
|
||||
query="Utah agricultural extension soil salinity studies 2024-2026",
|
||||
reason="Multiple web sources reference USU soil salinity research but don't include the data",
|
||||
source_locator="https://extension.usu.edu/news/2025/soil-salinity-study"
|
||||
),
|
||||
DiscoveryEvent(
|
||||
type="contradiction",
|
||||
suggested_researcher=None,
|
||||
query="tomato growing season length Utah valley vs mountain",
|
||||
reason="Two sources disagree on whether tomatoes are viable above 6,000 ft",
|
||||
source_locator=None
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
**Rules:**
|
||||
- Discovery events are **informational only** — the researcher does not act on them
|
||||
- The researcher should not generate discovery events speculatively; each must be grounded in something encountered during the research loop
|
||||
- The PI decides whether to act on discovery events; the researcher does not second-guess
|
||||
|
||||
---
|
||||
|
||||
### `confidence` (float, 0.0–1.0)
|
||||
|
||||
Overall confidence in the answer. Accompanied by `confidence_factors` to prevent "vibe check" scoring.
|
||||
|
||||
General ranges:
|
||||
- `0.9–1.0`: High. Multiple corroborating sources, strong authority, no contradictions.
|
||||
- `0.7–0.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved.
|
||||
- `0.5–0.7`: Low. Few direct sources; lots of synthesis; clear gaps.
|
||||
- `0.5–0.7`: Low. Few direct sources; significant synthesis; clear gaps.
|
||||
- `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review.
|
||||
|
||||
The PI uses this to decide whether to act on the answer or seek more sources.
|
||||
|
||||
#### `cost_metadata` (object)
|
||||
**V1 note:** The confidence score is produced by the LLM researcher and is not yet calibrated against empirical data. Treat it as directional, not precise. Calibration rubric will be formalized after sufficient V1 testing data is collected.
|
||||
|
||||
---
|
||||
|
||||
### `confidence_factors` (object)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ConfidenceFactors:
|
||||
num_corroborating_sources: int # How many sources agree
|
||||
source_authority: str # "high" (.gov, .edu, peer-reviewed), "medium" (established orgs), "low" (blogs, forums)
|
||||
contradiction_detected: bool # Were conflicting claims found?
|
||||
query_specificity_match: float # 0.0–1.0: did results address the actual question?
|
||||
budget_exhausted: bool # Hard penalty if true
|
||||
recency: Optional[str] # "current" (< 1yr), "recent" (1-3yr), "dated" (> 3yr), None if unknown
|
||||
```
|
||||
|
||||
The `confidence_factors` dict exposes *why* the researcher chose a particular confidence score. This serves two purposes:
|
||||
1. **Auditability** — a human or PI can verify that the score is reasonable
|
||||
2. **Calibration data** — after 20-30 real queries, these factors become the basis for a formal confidence rubric
|
||||
|
||||
Example:
|
||||
```python
|
||||
ConfidenceFactors(
|
||||
num_corroborating_sources=4,
|
||||
source_authority="high",
|
||||
contradiction_detected=False,
|
||||
query_specificity_match=0.85,
|
||||
budget_exhausted=False,
|
||||
recency="current"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `cost_metadata` (object)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
|
|
@ -186,7 +328,9 @@ The PI uses this to:
|
|||
- Detect runaway loops (budget_exhausted = True)
|
||||
- Plan timeouts (wall_time_sec tells you if this is acceptable latency)
|
||||
|
||||
#### `trace_id` (string, UUID)
|
||||
---
|
||||
|
||||
### `trace_id` (string, UUID)
|
||||
|
||||
A unique identifier linking to the JSONL trace log:
|
||||
|
||||
|
|
@ -194,7 +338,30 @@ A unique identifier linking to the JSONL trace log:
|
|||
~/.marchwarden/traces/{trace_id}.jsonl
|
||||
```
|
||||
|
||||
The trace contains every decision, search, fetch, parse step for debugging and replay.
|
||||
The trace contains every decision, search, fetch, parse step for debugging and audit.
|
||||
|
||||
#### Trace entries and content hashing
|
||||
|
||||
Each trace entry for a fetched source includes a `content_hash` (SHA-256 of the fetched content). This provides a **pseudo-CAS** (Content Addressable Storage) capability:
|
||||
|
||||
```json
|
||||
{
|
||||
"step": 2,
|
||||
"action": "fetch_url",
|
||||
"url": "https://extension.usu.edu/gardening/research/utah-crops",
|
||||
"content_hash": "sha256:a3f2b8c91d...",
|
||||
"content_length": 14523,
|
||||
"timestamp": "2026-04-08T12:00:05Z",
|
||||
"decision": "Relevant to question; extracting crop data"
|
||||
}
|
||||
```
|
||||
|
||||
The `content_hash` enables:
|
||||
- **Change detection** — comparing two audit runs to see if the underlying source changed
|
||||
- **Integrity verification** — confirming the raw_excerpt came from the content that was actually fetched
|
||||
- **Future CAS** — when full content storage is added (V2+), the hash becomes the content address
|
||||
|
||||
**V1 limitation:** We store the hash, not the full content. True replay requires Content Addressable Storage (V2+). V1 traces are "audit logs," not deterministic replays.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -203,16 +370,22 @@ The trace contains every decision, search, fetch, parse step for debugging and r
|
|||
### The Researcher Must
|
||||
|
||||
1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it.
|
||||
2. **Admit gaps.** If you can't find something, say so. Don't guess.
|
||||
3. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`.
|
||||
4. **Ground claims.** Every factual claim in `answer` must link to at least one citation.
|
||||
5. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` and continue with what you have.
|
||||
2. **Provide raw evidence.** Every citation must include a `raw_excerpt` copied verbatim from the source.
|
||||
3. **Admit and categorize gaps.** If you can't find something, say so with the appropriate `GapCategory`.
|
||||
4. **Report lateral discoveries.** If you encounter something relevant to another researcher's domain, emit a `DiscoveryEvent`.
|
||||
5. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`.
|
||||
6. **Ground claims.** Every factual claim in `answer` must link to at least one citation.
|
||||
7. **Explain confidence.** Populate `confidence_factors` honestly; do not inflate scores.
|
||||
8. **Hash fetched content.** Every URL/source fetch in the trace must include a `content_hash`.
|
||||
9. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` with the appropriate category and continue with what you have.
|
||||
|
||||
### The Caller (PI/CLI) Must
|
||||
|
||||
1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly.
|
||||
2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer.
|
||||
3. **Check locators.** For important decisions, verify citations by following the locators.
|
||||
3. **Check raw_excerpts.** For important decisions, verify claims against `raw_excerpt` before acting.
|
||||
4. **Process discovery_events.** Log them (V1) or dispatch additional researchers (V2+).
|
||||
5. **Respect gap categories.** Use the category to decide the appropriate response (retry, re-dispatch, escalate, accept).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -238,11 +411,21 @@ Response:
|
|||
"locator": "https://en.wikipedia.org/wiki/Paris",
|
||||
"title": "Paris - Wikipedia",
|
||||
"snippet": "Paris is the capital and largest city of France",
|
||||
"raw_excerpt": "Paris is the capital and most populous city of France, with an estimated population of 2,102,650 residents as of 1 January 2023.",
|
||||
"confidence": 0.99
|
||||
}
|
||||
],
|
||||
"gaps": [],
|
||||
"discovery_events": [],
|
||||
"confidence": 0.99,
|
||||
"confidence_factors": {
|
||||
"num_corroborating_sources": 1,
|
||||
"source_authority": "high",
|
||||
"contradiction_detected": false,
|
||||
"query_specificity_match": 1.0,
|
||||
"budget_exhausted": false,
|
||||
"recency": "current"
|
||||
},
|
||||
"cost_metadata": {
|
||||
"tokens_used": 450,
|
||||
"iterations_run": 1,
|
||||
|
|
@ -253,7 +436,7 @@ Response:
|
|||
}
|
||||
```
|
||||
|
||||
### Example 2: Partial Answer with Gaps
|
||||
### Example 2: Partial Answer with Gaps and Discoveries
|
||||
|
||||
Request:
|
||||
```json
|
||||
|
|
@ -270,23 +453,43 @@ Response:
|
|||
"citations": [
|
||||
{
|
||||
"source": "web",
|
||||
"locator": "https://www.crunchbase.com/...",
|
||||
"locator": "https://www.crunchbase.com/hub/crispr-startups",
|
||||
"title": "Crunchbase: CRISPR Startups",
|
||||
"snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics...",
|
||||
"snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics lead the field.",
|
||||
"raw_excerpt": "The CRISPR gene-editing space has attracted over $12B in venture funding since 2018. Key players include Editas Medicine (EDIT), Beam Therapeutics (BEAM), CRISPR Therapeutics (CRSP), and newer entrants Prime Medicine and Verve Therapeutics.",
|
||||
"confidence": 0.8
|
||||
}
|
||||
],
|
||||
"gaps": [
|
||||
{
|
||||
"topic": "funding rounds in 2026",
|
||||
"reason": "Web sources only go through Q1 2026; may be stale"
|
||||
"category": "source_not_found",
|
||||
"detail": "Web sources only go through Q1 2026; most recent rounds may not be indexed yet"
|
||||
},
|
||||
{
|
||||
"topic": "clinical trial status",
|
||||
"reason": "Requires access to clinical trials database (outside web search scope)"
|
||||
"category": "scope_exceeded",
|
||||
"detail": "Requires access to ClinicalTrials.gov database; outside web search scope"
|
||||
}
|
||||
],
|
||||
"discovery_events": [
|
||||
{
|
||||
"type": "related_research",
|
||||
"suggested_researcher": "database",
|
||||
"query": "CRISPR gene therapy clinical trials Phase I II III 2025-2026",
|
||||
"reason": "Multiple sources reference ongoing trials but web results don't include trial data",
|
||||
"source_locator": "https://www.crunchbase.com/hub/crispr-startups"
|
||||
}
|
||||
],
|
||||
"confidence": 0.72,
|
||||
"confidence_factors": {
|
||||
"num_corroborating_sources": 3,
|
||||
"source_authority": "medium",
|
||||
"contradiction_detected": false,
|
||||
"query_specificity_match": 0.75,
|
||||
"budget_exhausted": false,
|
||||
"recency": "recent"
|
||||
},
|
||||
"cost_metadata": {
|
||||
"tokens_used": 19240,
|
||||
"iterations_run": 4,
|
||||
|
|
@ -297,7 +500,7 @@ Response:
|
|||
}
|
||||
```
|
||||
|
||||
### Example 3: Budget Exhausted
|
||||
### Example 3: Budget Exhausted with Contradictions
|
||||
|
||||
Request:
|
||||
```json
|
||||
|
|
@ -316,19 +519,45 @@ Response:
|
|||
{
|
||||
"answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]",
|
||||
"citations": [
|
||||
{ ... 3-4 citations ... }
|
||||
{
|
||||
"source": "web",
|
||||
"locator": "https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
|
||||
"title": "History of AI - Wikipedia",
|
||||
"snippet": "AI research began at a 1956 Dartmouth workshop.",
|
||||
"raw_excerpt": "The field of AI research was founded at a workshop held on the campus of Dartmouth College, USA during the summer of 1956. Those who attended would become the leaders of AI research for decades.",
|
||||
"confidence": 0.95
|
||||
}
|
||||
],
|
||||
"gaps": [
|
||||
{
|
||||
"topic": "detailed timeline 2020-2026",
|
||||
"reason": "budget exhausted before deep synthesis"
|
||||
"category": "budget_exhausted",
|
||||
"detail": "Identified relevant sources but hit iteration cap before extraction"
|
||||
},
|
||||
{
|
||||
"topic": "minor research directions",
|
||||
"reason": "out of scope due to token limit"
|
||||
"topic": "AI winter causes and resolution",
|
||||
"category": "budget_exhausted",
|
||||
"detail": "Found conflicting narratives; needed more iterations to resolve"
|
||||
}
|
||||
],
|
||||
"discovery_events": [
|
||||
{
|
||||
"type": "related_research",
|
||||
"suggested_researcher": "arxiv",
|
||||
"query": "survey papers history of deep learning 2020-2026",
|
||||
"reason": "Web sources are shallow on recent technical developments; academic surveys would be more authoritative",
|
||||
"source_locator": null
|
||||
}
|
||||
],
|
||||
"confidence": 0.55,
|
||||
"confidence_factors": {
|
||||
"num_corroborating_sources": 2,
|
||||
"source_authority": "high",
|
||||
"contradiction_detected": true,
|
||||
"query_specificity_match": 0.5,
|
||||
"budget_exhausted": true,
|
||||
"recency": "dated"
|
||||
},
|
||||
"cost_metadata": {
|
||||
"tokens_used": 4998,
|
||||
"iterations_run": 3,
|
||||
|
|
@ -341,9 +570,28 @@ Response:
|
|||
|
||||
---
|
||||
|
||||
## Known Limitations (V1)
|
||||
|
||||
These are documented architectural decisions, not oversights:
|
||||
|
||||
| Limitation | Rationale | Future Resolution |
|
||||
|:---|:---|:---|
|
||||
| Confidence is LLM-generated, not calibrated | Need empirical data before formalizing rubric | V1.1: Calibrate after 20-30 real queries |
|
||||
| No citation validation (URL/DOI ping) | Adds latency and complexity; document as known risk | V2: Validator node that programmatically verifies locators |
|
||||
| Traces are audit logs, not true replays | True replay requires CAS for fetched content | V2: Content Addressable Storage for all fetched data |
|
||||
| Discovery events are logged, not acted on | MCP is request-response; no mid-flight dispatch | V2: PI orchestrator processes events dynamically |
|
||||
| No streaming of intermediate progress | MCP tool responses are one-shot | V2+: Evaluate streaming MCP or polling pattern |
|
||||
| Hub-and-spoke only (no inter-researcher comms) | Keeps V1 simple; PI is the only coordinator | V2: Dynamic priority queue in PI |
|
||||
|
||||
---
|
||||
|
||||
## Versioning
|
||||
|
||||
The contract is versioned as `v1`. If breaking changes are needed (e.g., new required fields), the next version becomes `v2` and both can coexist in the network for a transition period.
|
||||
The contract is versioned as `v1`. If breaking changes are needed (e.g., removing a required field or changing its type), the next version becomes `v2` and both can coexist in the network for a transition period.
|
||||
|
||||
**Backward-compatible changes** (adding optional fields) do not require a version bump.
|
||||
|
||||
**Breaking changes** (removing fields, changing types, changing required/optional status) require a new version.
|
||||
|
||||
Current version: **v1**
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue