5 ResearchContract
Jeff Smith edited this page 2026-04-08 14:37:37 -06:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Research Contract

This document defines the research() tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI).

Contract version: v1

Tool Signature

async def research(
    question: str,
    context: Optional[str] = None,
    depth: Literal["shallow", "balanced", "deep"] = "balanced",
    constraints: Optional[ResearchConstraints] = None,
) -> ResearchResult

Input Parameters

question (required, string)

The question the researcher is asked to investigate. Examples:

  • "What are ideal crops for a garden in Utah?"
  • "Summarize recent developments in transformer architectures"
  • "What is the legal status of AI in France?"

Constraints: 1500 characters, UTF-8 encoded.

context (optional, string)

What the PI or caller already knows. The researcher uses this to avoid duplicating effort or to refocus. Examples:

  • "I already know Utah is in USDA zones 3-8. Focus on water requirements."
  • "I've read the 2024 papers on LoRA. What's new in 2025?"

Constraints: 02000 characters.

depth (optional, enum)

How thoroughly to research:

  • "shallow" — quick scan, 12 iterations, ~5k tokens. For "does this exist?" questions.
  • "balanced" (default) — moderate depth, 24 iterations, ~15k tokens. For typical questions.
  • "deep" — thorough investigation, up to 5 iterations, ~25k tokens. For important decisions.

The researcher uses this as a hint, not a strict constraint. The actual depth depends on how much content is available and how confident the researcher becomes.

constraints (optional, object)

Fine-grained control over researcher behavior:

@dataclass
class ResearchConstraints:
    max_iterations: int = 5              # Stop after N iterations, regardless
    token_budget: int = 20000            # Soft limit on tokens; researcher respects
    max_sources: int = 10                # Max number of sources to fetch
    source_filter: Optional[str] = None  # Only search specific domains (V2)

If not provided, defaults are:

  • max_iterations: 5
  • token_budget: 20000
  • max_sources: 10

The MCP server enforces these constraints and will stop the researcher if they exceed them.


Output: ResearchResult

@dataclass
class ResearchResult:
    answer: str                              # The synthesized answer
    citations: List[Citation]                # Sources used, with raw evidence
    gaps: List[Gap]                          # What couldn't be resolved (categorized)
    discovery_events: List[DiscoveryEvent]   # Lateral findings for other researchers
    open_questions: List[OpenQuestion]       # Follow-up questions that emerged
    confidence: float                        # 0.01.0 overall confidence
    confidence_factors: ConfidenceFactors    # What fed the confidence score
    cost_metadata: CostMetadata              # Resource usage
    trace_id: str                            # UUID linking to JSONL trace log

answer (string)

The synthesized answer. Should be:

  • Grounded — every claim traces back to a citation
  • Humble — includes caveats and confidence levels
  • Actionable — structured so the caller can use it

Example:

In Utah (USDA zones 3-8), ideal crops depend on elevation and season:

High elevation (>7k ft): Short-season crops dominate. Cool-season vegetables
(peas, lettuce, potatoes) thrive. Fruit: apples, berries. Summer crops
(tomatoes, squash) work in south-facing microclimates.

Lower elevation: Full range possible. Long growing season supports tomatoes,
peppers, squash. Perennials (fruit trees, asparagus) are popular.

Water is critical: Utah averages 10-20" annual precipitation (dry for vegetable
gardening). Most gardeners supplement with irrigation.

Pests: Japanese beetles (south), aphids (statewide). Deer pressure varies by
location.

See sources below for varietal recommendations by specific county.

citations (list of Citation objects)

@dataclass
class Citation:
    source: str              # "web", "file", "database", etc
    locator: str             # URL, file path, row ID, or unique identifier
    title: Optional[str]     # Human-readable title (for web)
    snippet: Optional[str]   # Researcher's summary of relevant content (50200 chars)
    raw_excerpt: str         # Verbatim text from the source (up to 500 chars)
    confidence: float        # 0.01.0: researcher's confidence in this source's accuracy

The raw_excerpt field

The raw_excerpt is a verbatim copy of the relevant passage from the source, bypassing the researcher's synthesis. This prevents the "Synthesis Paradox" — when the PI synthesizes already-synthesized data, subtle nuances and contradictions get smoothed away.

The PI uses raw_excerpt to:

  • Perform its own ground-truth verification
  • Detect when two researchers interpret the same evidence differently
  • Identify high-entropy points where human review is needed

Rules:

  • Must be copied verbatim from the fetched source (no paraphrasing)
  • Up to 500 characters; truncate with [...] if longer
  • If the source cannot be excerpted (e.g., image, binary), set to "[non-text source]"

Example:

Citation(
    source="web",
    locator="https://extension.usu.edu/gardening/research/utah-crops",
    title="USU Extension: Utah Crop Guide",
    snippet="Cool-season crops thrive above 7,000 feet; irrigation essential.",
    raw_excerpt="In Utah's high-elevation gardens (above 7,000 ft), cool-season vegetables such as peas, lettuce, spinach, and potatoes consistently outperform warm-season crops. Average growing season at these elevations is 90-120 days. Supplemental irrigation is essential; natural precipitation averages 12-16 inches annually, well below the 20-30 inches most vegetable crops require.",
    confidence=0.92
)

Citations must be:

  • Verifiable — a human can follow the locator and confirm the claim
  • Not hallucinated — the researcher actually read/fetched the source
  • Attributed — each claim in answer should link to at least one citation
  • Evidence-bearingraw_excerpt must contain the actual text that supports the claim

gaps (list of Gap objects)

class GapCategory(str, Enum):
    SOURCE_NOT_FOUND = "source_not_found"          # No relevant sources exist
    ACCESS_DENIED = "access_denied"                # Paywall, robots.txt, auth required
    BUDGET_EXHAUSTED = "budget_exhausted"           # Ran out of iterations or tokens
    CONTRADICTORY_SOURCES = "contradictory_sources" # Sources disagree, can't resolve
    SCOPE_EXCEEDED = "scope_exceeded"               # Question requires a different researcher type

@dataclass
class Gap:
    topic: str              # What aspect wasn't resolved
    category: GapCategory   # Structured reason category
    detail: str             # Human-readable explanation

Gap categories

Category Meaning PI Action
SOURCE_NOT_FOUND The information doesn't appear to exist in this researcher's domain Dispatch a different researcher, or accept the gap
ACCESS_DENIED Source exists but is behind a paywall, login, or robots.txt May need a specialized fetcher or human intervention
BUDGET_EXHAUSTED Researcher hit iteration or token cap before resolving this Re-dispatch with a larger budget, or accept partial answer
CONTRADICTORY_SOURCES Multiple sources disagree and the researcher can't resolve the conflict PI should examine raw_excerpts and flag for human review
SCOPE_EXCEEDED The question requires capabilities this researcher doesn't have (e.g., web researcher finds a paper DOI but can't access arxiv) Dispatch the appropriate specialist

Example:

[
    Gap(
        topic="pest management by county",
        category=GapCategory.SOURCE_NOT_FOUND,
        detail="No county-specific pest data found in general web sources"
    ),
    Gap(
        topic="2026 varietal trial results",
        category=GapCategory.ACCESS_DENIED,
        detail="USU extension database requires institutional login"
    ),
    Gap(
        topic="soil pH requirements by crop",
        category=GapCategory.BUDGET_EXHAUSTED,
        detail="Identified relevant sources but hit iteration cap before extraction"
    ),
]

Gaps are critical for the PI. They tell the orchestrator:

  • Whether to dispatch a different researcher (SCOPE_EXCEEDED)
  • Whether to retry with more budget (BUDGET_EXHAUSTED)
  • Whether to escalate to human review (CONTRADICTORY_SOURCES, ACCESS_DENIED)
  • Whether to accept the answer as-is (SOURCE_NOT_FOUND — info may not exist)

A researcher that admits and categorizes gaps is more trustworthy than one that fabricates answers.


discovery_events (list of DiscoveryEvent objects)

@dataclass
class DiscoveryEvent:
    type: str                           # "related_research", "new_source", "contradiction"
    suggested_researcher: Optional[str]  # "arxiv", "database", "legal", etc.
    query: str                          # Suggested query for the target researcher
    reason: str                         # Why this is relevant
    source_locator: Optional[str]       # Where the discovery was found (URL, DOI, etc.)

Discovery events capture lateral findings — things the researcher noticed that fall outside its own scope but are relevant to the overall investigation. They are the "nervous system" for the future V2 orchestrator.

In V1, the CLI logs these events for analysis. In V2, the PI uses them to dynamically dispatch additional researchers mid-investigation.

Example:

[
    DiscoveryEvent(
        type="related_research",
        suggested_researcher="arxiv",
        query="Utah agricultural extension soil salinity studies 2024-2026",
        reason="Multiple web sources reference USU soil salinity research but don't include the data",
        source_locator="https://extension.usu.edu/news/2025/soil-salinity-study"
    ),
    DiscoveryEvent(
        type="contradiction",
        suggested_researcher=None,
        query="tomato growing season length Utah valley vs mountain",
        reason="Two sources disagree on whether tomatoes are viable above 6,000 ft",
        source_locator=None
    ),
]

Rules:

  • Discovery events are informational only — the researcher does not act on them
  • The researcher should not generate discovery events speculatively; each must be grounded in something encountered during the research loop
  • The PI decides whether to act on discovery events; the researcher does not second-guess

open_questions (list of OpenQuestion objects)

@dataclass
class OpenQuestion:
    question: str              # The follow-up question that emerged
    context: str               # What evidence prompted this question
    priority: str              # "high", "medium", "low"
    source_locator: Optional[str]  # Where this question arose from

Open questions capture forward-looking follow-ups that emerged from the research itself. They are distinct from gaps (what failed) and discovery events (what's lateral):

Field Direction Meaning
gaps Backward "I tried to find X but couldn't"
discovery_events Sideways "Another researcher should look at this"
open_questions Forward "Based on what I found, this needs deeper investigation"

Example:

[
    OpenQuestion(
        question="What is the optimal irrigation schedule for high-elevation potatoes?",
        context="Multiple sources mention irrigation is critical but none specify schedules.",
        priority="medium",
        source_locator="https://extension.usu.edu/gardening/utah-crops"
    ),
    OpenQuestion(
        question="How does Utah's soil salinity vary by county?",
        context="Two sources referenced salinity as a limiting factor but with conflicting data.",
        priority="high",
        source_locator=None
    ),
]

Priority levels:

  • "high" — critical to answer quality; the PI should strongly consider dispatching follow-up research
  • "medium" — would meaningfully improve the answer
  • "low" — nice to know; not essential

The PI uses open questions to feed a dynamic priority queue — deciding whether to go deeper on the current topic or move on.

Rules:

  • Each open question must be grounded in evidence encountered during research (not speculative)
  • Questions should be specific and actionable (not vague like "learn more about Utah")
  • The researcher should not attempt to answer its own open questions — that's the PI's job

confidence (float, 0.01.0)

Overall confidence in the answer. Accompanied by confidence_factors to prevent "vibe check" scoring.

General ranges:

  • 0.91.0: High. Multiple corroborating sources, strong authority, no contradictions.
  • 0.70.9: Moderate. Most claims grounded; some inference; minor contradictions resolved.
  • 0.50.7: Low. Few direct sources; significant synthesis; clear gaps.
  • < 0.5: Very low. Mainly inference; major gaps; likely needs human review.

The PI uses this to decide whether to act on the answer or seek more sources.

V1 note: The confidence score is produced by the LLM researcher and is not yet calibrated against empirical data. Treat it as directional, not precise. Calibration rubric will be formalized after sufficient V1 testing data is collected.


confidence_factors (object)

@dataclass
class ConfidenceFactors:
    num_corroborating_sources: int      # How many sources agree
    source_authority: str               # "high" (.gov, .edu, peer-reviewed), "medium" (established orgs), "low" (blogs, forums)
    contradiction_detected: bool        # Were conflicting claims found?
    query_specificity_match: float      # 0.01.0: did results address the actual question?
    budget_exhausted: bool              # Hard penalty if true
    recency: Optional[str]              # "current" (< 1yr), "recent" (1-3yr), "dated" (> 3yr), None if unknown

The confidence_factors dict exposes why the researcher chose a particular confidence score. This serves two purposes:

  1. Auditability — a human or PI can verify that the score is reasonable
  2. Calibration data — after 20-30 real queries, these factors become the basis for a formal confidence rubric

Example:

ConfidenceFactors(
    num_corroborating_sources=4,
    source_authority="high",
    contradiction_detected=False,
    query_specificity_match=0.85,
    budget_exhausted=False,
    recency="current"
)

cost_metadata (object)

@dataclass
class CostMetadata:
    tokens_used: int          # Total tokens (Claude + Tavily calls)
    iterations_run: int       # Number of inner-loop iterations
    wall_time_sec: float      # Actual elapsed time
    budget_exhausted: bool    # True if researcher hit iteration or token cap
    model_id: str             # Model used for the research loop (e.g. "claude-sonnet-4-6")

The model_id field records which LLM powered the researcher's inner loop. This is critical for:

  • Cost analysis — comparing token spend across model tiers
  • Quality calibration — correlating confidence scores with model capability
  • Reproducibility — knowing exactly what produced a given result

Example:

CostMetadata(
    tokens_used=8452,
    iterations_run=3,
    wall_time_sec=42.5,
    budget_exhausted=False,
    model_id="claude-sonnet-4-6"
)

The PI uses this to:

  • Track costs (token budgets, actual spend, model tier)
  • Detect runaway loops (budget_exhausted = True)
  • Plan timeouts (wall_time_sec tells you if this is acceptable latency)
  • Compare fidelity-to-cost ratio across models

trace_id (string, UUID)

A unique identifier linking to the JSONL trace log:

~/.marchwarden/traces/{trace_id}.jsonl

The trace contains every decision, search, fetch, parse step for debugging and audit.

Trace entries and content hashing

Each trace entry for a fetched source includes a content_hash (SHA-256 of the fetched content). This provides a pseudo-CAS (Content Addressable Storage) capability:

{
  "step": 2,
  "action": "fetch_url",
  "url": "https://extension.usu.edu/gardening/research/utah-crops",
  "content_hash": "sha256:a3f2b8c91d...",
  "content_length": 14523,
  "timestamp": "2026-04-08T12:00:05Z",
  "decision": "Relevant to question; extracting crop data"
}

The content_hash enables:

  • Change detection — comparing two audit runs to see if the underlying source changed
  • Integrity verification — confirming the raw_excerpt came from the content that was actually fetched
  • Future CAS — when full content storage is added (V2+), the hash becomes the content address

V1 limitation: We store the hash, not the full content. True replay requires Content Addressable Storage (V2+). V1 traces are "audit logs," not deterministic replays.


Contract Rules

The Researcher Must

  1. Never hallucinate citations. If a claim isn't in a source, don't cite it.
  2. Provide raw evidence. Every citation must include a raw_excerpt copied verbatim from the source.
  3. Admit and categorize gaps. If you can't find something, say so with the appropriate GapCategory.
  4. Report lateral discoveries. If you encounter something relevant to another researcher's domain, emit a DiscoveryEvent.
  5. Surface open questions. If the research raises follow-up questions that need deeper investigation, emit OpenQuestion objects with priority levels.
  6. Respect budgets. Stop iterating if max_iterations or token_budget is reached. Reflect in budget_exhausted.
  7. Ground claims. Every factual claim in answer must link to at least one citation.
  8. Explain confidence. Populate confidence_factors honestly; do not inflate scores.
  9. Hash fetched content. Every URL/source fetch in the trace must include a content_hash.
  10. Handle failures gracefully. If Tavily is down or a URL is broken, note it in gaps with the appropriate category and continue with what you have.

The Caller (PI/CLI) Must

  1. Accept partial answers. A researcher that hits its budget but admits gaps is better than one that spins endlessly.
  2. Use confidence and gaps. Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer.
  3. Check raw_excerpts. For important decisions, verify claims against raw_excerpt before acting.
  4. Process discovery_events. Log them (V1) or dispatch additional researchers (V2+).
  5. Respect gap categories. Use the category to decide the appropriate response (retry, re-dispatch, escalate, accept).
  6. Review open_questions. Use priority levels to decide whether to dispatch deeper research or accept the current answer.

Examples

Example 1: High-Confidence Answer

Request:

{
  "question": "What is the capital of France?",
  "depth": "shallow"
}

Response:

{
  "answer": "Paris is the capital of France. It is the country's largest city and serves as the political, cultural, and economic center.",
  "citations": [
    {
      "source": "web",
      "locator": "https://en.wikipedia.org/wiki/Paris",
      "title": "Paris - Wikipedia",
      "snippet": "Paris is the capital and largest city of France",
      "raw_excerpt": "Paris is the capital and most populous city of France, with an estimated population of 2,102,650 residents as of 1 January 2023.",
      "confidence": 0.99
    }
  ],
  "gaps": [],
  "discovery_events": [],
  "open_questions": [],
  "confidence": 0.99,
  "confidence_factors": {
    "num_corroborating_sources": 1,
    "source_authority": "high",
    "contradiction_detected": false,
    "query_specificity_match": 1.0,
    "budget_exhausted": false,
    "recency": "current"
  },
  "cost_metadata": {
    "tokens_used": 450,
    "iterations_run": 1,
    "wall_time_sec": 3.2,
    "budget_exhausted": false,
    "model_id": "claude-sonnet-4-6"
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440001"
}

Example 2: Partial Answer with Gaps and Discoveries

Request:

{
  "question": "What emerging startups in biotech are working on CRISPR gene therapy?",
  "depth": "deep"
}

Response:

{
  "answer": "Several emerging startups are advancing CRISPR gene therapy... [detailed answer]",
  "citations": [
    {
      "source": "web",
      "locator": "https://www.crunchbase.com/hub/crispr-startups",
      "title": "Crunchbase: CRISPR Startups",
      "snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics lead the field.",
      "raw_excerpt": "The CRISPR gene-editing space has attracted over $12B in venture funding since 2018. Key players include Editas Medicine (EDIT), Beam Therapeutics (BEAM), CRISPR Therapeutics (CRSP), and newer entrants Prime Medicine and Verve Therapeutics.",
      "confidence": 0.8
    }
  ],
  "gaps": [
    {
      "topic": "funding rounds in 2026",
      "category": "source_not_found",
      "detail": "Web sources only go through Q1 2026; most recent rounds may not be indexed yet"
    },
    {
      "topic": "clinical trial status",
      "category": "scope_exceeded",
      "detail": "Requires access to ClinicalTrials.gov database; outside web search scope"
    }
  ],
  "discovery_events": [
    {
      "type": "related_research",
      "suggested_researcher": "database",
      "query": "CRISPR gene therapy clinical trials Phase I II III 2025-2026",
      "reason": "Multiple sources reference ongoing trials but web results don't include trial data",
      "source_locator": "https://www.crunchbase.com/hub/crispr-startups"
    }
  ],
  "open_questions": [
    {
      "question": "What are the current Phase III CRISPR trial success rates?",
      "context": "Multiple sources reference ongoing trials but none include outcome data.",
      "priority": "high",
      "source_locator": "https://www.crunchbase.com/hub/crispr-startups"
    }
  ],
  "confidence": 0.72,
  "confidence_factors": {
    "num_corroborating_sources": 3,
    "source_authority": "medium",
    "contradiction_detected": false,
    "query_specificity_match": 0.75,
    "budget_exhausted": false,
    "recency": "recent"
  },
  "cost_metadata": {
    "tokens_used": 19240,
    "iterations_run": 4,
    "wall_time_sec": 67.8,
    "budget_exhausted": false,
    "model_id": "claude-sonnet-4-6"
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440002"
}

Example 3: Budget Exhausted with Contradictions

Request:

{
  "question": "Comprehensive history of AI from 1950s to 2026",
  "depth": "deep",
  "constraints": {
    "max_iterations": 3,
    "token_budget": 5000
  }
}

Response:

{
  "answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]",
  "citations": [
    {
      "source": "web",
      "locator": "https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
      "title": "History of AI - Wikipedia",
      "snippet": "AI research began at a 1956 Dartmouth workshop.",
      "raw_excerpt": "The field of AI research was founded at a workshop held on the campus of Dartmouth College, USA during the summer of 1956. Those who attended would become the leaders of AI research for decades.",
      "confidence": 0.95
    }
  ],
  "gaps": [
    {
      "topic": "detailed timeline 2020-2026",
      "category": "budget_exhausted",
      "detail": "Identified relevant sources but hit iteration cap before extraction"
    },
    {
      "topic": "AI winter causes and resolution",
      "category": "budget_exhausted",
      "detail": "Found conflicting narratives; needed more iterations to resolve"
    }
  ],
  "discovery_events": [
    {
      "type": "related_research",
      "suggested_researcher": "arxiv",
      "query": "survey papers history of deep learning 2020-2026",
      "reason": "Web sources are shallow on recent technical developments; academic surveys would be more authoritative",
      "source_locator": null
    }
  ],
  "open_questions": [
    {
      "question": "What caused the second AI winter to end?",
      "context": "Found conflicting narratives about 1990s revival; needed more iterations to resolve.",
      "priority": "medium",
      "source_locator": null
    }
  ],
  "confidence": 0.55,
  "confidence_factors": {
    "num_corroborating_sources": 2,
    "source_authority": "high",
    "contradiction_detected": true,
    "query_specificity_match": 0.5,
    "budget_exhausted": true,
    "recency": "dated"
  },
  "cost_metadata": {
    "tokens_used": 4998,
    "iterations_run": 3,
    "wall_time_sec": 31.2,
    "budget_exhausted": true,
    "model_id": "claude-haiku-4-5"
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440003"
}

Known Limitations (V1)

These are documented architectural decisions, not oversights:

Limitation Rationale Future Resolution
Confidence is LLM-generated, not calibrated Need empirical data before formalizing rubric V1.1: Calibrate after 20-30 real queries
No citation validation (URL/DOI ping) Adds latency and complexity; document as known risk V2: Validator node that programmatically verifies locators
Traces are audit logs, not true replays True replay requires CAS for fetched content V2: Content Addressable Storage for all fetched data
Discovery events are logged, not acted on MCP is request-response; no mid-flight dispatch V2: PI orchestrator processes events dynamically
No streaming of intermediate progress MCP tool responses are one-shot V2+: Evaluate streaming MCP or polling pattern
Hub-and-spoke only (no inter-researcher comms) Keeps V1 simple; PI is the only coordinator V2: Dynamic priority queue in PI

Versioning

The contract is versioned as v1. If breaking changes are needed (e.g., removing a required field or changing its type), the next version becomes v2 and both can coexist in the network for a transition period.

Backward-compatible changes (adding optional fields) do not require a version bump.

Breaking changes (removing fields, changing types, changing required/optional status) require a new version.

Current version: v1


See also: Architecture, DevelopmentGuide