Aly Sawft · Founder & Engineer, Sawftware LLC · June 19, 2026 · 10 min read

Merkle proofs for document citations: how to ground AI outputs cryptographically

The hallucination citation problem

AI language models cite sources the way confident people give directions: authoritative tone, plausible-sounding specifics, and occasionally completely fabricated. The model does not distinguish between "I retrieved this from a document" and "I generated this because it sounds like what would be in a document."

This matters more as AI moves into high-stakes decisions. A legal AI that cites clause 12.3 of a contract — does that clause actually say that? A financial AI that cites a quarterly filing — is that figure actually on page 14? A compliance AI that confirms a policy — does the policy document actually support that reading?

The conventional answer is "put the source text in the prompt and let the human check." This works at small scale. It fails when:

The document is 200 pages and the relevant passage is buried
The agent is running autonomously without a human checking each step
The citation needs to be auditable months later, not just right now
The downstream consumer is another AI agent, not a human

Merkle proofs are a cryptographic answer to this problem. They allow a prover (DocImprint) to prove that a specific text chunk is part of a specific document — and allow any verifier to check that proof without re-reading the full document, without trusting the prover, and without network access to the prover's servers.

What a Merkle tree is (briefly)

A Merkle tree is a binary tree where every leaf node contains the hash of a data block, and every parent node contains the hash of its two children. The root of the tree (the Merkle root) is a single hash that commits to all the data blocks.

The key property: to prove that a specific leaf (a specific data block) is part of the tree, you need only the leaf's hash and the hashes of its sibling nodes up to the root — not all the other leaves. This proof is O(log n) in size, and verification requires only O(log n) hash operations.

Git uses Merkle trees for commits (see Pro Git, ch. 6). Bitcoin's whitepaper (Nakamoto, 2008) uses Merkle trees for transaction inclusion proofs. Certificate Transparency (RFC 6962) uses them for public audit logs. NIST FIPS 180-4 defines SHA-256 used for all leaf hashes.

DocImprint applies the same structure to document chunks: each chunk of extracted text becomes a leaf, the tree is built over all chunks, and the root is stored in the bundle's manifest. For a 1,000-chunk document, a Merkle proof requires only ~10 sibling hashes (~320 bytes) — O(log n) verification cost.

How DocImprint builds the Merkle tree

When a document is extracted and indexed (either directly or when added to a collection), the extracted text is split into chunks. Each chunk is:

Identified by a sequential index and a content-derived ID (chunk_id)
Numbered within the document for citation reference (paragraph numbering for prose, row numbering for tables)
Hashed with SHA-256
Added as a leaf in the Merkle tree

The tree is built bottom-up: leaf hashes are paired and hashed together to form parent nodes, repeating until the root is reached. For odd-numbered leaf counts, the last leaf is duplicated (standard Merkle construction).

The merkle_root is stored in manifest.json, included in the manifest_sha256 calculation, and therefore covered by DocImprint's EIP-191 signature. This means the root is bound to the specific extraction: you cannot swap a chunk without changing the root, without changing the manifest hash, without invalidating the signature.

The tree version (merkle_version: 2) is also stored, allowing the verification algorithm to evolve while maintaining backward compatibility with existing proofs.

What a citation proof contains

When an AI agent answers a question using a specific chunk — "clause 12.3 on page 4 says..." — it includes the chunk_id in its response. The citation proof for that chunk contains:

chunk_id: the identifier of the chunk being proved
leaf_hash: SHA-256 of the chunk text
proof: an array of sibling hashes up the tree
root: the Merkle root (matches what is in the manifest)
position: the chunk's leaf position in the tree

To verify: hash the chunk text, then iteratively hash with each sibling in the proof array (left or right, as specified), until you reach the root. If the computed root matches the stored root, the chunk is a provable member of the bundle.

This verification can be done in 20 lines of JavaScript, without any network calls, without trusting DocImprint, and without the full document. The proof is the size of log2(n) hashes — for a 1000-chunk document, that is 10 hashes (320 bytes).

bashRequest a citation proof and verify it

# Get a citation proof for a specific chunk
curl -X POST https://api.docimprint.com/v1/extract/ev_abc123/verify-citation \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{"chunk_id":"c3f2a1","quoted_text":"The termination notice period shall be 90 days"}'

# Response:
# {
#   "valid": true,
#   "chunk_id": "c3f2a1",
#   "leaf_hash": "sha256:a3b2c1...",
#   "proof": ["sha256:x1y2z3...", "sha256:m4n5o6..."],
#   "root": "sha256:r7s8t9...",
#   "position": 47
# }

Integrating citation proofs into agent responses

The practical integration pattern: when your agent answers a question using DocImprint's Q&A or ask-collection endpoint, the response already includes chunk_ids for cited passages. Your agent can:

Pass chunk_ids to the user or downstream system as citation references
Call verify-citation on-demand when a human or auditor wants to verify a specific citation
Pre-verify all citations before presenting an answer — reject any answer where the Merkle proof fails

The third option is the strongest: your agent only produces answers it can prove. If the model extrapolates beyond what is in the document (a hallucination), the citation it invents will not have a valid chunk_id, and the verification will fail. You can treat citation verification failures as confidence signals or hard stops.

Claim-check mode takes this further: rather than answering a question, it evaluates a list of claims against a document and returns supported / contradicted / not_found for each claim, with the supporting passage and its chunk_id. Every supported claim is verifiable via Merkle proof.

bashClaim-check with citation chunk IDs

curl -X POST https://api.docimprint.com/v1/check-claims \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "source": "https://example.com/contract.pdf",
    "claims": [
      "The termination notice period is 90 days",
      "Governing law is the State of Delaware",
      "The contract auto-renews annually"
    ]
  }'

# Response per claim:
# { "claim": "...", "verdict": "supported", "evidence": { "quote": "...", "chunk_id": "c3f2a1", "paragraphs": [47] } }
# { "claim": "...", "verdict": "supported", "evidence": { "quote": "...", "chunk_id": "d4e5f6" } }
# { "claim": "...", "verdict": "not_found", "evidence": null }

Offline verification: no network required

The most powerful property of Merkle proofs is that they are verifiable offline. A party that receives a bundle ZIP (downloaded from GET /v1/extract/:id/download) and a set of citation proofs can verify everything without contacting DocImprint:

Extract manifest.json from the ZIP
Verify the EIP-191 signature using the public key from /.well-known/docimprint-keys.json (fetched once and stored locally)
Recompute each artifact SHA-256 and compare to the manifest
For each citation proof: hash the chunk text, walk the proof array, verify the root matches the manifest's merkle_root

This is full end-to-end verification. The only trust assumption is that DocImprint's secp256k1 public key is authentic — the same trust assumption you make when trusting any TLS certificate. And if the bundle was notarized on Base, you can additionally verify the on-chain timestamp without trusting DocImprint at all.

For legal proceedings, this means the opposing party can independently verify the evidence without asking DocImprint to "confirm" anything. The bundle is self-contained proof.

Collection-level citation proofs

Individual bundle proofs verify that a chunk belongs to a specific document. Collection workflows add a cross-document layer.

When bundles are indexed into a collection, semantic search returns chunks from multiple bundle_ids. An ask_collection response cites passages with both bundle_id and chunk_id. Each citation is independently verifiable:

Call verify-citation on the source bundle with the chunk_id
Receive a Merkle proof bound to that bundle's merkle_root
Confirm the quoted text matches the leaf hash

Cross-document answers therefore decompose into a set of independently verifiable atomic claims. A compliance report citing ten contracts is ten Merkle proofs, not one unverifiable summary.

This composability is what makes verifiable document memory scale to corpus-level reasoning. RAG systems return chunks with embedding similarity scores — unverifiable ranking metadata. DocImprint returns chunks with cryptographic membership proofs — verifiable regardless of how they were retrieved.

For agents presenting multi-document findings, the recommended pattern is: generate answer → extract all chunk_ids → verify each proof before presenting → attach proof metadata to the final response. Downstream systems (humans, auditors, other agents) can spot-check any citation without re-running the full pipeline.

Performance and practical limits

Merkle tree construction adds minimal overhead to extraction. For a typical 10-page document producing ~100 chunks, tree construction takes under 10ms and adds ~3KB to the manifest.

Proof size grows logarithmically: 100 chunks = 7 hashes per proof = 224 bytes. 10,000 chunks = 14 hashes = 448 bytes. For practical document sizes (up to a few thousand chunks), proofs are tiny.

Verification speed: verifying a single citation proof takes microseconds — it is just sequential SHA-256 operations. A browser, a Node.js script, a Python notebook — all can verify in real time.

The current implementation uses merkle_version: 2, which uses standard binary Merkle with SHA-256 leaves. Future versions may add a Sparse Merkle Tree variant for efficient non-membership proofs (useful for proving a claim is not supported, not just that it is).

Batch verification: when an agent cites twenty chunks from the same bundle, you verify twenty proofs against the same root. Cache the root from the manifest after first verification — subsequent proofs only need the leaf-to-root walk, not a full manifest re-fetch.

Storage overhead: for a 500-chunk document, the merkle_root is 32 bytes in the manifest. Proof storage for audit logs (if you persist proofs alongside agent responses) is ~320 bytes per citation at most. This is negligible compared to storing the full document text.

Evidence bundles

Bundle structure and verification

Extract docs

All extraction modes

Focused endpoints

Claim-check, Q&A, summarize

Document memory

Architecture overview