Aly Sawft · Founder & Engineer, Sawftware LLC · · 10 min read
AI language models cite sources the way confident people give directions: authoritative tone, plausible-sounding specifics, and occasionally completely fabricated. The model does not distinguish between "I retrieved this from a document" and "I generated this because it sounds like what would be in a document."
This matters more as AI moves into high-stakes decisions. A legal AI that cites clause 12.3 of a contract — does that clause actually say that? A financial AI that cites a quarterly filing — is that figure actually on page 14? A compliance AI that confirms a policy — does the policy document actually support that reading?
The conventional answer is "put the source text in the prompt and let the human check." This works at small scale. It fails when:
Merkle proofs are a cryptographic answer to this problem. They allow a prover (DocImprint) to prove that a specific text chunk is part of a specific document — and allow any verifier to check that proof without re-reading the full document, without trusting the prover, and without network access to the prover's servers.
A Merkle tree is a binary tree where every leaf node contains the hash of a data block, and every parent node contains the hash of its two children. The root of the tree (the Merkle root) is a single hash that commits to all the data blocks.
The key property: to prove that a specific leaf (a specific data block) is part of the tree, you need only the leaf's hash and the hashes of its sibling nodes up to the root — not all the other leaves. This proof is O(log n) in size, and verification requires only O(log n) hash operations.
Git uses Merkle trees for commits (see Pro Git, ch. 6). Bitcoin's whitepaper (Nakamoto, 2008) uses Merkle trees for transaction inclusion proofs. Certificate Transparency (RFC 6962) uses them for public audit logs. NIST FIPS 180-4 defines SHA-256 used for all leaf hashes.
DocImprint applies the same structure to document chunks: each chunk of extracted text becomes a leaf, the tree is built over all chunks, and the root is stored in the bundle's manifest. For a 1,000-chunk document, a Merkle proof requires only ~10 sibling hashes (~320 bytes) — O(log n) verification cost.
When a document is extracted and indexed (either directly or when added to a collection), the extracted text is split into chunks. Each chunk is:
The tree is built bottom-up: leaf hashes are paired and hashed together to form parent nodes, repeating until the root is reached. For odd-numbered leaf counts, the last leaf is duplicated (standard Merkle construction).
The merkle_root is stored in manifest.json, included in the manifest_sha256 calculation, and therefore covered by DocImprint's EIP-191 signature. This means the root is bound to the specific extraction: you cannot swap a chunk without changing the root, without changing the manifest hash, without invalidating the signature.
The tree version (merkle_version: 2) is also stored, allowing the verification algorithm to evolve while maintaining backward compatibility with existing proofs.
When an AI agent answers a question using a specific chunk — "clause 12.3 on page 4 says..." — it includes the chunk_id in its response. The citation proof for that chunk contains:
To verify: hash the chunk text, then iteratively hash with each sibling in the proof array (left or right, as specified), until you reach the root. If the computed root matches the stored root, the chunk is a provable member of the bundle.
This verification can be done in 20 lines of JavaScript, without any network calls, without trusting DocImprint, and without the full document. The proof is the size of log2(n) hashes — for a 1000-chunk document, that is 10 hashes (320 bytes).
# Get a citation proof for a specific chunk
curl -X POST https://api.docimprint.com/v1/extract/ev_abc123/verify-citation \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"chunk_id":"c3f2a1","quoted_text":"The termination notice period shall be 90 days"}'
# Response:
# {
# "valid": true,
# "chunk_id": "c3f2a1",
# "leaf_hash": "sha256:a3b2c1...",
# "proof": ["sha256:x1y2z3...", "sha256:m4n5o6..."],
# "root": "sha256:r7s8t9...",
# "position": 47
# }The practical integration pattern: when your agent answers a question using DocImprint's Q&A or ask-collection endpoint, the response already includes chunk_ids for cited passages. Your agent can:
The third option is the strongest: your agent only produces answers it can prove. If the model extrapolates beyond what is in the document (a hallucination), the citation it invents will not have a valid chunk_id, and the verification will fail. You can treat citation verification failures as confidence signals or hard stops.
Claim-check mode takes this further: rather than answering a question, it evaluates a list of claims against a document and returns supported / contradicted / not_found for each claim, with the supporting passage and its chunk_id. Every supported claim is verifiable via Merkle proof.
curl -X POST https://api.docimprint.com/v1/check-claims \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{
"source": "https://example.com/contract.pdf",
"claims": [
"The termination notice period is 90 days",
"Governing law is the State of Delaware",
"The contract auto-renews annually"
]
}'
# Response per claim:
# { "claim": "...", "verdict": "supported", "evidence": { "quote": "...", "chunk_id": "c3f2a1", "paragraphs": [47] } }
# { "claim": "...", "verdict": "supported", "evidence": { "quote": "...", "chunk_id": "d4e5f6" } }
# { "claim": "...", "verdict": "not_found", "evidence": null }The most powerful property of Merkle proofs is that they are verifiable offline. A party that receives a bundle ZIP (downloaded from GET /v1/extract/:id/download) and a set of citation proofs can verify everything without contacting DocImprint:
This is full end-to-end verification. The only trust assumption is that DocImprint's secp256k1 public key is authentic — the same trust assumption you make when trusting any TLS certificate. And if the bundle was notarized on Base, you can additionally verify the on-chain timestamp without trusting DocImprint at all.
For legal proceedings, this means the opposing party can independently verify the evidence without asking DocImprint to "confirm" anything. The bundle is self-contained proof.
Individual bundle proofs verify that a chunk belongs to a specific document. Collection workflows add a cross-document layer.
When bundles are indexed into a collection, semantic search returns chunks from multiple bundle_ids. An ask_collection response cites passages with both bundle_id and chunk_id. Each citation is independently verifiable:
Cross-document answers therefore decompose into a set of independently verifiable atomic claims. A compliance report citing ten contracts is ten Merkle proofs, not one unverifiable summary.
This composability is what makes verifiable document memory scale to corpus-level reasoning. RAG systems return chunks with embedding similarity scores — unverifiable ranking metadata. DocImprint returns chunks with cryptographic membership proofs — verifiable regardless of how they were retrieved.
For agents presenting multi-document findings, the recommended pattern is: generate answer → extract all chunk_ids → verify each proof before presenting → attach proof metadata to the final response. Downstream systems (humans, auditors, other agents) can spot-check any citation without re-running the full pipeline.
Merkle tree construction adds minimal overhead to extraction. For a typical 10-page document producing ~100 chunks, tree construction takes under 10ms and adds ~3KB to the manifest.
Proof size grows logarithmically: 100 chunks = 7 hashes per proof = 224 bytes. 10,000 chunks = 14 hashes = 448 bytes. For practical document sizes (up to a few thousand chunks), proofs are tiny.
Verification speed: verifying a single citation proof takes microseconds — it is just sequential SHA-256 operations. A browser, a Node.js script, a Python notebook — all can verify in real time.
The current implementation uses merkle_version: 2, which uses standard binary Merkle with SHA-256 leaves. Future versions may add a Sparse Merkle Tree variant for efficient non-membership proofs (useful for proving a claim is not supported, not just that it is).
Batch verification: when an agent cites twenty chunks from the same bundle, you verify twenty proofs against the same root. Cache the root from the manifest after first verification — subsequent proofs only need the leaf-to-root walk, not a full manifest re-fetch.
Storage overhead: for a 500-chunk document, the merkle_root is 32 bytes in the manifest. Proof storage for audit logs (if you persist proofs alongside agent responses) is ~320 bytes per citation at most. This is negligible compared to storing the full document text.