Verifiable document memory is a persistent, cryptographically auditable layer where AI agents store document captures — not just embeddings or summaries, but tamper-evident evidence bundles with manifest SHA-256 hashes, artifact digests, optional Merkle citation proofs (~320-byte O(log n) proofs), and chain-of-custody metadata. Unlike RAG, every claim can be traced to an immutable source.
RAG retrieves similar text chunks but cannot prove they match an authoritative source at capture time. DocImprint evidence bundles bind extracted bytes to cryptographic hashes and optional Base L2 anchors, enabling auditable agent citations months later.
Lean extracts start at $0.010 USDC via x402 on Base. Full evidence bundles with screenshot, PDF, Markdown, and OCR are $0.075 per call. Verify, download, and signing key endpoints are always free — offline verification requires no API key.
Retrieval-augmented generation answers questions from chunks, but it cannot prove what was captured, when, or whether artifacts were altered. High-stakes workflows — legal review, compliance, finance, audit — require a paper trail: who captured the document, which passages support each claim, and whether the bundle still matches its signed manifest.
Verifiable document memory closes that gap. Each capture produces a bundle_id (ev_…) bound to an owner wallet or API key. Artifacts (Markdown, screenshot, PDF, OCR) are content-addressed. Agents can cite chunk_id values and verify citations with Merkle proofs. Legal hold and retention policies govern lifecycle; optional Base L2 notarization anchors existence at a point in time.
DocImprint's trust chain is designed for third-party verification without trusting DocImprint servers at read time:
Deep verify is free: GET /v1/extract/:id/verify returns 200 valid or 409 tamper detected.
A typical agent workflow maps to API routes:
Capture — POST /v1/extract with store=true queues a full evidence bundle (202 + job_id by default). Use ?sync=true for an immediate bundle_id on small pages. Poll GET /v1/jobs/:id until complete.
Analyze — Use claim-check, qa, summarize, or focused endpoints (/v1/summarize, /v1/qa) for lean responses without storage, or re-run extract modes against the same source.
Corpus — Add bundles to a collection; index asynchronously; search with GET /v1/collections/:id/search and ask with POST /v1/collections/:id/ask.
Retain — Set retention (90d or ISO date), legal hold (PUT /v1/extract/:id/hold), version chains (parent_bundle_id), and agent handoff logs.
Prove — Verify bundle integrity, verify citations, download ZIP artifacts, notarize on Base, or share bundle_id with auditors.
Use DocImprint when documents must be defensible: contracts, financial filings, research papers, invoices, regulatory forms, insurance policies, and deposition materials.
Do not use DocImprint as a bulk open-web crawler. URLs are an ingestion path for hosted PDFs and pages — the product category is high-trust document intelligence, not Firecrawl-style scraping at scale.
x402 customers pay per call with USDC on Base; resources bind to the payment payer address. Send X-Wallet-Address on owner-scoped GET/PUT/DELETE routes. Subscription customers use Authorization: Bearer dr_live_… — no wallet header required.
Missing wallet identity on owner-scoped routes returns 401 WALLET_REQUIRED.
A document enters DocImprint as a URL, raw file upload, or base64 payload. The pipeline has four layers:
Ingestion — Browser Rendering captures the page or file as it appears to a real browser. For PDFs, each page is converted to an image for visual OCR, or Textract is used for structured text extraction.
Extraction — The AI layer (Claude claude-sonnet-4-6 or Claude Haiku) runs the requested mode: extract fields, summarize, run Q&A, check claims, or translate. For large documents, a chunked pipeline splits the text, processes each chunk, and re-summarizes the pieces.
Evidence bundle — All artifacts (Markdown, screenshot, PDF, OCR text, structured fields) are stored in Cloudflare R2. A manifest.json is generated listing each artifact's SHA-256 hash, the capture URL, mode, and timestamps. DocImprint signs the manifest SHA-256 with its secp256k1 key.
Memory layer — The bundle is indexed into Cloudflare Vectorize (embeddings). Collections aggregate bundles into matter corpora for semantic search and cross-document Q&A with citation proofs.
DocImprint uses Cloudflare-native storage primitives:
R2 — Immutable artifact files: Markdown text, full-page screenshot PNG, raw PDF bytes, OCR output, structured JSON fields, and manifest.json. URLs are content-addressed by bundle_id.
D1 (SQLite) — Bundle metadata: bundle_id, owner identity (wallet address or API key hash), mode, source URL, captured_at, retention policy, legal hold flag, manifest SHA-256, EAS attestation UID if notarized, parent_bundle_id for version chains.
Vectorize — Chunk embeddings for semantic search within collections. Each chunk stores bundle_id, chunk_id, and Merkle leaf index to enable citation verification after retrieval.
KV — Rate limit counters, API key quota, nonce cache for x402 replay prevention.
Nothing is stored outside Cloudflare infrastructure. No third-party analytics or telemetry services receive document content.
Verify a bundle without trusting DocImprint servers:
bash# 1. Download the bundle ZIP curl -o bundle_ev_abc123.zip \ https://api.docimprint.com/v1/extract/ev_abc123/download # 2. Unzip and compute manifest hash locally unzip bundle_ev_abc123.zip -d bundle/ sha256sum bundle/manifest.json # Compare to manifest_sha256 in the JSON response # 3. Verify signature.json against GET /v1/keys (active or retired signer) # EIP-191 over manifest_sha256; complete bundles require platform signature for valid: true # 4. Check each artifact's hash matches the manifest jq '.artifacts[].sha256' bundle/manifest.json | while read hash; do # find the corresponding file and verify sha256sum matches echo "Checking $hash" done
The bundle is valid if the manifest SHA-256 matches, signature.json verifies against a key from GET /v1/keys, and each artifact hash matches file bytes. Handoffs in provenance are application-layer audit records, not signed custody transfers.
Standard RAG pipelines embed document chunks and retrieve them at query time. This answers "what does this document say?" but cannot answer "was this document captured unmodified?", "who created this chunk?", or "can I prove this passage to a third party?"
Verifiable document memory adds the provenance layer RAG omits:
| RAG | Verifiable document memory | |
|---|---|---|
| Retrieval | Yes | Yes |
| Citation source | Approximate (cosine match) | Exact (chunk_id + Merkle proof) |
| Tamper detection | No | Yes (manifest SHA-256 + signature) |
| Capture timestamp | No | Yes (captured_at + on-chain anchor) |
| Chain of custody | No | Yes (agent provenance logs) |
| Legal hold | No | Yes (PUT /v1/extract/:id/hold) |
For most question-answering use cases, RAG is sufficient. When answers must be defensible — to regulators, auditors, courts, or counterparties — use verifiable document memory.
curl -X POST https://api.docimprint.com/v1/extract \
-H "Content-Type: application/json" \
-H "X-Payment: <eip712-signed-usdc-transfer>" \
-d '{"source":"https://example.com/contract.pdf","include":["markdown","summary","screenshot"]}'