docimprint

Verifiable document memory for AI agents

What is verifiable document memory?

Verifiable document memory is a persistent, cryptographically auditable layer where AI agents store document captures — not just embeddings or summaries, but tamper-evident evidence bundles with manifest SHA-256 hashes, artifact digests, optional Merkle citation proofs (~320-byte O(log n) proofs), and chain-of-custody metadata. Unlike RAG, every claim can be traced to an immutable source.

How is DocImprint different from RAG?

RAG retrieves similar text chunks but cannot prove they match an authoritative source at capture time. DocImprint evidence bundles bind extracted bytes to cryptographic hashes and optional Base L2 anchors, enabling auditable agent citations months later.

How much does verifiable document capture cost?

Lean extracts start at $0.010 USDC via x402 on Base. Full evidence bundles with screenshot, PDF, Markdown, and OCR are $0.075 per call. Verify, download, and signing key endpoints are always free — offline verification requires no API key.

Why agents need document memory — not just RAG

Retrieval-augmented generation answers questions from chunks, but it cannot prove what was captured, when, or whether artifacts were altered. High-stakes workflows — legal review, compliance, finance, audit — require a paper trail: who captured the document, which passages support each claim, and whether the bundle still matches its signed manifest.

Verifiable document memory closes that gap. Each capture produces a bundle_id (ev_…) bound to an owner wallet or API key. Artifacts (Markdown, screenshot, PDF, OCR) are content-addressed. Agents can cite chunk_id values and verify citations with Merkle proofs. Legal hold and retention policies govern lifecycle; optional Base L2 notarization anchors existence at a point in time.

Trust chain (offline-verifiable)

DocImprint's trust chain is designed for third-party verification without trusting DocImprint servers at read time:

  1. Manifest SHA-256 — canonical hash of manifest.json stored in R2
  2. EIP-191 signature — DocImprint signs manifest_sha256 with a published secp256k1 key
  3. Per-artifact SHA-256 — each file in the bundle is content-addressed
  4. Merkle root — binary Merkle tree over document chunks (merkle_version: 2)
  5. Citation proofs — POST /v1/extract/:id/verify-citation returns proof[] for offline checks
  6. On-chain anchor — Base calldata or EAS attestation when configured

Deep verify is free: GET /v1/extract/:id/verify returns 200 valid or 409 tamper detected.

Document memory lifecycle

A typical agent workflow maps to API routes:

Capture — POST /v1/extract with store=true queues a full evidence bundle (202 + job_id by default). Use ?sync=true for an immediate bundle_id on small pages. Poll GET /v1/jobs/:id until complete.

Analyze — Use claim-check, qa, summarize, or focused endpoints (/v1/summarize, /v1/qa) for lean responses without storage, or re-run extract modes against the same source.

Corpus — Add bundles to a collection; index asynchronously; search with GET /v1/collections/:id/search and ask with POST /v1/collections/:id/ask.

Retain — Set retention (90d or ISO date), legal hold (PUT /v1/extract/:id/hold), version chains (parent_bundle_id), and agent handoff logs.

Prove — Verify bundle integrity, verify citations, download ZIP artifacts, notarize on Base, or share bundle_id with auditors.

When to use DocImprint vs generic scrapers

Use DocImprint when documents must be defensible: contracts, financial filings, research papers, invoices, regulatory forms, insurance policies, and deposition materials.

Do not use DocImprint as a bulk open-web crawler. URLs are an ingestion path for hosted PDFs and pages — the product category is high-trust document intelligence, not Firecrawl-style scraping at scale.

Authentication and ownership

x402 customers pay per call with USDC on Base; resources bind to the payment payer address. Send X-Wallet-Address on owner-scoped GET/PUT/DELETE routes. Subscription customers use Authorization: Bearer dr_live_… — no wallet header required.

Missing wallet identity on owner-scoped routes returns 401 WALLET_REQUIRED.

Architecture: from bytes to verifiable memory

A document enters DocImprint as a URL, raw file upload, or base64 payload. The pipeline has four layers:

Ingestion — Browser Rendering captures the page or file as it appears to a real browser. For PDFs, each page is converted to an image for visual OCR, or Textract is used for structured text extraction.

Extraction — The AI layer (Claude claude-sonnet-4-6 or Claude Haiku) runs the requested mode: extract fields, summarize, run Q&A, check claims, or translate. For large documents, a chunked pipeline splits the text, processes each chunk, and re-summarizes the pieces.

Evidence bundle — All artifacts (Markdown, screenshot, PDF, OCR text, structured fields) are stored in Cloudflare R2. A manifest.json is generated listing each artifact's SHA-256 hash, the capture URL, mode, and timestamps. DocImprint signs the manifest SHA-256 with its secp256k1 key.

Memory layer — The bundle is indexed into Cloudflare Vectorize (embeddings). Collections aggregate bundles into matter corpora for semantic search and cross-document Q&A with citation proofs.

What's stored and where

DocImprint uses Cloudflare-native storage primitives:

R2 — Immutable artifact files: Markdown text, full-page screenshot PNG, raw PDF bytes, OCR output, structured JSON fields, and manifest.json. URLs are content-addressed by bundle_id.

D1 (SQLite) — Bundle metadata: bundle_id, owner identity (wallet address or API key hash), mode, source URL, captured_at, retention policy, legal hold flag, manifest SHA-256, EAS attestation UID if notarized, parent_bundle_id for version chains.

Vectorize — Chunk embeddings for semantic search within collections. Each chunk stores bundle_id, chunk_id, and Merkle leaf index to enable citation verification after retrieval.

KV — Rate limit counters, API key quota, nonce cache for x402 replay prevention.

Nothing is stored outside Cloudflare infrastructure. No third-party analytics or telemetry services receive document content.

Offline verification workflow

Verify a bundle without trusting DocImprint servers:

bash
# 1. Download the bundle ZIP
curl -o bundle_ev_abc123.zip \
  https://api.docimprint.com/v1/extract/ev_abc123/download

# 2. Unzip and compute manifest hash locally
unzip bundle_ev_abc123.zip -d bundle/
sha256sum bundle/manifest.json
# Compare to manifest_sha256 in the JSON response

# 3. Verify signature.json against GET /v1/keys (active or retired signer)
# EIP-191 over manifest_sha256; complete bundles require platform signature for valid: true

# 4. Check each artifact's hash matches the manifest
jq '.artifacts[].sha256' bundle/manifest.json | while read hash; do
  # find the corresponding file and verify sha256sum matches
  echo "Checking $hash"
done

The bundle is valid if the manifest SHA-256 matches, signature.json verifies against a key from GET /v1/keys, and each artifact hash matches file bytes. Handoffs in provenance are application-layer audit records, not signed custody transfers.

Document memory vs RAG

Standard RAG pipelines embed document chunks and retrieve them at query time. This answers "what does this document say?" but cannot answer "was this document captured unmodified?", "who created this chunk?", or "can I prove this passage to a third party?"

Verifiable document memory adds the provenance layer RAG omits:

RAGVerifiable document memory
RetrievalYesYes
Citation sourceApproximate (cosine match)Exact (chunk_id + Merkle proof)
Tamper detectionNoYes (manifest SHA-256 + signature)
Capture timestampNoYes (captured_at + on-chain anchor)
Chain of custodyNoYes (agent provenance logs)
Legal holdNoYes (PUT /v1/extract/:id/hold)

For most question-answering use cases, RAG is sufficient. When answers must be defensible — to regulators, auditors, courts, or counterparties — use verifiable document memory.

Example

bash
curl -X POST https://api.docimprint.com/v1/extract \
  -H "Content-Type: application/json" \
  -H "X-Payment: <eip712-signed-usdc-transfer>" \
  -d '{"source":"https://example.com/contract.pdf","include":["markdown","summary","screenshot"]}'

Related