Aly Sawft · Founder & Engineer, Sawftware LLC · · 11 min read
Retrieval-augmented generation (RAG) solves a real problem: language models have limited context windows and cannot hold entire document corpora in memory. RAG chunks documents, embeds them in a vector store, retrieves relevant chunks at query time, and injects them into the prompt. It works well for answering questions about large document sets.
But RAG makes no promises about the provenance or integrity of what it retrieves. When your agent answers "what does clause 12.3 say?", it is drawing from a vector database that:
For consumer applications — chatbots, knowledge bases, internal Q&A tools — these gaps are acceptable. For legal review, financial due diligence, compliance audits, and regulated AI applications, they are not. The answer must be defensible: traceable to a source, timestamped, and verifiable by a third party.
Verifiable document memory adds the provability layer that RAG lacks.
DocImprint's architecture has four layers, each building on the previous:
Layer 1 — Ingestion: document bytes arrive as a URL, a direct PDF upload, or an image. The source is fetched or received, rendered if necessary (PDFs to images for visual fidelity), and stored as an original artifact.
Layer 2 — Extraction: the document is processed through OCR (AWS Textract for tables, Cloudflare Vision as fallback), optionally passed to a language model for summarization, Q&A, claim-check, or structured extraction, and the outputs are stored as content-addressed artifacts (markdown, screenshot, OCR text).
Layer 3 — Evidence: a manifest.json is constructed from all artifact hashes plus capture metadata. It is signed with DocImprint's secp256k1 key. Optionally, the manifest_sha256 is written to Base L2 for timestamped on-chain evidence. The bundle is now tamper-evident.
Layer 4 — Memory: bundles are indexed into Vectorize with chunk embeddings, linked to collections, and made queryable via semantic search and cross-document ask. The agent can now retrieve, cite, and reason across a corpus — and every answer is traceable to a specific bundle, artifact, and text chunk with a Merkle proof.
The storage topology maps to Cloudflare's managed primitives:
R2 (object storage): original artifacts (PDF, image), extracted artifacts (markdown, OCR, screenshot), manifest.json. Addressed by bundle_id. Artifacts are immutable once written — a new extraction creates a new bundle.
D1 (SQLite database): bundle metadata index (bundle_id, owner_wallet, captured_at, mode, status, retention, legal_hold, eas_uid, tx_hash), collection membership, job queue state, agent provenance logs, handoff records.
Vectorize (vector index): chunk embeddings for semantic search. Each chunk maps to a bundle_id and chunk_id, enabling citation traceback from a search result to a specific artifact position.
KV (key-value): API key hashes, rate limit counters, nonce deduplication for x402 payments.
Nothing sensitive is stored in plaintext: API keys are hashed (SHA-256) in D1. x402 nonces are deduplicated in KV with TTL to prevent replay attacks. Wallet addresses are stored as checksummed hex strings.
Verifiable document memory requires answering the question: "who vouches for this?"
The trust chain has three nodes:
DocImprint signs the manifest. DocImprint's secp256k1 key (published at /.well-known/docimprint-keys.json) signs every manifest_sha256. This binds the capture to DocImprint's identity. Anyone can verify the signature without calling any DocImprint endpoint.
The blockchain timestamps the existence. When notarized, Base L2 records the manifest_sha256 as calldata or an EAS attestation. The block timestamp is immutable and publicly verifiable. This proves the bundle existed at a specific moment in calendar time — not just "DocImprint says so."
The client controls the key. The caller's wallet address (for x402) or API key identity is recorded in the manifest. The client owns the bundle — they can legal-hold it, version it, delete it, and add it to their own collections. DocImprint is a service provider, not a custodian with unilateral power over the evidence.
The combination of these three nodes means: a third party can verify a bundle without trusting DocImprint, without having network access to DocImprint, and without DocImprint's cooperation — just a bundle ZIP, a secp256k1 library, and a Base RPC node.
Embedding documents in Pinecone, Weaviate, or pgvector gives you semantic search. It does not give you provenance.
The differences matter in practice:
When your agent cites a clause from a contract, a generic RAG system gives you: "found in chunk 47 of document_abc." Verifiable document memory gives you: "found in chunk c3f2... of bundle ev_abc, captured on 2026-05-15, original document hash a3b2..., Merkle proof available, notarized at Base block 12345678."
When an auditor asks "what was the source for this finding?", a generic RAG system gives you a chunk of text. Verifiable document memory gives you a signed manifest, a downloadable artifact ZIP, an on-chain timestamp, and a Merkle proof for the specific paragraph.
When a document is updated and re-extracted, a generic vector store has no version history. Verifiable document memory links the new bundle to the old via parent_bundle_id, preserves both, and lets you diff the extraction outputs.
These are not theoretical improvements. They are the difference between "our AI said this" and "our AI said this, here is the proof, here is the source, and here is the chain of custody."
Individual bundles answer questions about individual documents. Collections answer questions across a document set.
A collection is a named group of bundles — a matter, a client folder, a regulatory filing set, a contract repository. When a bundle is added to a collection and indexed, its text chunks are embedded and stored in Vectorize. The collection can then be:
The ask endpoint runs a two-stage pipeline: retrieve relevant chunks via vector similarity, inject into a prompt with citation instructions, parse the LLM response to extract citations, and verify each citation against the bundle's Merkle tree before returning.
This means every cross-document answer has a cryptographic receipt: the cited passages are verified members of the indexed bundles. An agent cannot hallucinate a citation that passes Merkle verification.
# Create a collection
curl -X POST https://api.docimprint.com/v1/collections \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"name":"supplier-contracts","description":"Q2 2026 supplier agreements"}'
# Add a bundle to the collection
curl -X POST https://api.docimprint.com/v1/collections/col_abc/bundles \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"bundle_id":"ev_contract1"}'
# Ask across the corpus
curl -X POST https://api.docimprint.com/v1/collections/col_abc/ask \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"question":"What termination notice period do our supplier contracts require?"}'Synchronous extraction works for most documents. For PDFs above 10 pages, batches of documents, or extraction modes that require heavy processing (multi-document compare, structured extraction with complex schemas), the job queue is the right path.
POST /v1/extract with store=true returns 202 by default; POST /v1/jobs creates batch async jobs. The queue consumer runs up to 50 concurrent workers (max_concurrency). Progress and completion are delivered via webhook or polled via GET /v1/jobs/:id.
The job response embeds the full result — for bundles, it includes the bundle_id and manifest_sha256; for lean extractions, it includes the complete output. There is no third request to fetch a separate bundle. The two-step pattern (dispatch → poll) covers everything.
Webhooks fire on status changes: queued → processing → complete or failed. The webhook payload matches the polling response format, making it easy to handle both paths with the same code.
# Dispatch async extract (add ?async=true or use POST /v1/jobs)
curl -X POST "https://api.docimprint.com/v1/extract?async=true" \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"source":"https://example.com/large-report.pdf","include":["markdown","summary"]}'
# Response: { "status": "queued", "job_id": "job_xyz", "bundle_id": "ev_..." }
# Poll for completion
curl https://api.docimprint.com/v1/jobs/job_xyz \
-H "Authorization: Bearer dr_live_..."
# { "status": "complete", "result": { "bundle_id": "ev_...", "manifest_sha256": "...", ... } }Verifiable document memory is the right choice when trust, provenance, and auditability matter. It is not the right choice for every use case.
Do not use DocImprint for:
The per-call cost ($0.018–$0.075) reflects the work being done: OCR, LLM processing, artifact storage, cryptographic signing. For use cases that need this depth, the cost is justified. For use cases that just need "give me the text of this page," a simpler scraper is more economical.
The architecture described here is optimized for high-trust, high-value document workflows: legal, financial, compliance, research, and regulated AI applications where the cost of an unverifiable AI output exceeds the cost of verifiable infrastructure.
Verifiable document memory is not only about static PDFs. Regulatory filings update, contract terms change, and policy pages get revised. The URL Monitor product closes this gap.
POST /v1/monitor registers a URL for periodic re-capture. When the rendered content hash changes, DocImprint creates a new evidence bundle linked to the prior version via parent_bundle_id. Webhooks notify your system of the change event with both bundle IDs.
This extends the four-layer architecture into a temporal dimension: you do not just know what a document said at capture time — you know when it changed and have independently verifiable bundles for both the before and after states.
For compliance teams monitoring supplier terms, SEC filings, or policy documents, Monitor plus evidence bundles replaces manual "check the website weekly" workflows with cryptographic change detection. Combined with collections, you can ask "which monitored URLs changed this month?" and receive answers with bundle-level citations.
Monitor complements batch extraction: use extract for one-time capture, Monitor for ongoing surveillance of documents that matter over months or years.
curl -X POST https://api.docimprint.com/v1/monitor \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com/terms-of-service","interval_hours":24,"webhook_url":"https://your-app.com/hooks/doc-change"}'