Aly Sawft · Founder & Engineer, Sawftware LLC · June 17, 2026 · 11 min read

Verifiable document memory for AI agents: architecture and tradeoffs

Why RAG alone is not enough

Retrieval-augmented generation (RAG) solves a real problem: language models have limited context windows and cannot hold entire document corpora in memory. RAG chunks documents, embeds them in a vector store, retrieves relevant chunks at query time, and injects them into the prompt. It works well for answering questions about large document sets.

But RAG makes no promises about the provenance or integrity of what it retrieves. When your agent answers "what does clause 12.3 say?", it is drawing from a vector database that:

May have chunked the document incorrectly (tables flatten to garbled text, page boundaries break mid-sentence)
May be out of date (the source document was updated, but the embedding was not re-indexed)
Cannot prove which version of the document was captured
Cannot prove the chunk was not modified after indexing
Cannot tell an auditor "this answer came from this specific page of this specific document on this specific date"

For consumer applications — chatbots, knowledge bases, internal Q&A tools — these gaps are acceptable. For legal review, financial due diligence, compliance audits, and regulated AI applications, they are not. The answer must be defensible: traceable to a source, timestamped, and verifiable by a third party.

Verifiable document memory adds the provability layer that RAG lacks.

The four-layer architecture

DocImprint's architecture has four layers, each building on the previous:

Layer 1 — Ingestion: document bytes arrive as a URL, a direct PDF upload, or an image. The source is fetched or received, rendered if necessary (PDFs to images for visual fidelity), and stored as an original artifact.

Layer 2 — Extraction: the document is processed through OCR (AWS Textract for tables, Cloudflare Vision as fallback), optionally passed to a language model for summarization, Q&A, claim-check, or structured extraction, and the outputs are stored as content-addressed artifacts (markdown, screenshot, OCR text).

Layer 3 — Evidence: a manifest.json is constructed from all artifact hashes plus capture metadata. It is signed with DocImprint's secp256k1 key. Optionally, the manifest_sha256 is written to Base L2 for timestamped on-chain evidence. The bundle is now tamper-evident.

Layer 4 — Memory: bundles are indexed into Vectorize with chunk embeddings, linked to collections, and made queryable via semantic search and cross-document ask. The agent can now retrieve, cite, and reason across a corpus — and every answer is traceable to a specific bundle, artifact, and text chunk with a Merkle proof.

What gets stored and where

The storage topology maps to Cloudflare's managed primitives:

R2 (object storage): original artifacts (PDF, image), extracted artifacts (markdown, OCR, screenshot), manifest.json. Addressed by bundle_id. Artifacts are immutable once written — a new extraction creates a new bundle.

D1 (SQLite database): bundle metadata index (bundle_id, owner_wallet, captured_at, mode, status, retention, legal_hold, eas_uid, tx_hash), collection membership, job queue state, agent provenance logs, handoff records.

Vectorize (vector index): chunk embeddings for semantic search. Each chunk maps to a bundle_id and chunk_id, enabling citation traceback from a search result to a specific artifact position.

KV (key-value): API key hashes, rate limit counters, nonce deduplication for x402 payments.

Nothing sensitive is stored in plaintext: API keys are hashed (SHA-256) in D1. x402 nonces are deduplicated in KV with TTL to prevent replay attacks. Wallet addresses are stored as checksummed hex strings.

The trust chain (who signs what)

Verifiable document memory requires answering the question: "who vouches for this?"

The trust chain has three nodes:

DocImprint signs the manifest. DocImprint's secp256k1 key (published at /.well-known/docimprint-keys.json) signs every manifest_sha256. This binds the capture to DocImprint's identity. Anyone can verify the signature without calling any DocImprint endpoint.
The blockchain timestamps the existence. When notarized, Base L2 records the manifest_sha256 as calldata or an EAS attestation. The block timestamp is immutable and publicly verifiable. This proves the bundle existed at a specific moment in calendar time — not just "DocImprint says so."
The client controls the key. The caller's wallet address (for x402) or API key identity is recorded in the manifest. The client owns the bundle — they can legal-hold it, version it, delete it, and add it to their own collections. DocImprint is a service provider, not a custodian with unilateral power over the evidence.

The combination of these three nodes means: a third party can verify a bundle without trusting DocImprint, without having network access to DocImprint, and without DocImprint's cooperation — just a bundle ZIP, a secp256k1 library, and a Base RPC node.

Document memory vs generic vector stores

Embedding documents in Pinecone, Weaviate, or pgvector gives you semantic search. It does not give you provenance.

The differences matter in practice:

When your agent cites a clause from a contract, a generic RAG system gives you: "found in chunk 47 of document_abc." Verifiable document memory gives you: "found in chunk c3f2... of bundle ev_abc, captured on 2026-05-15, original document hash a3b2..., Merkle proof available, notarized at Base block 12345678."

When an auditor asks "what was the source for this finding?", a generic RAG system gives you a chunk of text. Verifiable document memory gives you a signed manifest, a downloadable artifact ZIP, an on-chain timestamp, and a Merkle proof for the specific paragraph.

When a document is updated and re-extracted, a generic vector store has no version history. Verifiable document memory links the new bundle to the old via parent_bundle_id, preserves both, and lets you diff the extraction outputs.

These are not theoretical improvements. They are the difference between "our AI said this" and "our AI said this, here is the proof, here is the source, and here is the chain of custody."

Collections: the corpus layer

Individual bundles answer questions about individual documents. Collections answer questions across a document set.

A collection is a named group of bundles — a matter, a client folder, a regulatory filing set, a contract repository. When a bundle is added to a collection and indexed, its text chunks are embedded and stored in Vectorize. The collection can then be:

Searched semantically: "find all contract clauses related to indemnification"
Asked across the corpus: "what is the aggregate exposure to force majeure clauses across all supplier contracts?"
Cited: every answer includes chunk_ids and bundle_ids, with Merkle proof available for each cited chunk

The ask endpoint runs a two-stage pipeline: retrieve relevant chunks via vector similarity, inject into a prompt with citation instructions, parse the LLM response to extract citations, and verify each citation against the bundle's Merkle tree before returning.

This means every cross-document answer has a cryptographic receipt: the cited passages are verified members of the indexed bundles. An agent cannot hallucinate a citation that passes Merkle verification.

bashAdd a bundle to a collection and ask across the corpus

# Create a collection
curl -X POST https://api.docimprint.com/v1/collections \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{"name":"supplier-contracts","description":"Q2 2026 supplier agreements"}'

# Add a bundle to the collection
curl -X POST https://api.docimprint.com/v1/collections/col_abc/bundles \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{"bundle_id":"ev_contract1"}'

# Ask across the corpus
curl -X POST https://api.docimprint.com/v1/collections/col_abc/ask \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{"question":"What termination notice period do our supplier contracts require?"}'

Async jobs for large documents

Synchronous extraction works for most documents. For PDFs above 10 pages, batches of documents, or extraction modes that require heavy processing (multi-document compare, structured extraction with complex schemas), the job queue is the right path.

POST /v1/extract with store=true returns 202 by default; POST /v1/jobs creates batch async jobs. The queue consumer runs up to 50 concurrent workers (max_concurrency). Progress and completion are delivered via webhook or polled via GET /v1/jobs/:id.

The job response embeds the full result — for bundles, it includes the bundle_id and manifest_sha256; for lean extractions, it includes the complete output. There is no third request to fetch a separate bundle. The two-step pattern (dispatch → poll) covers everything.

Webhooks fire on status changes: queued → processing → complete or failed. The webhook payload matches the polling response format, making it easy to handle both paths with the same code.

bashDispatch an async job and poll for completion

# Dispatch async extract (add ?async=true or use POST /v1/jobs)
curl -X POST "https://api.docimprint.com/v1/extract?async=true" \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{"source":"https://example.com/large-report.pdf","include":["markdown","summary"]}'
# Response: { "status": "queued", "job_id": "job_xyz", "bundle_id": "ev_..." }

# Poll for completion
curl https://api.docimprint.com/v1/jobs/job_xyz \
  -H "Authorization: Bearer dr_live_..."
# { "status": "complete", "result": { "bundle_id": "ev_...", "manifest_sha256": "...", ... } }

Tradeoffs and when not to use DocImprint

Verifiable document memory is the right choice when trust, provenance, and auditability matter. It is not the right choice for every use case.

Do not use DocImprint for:

Bulk open-web crawling at scale (this is Firecrawl's domain)
RAG pipelines where provenance does not matter and cost is the primary constraint
Documents that change frequently and do not require historical versions
Real-time monitoring of hundreds of URLs per second (the Monitor product is for specific URLs on a schedule, not high-frequency web scrapers)

The per-call cost ($0.018–$0.075) reflects the work being done: OCR, LLM processing, artifact storage, cryptographic signing. For use cases that need this depth, the cost is justified. For use cases that just need "give me the text of this page," a simpler scraper is more economical.

The architecture described here is optimized for high-trust, high-value document workflows: legal, financial, compliance, research, and regulated AI applications where the cost of an unverifiable AI output exceeds the cost of verifiable infrastructure.

URL Monitor: change detection over time

Verifiable document memory is not only about static PDFs. Regulatory filings update, contract terms change, and policy pages get revised. The URL Monitor product closes this gap.

POST /v1/monitor registers a URL for periodic re-capture. When the rendered content hash changes, DocImprint creates a new evidence bundle linked to the prior version via parent_bundle_id. Webhooks notify your system of the change event with both bundle IDs.

This extends the four-layer architecture into a temporal dimension: you do not just know what a document said at capture time — you know when it changed and have independently verifiable bundles for both the before and after states.

For compliance teams monitoring supplier terms, SEC filings, or policy documents, Monitor plus evidence bundles replaces manual "check the website weekly" workflows with cryptographic change detection. Combined with collections, you can ask "which monitored URLs changed this month?" and receive answers with bundle-level citations.

Monitor complements batch extraction: use extract for one-time capture, Monitor for ongoing surveillance of documents that matter over months or years.

bashRegister a URL for change monitoring

curl -X POST https://api.docimprint.com/v1/monitor \
  -H "Authorization: Bearer dr_live_..." \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/terms-of-service","interval_hours":24,"webhook_url":"https://your-app.com/hooks/doc-change"}'

Document memory

Canonical reference

Evidence bundles

Bundle anatomy

Collections docs

Corpus workflows

Async jobs docs

Background processing