Aly Sawft · Founder & Engineer, Sawftware LLC · · 9 min read
When an AI agent extracts data from a PDF — a contract clause, an invoice total, a research finding — the output is only as trustworthy as the pipeline that produced it. The agent saw the document, ran it through a model, and returned a JSON object. But:
These questions do not have answers in most document AI pipelines. The extraction is a black box: input goes in, output comes out, and you are expected to trust both the vendor and the model.
Evidence bundles are the answer. They are tamper-evident by design: every artifact is content-addressed, and the manifest that describes them carries a cryptographic signature that lets you verify nothing has changed — without trusting the server.
Every DocImprint bundle has a stable identifier (bundle_id starting with ev_) and consists of three layers. Artifact hashes use SHA-256 as defined in NIST FIPS 180-4. Manifest signatures follow EIP-191 structured data signing.
Artifacts — the raw document outputs stored in R2:
Manifest — manifest.json, also in R2, which records:
Signature — appended to the manifest response:
# Create an evidence bundle
curl -X POST https://api.docimprint.com/v1/extract \
-H "Authorization: Bearer dr_live_..." \
-H "Content-Type: application/json" \
-d '{"source":"https://example.com/contract.pdf","include":["markdown","screenshot","summary"]}'
# Response includes bundle_id, manifest_sha256, artifacts, signature
# {
# "bundle_id": "ev_abc123",
# "manifest_sha256": "a3f2c1...",
# "signature": "0x1b2c3d...",
# "artifacts": { "markdown": "...", "summary": "..." },
# "captured_at": "2026-06-12T10:00:00Z"
# }The word "tamper-evident" has a precise meaning here. It does not mean the system prevents tampering — an attacker with write access to R2 could modify artifacts. It means that any modification to artifacts or the manifest is detectable by anyone who has the manifest_sha256.
The verification chain:
You store manifest_sha256 (and optionally the signature) outside of DocImprint — in your own database, in a legal hold system, or on-chain.
Later, you call GET /v1/extract/:id/verify. DocImprint re-fetches every artifact, recomputes each SHA-256, reconstructs the manifest, and compares to the stored manifest_sha256.
If anything has changed — any artifact byte, any metadata field — the recomputed hash will not match and the endpoint returns 409 with tamper_detected: true.
You can verify the signature independently using the public key from /.well-known/docimprint-keys.json, without calling any DocImprint endpoint. The signature proves that DocImprint produced this specific manifest at capture time.
You can download the full artifact ZIP and verify locally: unzip, hash each file, reconstruct the manifest, check the hash. No network call to DocImprint required.
# Deep verify (re-hashes all artifacts server-side)
curl https://api.docimprint.com/v1/extract/ev_abc123/verify
# 200: { "valid": true, "bundle_id": "ev_abc123", "manifest_sha256": "a3f2c1..." }
# 409: { "valid": false, "error": "TAMPER_DETECTED", "mismatched_artifacts": ["markdown.md"] }
# Quick verify (signature check only, no re-hash)
curl "https://api.docimprint.com/v1/extract/ev_abc123/verify?quick=true"
# Download artifact ZIP for local verification
curl -O -J https://api.docimprint.com/v1/extract/ev_abc123/download
# evidence-ev_abc123.zip contains: manifest.json, markdown.md, screenshot.png, ...The signature in the manifest proves that DocImprint produced a specific bundle. It does not prove when. For legal and compliance use cases, timestamping matters: "this document showed this content on this date."
POST /v1/extract/:id/notarize submits the manifest_sha256 to Base L2 — either as calldata on a transaction or as an EAS (Ethereum Attestation Service) attestation. The on-chain record is permanent, public, and independently verifiable by anyone with a Base RPC node.
Notarization costs $0.05 per bundle. The on-chain transaction hash is stored in the bundle and returned by verify. Anyone can look up the transaction on Basescan and confirm that a given manifest_sha256 existed at a specific block timestamp — without trusting DocImprint.
This combination — artifact hashes + manifest signature + on-chain timestamp — produces a chain of evidence that is difficult to dispute in legal proceedings: the document existed, contained this content, and was captured at this time.
# Notarize (writes manifest_sha256 to Base calldata or EAS)
curl -X POST https://api.docimprint.com/v1/extract/ev_abc123/notarize \
-H "Authorization: Bearer dr_live_..."
# Response: { "tx_hash": "0x...", "block": 12345678, "network": "base", "eas_uid": "0x..." }
# Verify shows notarization status
curl https://api.docimprint.com/v1/extract/ev_abc123/verify
# { "valid": true, "notarized": true, "tx_hash": "0x...", "eas_uid": "0x..." }Evidence bundles have lifecycle controls built for compliance workflows:
Legal hold (PUT /v1/extract/:id/hold) marks a bundle as protected. Held bundles cannot be deleted — not by the API, not by retention policies, not by a DELETE call. The hold must be released explicitly before deletion is possible. This matches the legal discovery obligation to preserve relevant documents once litigation is reasonably anticipated.
Retention periods set automatic expiry. The default is 90 days; you can set a specific ISO date. When retention expires, bundles are eligible for garbage collection. Legal hold overrides retention.
Notarized bundles require acknowledge_notarized: true on DELETE — a guardrail against accidental deletion of bundles with on-chain evidence.
These controls are not cosmetic. A compliance audit or e-discovery request can be answered with: "we held this bundle under legal hold from date X, it has not been modified (hash matches), and it was notarized on Base at block Y."
DocImprint has two modes: store=true (evidence bundle) and store=false (lean response).
Use evidence bundles when:
Use lean endpoints (/v1/summarize, /v1/qa, etc.) when:
The API surfaces are separate but compatible. A lean summarize call and a full extract bundle can both be run against the same source; the results will differ in depth and cost but not in the underlying model behavior.
Understanding the hash calculation is essential for offline verification. The manifest_sha256 is not a hash of the ZIP file or a hash of individual artifacts in isolation — it is a deterministic serialization of manifest.json.
The manifest object includes: bundle_id, captured_at, source reference, mode parameters, artifact hash map (filename → SHA-256 hex), mode-specific result fields, owner identity, and merkle_root when indexed. Fields are serialized in a canonical JSON order before hashing.
When you call GET /v1/extract/:id/verify, DocImprint:
If you modify a single byte in markdown.md — even a trailing newline — the artifact hash changes, the manifest hash changes, and verification returns TAMPER_DETECTED with the mismatched artifact names.
This is why storing manifest_sha256 in your own system of record matters: it is your independent anchor. You can verify against your stored hash even if DocImprint's API is unavailable, as long as you have the artifact ZIP and the public signing key.
# After downloading evidence-ev_abc123.zip:
# 1. sha256sum each artifact file
# 2. Build manifest object matching manifest.json structure
# 3. Canonical JSON serialize → SHA-256 → compare to stored manifest_sha256
# 4. Verify EIP-191 signature over manifest_sha256 using key from:
curl https://api.docimprint.com/v1/.well-known/docimprint-keys.jsonA common question: "I already store PDFs in S3 with versioning. Why do I need evidence bundles?"
S3 versioning proves that an object existed at a point in time within AWS's trust boundary. It does not prove:
Evidence bundles bind extraction outputs to source artifacts cryptographically. The manifest records both the original PDF hash and the extracted markdown hash. The signature proves DocImprint produced this specific binding. Merkle roots enable paragraph-level citation proofs.
S3 and evidence bundles are complementary, not competing. A robust workflow stores the evidence ZIP in your own S3 bucket (step 4 in the chain-of-custody guide) while using DocImprint for capture, signing, and verification. Your S3 copy is the custody record; the bundle signature is the integrity proof.
Generic document AI APIs return JSON with no hash chain. You cannot prove tomorrow that today's output is unchanged. Evidence bundles make "prove it" a one-line API call instead of a forensic investigation.
Evidence bundles are the foundational primitive for higher-order document intelligence workflows:
Collections — group bundles by matter, client, or topic. Semantic search across the corpus with citations that point back to specific bundle artifacts and chunk IDs.
Version chains — when a document is updated, create a new bundle with parent_bundle_id pointing to the previous version. GET /v1/extract/:id/history returns the full chain. You can diff versions, compare claim-check results, and prove what changed and when.
Agent handoffs — POST /v1/extract/:id/handoff records which agent passed the bundle to which agent, with a note. GET /v1/extract/:id/chain returns the delegation graph. This is chain-of-custody for multi-agent pipelines.
Citation proofs — POST /v1/extract/:id/verify-citation with a chunk_id and quoted text returns a Merkle proof. The proof is portable: any party can verify the chunk's membership in the bundle's Merkle tree offline.
Each of these features is only possible because the bundle is an immutable, content-addressed artifact with a verifiable history. That is what tamper-evidence unlocks.