Intelligent Document Processing · Open source

Document AI you can audit.

An MCP server that ingests documents from anywhere, classifies them, stacks them in your reviewer's exact order, and extracts fields with grounded, page-level provenance. Built on LandingAI ADE, for regulated finance and healthcare.

Apache-2.0 Stub mode, no API key needed OAuth 2.1 remote Bring your own LandingAI key
Works with
Claude Lyzr AI LangGraph CrewAI Databricks any MCP client
The problem

Regulated teams drown in paperwork they can't safely automate.

Loan files, claims, intake packets. A pile of PDFs that someone has to read, sort, key in, and sign off on. A wrong field isn't a typo, it's a compliance event. Traditional OCR gives you a wall of text. Fluent LLMs give you confident answers with no way to check them. Neither is something you can put in front of an examiner.

How it works

One grounded pipeline, five steps.

Documents in from any source, a review-ready package out. Every grounded value points back to its page and box, and anything ungrounded is flagged for a human.

1

Ingest

From an upload portal, SFTP/S3, email, or an export. Source-agnostic.

2

Classify

Detect each document's type to pick the right schema and stack slot.

3

Extract

LandingAI ADE returns typed fields with page, box, and confidence on grounded values.

4

Stack

Order the set exactly the way your reviewer expects. Configurable per use case.

5

Render

A combined PDF cover sheet tied to source pages, plus a JSON sidecar.

Why it's different

Grounding is the audit signal.

Built on the one idea that makes document AI usable where the stakes are real: a machine-read value should always point back to where it came from.

Grounded provenance

Grounded values carry page, bounding box, and confidence. Examinable by default.

Flag, don't guess

Ungrounded or low-confidence values are flagged for human review, not trusted silently.

Configurable stacking

Assemble any document set into the exact order a reviewer wants.

Source-agnostic

Works on any list of files, however they arrived. Connectors are thin adapters.

Portable by design

One MCP core, called from Claude, Lyzr, LangGraph, CrewAI, or run on Databricks.

No autonomous decisions

It outputs data and a review queue. It never approves, denies, scores, or ranks.

Governance

The posture regulated work expects, in the architecture.

Not promised on a slide. The properties an auditor cares about are structural.

  • Provenance on grounded values gives an examinable audit trail.
  • Human-in-the-loop. A person decides; the system prepares the file.
  • OAuth 2.1 on the remote server. It refuses to start unauthenticated.
  • Run it in your own environment, your cloud or your Databricks workspace.
One honest caveat: live extraction calls LandingAI ADE, a third-party API, so documents are sent there for parsing. Stub mode is fully local, and LandingAI offers on-prem / VPC options where that step must stay inside. This is design-aligned for regulated use, not a compliance certification.
# A grounded field comes back like this
{
  "name": "borrower.income",
  "value": "6,500.00",
  "grounded": true,
  "confidence": 0.98,
  "source_doc": "paystub",
  "page": 1,
  "needs_review": false
}

# An ungrounded value is surfaced, not trusted
{ "name": "account.holder",
  "grounded": false,
  "needs_review": true }
Who it's for

Same shape underneath, across regulated work.

The pattern that orders a mortgage credit file orders a healthcare intake or a claims packet.

$

Lending & banking

Stack a credit file (1003, paystubs, W-2, bank statements, ID), extract income, identity, and collateral fields with provenance, hand a reviewer a decision-ready package.

+

Healthcare ops

Order an intake, prior-auth, or claims packet, extract the fields a reviewer needs, flag anything ungrounded.

Any regulated back office

Turn a folder of PDFs into stacked, grounded, audit-ready data your team can act on.

Get started

Try the whole pipeline in a minute, free.

No API key required. Stub mode runs the full pipeline on synthetic data so you can wire it up before spending a cent on ADE. Add your LandingAI key for live extraction.

# clone, install, run, no key needed
git clone https://github.com/rdmurugan/idpflow-core.git
cd idpflow-core
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e .

python examples/make_sample_docs.py
python examples/direct_library.py   # stub mode, free
Open source

Built in the open. PRs welcome.

Apache-2.0. Especially looking for new stacking profiles, extraction schemas, and connectors from people in lending, banking, and healthcare ops. Tell me what documents you're drowning in.

Contribute on GitHub