Personal Project • Agentic AI & Document Intelligence

Leah Rae
Genealogy

A team of purpose-built AI agents — backed by Azure Postgres/PostGIS and Blob storage — that reads ~1,400 pages of historical documents, photos, and handwritten records, transcribes them, and aligns every extracted fact to the right ancestor with full citations. Built for love of family history; it doubles as a working proof that an agentic pipeline can do trustworthy, audited work at scale.

Claude Agents Document AI Vision / OCR Azure Postgres / PostGIS Azure Blob Provenance Modeling

~1,400

Pages read & aligned

834

People in the tree

551

Sourced citations

150 yr

Research wall broken

01

The Problem

Unstructured by nature

Family history is a data problem disguised as a hobby. The raw material is century-old handwritten civil registers, typewritten certificates, scanned photographs, newspaper clippings, census images, and online profiles of wildly varying reliability — heterogeneous, unstandardized, and mostly trapped on paper or behind archive viewers.

The hard parts

The challenge isn't storing names. It's reading documents that range from clean typescript to 19th-century cursive in a foreign bureaucratic hand; aligning each fact to the correct person when names repeat across generations and records disagree; and tracking provenance — which record supports which claim, at what confidence — so conflicting sources can be adjudicated instead of silently overwritten.

At scale

Do all of that across ~1,400 pages and 800+ people without it collapsing into an unmaintainable pile of notes — and prove the pipeline is reliable enough that its output can be trusted.

02

The Solution

A team of agents, not one prompt

I built a fleet of purpose-built Claude agents, each defined as a structured skill file that owns exactly one job and shares a single database. One agent reads and transcribes scanned documents and photos with vision. One walks the FamilySearch pedigree and reconciles it against the live tree. One performs deep, recursive research on a single ancestor. One audits that research actually landed. One runs data-quality checks and safe fixes. New capabilities are new skills — the system grows without becoming a monolith.

Agents propose, code commits

The reliability of the whole thing comes from never letting a model write to the database directly. Every model-driven extraction lands in a review sidecar that a human approves; only then does a deterministic Python step mint IDs, dedupe by content hash, enforce foreign keys, store the original file, and write the records in a single transaction — with a backup taken first, every time.

Built for trust

The data model treats provenance as first-class: a citation links one source to one specific claim (about a person, event, or place) with a confidence level and a conflict flag. That makes "two records disagree about a birth year" a query to resolve — not a lost edit. Everything is backed by Azure Postgres/PostGIS for the structured tree and Azure Blob for original scans and timestamped backups, and surfaces as an interactive map with a page for every ancestor.

03

Result & Impact

A validated pipeline

The system now holds 834 people, 199 sources, and 551 citations across 647 geocoded places, built from ~1,400 pages of documents and photos read, transcribed, and aligned to the right individuals — migrated to a normalized schema with enforced referential integrity and zero broken references.

A 150-year wall, broken

The maternal Italian line was stuck at a genuine dead end: an ancestor with no recorded parents and zero attached sources. The research agents surfaced the family with AI handwriting OCR, then read the original 19th-century Italian government civil registers through a gated archive viewer — catching an error in the authoritative source (the recorded birth date was wrong by two years), disambiguating a same-named cousin via a marginal marriage annotation, and committing three new ancestors with eight citations, each tied to a specific register act.

Why it matters professionally

This is the exact shape of a production document-AI system — multi-agent decomposition, human-in-the-loop review gates, deterministic apply-and-audit, and full provenance — applied to a domain I genuinely love. The same instincts (reading messy real-world documents, modeling trust and conflict, automating safely) transfer directly to enterprise document intelligence, data-quality, and agentic-workflow work.

The discovery that proves the pipeline

An AI agent read a 19th-century Italian civil register in an ornate bureaucratic hand, reconciled it against a conflicting online record, determined the trusted source was wrong by two years, and committed a fully-cited correction — three new generations of the family — with every fact traceable to its original document image. That single episode is the whole thesis: agentic AI doing careful, sourced, auditable research a person can rely on.

Architecture

How the pipeline runs

01 / INGEST Read Vision agents transcribe scans, photos, and registers; web/pedigree agents pull online records.

02 / EXTRACT Propose Each fact written to a review sidecar with its source — never straight to the database.

03 / REVIEW Approve A human confirms or corrects the proposed extraction before anything commits.

04 / APPLY Commit Deterministic scripts mint IDs, dedupe by hash, enforce FKs, store originals, write citations.

05 / AUDIT Verify Audit + data-quality agents confirm patches landed and flag conflicts and gaps.

Stack

Technologies used

Claude Agents (Skill Files)

A team of single-purpose agents — vision transcription, pedigree reconciliation, recursive deep-dive research, audit, and data quality — each a reusable, runbook-backed skill.

Vision & Handwriting OCR

Reads scanned documents, photographs, and 19th-century handwritten registers — including foreign-language civil records — transcribing them into structured, sourced facts.

Azure PostgreSQL / PostGIS

The single source of truth — a normalized, spatially-enabled schema with enforced foreign keys and a first-class source→citation provenance model.

Azure Blob Storage

Stores original document and photo files (hashed and deduped) alongside timestamped database backups taken before every change.

Python (psycopg)

The deterministic apply-and-audit layer: dry-run by default, transactional writes, hash-based dedupe, and ID minting — the only code that touches the database.

Leaflet Map Viewer

PostGIS geometry exported to a published web map with a per-person "hub" page — timeline, family, narrative, citations — turning the database into a living family atlas.

More work like this

All Work Get in Touch

Leah RaeGenealogy