Skip to main content

Ingest Internals

This guide explains the internal architecture of the ingest pipeline — the seven execution stages, Cypher MERGE patterns, batching strategy, and error handling.

Architecture Overview

The ingest pipeline is a synchronous, seven-stage linear processor:

corpus JSON files
→ Pydantic validation
→ Stage-ordered FalkorDB writes
→ Run report

Key design decisions:

  • Synchronous — FalkorDB (Redis-based) is single-threaded; async adds overhead with no throughput benefit
  • ThreadPoolExecutor for file I/O — parallelizes JSON parsing and Pydantic validation (max_workers=4)
  • Sequential DB writes — UNWIND batches within each pipeline; sequential ordering ensures MERGE consistency
  • Fully idempotent — every node and edge uses MERGE (not CREATE), so re-runs are safe

Execution Stages

Stage 0 — Index creation (no data)
Stage 1 — Book registry load (no DB writes)
Stage 2 — Lexicon
Stage 3 — Scripture Text
Stage 4 — Reference nodes (TG, BD, Index — parallelized)
Stage 5 — Commentary
Stage 6 — Pending resolution
Report — Summary output

Stage 0: Index Creation

Creates all FalkorDB indices before any data is written. This ensures MERGE operations have index-backed lookups:

CREATE INDEX FOR (p:Passage) ON (p.id)
CREATE INDEX FOR (w:Word) ON (w.strongs)
CREATE INDEX FOR (t:TGEntry) ON (t.topicId)

Stage 1: Book Registry Load

Loads data/book_registry.json into memory. The registry maps book IDs to metadata (title, abbreviation, corpus type, chapter count). Used by all subsequent stages for ID validation.

Stage 2: Lexicon Pipeline

Reads lexicon/*.json files (Hebrew, Greek, Aramaic). Produces:

  • :Word nodes (one per Strong's number)
  • DERIVES_FROM edges (etymology chains via derivation.roots)
  • RELATED_TO edges (cross-referenced entries via related)

Must precede Stage 3 so ALIGNED_TO edges can link tokens to existing :Word nodes.

Stage 3: Scripture Text Pipeline

Reads corpus/*.json and all testament directories. Produces:

  • :Passage nodes (one per verse)
  • :Witness nodes (manuscript evidence)
  • :WordAlignment nodes (interlinear tokens)
  • HAS_ORIGINAL, HAS_WORD, ALIGNED_TO, CROSS_REF edges

Stage 4: Reference Nodes

Three independent pipelines run via ThreadPoolExecutor(max_workers=3):

  • Topical Guide:TGEntry nodes, CITES edges
  • Bible Dictionary:BDEntry nodes, SEE_ALSO_* edges
  • Scripture Index:IndexTopic nodes, ALSO_CITES edges

All three complete before Stage 5 begins.

Stage 5: Commentary

Reads verse-commentary and scholarly-commentary files. Produces:

  • :VerseNote, :Commentary, :Section nodes
  • ANNOTATES, CITES edges

Depends on Stage 3 because commentary edges reference :Passage nodes.

Stage 6: Pending Resolution

Promotes :PendingPassage stubs to :Passage where the target now exists. Reports unresolvable references in the run report.

Cypher MERGE Patterns

All writes use MERGE for idempotency — running the pipeline twice produces the same graph:

Node MERGE

UNWIND $rows AS row
MERGE (p:Passage {id: row.id})
SET p.bookId = row.bookId,
p.chapter = row.chapter,
p.verse = row.verse,
p.text = row.text

Edge MERGE

UNWIND $rows AS row
MATCH (p:Passage {id: row.passage_id})
MATCH (w:Word {strongs: row.strongs})
MERGE (p)-[:ALIGNED_TO]->(w)

PendingPassage pattern

When a cross-reference targets a passage that hasn't been ingested yet:

MERGE (p:PendingPassage {id: $target_id})

Stage 6 resolves these by matching against actual :Passage nodes.

Batching Strategy

Writes use UNWIND batches to minimize FalkorDB round-trips:

ParameterDefaultPurpose
--node-batch500Rows per UNWIND for node creation
--edge-batch200Rows per UNWIND for edge creation

The batch writer in ingest/db/batch.py:

  1. Splits the row list into chunks of batch_size
  2. Executes one UNWIND query per chunk
  3. Retries up to 3 times on transient failures (exponential backoff)
  4. Returns the total number of rows processed
def batch_write(graph, query, rows, batch_size=500, max_retries=3):
total = 0
for i in range(0, len(rows), batch_size):
chunk = rows[i:i + batch_size]
# Execute UNWIND with retry logic
graph.query(query, {"rows": chunk})
total += len(chunk)
return total
tip

UNWIND batching is much faster than individual MERGE statements. A batch of 500 nodes executes as a single FalkorDB command.

Error Handling

Validation errors

Every source file is validated with Pydantic before any DB write. Files that fail validation are:

  1. Logged at ERROR level with the file path and validation error details
  2. Skipped (the pipeline continues with remaining files)
  3. Recorded in the run report's validation_errors array

Database errors

  • Transient failures (connection resets, timeouts) trigger automatic retry with exponential backoff (up to 3 attempts)
  • Persistent failures abort the current pipeline stage and are recorded in the report
  • The pipeline continues to subsequent stages if possible

Unresolvable references

Cross-references to passages that don't exist in the corpus are:

  1. Created as :PendingPassage stubs during ingestion
  2. Attempted for resolution in Stage 6
  3. Reported as unresolvable in the run report if no matching :Passage exists

Node ID Derivation

Node IDs are deterministically derived from the source data:

Node TypeID FormatExample
Passage{bookId}.{chapter}.{verse}gen.1.1
WordStrong's numberH0430
TGEntrytg:{slug}tg:angels
BDEntrybd:{slug}bd:aaron
IndexTopicidx:{slug}idx:faith

Deterministic IDs enable MERGE idempotency — the same source data always produces the same node ID.

Project Layout

services/ingest/src/ingest/
├── main.py # CLI entry point (Click)
├── runner.py # Stage orchestrator
├── logger.py # structlog configuration
├── models/ # Pydantic v2 models (one file per schema family)
├── db/
│ ├── client.py # FalkorDB connection management
│ ├── schema.py # Index creation (Stage 0)
│ ├── batch.py # UNWIND batch helpers
│ └── cypher.py # All MERGE query constants
├── registry.py # Book ID registry
├── ids.py # Node ID derivation functions
├── pipelines/
│ ├── base.py # BasePipeline ABC
│ ├── lexicon.py
│ ├── scripture_text.py
│ ├── topical_guide.py
│ ├── bible_dictionary.py
│ ├── scripture_index.py
│ ├── verse_commentary.py
│ └── scholarly.py
├── pending.py # Stage 6: pending reference resolution
├── report.py # Run report accumulation
└── loader.py # File discovery and Pydantic dispatch