Skip to main content

M01: Data Pipeline — v0.1.0-alpha

Version tag: ingest/v0.1.0-alpha Phase: P0: Foundation Target: Weeks 3–8 Sprints: S1, S2, S3


Phase Context

Goal: Corpus data flows from JSON files through the ingest pipeline into FalkorDB, and the content service serves passage data over HTTP.

Key constraint: Everything downstream depends on this. No reader, no interlinear, no graph view without a working data layer.


ZenHub Configuration

FieldValue
MilestoneM01: Data Pipeline
Due Date2026-05-03
Default PipelineProduct Backlog
Primary Epic(s)Ingest Core Pipeline, Ingest Test Suite, Corpus Validation

Prerequisites

  • M00: Tech Prep — testing infrastructure, dev environment validation, corpus v1→v2 migration verified, structured logging configured

Epic: Ingest Core Pipeline

Implement the 7-stage ingest pipeline per GOSPELIB-INGEST-SPEC.md.

Story AreaScopeSpec Reference
Pydantic modelsAll 7 schema families + shared typesGOSPELIB-SCHEMAS.md § Schema Families
Book registryLoad + validate data/book_registry.jsonGOSPELIB-INGEST-SPEC.md § Stage 1
ID generatorsAll 10 ids.py functionsGOSPELIB-INGEST-SPEC.md § ID Derivation
File loaderSchema-dispatched Pydantic validationGOSPELIB-INGEST-SPEC.md § loader.py
FalkorDB clientConnection pool, batch write helpers, retryGOSPELIB-INGEST-SPEC.md § Batch Strategy
Cypher constantsAll MERGE queries for 12 node types + 16 edge typesGOSPELIB-INGEST-SPEC.md § Cypher Constants
Index creationStage 0 — primary + secondary indicesGOSPELIB-INGEST-SPEC.md § Index Schema
Lexicon pipelineStage 2 — :Word nodes, DERIVES_FROM, RELATED_TOGOSPELIB-INGEST-SPEC.md § Stage 2
Scripture text pipelineStage 3 — :Passage, :Witness, :WordAlignment + all edgesGOSPELIB-INGEST-SPEC.md § Stage 3
TG/BD/Index pipelinesStage 4 — concurrent file I/O, sequential writesGOSPELIB-INGEST-SPEC.md § Stage 4
Commentary pipelinesStage 5 — verse + scholarly commentaryGOSPELIB-INGEST-SPEC.md § Stage 5
Pending resolutionStage 6 — promote :PendingPassage stubsGOSPELIB-INGEST-SPEC.md § Stage 6
Pipeline runnerIngestRunner orchestrating stages 0–6 + reportGOSPELIB-INGEST-SPEC.md § Runner
CLI interfaceClick commands: run, --dry-run, --only, --resetGOSPELIB-INGEST-SPEC.md § CLI
Run reportJSON report with timing, counts, errorsGOSPELIB-INGEST-SPEC.md § Report

Issues

IDTitleStatusNotes
M01-001Implement Pydantic v2 Models for All 7 Schema Families✅ DoneRe-exports from gospelib-schemas is the intended design; models/ package covers all 7 schema families
M01-002Implement Book Registry Loader✅ DoneBookRegistry class with load/resolve/is_known methods
M01-003Implement ID Derivation Functions✅ DoneAll 12 ID functions in ids.py
M01-004Implement Schema-Dispatched File Loader✅ Doneloader.py (150+ lines) with _SCHEMA_MAP dispatch dict, discover_files(), FileLoadError
M01-005Implement FalkorDB Client, Batch Writer, and Retry Logic✅ Donedb/client.py + db/batch.py with async connection pooling, cache-aside, and retry logic
M01-006Implement Cypher MERGE Constants for All Node and Edge Types✅ Donedb/cypher.py (517 lines) with MERGE templates for all 12 node types and 16+ edge types
M01-007Implement FalkorDB Index and Schema Creation (Stage 0)✅ Donedb/schema.py (200+ lines) with primary and secondary index creation
M01-008Implement Lexicon Pipeline (Stage 2)✅ Donepipelines/lexicon.py with Word nodes, DERIVES_FROM/RELATED_TO edges
M01-009Implement Scripture Text Pipeline (Stage 3)✅ Donepipelines/scripture_text.py with Passage/Witness/WordAlignment processing
M01-010Implement TG, BD, and Scripture Index Pipelines (Stage 4)✅ Donepipelines/topical_guide.py, bible_dictionary.py, scripture_index.py
M01-011Implement Commentary Pipelines (Stage 5)✅ Donepipelines/verse_commentary.py and scholarly.py
M01-012Implement Pending Reference Resolution (Stage 6)✅ Donepipelines/pending_resolution.py + pending.py
M01-013Implement Pipeline Runner (Stage Orchestrator)✅ Donerunner.py (660 lines) with STAGES map and run_pipeline() orchestration
M01-014Implement CLI Interface with Click✅ Donemain.py with Click run command, --dry-run / --only / --reset flags
M01-015Implement Ingest Run Report✅ Donereport.py with IngestReport, timing, counts, errors, write()
M01-016Implement Unit Tests for Registry, IDs, Models, and Batch Helpers✅ Done12+ unit test files: test_registry.py, test_ids.py, test_models.py, test_loader.py, test_cypher.py, test_batch.py, test_client.py, etc.
M01-017Implement Integration Tests for All Pipelines✅ Done14 integration test files covering every pipeline stage plus idempotency and graph writes
M01-018Create Minimal Test Fixtures for All 7 Schema Families✅ Donedata/fixtures/ with bd/, commentary/, corpus/, cross-references/, index/, lexicon/, scholarly/, tg/
M01-019Implement Smoke Test for Full Pipeline✅ Donetests/smoke_test.py
M01-020Corpus Schema Compliance Audit✅ DoneAudit CLI command and schema compliance validation implemented
M01-021Data Quality Checks (Cross-Refs, Strong's, Encoding)✅ DoneData quality checks for orphaned Strong's numbers, missing cross-refs, and encoding issues implemented
M01-022Generate Test Fixtures from Real Corpus Files✅ DoneScript to auto-extract minimal fixtures from live corpus files implemented

Progress: 22 Done · 0 Partial · 0 To Do (100%)


Epic: Ingest Test Suite

Story AreaScopeSpec Reference
Unit testsRegistry, ID derivation, Pydantic models (valid + invalid), batch helpersGOSPELIB-INGEST-SPEC.md § Testing
Integration testsEach pipeline against real FalkorDB (testcontainers)GOSPELIB-INGEST-SPEC.md § Testing
Test fixturesMinimal corpus files in data/fixtures/GOSPELIB-SCHEMAS.md
Smoke testFull pipeline on all fixturesGOSPELIB-INGEST-SPEC.md § Testing

Issues

(Issues for this epic are numbered as part of the main M01 issue list above)


Epic: Corpus Validation

Story AreaScopeSpec Reference
Schema compliance auditValidate every corpus JSON file against Pydantic modelsGOSPELIB-SCHEMAS.md
Data quality checksMissing cross-refs, orphaned Strong's numbers, encoding issuesGOSPELIB-SCHEMAS.md § Cross-References
Fixture generationCreate minimal test fixtures from real corpus filesGOSPELIB-SCHEMAS.md

Issues

(Issues for this epic are numbered as part of the main M01 issue list above)


Document References

DocContainsUse When Writing Stories For
MVP.mdFeature scope, tier breakdown, success criteria, budgetAcceptance criteria, scope boundaries
TECH-SPEC.mdArchitecture, service boundaries, data stores, API catalogTechnical implementation details
GOSPELIB-SCHEMAS.mdAll 7 schema families, node/edge types, validation rulesData models, Pydantic models, graph schema
GOSPELIB-INGEST-SPEC.md7-stage pipeline, Cypher templates, batch strategy, CLIIngest pipeline stories
REPO-MAP.mdDirectory structure, naming conventions, dependency rulesAll stories (coding standards)
BusinessLEGAL.md, POLICY-TERMS.md, executive summary, market research, GTMLaunch readiness, legal/compliance stories

Sprint Mapping

SprintWeeksPrimary Focus
S13–4Pydantic models, book registry, ID generators, file loader
S25–6FalkorDB client, Cypher constants, Stage 0–2 (indices + lexicon)
S37–8Stages 3–6 (scripture text → pending resolution), CLI, report

Sprint Load Warnings

No explicit load warnings for S1–S3. However, S3 covers Stages 3–6 plus CLI and report in 2 weeks — this is the densest sprint in M01 since it completes all remaining pipeline stages.


Release Info

ReleaseTagContains
v0.1.0-alphaingest/v0.1.0-alphaFull ingest pipeline operational — complete corpus ingested into local FalkorDB

Relevant Risks

RiskImpactMitigation
Ingest pipeline data quality issuesBlocks all downstream featuresCorpus validation epic in P0; dry-run + schema enforcement
FalkorDB performance at corpus scaleSlow content API, bad UXBenchmark after M01; index tuning; caching layer in M02
Missing specification documentsGOSPELIB-MIGRATION-SPEC.md not writtenDocument corpus migration process before M01

Cross-Cutting Concerns

Testing

LayerFrameworkWhenSpec Reference
Python unit/integrationpytest + testcontainersEvery PRGOSPELIB-INGEST-SPEC.md § Testing

Documentation

DocUpdate Trigger
Getting StartedM01 complete — document local setup
Running IngestM01 complete — document pipeline operation
ADRsEach major technical decision

CI/CD

AdditionDetail
Python test containers in CIFalkorDB + PostgreSQL service containers for ingest integration tests

Issue Dependency Graph

Foundation (S1, no blockers):
M01-001 (Models) M01-002 (Registry) M01-003 (IDs)
M01-005 (DB Client) M01-006 (Cypher) M01-015 (Report)

Infrastructure (S2):
M01-005 ──► M01-007 (Indices)
M01-001 ──► M01-004 (Loader)

Pipelines (S2–S3):
M01-001 ─┐
M01-005 ─┼──► M01-008 (Lexicon) ──► M01-009 (Scripture) ──► M01-011 (Commentary) ──┐
M01-006 ─┘ │
M01-001 ─┐ │
M01-003 ─┼──► M01-010 (TG/BD/Index) ──────────────────────────────────────────────┤
M01-005 ─┘ │

M01-002 ──────────────────────────────────────────────────────► M01-012 (Pending)

Orchestration (S3):
M01-004 ─┐
M01-007 ─┤
M01-008 ─┤
M01-009 ─┼──► M01-013 (Runner) ──► M01-014 (CLI)
M01-010 ─┤ ▲
M01-011 ─┤ │
M01-012 ─┘ M01-015 (Report)

Testing (S1–S3):
M01-001 ──► M01-018 (Fixtures) ──► M01-017 (Integration) ──► M01-019 (Smoke)
M01-022 (Fixture Gen) ──► M01-018 │ ▲
└─────────────────► M01-019
M01-016 (Unit Tests) — independent of pipeline issues
M01-013 (Runner) ──► M01-019

Validation (S2):
M01-001 ──► M01-020 (Schema Audit) ──► M01-021 (Quality Checks)
M01-004 ──► M01-020

Legend: A ──► B means A blocks B (B is blocked by A)


Dependencies

Upstream (what M01 needs)

  • M00: Tech Prep — testing infrastructure, dev environment validation, corpus v1→v2 migration, structured logging

Downstream (what depends on M01)

  • M02: Content API — depends on M01 FalkorDB client + Stage 3 (scripture text ingested data)
  • M05: Search & Staging — Typesense sync extends the ingest pipeline built here
  • M06: Interlinear & Lexicon — depends on word alignment + lexicon data ingested here
  • M08: Knowledge Graph — depends on cross-ref + topical guide edges ingested here

Summary

MetricCount
Total Issues22
Sub-Issues3
Total Estimate (pts)102
SprintsS1–S3
Dependencies (blocking)61
Dependencies (blocked by)56