Skip to main content

M14: Corpus Harmonization — schemas/v0.1.0

Version tag: schemas/v0.1.0 Phase: P0: Foundation Target: Weeks 9--14 Sprints: S3, S4, S5


Phase Context

Goal: Eliminate schema drift between the corpus downloader, ingest pipeline, and GOSPELIB-SCHEMAS.md by extracting shared Pydantic models into a single package, implementing the cross-reference architecture, and closing all schema/pipeline gaps.

Key constraint: Phase 1 (shared models package) is the linchpin -- every subsequent phase depends on it. The existing ingest pipeline must continue working throughout.


ZenHub Configuration

FieldValue
MilestoneM14: Corpus Harmonization
Due Date2026-06-14
Default PipelineProduct Backlog
Primary Epic(s)Schema Foundation & Critical Fixes, Ingest Core Pipeline, Corpus Validation

Prerequisites

  • M01: Data Pipeline -- ingest pipeline operational, all 7 schema family models exist, FalkorDB client working

Epic: Schema Foundation & Critical Fixes

Extract shared Pydantic models into packages/schemas/, align downloader models, add missing schema families.

Issues

IssueTitleStatusNotes
M14-001Create gospelib-schemas Package Scaffold✅ Donepackages/schemas with pyproject.toml, project.json, py.typed
M14-002Migrate Shared Types to gospelib-schemas✅ Done22+ exported types in gospelib_schemas/shared.py
M14-003Migrate Scripture Text Models to gospelib-schemas✅ DoneAnnotationBlock, ReferenceBlock, VerseNote, WordAlignment, Verse, Chapter
M14-004Migrate Lexicon Models to gospelib-schemas✅ DoneLexiconEntry, LexiconFile, LexiconRange
M14-005Migrate Remaining Schema Models to gospelib-schemas✅ Done8 domain model files implemented
M14-006Update Ingest and Downloader to Import from gospelib-schemas✅ Done12 ingest model files re-export from gospelib_schemas
M14-007Define Cross-References Pydantic Models✅ DoneCrossRefKind, CrossRefSource, CrossRefAnchor, CrossReference

Epic: Ingest Core Pipeline

Align downloader models, implement cross-reference pipeline, add remaining pipelines.

Issues

IssueTitleStatusNotes
M14-008Add Cross-References Schema to GOSPELIB-SCHEMAS.md✅ DoneGOSPELIB-SCHEMAS.md section 14 fully documented
M14-009Align Downloader VerseNote Structure✅ DonePR #960
M14-010Align Downloader LexiconEntry Structure✅ DonePR #960
M14-011Align Downloader PassageRef and WordAlignment✅ DonePR #960
M14-012Align Downloader Book Metadata and SeeAlsoLink✅ DonePR #960
M14-013Implement Cross-Reference Ingest Pipeline (Stage 3.5)✅ Donepipelines/cross_references.py, registered as stage 3.5
M14-014Add Source Attribution to Existing CROSS_REF Edges✅ DonesourceId, kind, relevance fields on edges
M14-015Register Cross-Reference Pipeline in Orchestrator✅ Donerunner.py STAGES map updated
M14-016Extract LDS Footnote Cross-References to Standalone Files✅ Donetools/scripts/extract-cross-refs.py with Click CLI. PR #970
M14-017Implement Proper Names and Versification Models✅ Donegospelib_schemas/proper_names.py, versification.py
M14-018Implement Morphology Codes and Theographic Models✅ Donegospelib_schemas/morphology_codes.py, theographic.py
M14-019Add Missing Schema Families to GOSPELIB-SCHEMAS.md✅ DoneSections 15--18 cover 4 new schema families
M14-020Implement Remaining Ingest Pipelines✅ Doneproper_names, versification, theographic pipeline stages
M14-021Church Content Adapter Decomposition✅ Done5 providers, 60 unit tests. PR #970
M14-022Update Documentation for Harmonization Changes✅ DoneGOSPELIB-SCHEMAS.md + apps/docs schema landscape
M14-023Witness-Source Ingest Pipeline✅ DonePydantic models, Dagster assets, 6 sources. PR #1326
M14-024Commentary-Source Ingest Pipeline✅ DoneCommentarySource model, 56 sources, count-parity tests. PR #1343

Progress: 24 Done · 0 Partial · 0 To Do (100%)


Release Info

ReleaseTagContains
schemas/v0.1.0schemas/v0.1.0Shared models package, cross-reference schema, aligned downloader, updated docs

Relevant Risks

RiskImpactMitigation
uv path resolution fails across workspaceBlocks shared package usageTest with uv pip install -e before automating
Downloader drivers break on required fieldsBlocks corpus downloadMake fields optional during transition, tighten later
New pipelines depend on nonexistent corpus dataPipeline stages failGate behind file existence checks