Skip to main content

Adding Corpus Data

This guide walks through adding new source data to the GospeLib corpus — whether it's a new scripture text, lexicon range, commentary, or entirely new schema family.

Before You Start

  1. Identify which schema family the data belongs to (see Corpus Data Model)
  2. Ensure you have the source material in a form you can convert to JSON
  3. Check data/book_registry.json to see if the book ID already exists

Adding a New Scripture Text

1. Choose the canonical book ID

Book IDs follow kebab-case slug format and must be unique across all corpus types:

ConventionExample
Short namegen, exod, lev
Numbered prefix1-ne, 2-ne, 3-ne
Hyphenatedw-of-m, 4-ezra
Pseudepigrapha1-enoch, jub, apoc-ab

2. Register the book

Add an entry to data/book_registry.json:

{
"bookId": "your-book-id",
"title": "Display Title",
"abbreviation": "Abbr.",
"corpus": "pseudepigrapha",
"chapterCount": 10,
"language": "en"
}

3. Create the JSON file

Create corpus/{bookId}.json following the scripture-text schema:

{
"schema": "scripture-text",
"version": "2.0.0",
"bookId": "your-book-id",
"title": "Display Title",
"abbreviation": "Abbr.",
"corpus": "pseudepigrapha",
"language": "en",
"chapters": [
{
"chapter": 1,
"verses": [
{
"verse": 1,
"text": "The text of the first verse."
}
]
}
]
}

4. Add enrichments (optional)

Witnesses, interlinear words, and notes can be added to any verse:

{
"verse": 1,
"text": "English translation text",
"witnesses": [
{
"language": "ethiopic",
"script": "ethiopic",
"text": "Source language text",
"witness": "Manuscript sigla",
"edition": "Critical edition reference"
}
],
"words": [
{
"order": 0,
"gloss": "English gloss",
"strongs": "H0001",
"token": "Source token"
}
]
}
info

Strong's numbers must be normalized to letter + 4-digit zero-padded format: H0430, G0056.

5. Validate the file

cd services/ingest
uv run gospelib-ingest run --dry-run --log-level DEBUG

Fix any Pydantic validation errors before proceeding.

6. Ingest and verify

uv run gospelib-ingest run --only scripture-text

Adding Lexicon Entries

1. Create or extend a lexicon file

Files live at lexicon/{range}.json (e.g., lexicon/H0001-H1000.json):

{
"schema": "lexicon",
"version": "2.0.0",
"language": "hebrew",
"range": { "from": "H5001", "to": "H6000" },
"entries": {
"H5001": {
"strongs": "H5001",
"original": "...",
"translit": "...",
"pronunciation": "...",
"pos": "noun.masculine",
"posRaw": "Noun Masculine",
"glosses": ["..."],
"definition": { "short": "...", "senses": [] },
"derivation": { "description": "...", "roots": [] }
}
}
}

2. Validate and ingest

uv run gospelib-ingest run --only lexicon --dry-run
uv run gospelib-ingest run --only lexicon

Adding Topical Guide / Bible Dictionary Entries

These follow the alphabetical file pattern (tg/{letter}.json, bd/{letter}.json). Add entries to the appropriate letter file.

Adding Commentary

Verse Commentary

Create files at commentary/{commentaryId}/{bookId}.json following the verse-commentary schema.

Scholarly Commentary

Create files at scholarly/{commentaryId}.json following the scholarly-commentary schema.

Validation Rules

All corpus data is validated against these rules:

  • Every file must have a schema field matching one of the seven family names
  • Every bookId must exist in data/book_registry.json
  • Verse numbers are 1-based and sequential within a chapter
  • Chapter numbers are 1-based and sequential within a book
  • PassageRef objects must reference valid book IDs
  • Strong's numbers must match [HG]\d{4} format
  • Pydantic models use extra="forbid" — unexpected fields cause validation failure
  • Optional fields must be absent (not null or empty string) when not populated

Checklist

  • Book ID follows kebab-case slug convention
  • Book registered in data/book_registry.json (for new books)
  • JSON file follows the correct schema family structure
  • schema field is present and matches the family name
  • All PassageRef targets reference valid book IDs
  • Strong's numbers are zero-padded (H0430, not H430)
  • Dry run passes with no Pydantic validation errors
  • Full ingest succeeds and run report shows no errors
  • Content is queryable via the content service API