Skip to main content

Sources Not Found / Remaining Gaps

Data gaps that are not yet fully resolved by available open-source datasets, along with current status and potential paths forward.

Dead Sea Scrolls

Status: Viable open dataset identified — ETCBC/dss provides word-level transcriptions under MIT license.

The Dead Sea Scrolls are now accessible through:

  • ETCBC/dss (GitHub) — MIT license, Text-Fabric format, word-level transcriptions with morphological and linguistic annotations. Based on Martin Abegg's transcription data. 34 releases, actively maintained by VU Amsterdam (CACCHT). This is the primary viable source for structured DSS data.
  • SQE Database (Scripta Qumranica Electronica) — MIT license, MariaDB format. Focused on fragment reconstruction rather than reading text. Useful as supplementary metadata but not a primary reading source.
  • Leon Levy Dead Sea Scrolls Digital Library (deadseascrolls.org.il) — high-resolution images, not structured text. Useful for visual reference only.

The scrolls' fragmentary nature means incomplete verse coverage is inherent to the source material. Best suited for Phase 3 advanced scholarly features.

Aramaic Lexicon

Status: No single comprehensive source, but a composite approach provides substantial coverage.

Three sources together cover the Aramaic vocabulary spectrum relevant to biblical studies:

  • SEDRA IV (Apache 2.0) — structured REST API with 3,465 roots, 35,890 lexemes, 65,000 words covering Syriac/Aramaic. Best structured open Aramaic data available.
  • Sefaria Jastrow Dictionary (CC-BY-NC) — ~15K+ Talmudic/Rabbinic Aramaic entries. Gold standard for Rabbinic Aramaic. Note: CC-BY-NC may conflict with GospeLib's subscription model, though the original 1903 Jastrow text is public domain.
  • Sefaria BDB Aramaic (CC-BY-NC) — ~269 Biblical Aramaic entries from Brown-Driver-Briggs.
  • Existing Strong's entries — GospeLib's Hebrew lexicon already includes Strong's H entries that cover some Aramaic vocabulary (Daniel, Ezra).

No single source covers all Aramaic dialects, but the composite approach (SEDRA IV + existing Strong's entries, with Sefaria as supplementary pending license review) fills the gap substantially.

Extended Commentary

Status: Structured verse-aligned commentary data available via SWORD modules — requires SWORD→OSIS→JSON conversion pipeline.

CrossWire SWORD modules provide ~10 whole-Bible English public-domain commentaries in a verse-aligned format:

  • Matthew Henry (Complete), John Gill (Exposition), Albert Barnes (Notes), Jamieson-Fausset-Brown, Adam Clarke, John Wesley (Notes), Matthew Poole, Keil & Delitzsch (OT), Geneva Study Bible Notes
  • Exportable via mod2osis CLI tool → OSIS XML → JSON
  • Verse references in OSIS XML are standard and mappable to GospeLib passage IDs
  • Every major open-source Bible app (AndBible, BibleTime, Xiphos) uses SWORD modules — the tooling is mature
  • Engineering effort: ~2–4 hours for the conversion pipeline

Historical commentaries also exist in unstructured form on Project Gutenberg and CCEL, but the SWORD modules are the most efficient path to structured, verse-aligned commentary data.

Transliteration

Status: No external transliteration dataset found, but algorithmically solvable.

Transliteration (Hebrew/Greek → Latin characters) can be generated algorithmically from the source text + morphological data using standard transliteration schemes (SBL Hebrew, SBL Greek). This does not require an external data source — it requires implementing a transliteration function in the content service or ingest pipeline. The unidecode Python library and custom Hebrew/Greek mapping tables are the standard approach.

Synoptic Gospel Parallels

Status: No structured open dataset found.

No dedicated pericope-level parallel mapping dataset exists in open-source. Synoptic parallel structure is derivable from cross-reference data (scrollmapper TSK, ~340K entries) which will identify parallel passages across the Synoptic Gospels. A dedicated parallel corpus would need manual scholarly curation or derivation from ingested cross-reference data.

OT Quotations in NT

Status: No dedicated dataset. Derivable from cross-references.

The mapping of Old Testament passages quoted in the New Testament is a valuable scholarly feature but no dedicated open dataset exists. The scrollmapper cross-reference data (~340K entries) should capture most OT→NT quotation relationships. A purpose-built quotation dataset would also need to distinguish between direct quotations, allusions, and thematic echoes — a classification that requires scholarly judgment beyond automated cross-referencing.

Textual Apparatus / Manuscript Variants

Status: No actionable open dataset found.

Text-critical data (variant readings across manuscripts) is primarily held by institutional projects (INTF, IGNTP) with restricted access. Early-stage GitHub repositories exist but are not yet mature enough for production use. STEPBible's TEHMC and TEGMC datasets provide edition-level comparisons for Hebrew and Greek manuscripts, which partially address this need. A full critical apparatus remains an institutional/academic effort beyond current open-source availability.

Audio Pronunciation

Status: No open dataset found.

No structured, machine-readable open dataset exists for biblical Hebrew or Koine Greek word-level audio pronunciation. This remains an area where no viable open-source path exists. Audio would likely need to be generated (text-to-speech) or recorded by domain experts.


Appendix A: BLB Verse Numeric ID Scheme

Source: BLB Infrastructure Investigation

BLB assigns a sequential numeric ID to every verse in the Bible, used in URL parameters (s_ and t_conc_). This scheme is documented here for reference in any future BLB data interaction or cross-referencing.

Encoding: {book_offset} + {chapter} * 1000 + {verse} encoded as a single integer.

ReferenceNumeric IDNotes
Gen 1:11001First verse
Gen 1:311031
Gen 2:12001Chapter boundary
Exo 1:151001Book boundary
Mat 1:1930001NT begins
Rev 22:211189021Last verse

The exact book-offset values would need to be enumerated from BLB's sitemap data. The scheme appears to follow book_sequence_number * 1000 + chapter * 1000 + verse, though the precise formula requires further analysis of sitemap URLs to confirm. URL patterns using this scheme:

  • Verse view: /{translation}/{book}/{chapter}/{verse}/s_{numericId}
  • Interlinear/Concordance: /{translation}/{book}/{chapter}/{verse}/t_conc_{numericId}