DSS Restoration Demo

Exploratory interface preview

These cards demonstrate a possible scholar-facing workflow using an earlier checkpoint. They are retained for interface research and error analysis, not as current paper evidence. Hover a prediction slot to inspect final-layer attention.

Unknown Gaps

Strong uncertainty and loss markers from the TF layer. No oracle is available for these cases.

Oracle-Known Triage

Use this when the gold word is known and you want to classify why the model failed. Hover the predictions to highlight which context words drove the model's guess.

Current evidence and paper protocol

The retained numbers answer different questions and are deliberately not collapsed into one restoration-accuracy headline.

Literature-agreement pilot

Agreement with attributed researcher restorations

Reconstruction-free MsBERT, evaluated on 74 genuine single-word lacunae from held-out, non-biblical scrolls. The decoder keeps visible manuscript letters and approximate lacuna-derived word length (±1), but never receives the restored letters.

63.5% Target-level Top-10 Any compatible attributed restoration · N=74

60.6% Unique-reading Top-10 99 distinct target-reading pairs

9.5% Without physical constraints Same targets; diagnostic comparison

The 63.5% result measures the complete constrained decoder, not improvement in the language model and not recovery of manuscript truth. Target-level Top-10 95% CI: 51.4%–74.3%.

Train-only RAG ablation

Exact-context retrieval uses only preserved words from non-biblical training scrolls. Its weight (α=0.5) was selected on dev scrolls, never on held-out targets.

No network calls

Held-out evaluation	Unit	MLM Top-10	MLM + RAG Top-10
Qumran Digital attributed readings	74 single-word targets	63.5%	63.5%
Text-Fabric editorial labels	25 single-word spans	60.0%	64.0%
Text-Fabric editorial labels	440 slots in 100 multiword spans	41.4%	41.8%
Text-Fabric exact sequence	100 multiword spans	7.0%	9.0%

The Text-Fabric reconstructions are anonymous editorial evaluation labels, not physical truth. The balanced sample contains 25 spans in each length bucket; whole-sequence recovery requires every word to match in order. The observed deltas are descriptive pilots, not evidence that retrieval improves restoration.

Embible-style synthetic-damage baseline

Preserved held-out DSS text is hidden artificially; these are not real lacunae. Primary systems receive no gold length or boundaries.

30 frozen targets

System	Exact Top-1	Exact Top-10	Top-1 CER	Boundary F1
Preserved-only word model	3.3%	16.7%	0.890	0.300
Base TavBERT character model	6.7%	6.7%	0.837	0.333
Scaled Embible overlap ensemble	6.7%	6.7%	0.802	0.333
Dev-fitted rank ensemble	3.3%	10.0%	0.856	0.267

Neither ensemble improves the word model. Exact Top-10 is 0% for every two- and three-word target. As in Embible's masked-Tanakh evaluation, this known-answer experiment measures synthetic damage, not accuracy at naturally occurring lacunae.

Domain control: with this same DSS-trained word model and decoder on 120 held-out Biblical spans, Exact Top-10 rises to 80.0%, 42.5%, and 27.5% for one, two, and three hidden words. The corresponding DSS scores are 50.0%, 0.0%, and 0.0%, identifying domain transfer—not only decoding—as a major bottleneck.

Agreement by bibliographic source

Each source contributes at most one observation per manuscript target.

98 eligible sources

Publication source	Independent targets	Top-1	Top-10
Study Edition	24	20.8%	62.5%
Qimron 2013	23	30.4%	52.2%
PrCon I	10	40.0%	50.0%
Qimron 2020	9	44.4%	66.7%
Wacholder/Abegg 1995	9	33.3%	55.6%
DJD XXIX	8	37.5%	50.0%
Qimron 2014	8	50.0%	75.0%

“Source” means a bibliographic publication, sometimes with multiple authors—not an independent individual researcher. Rows below 10 targets are shaded and should not be interpreted as a ranking. When several publications propose the same reading at one target, that reading counts once in the headline metric.

Primary next experiment

Exact complete-span recovery under unknown length

The frozen paper benchmark will hide genuinely preserved text using the empirical DSS damage distribution. It will not supply character count or word-slot count. The primary metric is exact-sequence Top-10; CER, MRR, calibration, failure rate, and known-length regimes are secondary diagnostics.

Results will be reported both on the natural distribution and equally weighted 1, 2, 3, 4–5, and 6+ word buckets, with scroll- and composition-clustered uncertainty.

Promotion gate for any headline

Frozen manifests and hashes; scroll- and composition-disjoint tests; near-duplicate audit; at least three training seeds; dev-only selection; paired RAG ablations; 95% confidence intervals; exact multiword scoring; and explicit separation of preserved recovery, literature agreement, and scholar utility.

Read the full methodology · Inspect the evidence register

Protocol locked; final benchmark pending