Model

MsBERT + span-ft-refined

Real held-out lacuna spans for researcher-facing review, plus benchmark probes for diagnosis.

Researcher View

Real held-out lacuna spans with per-slot predictions under the span benchmark regime. Hover a prediction slot to highlight the context words the model attended to when filling it — final-layer attention saliency, in the spirit of DeepMind's Ithaca.

Unknown Gaps

Strong uncertainty and loss markers from the TF layer. No oracle is available for these cases.

Oracle-Known Triage

Use this when the gold word is known and you want to classify why the model failed. Hover the predictions to highlight which context words drove the model's guess.

Benchmark Snapshot & Model Achievements

Held-out Hebrew-only benchmark comparing decoding architectures and researcher-assist signals.

Key Breakthrough

Autoregressive Sequence-Level Restoration (Sequence Accuracy)

Traditional MLMs predict slots independently (Parallel Decoding), which ignores syntax constraints and causes duplicates like אשר אשר. Our Autoregressive Beam Search decodes tokens left-to-right, conditioning each step on previous predictions. This results in dramatic relative accuracy improvements for full, grammatically coherent sequence restoration (Sequence Top-1).

Top-10 Slot-Level Accuracy by Gap Length

Independent slot recovery accuracy (percentage of individual gap words correct in place).

Biblical Contrast Set (Control Group)

Evaluated on 60 biblical scroll fragments. Since biblical texts share high similarity with known canonical manuscripts, this serves as an upper-bound sanity check on the models' classical Hebrew language proficiency.

Retrieval And Parallel Witnesses