"Greek, sir, is like lace; every man gets as much of it as he can." (Samuel Johnson)
This site catalogues the results of our 2012/13 campaign to produce high-quality OCR of polytonic, or 'ancient', Greek texts in a HPC environment. It comprises over 600 volumes from archive.org and from original scans. There are over 6 million pages of OCR output in total, including experimental and rejected results.
Results are presented in a hierarchical organization, beginning with the archive.org volume identifier. Each of these are associated with one or more 'runs', or attempts at OCRing this volume. A run has a date stamp and is associated with a classifier and an aggregate best b-score (roughly indicating quality of Greek output.) Each run produces various kinds of output:
raw hocr output:the data generated by our OCR process, usually with multiple copies for each page, rendered at a range of binarization thresholds
selected hocr output:a filtered version of the data in (1), with each page image represented by a single, best, output page
blended hocr output:the data in (2), but replaced with the corresponding words from the
rawoutput in (1), should the
selectedpage not comprise a dictionary word and one of the
rawpages comprises one.
selected hocr output spellchecked:the data in (3) processed through a weighted levenshtein distance spellchecking algorithm that is meant to correct simple OCR errors
combined hocr output:where archive.org provides OCR output for Latin script (not Greek), this final step pieces together the data in (4) with archive's output, preferring archive's output where our output suggests that the data is Latin. If archive.org provides Greek output, this step is no different from (4)
All code and classifiers for Rigaudon are posted in a github repository. This holds the modified Gamera source code, ancillary python scripts such as the spellcheck engine, and the bash scripts that coordinate the process in a HPC environment through Sun Grid Engine.
Details of its operation are outlined in a white paper.
Our July 2013 presentation at the London Digital Classicist seminar series is available online from the Institue of Classical Studies.
This is a continuation of efforts begun through the Digging Into Data Round I project Toward Dynamic Variorum Editions, in which -- as the project white paper notes -- we discovered both the tantalizing potential of Greek OCR and the poor results that OCR engines at that time produced when operating at scale.
In order to bootstrap that process, we adapted the most extensible and successful of the frameworks to that date, the Gamera Greek OCR engine by Dalitz and Brandt. Using the AceNET HPC environment we analyzed a sample of the Google Greek and Latin corpus with twenty classifiers composed by Canadian undergraduate students. From this, we produced a quantitative report on the efficacy of our modified OCR code.
On the basis of this work, we received a 2012/2013 Humanities Computing Grant from Compute Canada, making this large-scale processing possible.
This work has benefited from the support of: