WSEAS Transactions on Computers


Print ISSN: 1109-2750
E-ISSN: 2224-2872

Volume 18, 2019

Notice: As of 2014 and for the forthcoming years, the publication frequency/periodicity of WSEAS Journals is adapted to the 'continuously updated' model. What this means is that instead of being separated into issues, new papers will be added on a continuous basis, allowing a more regular flow and shorter publication times. The papers will appear in reverse order, therefore the most recent one will be on top.



A Web-based Semantic Navigation System for Migne’s Patrologia Graeca based on OCR extracted Page and Volume Numbers from the Table of Contents of Dorotheos Scholarios

AUTHORS: Evagelos Varthis, Marios Poulos, Ilias Giarenis, Sozon Papavlasopoulos

Download as PDF

ABSTRACT: In this paper, the prototype of a new tool is presented for the navigation of a 19th century collection of Greek authors. This collection is published by Jacques Paul Migne and it is known today as Patrologia Graeca (PG). The project aspires to interconnect this vast amount of about 120000 scanned pages with the scanned Table of Contents (TOC) published by D.Scholarios in 1879. The D.Scholarios’s work contain summaries for the chapters and sub-chapters of PG, having next to them the corresponding volume and page number of the location in the PG. Using Optical Character Recognition (OCR) and pattern recognition techniques, we extract from D.Scholarios’s work the appropriate information in order to create links to the specific pages of PG. Our aim is to provide a Web Interface in which D.Scholarios’s work is used as a semantic compass for PG about the subjects it covers. The complete system consists by three main sections. A REST API backbone service for the scanned images of PG. OCR and pattern recognition techniques for extracting the volume and the page information from the scanned pages of D.Scholarios. A Web interface presenting the TOC by D.Scholarios with the appropriate functionality. The originality of our system lies in the interconnection of two different scanned texts for semantic enrichment and browsing convenience, especially if one is nearly 120000 pages and the other about 600 pages.

KEYWORDS: Migne’s Patrologia Graeca, Dorotheos Scholarios, Rest API; Web Interface, Semantic Web.

REFERENCES:

[ 1] Google Books Library Homepage, URL: https://books.google.com/.

[2] Ruslan Khazarzar Library, Patrologia Section, URL: http://khazarzar.ske ptik.net/pgm/PG_Migne/.

[3] Perseus Project Homepage, URL: http://www.perseus.tufts.edu/hopper/opensourc e

[4] Thesaurus Linguae Graeca, Homepage, URL: http://www.tlg.uci.edu/index.prev.php.

[5] Internet Archive Homepage, URL: https://archive.org.

[6] Digital Libary of Modern Greek Studies, https://anemi.lib.uoc.gr/search .

[7] Bruce Robertson, Christoph Dalitz, Fabian Schmitt, Automated Page Layout Simplification of Patrologia Graeca, DATeCH '14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, Pages 167-172, Madrid, Spain — May 19 - 20, 2014.

[8] Boschetti F., Romanello M., Babeu A., Bamman D., Crane G. (2009) Improving OCR Accuracy for Classical Critical Editions. In: Agosti M., Borbinha J., Kapidakis S., Papatheodorou C., Tsakonas G. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2009. Lecture Notes in Computer Science, vol 5714. Springer, Berlin, Heidelberg.

[9] Bruce Robertson, Federico Boschetti, LargeScale Optical Character Recognition of Ancient Greek, Mouseion: Journal of the Classical Association of Canada Volume 14, no. 3, 341- 359, 2017.

[10] Collected List of Volumes between 1-50, URL: https://gitlab.com/patrologia/pmg001-050.

[11] Collected List of Volumes between 51-100, URL: https://gitlab.com/patrologia/pmg051- 100.

[12] Collected List of Volumes between 101-161, URL: https://gitlab.com/patrologia/pmg101- 161.

[13] Smith, R.: An Overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE Computer Society, Los Alamitos (2007) Google Scholar.

[14] Tesseract Homepage, URL: https://github.com/tesseract-ocr/tesseract.

[15] Abbyy FineReader Homepage, URL: http://www.abbyy.com.

[16] Ancient Greek language training pack, URL: https://ancientgreekocr.org/2.0/grc.traineddata.

[17] Prototype Web Interface of D. Scholarios's work, URL: http://patrologia.tk/kleida/index.html.

WSEAS Transactions on Computers, ISSN / E-ISSN: 1109-2750 / 2224-2872, Volume 18, 2019, Art. #26, pp. 204-209


Copyright © 2018 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution License 4.0

Bulletin Board

Currently:

The editorial board is accepting papers.


WSEAS Main Site