An Integrated System for Polytonic Greek OCR

Similar documents
CX 105/205/305 Greek Language 2017/18

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Detecting English-French Cognates Using Orthographic Edit Distance

AQUA: An Ontology-Driven Question Answering System

Problems of the Arabic OCR: New Attitudes

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

The taming of the data:

Linking Task: Identifying authors and book titles in verbose queries

Arabic Orthography vs. Arabic OCR

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Applications of memory-based natural language processing

arxiv: v1 [cs.cl] 2 Apr 2017

Greek Teachers Attitudes toward the Inclusion of Students with Special Educational Needs

Python Machine Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Finding Translations in Scanned Book Collections

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Methods for Fuzzy Systems

Cross Language Information Retrieval

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Disambiguation of Thai Personal Name from Online News Articles

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Seminar - Organic Computing

Australian Journal of Basic and Applied Sciences

On document relevance and lexical cohesion between query terms

Leveraging Sentiment to Compute Word Similarity

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Corrective Feedback and Persistent Learning for Information Extraction

CX 101/201/301 Latin Language and Literature 2015/16

Florida Reading Endorsement Alignment Matrix Competency 1

Unit 7 Data analysis and design

RIVERS AND LAKES. MATERIA: GEOGRAFIA AUTORI Stefania Poggio Angela Renzi CONSULENZA: Cristina Fontana I.C. COMO-LORA-LIPOMO

Speech Recognition at ICSI: Broadcast News and beyond

On-Line Data Analytics

Constructing Parallel Corpus from Movie Subtitles

Artificial Neural Networks written examination

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Word Segmentation of Off-line Handwritten Documents

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Modeling function word errors in DNN-HMM based LVCSR systems

Large vocabulary off-line handwriting recognition: A survey

Postprint.

A Case Study: News Classification Based on Term Frequency

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

Stages of Literacy Ros Lugg

CS Machine Learning

The Role of String Similarity Metrics in Ontology Alignment

Research computing Results

Truth Inference in Crowdsourcing: Is the Problem Solved?

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Rule Learning With Negation: Issues Regarding Effectiveness

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lecturing Module

ANGLAIS LANGUE SECONDE

A NOTE ON UNDETECTED TYPING ERRORS

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

Instructional Approach(s): The teacher should introduce the essential question and the standard that aligns to the essential question

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

The MEANING Multilingual Central Repository

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Negation in Ancient Greek: a Typological Approach*

UK flood management scheme

Cross-Lingual Text Categorization

Vocabulary Usage and Intelligibility in Learner Language

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

South Carolina English Language Arts

Android App Development for Beginners

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Moderator: Gary Weckman Ohio University USA

A Domain Ontology Development Environment Using a MRD and Text Corpus

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

French Dictionary: 1000 French Words Illustrated By Evelyn Goldsmith

(Sub)Gradient Descent

ROSETTA STONE PRODUCT OVERVIEW

Gaming in Second Life via Scratch4SL: Engaging high school students in programming courses

Evolution of Symbolisation in Chimpanzees and Neural Nets

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Using dialogue context to improve parsing performance in dialogue systems

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

The Enterprise Knowledge Portal: The Concept

Transcription:

An Integrated System for Polytonic Greek OCR I. Generating the Data Bruce Robertson, Dept. of Classics, Mount Allison University, New Brunswick, Canada Digital Classicist Seminar, Institute of Classical Studies, London, UK, July 19, 2013

I. Generating the Data A. Reason

Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries

Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit.

Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 3. Text reuse analysis

Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 3. Text reuse analysis 4. General-purpose OCR search, like Google Books

Use Manual Editing? Automatic Spellchecking? Digitization Textual Variants Text Reuse OCR Search or

Use Manual Editing? Automatic Spellchecking? Digitization Textual Variants Text Reuse OCR Search or

I. Generating the Data B. Challenge

Example Text: John 1:1 Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.

Acute, Grave and Circumflex Accents Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.

Smooth and Rough Breathing Marks Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.

Iota Subscript Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.

Diversity of Greek Fonts in 19th C. archive.org Texts

Recognizing Lines

Recognizing Lines

Accurate Binarization

Binarization Important to Results

I. Generating the Data C. Resources

Contextless 'Greekness' Index Devised by Dr. Boschetti Based on dictionary and likely sequences of letters, etc. Named 'B-score' in these slides

Archive.org Provides: Thousands of volumes rendered in highresolution (400 ppi +) colour images OCR results from ABBYY Finereader Excellent Latin-script recognition Poor Greek results Top-quality line-segmentation

Open-source OCR Engines Gamera Current focus of my team Tesseract Nick White has worked extensively on this to generate good results OCRopus Dr. Boschetti recently has been able to use Tesseract training sets for this engine

Interchange Format: HOCR <p> <span class="ocr_line" title="bbox 196 751 1059 808"> <span class="ocr_word" title="bbox 196 769 269 794" xml:lang="grc">νυ ν</span> <span class="ocr_word" title="bbox 301 752 326 793" xml:lang="grc">δ'</span> <span class="ocr_word" title="bbox 378 755 598 804" xml:lang="grc">ω ρθωσας</span> </p>

I. Generating the Data D. Method

Rigaudon Greek OCR Process HOCR Output Replace Latin-script output words with ones in same position from Archive's ABBYY output Per-volume spellcheck table Weighted Edit Table for Classifier 410,000 word dictionary from open Perseus Greek texts HOCR Latin / Greek Combining Does ABBYY OCR file contain Latinscript output? Replace spellchecked words Weighted Levenshtein Automatic OCR Spellchecker (x14 cores) Automatic OCR Spellcheck Reduction to unique Greek strings... Replace non-dictionary words with dict words from other binarization pages HOCR "Blending" Select highest-ranking binarization page Score table for binarization thresholds Boschetti scoring HOCR Results at a range of binarization thresholds Parallel Process x35 Cores Classifiers for Teubner Sans, Teubner Serif, Oxford (Loeb, etc.) Greek OCR For Gamera JP2 Input Library Page Segmentation Thru HOCR Input Gamera 3.3.3 Image Recognition HOCR Output ABBYY to HOCR Conversion ABBYY OCR Information file Images from Archive.org OCR

Raw HOCR Production Using Gamera Plugin for Gamera OCR allows it to import high-quality line-segmentation information, compensating for Gamera's poor results in this critical function Plugin to output HOCR Wrapper function generates a range of output pages based on binarization threshold (typically 10-20 per page)

HOCR 'Blending' This step aims to gather word-by-word the 'best' results from the range of results pages for each image Selects the highest-scoring result page overall Where a Greek word in this page is not in the dictionary and another page has a dictionary word in the exact same physical location, it replaces with dictionary word

Automatic Spellcheck All pages in volume are reduced to a set of unique, decomposed Greek strings These are compared to dictionary using Levenshtein distances A 'weighting table', suitable for a given font, indicates which edits are preferable or allowed Result is 'light' correction, esp. of diacritics

Automatic Spellcheck Weighting Table ['replace', ['replace', ['replace', ['replace', ['replace', ['replace', ['replace', ['replace', ['replace', ['replace', ur'ϲϲ', ur'σςσ', 1],#for lunate fonts ur'c', ur'σς', 1],#for lunate fonts ur't', ur'ττ', 1], ur'τr', ur'τ', 1], ur'uu', ur'υ', 1], ur'y', ur'υ', 1], ur'e', ur'ε', 1], ur'e', ur'ε', 2], ur'z', ur'ζ', 1], ur'k', ur'κκ', 1],

Optionally injecting Greek into Original Latin HOCR Don't want to try to get excellent Greek and Latin results, esp. when ABBYY and others do better job with Latin In the case that archive.org provides Latin OCR: If Rigaudon's output word is Greek, replace archive. org's ABBYY output word with Rigaudon's

Reporting

I. Generating the Data F. Results

Results Περι δε δικαιοσυ νης και α δικι ας... τε τυγχα νουσιν ου σαι πρα ξεις, και ποι α... κο τος α λλ' ου τη ς κομιζου σης φυ σεως κτη μα... τπ ου ν, ω ς ε ν ε ργον η μι ν ε πιβα λλει μο νον...

Results πιπλαττομε νω[ν η μη κ[αλ]ω ς α λλως προσ[τιθεμε - κρα τος τ' ι σο ψυχον ε κ γυναικω ν καρδιο δηκτον ε μοι κρατυ νεις.

I. Generating the Data G. Future

Multiple OCR Engines Replace non-dictionary words with dict words from other binarization pages HOCR "Blending" Select highest-ranking binarization page Score table for binarization thresholds Boschetti scoring HOCR Results at a range of binarization thresholds OCR OCRopus Gamera Line Segmentation Tesseract Take ABBYY data out of the process With 'cleaning' Tesseract's line-segmentation is often as good Use Nick White's general-purpose polytonic classifier and ones specifically designed for a font

Resources Output: http://heml.mta.ca/rigaudon Code: https://github.com/brobertson/rigaudon Further Topics HPC Computing with Grid Engine Python Flask Web Microframework Making Book Images

An Integrated System for Generating and Correcting Polytonic Greek OCR Bruce Robertson and Federico Boschetti Part II The Proof-reading Process Federico Boschetti federico.boschetti@ilc.cnr.it ILC-CNR of Pisa Digital Classicist Seminars London, 19 July 2013 Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20

Introduction Information Aggregation Proof-reader Web Application False positives Manual corrections on OCR output may be performed by Experts Classicists devoted to proof-reading for a long-term project Data Entry Firms Professional proof-readers not skilled in the target language(s) Crowd Sourcing Students that are learning the target language(s) Random Volunteers People with heterogeneous education and skills Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20

Introduction Information Aggregation Proof-reader Web Application False positives For this reason proof-reading tools focused on ancient languages should be centralized easy to use based on image / text comparison line by line optimized to catch attention on possible errors, distinguished by category efficiently providing the most probable correction Federico Boschetti Generating and Correcting Polytonic Greek OCR 2/ 20

Overview Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives 1 Information Aggregation Enriched hocr files Alignment with other editions False negatives 2 Proof-reader Web Application 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20

Enriched hocr files Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives OCR output formatted in hocr microformat The hocr output produced by Rigaudon is postprocessed, in order to add information managed by the Proof-reading Web Application Multiple sources Dictionaries with and without diacritics Multiple editions of the same work (if available) Syllabic repertory Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20

Dictionaries Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives In order to identify possible errors and provide good suggestions to correct them, the OCR output is spell-checked and the potential errors are processed step by step The spell-checker is based on dictionaries generated from Perseus text collection. An upper-case dictionary is used to evaluate if a character sequence is a word with a wrong accent or breathing mark Federico Boschetti Generating and Correcting Polytonic Greek OCR 4/ 20

Information Aggregation Proof-reader Web Application False positives Alignment with other editions Enriched hocr files Alignment with other editions False negatives When another edition of the same work is available, the two editions are aligned word by word applying the Needleman-Wunsch algorithm ὁ Γαδαρεὺς ἐν ταῖς Χάρισιν ἐπιγραφομέναις ἔφη τὸν Ομηρον Σύρον ὄντα τὸ ὁ Γαδαρεὺς ἐν τ αχς Σάρισιν ἐπιγραφομέναις ἔφη τὸν Ομeρον Σύρον ὄντα τὸ Federico Boschetti Generating and Correcting Polytonic Greek OCR 5/ 20

Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives False negatives and the risk of digital contaminatio An example Rigaudon on the Anecdota Graeca edited by Cramer recognizes the word χόρος, which is rejected by the current spellchecker The spell-checker suggests χορός as a correction Also the alignment with Koster s edition of the Prolegomena de comoedia suggests χορός But the page image contains χόρος, a late form attested from Athenaeus to the Byzantine period Federico Boschetti Generating and Correcting Polytonic Greek OCR 6/ 20

Syllabication Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives In order to prevent false negatives due to (rare) variants ignored by the dictionaries, the system distinguishes between random character sequences and well-formed syllabic sequences Each potential error is divided in syllables and each syllable is evaluated according to its position For example, χό-ρος is a well-formed syllabic sequence: χό- is a valid Greek initial syllable and -ρος is a valid final Greek syllable Federico Boschetti Generating and Correcting Polytonic Greek OCR 7/ 20

Overview Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections 1 Information Aggregation 2 Proof-reader Web Application The web interface Cues Self-corrections 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20

Centralization Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections The proof-reader is a web application inspired by the Mozilla hocr Editor interface but employs the WikiSource collaborative philosophy Texts are stored in a central XML native database Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20

Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections The Control Panel Federico Boschetti Generating and Correcting Polytonic Greek OCR 9/ 20

Image / Text Pairs Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Federico Boschetti Generating and Correcting Polytonic Greek OCR 10/ 20

Cues Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Wrong accents and breathing marks Attention is focused on diacritics Self-corrections Special care is necessary to avoid the risk of contaminatio Errors Random errors Federico Boschetti Generating and Correcting Polytonic Greek OCR 11/ 20

Example Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Federico Boschetti Generating and Correcting Polytonic Greek OCR 12/ 20

Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Self-corrections and suggestions generated by alignment In a self-correction, the reading has been substituted by the aligned word of another edition. Self corrections need three conditions: character sequence is refused by the spell-checker edit distance between the character sequence and the aligned edition is very close the character sequence is not a well-formed syllabic sequence Federico Boschetti Generating and Correcting Polytonic Greek OCR 13/ 20

Example Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Federico Boschetti Generating and Correcting Polytonic Greek OCR 14/ 20

Information Aggregation Proof-reader Web Application False positives Dynamic Dictionaries The web interface Cues Self-corrections Dictionaries used by the spell-checker are dynamically rebuilt when a milestone in proof-reading is reached Enlarging the dictionaries, rare variants are acquired and used to spell-check the next works Federico Boschetti Generating and Correcting Polytonic Greek OCR 15/ 20

Overview Information Aggregation Proof-reader Web Application False positives 1 Information Aggregation 2 Proof-reader Web Application 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20

Information Aggregation Proof-reader Web Application False positives False positives are deceitful By definition, false positives pass the spell-checking Specially if they are graphically similar to the correct word, such as δ and ὁ in Greek or m and ni in Latin, they are difficult to be seen, in particular by proof-readers not skilled in the target language(s) Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20

Example Information Aggregation Proof-reader Web Application False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20

Example Information Aggregation Proof-reader Web Application False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20

Information Aggregation Proof-reader Web Application False positives Semantic Distance Semantic distance is calculated along the nodes of WordNet s hierarchy, i.e. along the chain of hyponyms / hypernyms, in order to reach co-hyponyms Different translations of the same concepts (e.g. vis in Latin and efficacia in Italian or efficacy in English) have semantic distance equal to zero Semantically unrelated words (e.g. vinum in Latin and efficacia in Italian) have a large semantic distance Federico Boschetti Generating and Correcting Polytonic Greek OCR 18/ 20

Information Aggregation Proof-reader Web Application False positives AncientWordNet Synsets of AncientGreekWordNet and LatinWordNet have been extracted from bilingual dictionaries They are aligned to modern languages such as English, Italian, etc. Federico Boschetti Generating and Correcting Polytonic Greek OCR 19/ 20

Conclusion Information Aggregation Proof-reader Web Application False positives The proof-reading Web Application puts together the main features of individual and collaborative proof-reading tools currently available The entire work-flow is circular: Training OCR - Performing OCR - Spell-checking OCR - Correcting OCR - Enlarging dictionaries - Retraining OCR Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20

Information Aggregation Proof-reader Web Application False positives Thank you for your attention Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20

References Information Aggregation Proof-reader Web Application False positives S. Feng, R. Manmatha: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books. JCDL 2006, 109 118 (2006) W.B. Lund, E.K. Ringger: Improving Optical Character Recognition through Efficient Multiple System Alignment, JCDL (2009) M. Reynaert: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. A. Gelbukh (ed.): CICLing 2008, LNCS 4919, 617 630 (2008) M. Reynaert: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation. 6th International Conference on Language Resources and Evaluation 2008, 1867 1872 (2008) C. Ringlstetter, K. Schulz, S. Mihov, K. Louka: The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition. 8th International Conference on Document Analysis and Recognition, 1, 406 410 (2005) M. Spencer, C. Howe: Collating texts using progressive multiple alignment. Computer and the Humanities, 37, 1, 97 109 (2003) G. Stewart, G. Crane, A. Babeu: A New Generation of Textual Corpora. JCDL 2007, 356 365 (2007) L. Zhuang, X. Zhu: An OCR Post-processing Approach Based on Multi-knowledge. 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, 346 352 (2005) Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20