Match Graph Generation for Symbolic Indirect Correlation

Similar documents
Word Segmentation of Off-line Handwritten Documents

Speech Recognition at ICSI: Broadcast News and beyond

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

An Online Handwriting Recognition System For Turkish

A Case Study: News Classification Based on Term Frequency

Lecture 1: Machine Learning Basics

Large vocabulary off-line handwriting recognition: A survey

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Python Machine Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Radius STEM Readiness TM

AQUA: An Ontology-Driven Question Answering System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Large Kindergarten Centers Icons

South Carolina English Language Arts

On-Line Data Analytics

Problems of the Arabic OCR: New Attitudes

Learning Methods in Multilingual Speech Recognition

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Rule Learning With Negation: Issues Regarding Effectiveness

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Linking Task: Identifying authors and book titles in verbose queries

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Introducing the New Iowa Assessments Mathematics Levels 12 14

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Using SAM Central With iread

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

How to Judge the Quality of an Objective Classroom Test

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Finding Translations in Scanned Book Collections

Busuu The Mobile App. Review by Musa Nushi & Homa Jenabzadeh, Introduction. 30 TESL Reporter 49 (2), pp

Reviewed by Florina Erbeli

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

National Literacy and Numeracy Framework for years 3/4

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

CS Machine Learning

What the National Curriculum requires in reading at Y5 and Y6

Field Experience Management 2011 Training Guides

Probabilistic Latent Semantic Analysis

CROSS COUNTRY CERTIFICATION STANDARDS

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CEFR Overall Illustrative English Proficiency Scales

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Mandarin Lexical Tone Recognition: The Gating Paradigm

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Constructing Parallel Corpus from Movie Subtitles

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

WHEN THERE IS A mismatch between the acoustic

Math 96: Intermediate Algebra in Context

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

(I couldn t find a Smartie Book) NEW Grade 5/6 Mathematics: (Number, Statistics and Probability) Title Smartie Mathematics

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Hardhatting in a Geo-World

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Natural Language Processing. George Konidaris

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

The Role of String Similarity Metrics in Ontology Alignment

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Beyond the Pipeline: Discrete Optimization in NLP

ACCOMMODATIONS FOR STUDENTS WITH DISABILITIES

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Evolution of Symbolisation in Chimpanzees and Neural Nets

Evolutive Neural Net Fuzzy Filtering: Basic Description

Grade 6: Correlated to AGS Basic Math Skills

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

English Language and Applied Linguistics. Module Descriptions 2017/18

Rendezvous with Comet Halley Next Generation of Science Standards

5.1 Sound & Light Unit Overview

Trends in College Pricing

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Appendix L: Online Testing Highlights and Script

Generative models and adversarial training

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Literature and the Language Arts Experiencing Literature

A Handwritten French Dataset for Word Spotting - CFRAMUZ

MATH 1A: Calculus I Sec 01 Winter 2017 Room E31 MTWThF 8:30-9:20AM

Learning to Think Mathematically With the Rekenrek

Biological Sciences, BS and BA

Active Learning. Yingyu Liang Computer Sciences 760 Fall

10.2. Behavior models

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Human Emotion Recognition From Speech

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Disambiguation of Thai Personal Name from Online News Articles

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Unit 3: Lesson 1 Decimals as Equal Divisions

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Introduction to the Practice of Statistics

Transcription:

Match Graph Generation for Symbolic Indirect Correlation Daniel Lopresti 1, George Nagy 2, and Ashutosh Joshi 2 1 Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015 2 Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180 Lopresti, Nagy, and Joshi January 2006 Slide 1

Symbolic Indirect Correlation (SIC) SIC is a new pattern recognition paradigm. Symbolic Indirect Correlation because it exploits the ordering of matches in lexical (symbolic) strings. because it is based on two levels of comparisons. because it can be viewed as making use of sliding windows. SIC is still a relatively new idea and largely untested. Lopresti, Nagy, and Joshi January 2006 Slide 2

Outline of SIC Approach 1. Lexical Matching. Match polygrams in every lexicon word against the transcription of the reference signal (offline preprocessing). 2. Feature Matching. Match feature strings derived from the query and reference signals. 3. Graph Matching. Match the feature graph (Step 2) against the lexical graphs (Step 1) for each word in the lexicon. 4. Result. Output the best matching lexicon word from Step 3 as the result. Lopresti, Nagy, and Joshi January 2006 Slide 3

Lexical Match Graph* Lexicon word Reference string Note there is edge for every match of bigram or better. * All match graphs shown in this presentation and the paper were generated automatically by running the algorithms in question; none were drawn by hand. Lopresti, Nagy, and Joshi January 2006 Slide 4

SIC Example Lexical domain Edge for every match of bigram or better Signal domain Edge for every match of sufficient weight Unknown input Lopresti, Nagy, and Joshi January 2006 Slide 5

SIC Advantages Matches based on signal subsequences of any length, although typically longer than single characters or phonemes. Common distortions in handwriting and cameraand tablet-based OCR (stretching, contraction) and speech (time-warping) can be accommodated. Independent of medium, feature set, and vocabulary. No training only a reference set as in Nearest Neighbor thus allowing unsupervised adaptation. Extensible to phrase recognition. Lopresti, Nagy, and Joshi January 2006 Slide 6

Present Study SIC performance is impacted by errors in any stage: For this study, we bypass final stages of SIC and compare results of match graph generation directly. Lopresti, Nagy, and Joshi January 2006 Slide 7

Approximate String Matching SIC uses Smith-Waterman string matching algorithm*: Note this differs from more widely-known Wagner- Fischer (Needleman-Wunsch) version as it allows for multiple matches that can start and end anywhere. * Identification of common molecular sequences, T. F. Smith and M. S. Waterman, Journal of Molecular Biology, vol. 147, pp. 195-197, 1981. Lopresti, Nagy, and Joshi January 2006 Slide 8

Lexical Distance Matrix Example We have developed a series of visualizations for reviewing results of intermediate steps in computation: * Again note that this graph was generated automatically by running the algorithm in question. Lopresti, Nagy, and Joshi January 2006 Slide 9

Signal Features To evaluate match graph generation, we performed a pilot study using synthesized images of text strings. Features are adapted from set used by Manmatha and Rath for offline handwriting.* Black pixel density Upper text contour Lower text contour 0-1 transitions * Indexing Handwritten Historical Documents Recent Progress, R. Manmatha and T. Rath, Proceedings of the Symposium on Document Image Understanding, pp. 195-197, 2003. Lopresti, Nagy, and Joshi January 2006 Slide 10

Visualization of Distance Matrices Result of lexical comparison: Result of signal comparison: Lopresti, Nagy, and Joshi January 2006 Slide 11

Resulting Match Graphs Lexical domain: Signal domain: Note that these match graphs correspond perfectly. Lopresti, Nagy, and Joshi January 2006 Slide 12

Match Graph Errors The real world is rarely so cooperative, however. Lexical domain: Missed edge Signal domain: Added edges Lopresti, Nagy, and Joshi January 2006 Slide 13

SIC Evaluation Employ synthesized TIF bitmaps of known strings. Reference strings = 100 random proverbs. Query strings = 100 random words from YAWL*. Compare match graphs, count missing/added edges. Recall = percentage of lexical match graph edges correctly represented in signal match graph. Precision = percentage of signal match graph edges truly present in lexical match graph. Total match graphs tested = 10,000 (= 100 100). * Yet Another Word List, http://www.ibiblio.org/pub/linux/libs/. Lopresti, Nagy, and Joshi January 2006 Slide 14

SIC Results Recall / Precision 1.000 0.950 0.900 0.850 0.800 0.750 0.700 0.650 0.600 0.550 0.500 0.450 0.400 0.350 0.300 0.250 0.200 0.150 0.100 0.050 Accuracy at ERR ~81% Point at which potential match in signal distance matrix gets classified as a match graph edge Recall Precision 0.000 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Threshold Lopresti, Nagy, and Joshi January 2006 Slide 15

Most Frequent Edge Effects Tabulate various effects we saw at optimal threshold: Missed edges due largely to thin characters (e.g., i). Spurious edges due to feature similarity, including character prefixes and suffixes (e.g., h n). Lopresti, Nagy, and Joshi January 2006 Slide 16

More Challenging Evaluation SIC proposed for handling hard-to-segment inputs. Repeat exact same experiment, only this time using highly condensed text strings. Lopresti, Nagy, and Joshi January 2006 Slide 17

SIC Results (Condensed Text) Recall / Precision 1.000 0.950 0.900 0.850 0.800 0.750 0.700 0.650 0.600 0.550 0.500 0.450 0.400 0.350 0.300 0.250 0.200 0.150 0.100 0.050 Accuracy at ERR ~29% Recall Precision 0.000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Threshold Lopresti, Nagy, and Joshi January 2006 Slide 18

Conclusions Smith-Waterman approach appears to be right model for building match graphs. Current problems lie with feature representation. Some issues may be challenging to surmount (e.g., suffix of h will always resemble suffix of n ). On the other hand, final stage of SIC has ability to overcome a certain number of errors. Future work includes exploring connection between match graph errors and overall SIC error rate, as well as extending evaluation to real handwriting and scanned text inputs (appropriately ground-truthed). Lopresti, Nagy, and Joshi January 2006 Slide 19

Visualizing Multiple Matching Results Results of comparing signal input splashiness to 10 different reference strings: Each reference string corresponds to a set of colored bars Each colored bar records starting and ending positions of one match along signal input Lopresti, Nagy, and Joshi January 2006 Slide 20

Visualizing Multiple Matching Results Results of comparing signal input splashiness to 10 different reference strings: Each match corresponds to a single datapoint. x- coordinate records starting position, y-coordinate records ending position. Lopresti, Nagy, and Joshi January 2006 Slide 21

Web Browser Interface to SIC Results Each table row corresponds to one queryreference comparison Thumbnail images are clickable to hires versions Lopresti, Nagy, and Joshi January 2006 Slide 22