Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Similar documents
Word Segmentation of Off-line Handwritten Documents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Recognition at ICSI: Broadcast News and beyond

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Australian Journal of Basic and Applied Sciences

Mandarin Lexical Tone Recognition: The Gating Paradigm

GACE Computer Science Assessment Test at a Glance

1.11 I Know What Do You Know?

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

On the Formation of Phoneme Categories in DNN Acoustic Models

Body-Conducted Speech Recognition and its Application to Speech Support System

REVIEW OF CONNECTED SPEECH

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Florida Reading Endorsement Alignment Matrix Competency 1

Longman English Interactive

Feature-oriented vs. Needs-oriented Product Access for Non-Expert Online Shoppers

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

AQUA: An Ontology-Driven Question Answering System

On-Line Data Analytics

arxiv: v1 [cs.cl] 2 Apr 2017

SARDNET: A Self-Organizing Feature Map for Sequences

Appendix L: Online Testing Highlights and Script

Rule Learning With Negation: Issues Regarding Effectiveness

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Teaching a Laboratory Section

Word Stress and Intonation: Introduction

READ 180 Next Generation Software Manual

Using computational modeling in language acquisition research

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Voice conversion through vector quantization

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Modeling function word errors in DNN-HMM based LVCSR systems

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

WHEN THERE IS A mismatch between the acoustic

5 Guidelines for Learning to Spell

Human Emotion Recognition From Speech

SIE: Speech Enabled Interface for E-Learning

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Mining Association Rules in Student s Assessment Data

Modeling function word errors in DNN-HMM based LVCSR systems

Proceedings of Meetings on Acoustics

USER GUIDANCE. (2)Microphone & Headphone (to avoid howling).

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Rule Learning with Negation: Issues Regarding Effectiveness

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Speaker recognition using universal background model on YOHO database

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Facing our Fears: Reading and Writing about Characters in Literary Text

Learning Methods for Fuzzy Systems

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Getting Started with Deliberate Practice

BRAZOSPORT COLLEGE LAKE JACKSON, TEXAS SYLLABUS. POFI 1301: COMPUTER APPLICATIONS I (File Management/PowerPoint/Word/Excel)

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Case study Norway case 1

MYCIN. The MYCIN Task

Calibration of Confidence Measures in Speech Recognition

Functional Maths Skills Check E3/L x

Coast Academies Writing Framework Step 4. 1 of 7

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Software Maintenance

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

A Correlation of. Grade 6, Arizona s College and Career Ready Standards English Language Arts and Literacy

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Speaker Identification by Comparison of Smart Methods. Abstract

GRAMMAR IN CONTEXT 2 PDF

INPE São José dos Campos

RESPONSE TO LITERATURE

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Math 96: Intermediate Algebra in Context

Student Name: OSIS#: DOB: / / School: Grade:

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

CENTRAL MAINE COMMUNITY COLLEGE Introduction to Computer Applications BCA ; FALL 2011

Pre-AP Geometry Course Syllabus Page 1

Generative models and adversarial training

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Disambiguation of Thai Personal Name from Online News Articles

A Case Study: News Classification Based on Term Frequency

Beginning to Flip/Enhance Your Classroom with Screencasting. Check out screencasting tools from (21 Things project)

GOLD Objectives for Development & Learning: Birth Through Third Grade

Research Design & Analysis Made Easy! Brainstorming Worksheet

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

Who s Reading Your Writing: How Difficult Is Your Text?

Grade 3: Module 2B: Unit 3: Lesson 10 Reviewing Conventions and Editing Peers Work

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Myths, Legends, Fairytales and Novels (Writing a Letter)

Spring 2015 Online Testing. Program Information and Registration and Technology Survey (RTS) Training Session

Organizational Knowledge Distribution: An Experimental Evaluation

Let s think about how to multiply and divide fractions by fractions!

Transcription:

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University, Tokyo, Japan Abstract In this paper, we propose a vision-based approach to recognize Japanese vowels. Traditional researches dealt with lip size, lip width and lip height, but our method deals with lip shape. Our method focus on temporal changes of lip shape, and we define new feature value to recognize vowels. There are a lot of conventional studies, but those studies datasets are captured in specific environment such as well-lighted room and using lipsticks. However, we use Active shape models to extract lip area and calculate feature values. Therefore, our technique is not influenced by environment. And this paper describe the feature values are robust. We experimented with our approach and about 80% of average accuracy rate was obtained, and this rate is same as vowels recognition of Japanese who use lip reading. We conclude that our method helps speech recognition. Keywords: lip reading, vowel recognition, lip extraction 1. Introduction Today, speech recognitions by audio are developed and those are used in game hardware, car navigation system and cell phones, however, the systems cannot be used under noisy environment. Basically, speech recognition by impaired hearing people is based on sign language. But some people use lip reading. Therefore we can say that visual information improve performance of audio speech recognition under bad environment. To recognize mouth area is avery important for lip reading. We classify methods for recognizing into two types. One is color-based recognition such as snake algorithm[1], and two is model based recognition like Active shape models[2]. Color-based recognition is influenced by brightness of environment. On the other hand, model based recognition is not influenced by light, but need training datasets of face. Lip reading experiments are classified into four types. First is letter recognition, second is word recognition, third is sentence recognition and the last is semantic recognition. But Japanese language has hiragana letters and unclear grammar. Therefore sentence recognition and semantic recognition are not robust and need a lot of learned datasets. And Japanese pronunciations consist of some hiragana letters. Japanese have differences between mouth shapes when they speak vowels. And almost all sound based on 5 vowels of /a/, /i/, /u/, /e/ and /o/. Therefore single sound recognition of vowels is important. There are two types of single sound recognition. Those are static lip image recognition and tracking temporal changes of lip. In this paper, we propose a method of letter recognition focusing on temporal changes of lip shapes by model-based lip extraction for lip reading. 2. Related works In this section, we discuss the previous related works and we show a direction of our method. Uchimura s study[3] is letter recognition using static image recognition. In their study, they use histograms of gray scale images to recognize lip area, and the letter recognition method use mouth size and mouth width. They use static image of lip, therefore specifying sections between letters is difficult and unsuitable to expand word recognition and sentence recognition. Saitoh and Konishi s study[4] uses one of the color-based method. And their method of letter recognition is to use temporal changes of lip size and lip aspect ratio. The results of their method was on average 93.8%. But the method is not robust because of color-based method. Fig. 1: Lip area extraction by color-based method Figure 1 and figure 2 are results of lip area extraction using color-based method. We experimented lip extraction by RGB information of image. Figure 1 shows that this method can get almost all lip area, but besides non-lip area. In figure 2, we changed the threshold of color-comparison.

Fig. 2: Lip area extraction by different threshold The figures show that this color-based algorithm is clearly influenced by a background and regulation of thresholds. recognition of utterance by visual information using lip features. 3.1 Initialization First, we use 68 points for make active shape models learn faces, and use 19 features in those points. Figure 3 shows 68 points learned by active shape models. In this experiment, we define sections of utterance as one segment between mouth close and next mouth close. Experimentally, one section has about 30 frames to 70 frames. Therefore we adjust those sections to 50 frames. And To adjust the movement of mouth, we adjust mouth size and inclination by the width between features of both sides of the closed mouth contour in the first frame. 3.2 Feature value To tracking temporal changes, we use feature value from features of lip contours including the inside of mouth. The following figure 4 shows our definition of feature value in this experiments. Our feature value is defined the width between center point of contour and each points. These feature values mean where the features are. Fig. 3: Lip area extraction by active shape models On the other hand, we propose a lip extraction method of a model-based method. Figure 3 shows that a lip extraction by Active shape models using the same image as the above face. Clearly, the model-based method extract lip area correctly and also in detail. And our method deal with lip shapes more and more minutely. We mentioned the above section, Japanese language consists of hiragana letters on pronounces. And there are so small difference between consonants. Therefore Uchimura s based on mouth size and width and Saitoh s method based on mouth size and aspect ratio are unsuitable to expand to consonants recognition. We propose a robust method based on model-based lip extraction and tracking temporal changes of feature points on lip shape to recognize vowels, and our method solve the above problems. 3. Method We use model-base method for lip area extraction in this experiments. In this section, we propose a method of Fig. 4: Features of lip area and Feature value Therefore feature values are formulated as V = (α x + α y ) 2 + (C x + C y ) 2 (1) where V is feature value, α is feature, and C is the center feature of mouth. 3.3 Relation between feature values In this paragraph, we explain relation between feature values. The following figure 5 is comparison between feature values of 5 different people that calculated by the previous paragraph using the top feature of the mouth of /a/. Those features change largely at the vowel of /a/. In addition, figure 6 is relation of temporal changes between vowels. We can see differences between vowels from figure.

3. Evaluate values of each vowel by formula 3, and the vowel which the evaluated value is smallest is a matchable vowel for input. 4. Experiments In this section, we implement our method and experiment. And discuss the results of our system. Fig. 5: Feature values of /a/ by 5 people 4.1 Setup We implemented the system which has the method we proposed. And the system was divided into the following 2 parts. INPUT CALCULATING FEATURE FEATURES RECOGNITION LEANING LIP AREA EXTRACTION DISK Fig. 7: Chart of learning part of system Fig. 6: Feature values of vowels Considering previous two graphs, we can recognize vowels by feature values which proposed by us and can be got by formula 1. 3.4 Learning values Calculating average of previous feature values by formula 1 for each vowel. And we use those values to recognize an input vowel. Therefore leaned datas are got by N n=0 D tvp = V np (2) N where D tv is a learned feature value of a time of a vowel. N is number of datasets, p is feature of lip area. V is value got by formula 1. 3.5 Matching method For recognition of vowels, we use following formula to calculate which vowel is most likely to the input. T 19 S v = X tvn D tvn (3) t=0 n where S v is evaluated value of a vowel, T is number of frames. And X tvn is input vowel. D is calculated by formula Figure 7 is a chart of learning part of our system. First we input a vowel and calculating feature values by our method. And learn those values to database. INPUT FEATURES RECOGNITION LIP AREA EXTRACTION CALCULATING FEATURE COMPARING WITH LEARNED DATA OUTPUT ESTIMATED DISK Fig. 8: Chart of estimating part of system Figure 8 is a char of estimating part of our system. Estimating part have the same processes as learning part by calculating feature values. But the next step is comparing process. The comparing process is done by the above matching method of section 3 using learned database. Last, we can get an estimated answer by the system.

Table 1: Environment of experiments OS Windows 7 Professional 64bit edition CPU Intel Core 2 Extreme X9650 Memory 4GByte Camera Logicool 2-MP Webcam C600h Resolution of camera 640px x 480px FPS during capturing 30fps Our system was run the following table 1. We used web camera. And this means that this system was run by a camera more poor than a camera of iphone 4. We captured 20 people speaking 5 vowels in front of camera and captured 3 times each. And we used 15 people of those data for valid dataset. Those valid dataset is defined not blurred and can recognize feature points by Active shape models. And our datasets were captured at various backgrounds such as laboratories, houses and meeting rooms. In our experiment, we used Leave-one-out Crossvalidation method for the evaluation. And we evaluated the following 2 situations. 1) Using captured vowels other than a vowel 2) Using captured vowels other than a man 4.2 Results The following table 2 is results of the above experiments. Table 2: Results of our experiments (the numbers of accuracy rate correspond to above evaluations) Vowel Accuracy rate of (1) Accuracy rate of (2) /a/ 76% 75% /i/ 92% 90% /u/ 67% 69% /e/ 84% 82% /o/ 72% 76% Average 78.2% 78.4% Average accuracy rates are over about 80%. In Sekiyama s research[5], average accuracy rate of vowels recognition of Japanese people who use lip reading is about 80%. Therefore, our study have the same accuracy. The cases of wrong estimations were almost between /a/ and /o/ and /u/ and /o/. Those cases are often shown in other papers. 4.3 Discussion Figure 9 shows comparison of the biggest different temporal changes of feature points of two training datasets which were defined by section 4.1. Clearly, there are no difference two datasets, therefore our method recognizes robust feature values and deal with vowels of unknown people. Figure 10 shows some of the biggest difference of the feature values between /u/ and /o/. We deal here with the cases between /u/ and /o/ for a wrong estimation. Clearly, the figure shows /o/ vowel closer than /u/ to the input vowel of /u/. Two reasons are considered in this case. First is Fig. 9: Comparison between two trained datasets Fig. 10: Feature values of vowels precision problem of Active shape models. On extracting lip feature points by Active shape models, occasionally, the extracting method tracks wrong face model. This problem occurs because training face dataset is not enough. And this reason also make feature points blurred. Second is speaker s problem. In our experiments, there was a tendency that those people didn t open mouth widely when they spoke. This make difference between vowels too small. Therefore, blurred feature points make our system output wrong recognitions. We have mentioned two studies[3][4] in section 2, and compare the results. The following table 3 shows the results of the two studies. Average accuracy rates is inferior than related works, but on recognizing some vowels, our method is superior. Table 3: The results of related works Vowel Uchimura s study Saitoh s study /a/ 90% 95.8% /i/ 70% 91.8% /u/ 100% 96.9% /e/ 100% 88.3% /o/ 70% 96.2% Average 86% 93.8%

5. Conclusion We have described a vowel recognition method by tracking temporal changes of lip feature points. The results shows that our method can make robust feature values for Japanese vowel recognition. We conclude that our method is widely applicable to lip reading systems. We also mentioned the above section, clearly lip tracking by Active shape models is blurred. So, there are improvements of lip tracking by Active shape models. And this method was evaluated about vowels, Therefore we are extending to consonants on the next step, and word and sentence recognition in the future. References [1] M. Kass, A. Witkin and D. Terzopoulos. Snakes: Active Contour Models. International Journal of NTSC Computer Vision, pp.321-331, 1998. [2] T.F. Cootes and D.H. Cooper and C.J. Taylor and J. Graham. Active shape models - their training and application. Computer Vision and Image Understanding, pp.38-59, 1995. [3] Keiichi UCHIMURA, Junji MICHIDA, Masami TOKOU, Teizo AIDA. Discrimination of Japanese vowels by image analysis. The Transactions of the Institute of Electronics, Information and Communication Engineers, pp.2700-2702, 1988. [4] Takeshi SAITOH, Mitsugu HISAKI, Ryosuke KONISHI. Japanese Phone Classification Based on Mouth Cavity Region. IEICE technical report, pp.161-166, 2007. [5] Kaoru SEKIYAMA, Kazuki Joe, Michio UMEDA. Lipreading Japanese syllables. ITEJ Technical Report 12(1), pp33-40, 1988.