Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University, Tokyo, Japan Abstract In this paper, we propose a vision-based approach to recognize Japanese vowels. Traditional researches dealt with lip size, lip width and lip height, but our method deals with lip shape. Our method focus on temporal changes of lip shape, and we define new feature value to recognize vowels. There are a lot of conventional studies, but those studies datasets are captured in specific environment such as well-lighted room and using lipsticks. However, we use Active shape models to extract lip area and calculate feature values. Therefore, our technique is not influenced by environment. And this paper describe the feature values are robust. We experimented with our approach and about 80% of average accuracy rate was obtained, and this rate is same as vowels recognition of Japanese who use lip reading. We conclude that our method helps speech recognition. Keywords: lip reading, vowel recognition, lip extraction 1. Introduction Today, speech recognitions by audio are developed and those are used in game hardware, car navigation system and cell phones, however, the systems cannot be used under noisy environment. Basically, speech recognition by impaired hearing people is based on sign language. But some people use lip reading. Therefore we can say that visual information improve performance of audio speech recognition under bad environment. To recognize mouth area is avery important for lip reading. We classify methods for recognizing into two types. One is color-based recognition such as snake algorithm[1], and two is model based recognition like Active shape models[2]. Color-based recognition is influenced by brightness of environment. On the other hand, model based recognition is not influenced by light, but need training datasets of face. Lip reading experiments are classified into four types. First is letter recognition, second is word recognition, third is sentence recognition and the last is semantic recognition. But Japanese language has hiragana letters and unclear grammar. Therefore sentence recognition and semantic recognition are not robust and need a lot of learned datasets. And Japanese pronunciations consist of some hiragana letters. Japanese have differences between mouth shapes when they speak vowels. And almost all sound based on 5 vowels of /a/, /i/, /u/, /e/ and /o/. Therefore single sound recognition of vowels is important. There are two types of single sound recognition. Those are static lip image recognition and tracking temporal changes of lip. In this paper, we propose a method of letter recognition focusing on temporal changes of lip shapes by model-based lip extraction for lip reading. 2. Related works In this section, we discuss the previous related works and we show a direction of our method. Uchimura s study[3] is letter recognition using static image recognition. In their study, they use histograms of gray scale images to recognize lip area, and the letter recognition method use mouth size and mouth width. They use static image of lip, therefore specifying sections between letters is difficult and unsuitable to expand word recognition and sentence recognition. Saitoh and Konishi s study[4] uses one of the color-based method. And their method of letter recognition is to use temporal changes of lip size and lip aspect ratio. The results of their method was on average 93.8%. But the method is not robust because of color-based method. Fig. 1: Lip area extraction by color-based method Figure 1 and figure 2 are results of lip area extraction using color-based method. We experimented lip extraction by RGB information of image. Figure 1 shows that this method can get almost all lip area, but besides non-lip area. In figure 2, we changed the threshold of color-comparison.

Fig. 2: Lip area extraction by different threshold The figures show that this color-based algorithm is clearly influenced by a background and regulation of thresholds. recognition of utterance by visual information using lip features. 3.1 Initialization First, we use 68 points for make active shape models learn faces, and use 19 features in those points. Figure 3 shows 68 points learned by active shape models. In this experiment, we define sections of utterance as one segment between mouth close and next mouth close. Experimentally, one section has about 30 frames to 70 frames. Therefore we adjust those sections to 50 frames. And To adjust the movement of mouth, we adjust mouth size and inclination by the width between features of both sides of the closed mouth contour in the first frame. 3.2 Feature value To tracking temporal changes, we use feature value from features of lip contours including the inside of mouth. The following figure 4 shows our definition of feature value in this experiments. Our feature value is defined the width between center point of contour and each points. These feature values mean where the features are. Fig. 3: Lip area extraction by active shape models On the other hand, we propose a lip extraction method of a model-based method. Figure 3 shows that a lip extraction by Active shape models using the same image as the above face. Clearly, the model-based method extract lip area correctly and also in detail. And our method deal with lip shapes more and more minutely. We mentioned the above section, Japanese language consists of hiragana letters on pronounces. And there are so small difference between consonants. Therefore Uchimura s based on mouth size and width and Saitoh s method based on mouth size and aspect ratio are unsuitable to expand to consonants recognition. We propose a robust method based on model-based lip extraction and tracking temporal changes of feature points on lip shape to recognize vowels, and our method solve the above problems. 3. Method We use model-base method for lip area extraction in this experiments. In this section, we propose a method of Fig. 4: Features of lip area and Feature value Therefore feature values are formulated as V = (α x + α y ) 2 + (C x + C y ) 2 (1) where V is feature value, α is feature, and C is the center feature of mouth. 3.3 Relation between feature values In this paragraph, we explain relation between feature values. The following figure 5 is comparison between feature values of 5 different people that calculated by the previous paragraph using the top feature of the mouth of /a/. Those features change largely at the vowel of /a/. In addition, figure 6 is relation of temporal changes between vowels. We can see differences between vowels from figure.

3. Evaluate values of each vowel by formula 3, and the vowel which the evaluated value is smallest is a matchable vowel for input. 4. Experiments In this section, we implement our method and experiment. And discuss the results of our system. Fig. 5: Feature values of /a/ by 5 people 4.1 Setup We implemented the system which has the method we proposed. And the system was divided into the following 2 parts. INPUT CALCULATING FEATURE FEATURES RECOGNITION LEANING LIP AREA EXTRACTION DISK Fig. 7: Chart of learning part of system Fig. 6: Feature values of vowels Considering previous two graphs, we can recognize vowels by feature values which proposed by us and can be got by formula 1. 3.4 Learning values Calculating average of previous feature values by formula 1 for each vowel. And we use those values to recognize an input vowel. Therefore leaned datas are got by N n=0 D tvp = V np (2) N where D tv is a learned feature value of a time of a vowel. N is number of datasets, p is feature of lip area. V is value got by formula 1. 3.5 Matching method For recognition of vowels, we use following formula to calculate which vowel is most likely to the input. T 19 S v = X tvn D tvn (3) t=0 n where S v is evaluated value of a vowel, T is number of frames. And X tvn is input vowel. D is calculated by formula Figure 7 is a chart of learning part of our system. First we input a vowel and calculating feature values by our method. And learn those values to database. INPUT FEATURES RECOGNITION LIP AREA EXTRACTION CALCULATING FEATURE COMPARING WITH LEARNED DATA OUTPUT ESTIMATED DISK Fig. 8: Chart of estimating part of system Figure 8 is a char of estimating part of our system. Estimating part have the same processes as learning part by calculating feature values. But the next step is comparing process. The comparing process is done by the above matching method of section 3 using learned database. Last, we can get an estimated answer by the system.

Table 1: Environment of experiments OS Windows 7 Professional 64bit edition CPU Intel Core 2 Extreme X9650 Memory 4GByte Camera Logicool 2-MP Webcam C600h Resolution of camera 640px x 480px FPS during capturing 30fps Our system was run the following table 1. We used web camera. And this means that this system was run by a camera more poor than a camera of iphone 4. We captured 20 people speaking 5 vowels in front of camera and captured 3 times each. And we used 15 people of those data for valid dataset. Those valid dataset is defined not blurred and can recognize feature points by Active shape models. And our datasets were captured at various backgrounds such as laboratories, houses and meeting rooms. In our experiment, we used Leave-one-out Crossvalidation method for the evaluation. And we evaluated the following 2 situations. 1) Using captured vowels other than a vowel 2) Using captured vowels other than a man 4.2 Results The following table 2 is results of the above experiments. Table 2: Results of our experiments (the numbers of accuracy rate correspond to above evaluations) Vowel Accuracy rate of (1) Accuracy rate of (2) /a/ 76% 75% /i/ 92% 90% /u/ 67% 69% /e/ 84% 82% /o/ 72% 76% Average 78.2% 78.4% Average accuracy rates are over about 80%. In Sekiyama s research[5], average accuracy rate of vowels recognition of Japanese people who use lip reading is about 80%. Therefore, our study have the same accuracy. The cases of wrong estimations were almost between /a/ and /o/ and /u/ and /o/. Those cases are often shown in other papers. 4.3 Discussion Figure 9 shows comparison of the biggest different temporal changes of feature points of two training datasets which were defined by section 4.1. Clearly, there are no difference two datasets, therefore our method recognizes robust feature values and deal with vowels of unknown people. Figure 10 shows some of the biggest difference of the feature values between /u/ and /o/. We deal here with the cases between /u/ and /o/ for a wrong estimation. Clearly, the figure shows /o/ vowel closer than /u/ to the input vowel of /u/. Two reasons are considered in this case. First is Fig. 9: Comparison between two trained datasets Fig. 10: Feature values of vowels precision problem of Active shape models. On extracting lip feature points by Active shape models, occasionally, the extracting method tracks wrong face model. This problem occurs because training face dataset is not enough. And this reason also make feature points blurred. Second is speaker s problem. In our experiments, there was a tendency that those people didn t open mouth widely when they spoke. This make difference between vowels too small. Therefore, blurred feature points make our system output wrong recognitions. We have mentioned two studies[3][4] in section 2, and compare the results. The following table 3 shows the results of the two studies. Average accuracy rates is inferior than related works, but on recognizing some vowels, our method is superior. Table 3: The results of related works Vowel Uchimura s study Saitoh s study /a/ 90% 95.8% /i/ 70% 91.8% /u/ 100% 96.9% /e/ 100% 88.3% /o/ 70% 96.2% Average 86% 93.8%

5. Conclusion We have described a vowel recognition method by tracking temporal changes of lip feature points. The results shows that our method can make robust feature values for Japanese vowel recognition. We conclude that our method is widely applicable to lip reading systems. We also mentioned the above section, clearly lip tracking by Active shape models is blurred. So, there are improvements of lip tracking by Active shape models. And this method was evaluated about vowels, Therefore we are extending to consonants on the next step, and word and sentence recognition in the future. References [1] M. Kass, A. Witkin and D. Terzopoulos. Snakes: Active Contour Models. International Journal of NTSC Computer Vision, pp.321-331, 1998. [2] T.F. Cootes and D.H. Cooper and C.J. Taylor and J. Graham. Active shape models - their training and application. Computer Vision and Image Understanding, pp.38-59, 1995. [3] Keiichi UCHIMURA, Junji MICHIDA, Masami TOKOU, Teizo AIDA. Discrimination of Japanese vowels by image analysis. The Transactions of the Institute of Electronics, Information and Communication Engineers, pp.2700-2702, 1988. [4] Takeshi SAITOH, Mitsugu HISAKI, Ryosuke KONISHI. Japanese Phone Classification Based on Mouth Cavity Region. IEICE technical report, pp.161-166, 2007. [5] Kaoru SEKIYAMA, Kazuki Joe, Michio UMEDA. Lipreading Japanese syllables. ITEJ Technical Report 12(1), pp33-40, 1988.