Speaker Independent Phoneme Recognition Based on Fisher Weight Map

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Human Emotion Recognition From Speech

Speaker recognition using universal background model on YOHO database

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Probabilistic Latent Semantic Analysis

Speech Recognition by Indexing and Sequencing

Speaker Recognition. Speaker Diarization and Identification

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Edinburgh Research Explorer

Speaker Identification by Comparison of Smart Methods. Abstract

On the Formation of Phoneme Categories in DNN Acoustic Models

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Body-Conducted Speech Recognition and its Application to Speech Support System

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Automatic Pronunciation Checker

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Segregation of Unvoiced Speech from Nonspeech Interference

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Proceedings of Meetings on Acoustics

Automatic segmentation of continuous speech using minimum phase group delay functions

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Learning Methods in Multilingual Speech Recognition

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Calibration of Confidence Measures in Speech Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

SARDNET: A Self-Organizing Feature Map for Sequences

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Voice conversion through vector quantization

Comment-based Multi-View Clustering of Web 2.0 Items

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Visit us at:

Lecture 1: Machine Learning Basics

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Support Vector Machines for Speaker and Language Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Word Segmentation of Off-line Handwritten Documents

Mandarin Lexical Tone Recognition: The Gating Paradigm

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Generative models and adversarial training

Statistical Parametric Speech Synthesis

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Affective Classification of Generic Audio Clips using Regression Models

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

On-the-Fly Customization of Automated Essay Scoring

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Python Machine Learning

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Australian Journal of Basic and Applied Sciences

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Grade 6: Correlated to AGS Basic Math Skills

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Learning Methods for Fuzzy Systems

DEVELOPMENT OF AN INTELLIGENT MAINTENANCE SYSTEM FOR ELECTRONIC VALVES

Artificial Neural Networks written examination

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Building Text Corpus for Unit Selection Synthesis

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Circuit Simulators: A Revolutionary E-Learning Platform

Lecture 9: Speech Recognition

Arabic Orthography vs. Arabic OCR

Investigation on Mandarin Broadcast News Speech Recognition

GDP Falls as MBA Rises?

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Author's personal copy

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Spoofing and countermeasures for automatic speaker verification

Transcription:

peaker Independent Phoneme Recognition Based on Fisher Weight Map Takashi Muroi, Tetsuya Takiguchi, Yasuo Ariki Department of Computer and ystem Engineering Kobe University, - Rokkodai, Nada, Kobe, 657-850, JAPAN muroi@me.cs.scitec.kobe-u.ac.jp, {takigu, ariki}@kobe-u.ac.jp Abstract We have already proposed a new feature extraction method based on higher-order local auto-correlation and Fisher weight map () at Interspeech006. This paper shows effectiveness of the proposed in speaker dependent and speaker independent phoneme recognition. Widely used features lack temporal dynamics. To solve this problem, local auto-correlation features are computed and accumulated by weighting high scores on the discriminative areas. This score map is called Fisher weight map. From the speaker dependent phoneme recognition, the proposed showed 79.5% recognition rate, by 5.0 points higher than the result by. Furhermore by combing with and, the recognition rate improved to 88.3%. In the speaker independent phoneme recognition, it showed 84.% recognition rate, by.0 points higher than the result by. By combining with and, the reecognition rate improved to 89.0%.. Introduction In speech recognition, (Mel-Frequency Cepstrum Coefficient) is widely used which is a cepstrum conversion of a sub-band mel-frequency spectrum within a short time. Due to the characteristic of short time spectrum, lacks temporal dynamic features and degrades the recognition rate. To overcome this defect, the regression coefficients of (delta, delta delta ) are usually utilized, but they are indirect expression of temporal frequency changes such as formant transition or high frequency plosives. More direct expression of the temporal frequency changes will be a geometrical feature in a two-dimensional local area, for example within 3 frames by 3 frequency bands area, on the temporal frequency domain[]. In order to locate such two-dimensional geometrical features, autocorrelation within a local area is effective because it can enhance the geometrical features. Originally this type of feature extraction was proposed in the field of facial emotion recognition []. Otsu computed 35 types of local autocorrelation features within a two-dimensional local area at each pixel on an image and accumulated them within some discriminative areas where the typical features among all emotions were well expressed. The map showing this discriminative areas was called Fisher weight map and Otsu employed a discriminant analysis to find this Fisher weight map. We have already proposed a method to find the geometrical discriminative features and discriminative areas of phonemes on the temporal-frequency domain of speech signals by using the Fisher weight maps and showed the effectiveness by vowel recognition[3]. In this paper, effectiveness of the proposed discriminative feature is verified through speaker dependent and speaker independent 5 phoneme recognition experiments. In section of this paper, we describe an extraction flow of the geometrical discriminative features for phoneme recognition. In section 3 and 4, auto-correlation coefficients based on the local features and the Fisher weight maps are described. In section 5, speaker dependent and speaker independent phoneme recognition experiments are shown.. Extraction flow of geometrical discriminative features Fig. shows an extraction flow of geometrical discriminative features and phoneme recognition. At first, speech waveforms are converted into time-frequency domain by short-time Fourier transformation. At this point, a time sequence of short-time spectra (frames) is obtained. Then a moving window with consecutive several frames is put on the time sequence of short-time spectra, forming a windowed time-frequency. Local features of 35 types are computed at each position (time, frequency) within this window, forming a local feature H with the number of positions 35 types of local features. Finally Fisher weight map w is produced by applying linear discriminant analysis (LDA) to the local feature ma-

peech hort-time Fourier Transform Continuation in a time direction H w Windowing Local features Fisher weight map by LDA Time-frequency Windowed time-frequency Local feature Fisher weight map Weighted higher-order local auto-correlation Weighted higher order = t x H w local auto-correlation features Continuation in a frequency direction Transition Figure. Local features. Phoneme recognition by GMM N = 0 Recognition results N = No. Figure. Flow of new feature extraction. trix H. Geometrical discriminative features are obtained as weighted higher-order local auto-correlation by summing up the local features weighted by the Fisher weight map for each type of local features, forming 35 dimensional vector x for a window. By moving this window, a sequence of 35 dimensional vectors of geometrical discriminative features are obtained. In a phoneme recognition, phoneme GMMs are trained at first. Then the test speech data is converted into a sequence of 35 dimensional vectors of geometrical discriminative features and phoneme likelihood is computed using the trained phoneme GMMs. 3 Local features and weighted higher order local auto-correlations 3. Local features Two-dimensional geometrical and local features are observed on the time-frequency shown on the left in Fig.. On the right hand side, 3 3 local patterns are shown to capture the local features. The upper pattern is for continuation in a time direction, the middle for continuation in a frequency direction and the lower for transition. The flag indicates the multiplication of the spectrum on the position. A local feature within the k-th local pattern at a position r is formalized as follows; h (k) r = I(r)I(r + a (k) ) I(r + a(k) N ) () N = No. 3 No. 7 No. 5 No. 3 No. 3 No. 3 No. 8 No. 6 No. 4 No. 3 No. 4 No. 9 No. 7 No. 5 No. 33 No. 5 No. 0 No. 8 No. 6 No. 34 No. 6 No. No. 9 No. 7 No. 35 No. No. 0 No. 8 No. 3 No. No. 9 Figure 3. 35 types of local patterns. No. 4 No. No. 30 where I(r) is the power spectrum at the position r on timefrequency composed of time t and frequency f. The r + a (k) i indicates the other position, where is attached, within the k-th local pattern. By limiting local patterns within 3 frames 3 bands area at reference position r, setting the order N to be and omitting the equivalence of translation, the number of displacement set (a,, a N ) becomes 35. Namely 35 types of local patterns are obtained at each position r on the timefrequency as shown in Fig.3, according to Otsu[]. 3. Weighted higher order local autocorrelations Higher-order local auto-correlation x k for the k-th local pattern is obtained by summing the local features shown in Eq. on the time-frequency. It is formalized as follows;

x k = r = r h (k) r I(r)I(r + a (k) ) I(r + a(k) N ) () In order to express the higher-order local autocorrelation in the form, all the local features shown in Eq. for the k-th local pattern are collected on the timefrequency and presented as a following vector. h (k) = [h (k), h(k),t, h(k) F,T ]t (3) here the dimension of the vector is M = T (time) F (frequency). The higher-order local auto-correlation x k for the k- th local pattern is expressed as follows using the M- dimensional vector h (k). Frequency 3 4 5 6 3 4 5 6 3 3 33 34 35 36 4 4 43 44 45 46 5 5 53 54 55 56 6 6 63 64 65 66 7 7 73 74 75 76 8 8 83 84 85 86 9 9 93 94 95 96 Windowed timefrequency time 6th local features at position(3,3) ( 6 ) h 33 3 33 34 0th local features at position(7,) ( 0 ) h Position on windowed time-frequency Local feature 7 7 7 63 35 types of local features h h h h h h h h h H h h h h h h h h Figure 4. Local feature. ( ) ( ) ( 35 ) ( ) ( ) ( ) 3 3 3 ( ) ( ) ( 35 ) 8 8 8 ( ) ( ) ( 35 ) 3 3 3 ( ) ( ) ( 35 ) 33 33 33 ( ) ( 35 ) 85 85 x k = h (k)t (4) A local feature is obtained as follows by placing the M-dimensional vectors h (k) in the horizontal direction one by one for all the 35 local patterns. H = [h () h (K) ] (5) The higher-order local auto-correlation vector x is obtained by packing the x k and is expressed as follows; x = [x x K ] t = H t (6) Fig.4 shows an example of computing the local feature H. Here, moving 35 local patterns on the windowed time-frequency (9 6), the local features are computed. These local features are packed into the local feature H (8 35). The higher-order local auto-correlation vector x presents the existence of the local patterns on all over the time-frequency. Therefore, it is not the discriminative vector. In order to make the higher-order local auto-correlation vector x have the discriminative ability, local features of the same local pattern are summed over the windowed time-frequency by putting the high weight on the local features where class difference appears clearly. This is done by replacing the vector consisting of M s by the weighting vector w. Then the weighted higher-order local auto-correlation vector x is obtained as follows; x = H t w (7) Here w is called Fisher weight map because it is computed based on linear discriminant analysis. 4 Fisher weight map In order to find the Fisher weight map, Fisher s discriminative criterion is utilized[]. Let N be the number of training data. Then the local feature matrices for the training data are denoted as {H i R M K } N i=. The corresponding weighted higher-order local auto-correlation vectors, the within-class covariance and the betweenclass covariance are denoted as {x i } N i=, Σ W and Σ B respectively. Then the Fisher discriminative criterion J(w) is expressed as follows using those denotations. J(w) = tr Σ B tr Σ W = wt Σ B w w t Σ W w where Σ W and Σ B is the within-class covariance and the between-class of the local feature matrices (training data). The Fisher weight map is obtained as eigen vectors w based on the following generalized eigen value decomposition derived by maximizing the Fisher discriminative criterion under the constraint such that w t Σ W w = (8) Σ B w = λσ W w (9) ince the Fisher weight map is composed of several eigen vectors, the number of eigen vectors is optimized in the phoneme recognition process. However, if the number of eigen vectors are set to 5, the weighted higher-order local auto-correlation vector x shown in EQ.7 equals to 875 (35 5) dimensional vector. It is so high that the GMM used in the phoneme recognition can not be estimated accurately and stably. To solve this problem, PCA (Principal Component Analysis) is used to reduce the dimension effectively.

5 Phoneme recognition experiments 5. Experimental setup We carried out speaker dependent and independent Japanese 5 phoneme recognition. peech material was continuous speech data spoken by six male speakers and four female speakers and was manually segmented into phoneme sections. In the speaker dependent phoneme recognition, 578 data (about 00 data for each phoneme) segmented by hands for all phonemes were collected from individual speaker and used for phoneme training (Fisher weight map and phoneme GMMs). Other 578 phoneme data from individual speaker were tested. Phoneme recognition rate was computed by averaging the results from ten speakers. On the other hand, in the speaker independent phoneme recognition, the training data from ten speakers were collected together and used for Fisher weight map and phoneme GMMs training. In the phoneme recognition, the test data from individual speaker was tested in the same way as the speaker dependent manner. peech waveform was transformed into time-frequency by short-time Fourier transformation with 5ms frame width and 0ms frame shift. Then the frequency was converted into mel-scale by mel-fiter bank (64 dimension). A window with T frame width and frame shift was moved on the time-frequency and the windowed time-frequency matrices were generated. T and were optimized experimentally to 5 and respectively. The number of eigen vectors W included in the Fisher weight map and the number of Gaussian mixtures G in phoneme GMM were experimentally optimized in the phoneme recognition. The number of dimensions D of the weighted higher-order local auto-correlation vector x reduced by PCA was also experimentally optimized. 5. peaker dependent phoneme recognition using single feature Fig.5 shows the results of speaker dependent phoneme recognition using the proposed feature, compared with the recognition result using. The highest phoneme recognition rate 79.5% was obtained by the proposed feature with the number of eigen vectors W = 5 (35 5=875 dimensions) in the Fisher weight map, the number of dimensions D = 50 of the weighted higher-order local auto-correlation vector x reduced by PCA and the number of Gaussian mixtures G = 8 in the phoneme GMMs. Compared with and, the recognition rate was improved by 5 points and 3.7 points respectively due to the direct expression of temporal features by the proposed method. When the PCA was not applied, since the dimension is so high as 875, the recognition rate was almost same as that of. 74.5% 3 dim 75.8% 3 dim 74.% w/o PCA 875 dim 79.5% with PCA 50 dim Figure 5. Results of speaker dependent phoneme recognition using single feature. 5.3 peaker dependent phoneme recognition by feature integration ince showed the highest phoneme recogntion rate using single feature, it was combined with and in the phoneme recognition. The feature combination was based on a stream weighting method which concatenated two or more feature vectors by weighting the respective feature. The weight was experimentally optimized, changing the weight ratio from 0.0:.0 to.0:0.0 by 0. step. In this case, the dimension of was decreased to 55 from 50 due to computation time. Fig.6 shows the phoneme recognition result. improved the recognition rate by.6 points and 6.0 points after combined with and respectively compared with original (79.5% in Fig.5). Comibination of two features and still showed the highest score 86.7%. When three features, and were combined together, the recogntioin rate showed the highest score 88.3%. This indicates that the has information to improve the recognition obtained by and combination. 8.% 85.5% 86.7% 88.3% Figure 6. Results of speaker dependent phoneme recognition by feature integration.

5.4 peaker independent phoneme recognition using single feature Fig.7 shows the results of speaker independent phoneme recognition using the proposed feature, compared with the recognition result using. The highest phoneme recognition rate 84.% was obtained by the proposed feature with the number of eigen vectors W = 35 (35 35=5 dimensions) in the Fisher weight map, the number of dimensions D = 50, instead of D = 50, of the weighted higher-order local autocorrelation vector x reduced by PCA and the number of Gaussian mixtures G = 8 in the phoneme GMMs. Compared with and, the recognition rate was improved by points and 9. points respectively due to accumulation of the direct expression of temporal features of 0 person by the proposed method. Compared with speaker dependent result shown in Fig.5, the result of and decreased due to data variation. However the result of showed 4.7 points improvement by speaker independency due to less data variation of Fisher weight map produced by 0 person. 73.% 3 dim 75.0% 3 dim 80.7% w/o PCA 875 dim 84.% with PCA 50 dim Figure 7. Results of speaker independent phoneme recognition by single feature. 5.5 peaker independent phoneme recognition by feature integration was combined with and based on a stream weighting method. The result is shown in Fig.8. improved the recognition rate by.4 points and.9 points after combined with and respectively compared with original speaker independent (84.% in Fig.7). When three features, and were combined together, the recogntioin rate showed the highest score 89.0% that was.9 points higher than the result of. This indicates that the has information to improve the recognition rate obtained by and combination. 85.6% 87.% 87.% 89.0% Figure 8. Results of speaker independent phoneme recognition by feature integration. 6 Conclusion We described the new feature extraction method based on higher-order local auto-correlation and Fisher weight map (). The effectiveness was verified through speaker dependent and speaker independent phoneme recognition. From the speaker dependent phoneme recognition, the proposed showed 79.5% recognition rate, by 5.0% point higher than the result by. Furhermore by combing with and, the recognition rate improved to 88.3%. In the speaker independent phoneme recognition, it showed 84.% recognition rate, by.0 points higher than the result by. By combining with and, the recognition improved to 89.0%. As future works, we will investigate the noise robustness of the proposed method because the higher order local auto-correlation used in the method is thought to be robust for noisy speech recognition. Another plan is to extend the method into HMM expression and to apply it to the continuous phoneme recognition. The problem of the method will be lack of the normalization like CMN and composition of GMM or HMM with noise components. We will investigate these problems theoretically as studied in [4]. References [] T. Nitta, Feature Extraction for peech Recognition Based on Orthogonal Acoustic- feature Planes and LDA, Proceedings of IEEE ICAP 999, pp.4-44, May 999. [] Nobuyuki Otsu, Facial Expression Recognition Using Fisher Weight Maps, FGR 004, pp.499-504, 004. [3] Y. Ariki,. Kato, T. Takiguchi,Phoneme Recognition Based on Fisher Weight Map to Higher-Order Local, Interspeech 006, pp. 377-380, ept. 006. [4] Cooke, M. P., Green, P. D., Josifovski, L. B., and Vizinho, A., Robust automatic speech recognition with missing and uncertain acoustic data, peech Communication, 34, pp.67-85, 00.