HMM-Based Emotional Speech Synthesis Using Average Emotion Model

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "HMM-Based Emotional Speech Synthesis Using Average Emotion Model"

Transcription

1 HMM-Based Emotional Speech Synthesis Using Average Emotion Model Long Qin, Zhen-Hua Ling, Yi-Jian Wu, Bu-Fan Zhang, and Ren-Hua Wang iflytek Speech Lab, University of Science and Technology of China, Hefei {qinlong, zhling, jasonwu, Abstract. This paper presents a technique for synthesizing emotional speech based on an emotion-independent model which is called average emotion model. The average emotion model is trained using a multi-emotion speech database. Applying a MLLR-based model adaptation method, we can transform the average emotion model to present the target emotion which is not included in the training data. A multi-emotion speech database including four emotions, neutral, happiness, sadness, and anger, is used in our experiment. The results of subjective tests show that the average emotion model can effectively synthesize neutral speech and can be adapted to the target emotion model using very limited training data. Keywords: average emotion model, model adaptation, affective space. 1 Introduction With the development of speech synthesis techniques, the intelligibility and naturalness of the synthetic speech has been improved a lot in the last decades. However, it is still a difficult problem for the TTS system to synthesize speech of various speakers and speaking styles with a limited database. It is known that the HMM-based speech synthesis can model speech for different speakers and speaking styles, and voice characteristics of the synthetic speech can be converted from one speaker to another by applying a model adaptation algorithm, such as the MLLR (Maximum Likelihood Linear Regression) algorithm, with a small amount of speech uttered by the target speaker [1], [2], [3]. Furthermore, the HMM-based emotional speech synthesis systems have been successfully constructed by directly training the models with enough emotion data or adapting the source model to the target emotion model when only a limited training data is available [4], [5]. We have realized a HMM-based speech synthesis system in which the LSP (Line Spectral Pair) coefficients and the STRAIGHT analysis-synthesis algorithm are employed [6], [7]. Then, by realizing the MLLR-based model adaptation algorithm, we provide our synthesis system with the ability of synthesizing voice of various speakers with different styles [8]. As only a very limited amount of emotion training data is acquired, we use the model adaptation method to construct our emotional speech system. Commonly, the source model for emotion adaptation is trained using only neutral speech data. But in this paper, we train an emotion-independent model using a Q. Huo et al. (Eds.): ISCSLP 2006, LNAI 4274, pp , Springer-Verlag Berlin Heidelberg 2006

2 234 L. Qin et al. multi-emotion speech database, which includes the neutral, happy and sad speech data of a female speaker. Compared with the neutral model, the average emotion model which considers the distributions of all emotions in the training data is a better coverage of the affective space. In fact, it takes the possible distribution of the target emotion into account, so it can achieve a better adaptation performance than the neutral model. The average emotion model is obtained using a shared decision tree clustering method which assures all nodes of the decision tree always have training data of all emotions [9]. Then we adapt the average emotion model to the target emotion model using a small amount of target speech data and generate the target synthetic speech. In the following part of this paper, a description of our HMM-based emotional speech synthesis system is presented in section 2. Section 3 presents the speech database information, the training set design and the results of subjective experiments, while section 4 provides a final conclusion. 2 System Description The framework of our HMM-based emotional speech synthesis system, shown in Figure 1, is the same as the conventional HMM-based synthesis system except that an average emotion model is used as the source model and a MLLR-based model adaptation stage, using context clustering decision tree and appropriate regression matrix, is added between the training stage and the synthesis stage. In the training stage, the LSP coefficients and the logarithm of fundamental frequency are extracted by the STRAIGHT analysis. Afterwards, their dynamic features including delta and delta-delta coefficients are calculated. The MSD (multi-space probability distribution) HMMs are introduced to model spectrum and pitch patterns because of the discontinuity of pitch observations [10]. And state durations are modeled by the multi-dimensional Gaussian distributions [11]. To obtain the average emotion model, firstly, the context-dependent models without context clustering are separately trained for each emotion. Then all these context-dependent emotion models are clustered using a shared decision tree and the Gaussian pdfs of the average emotion model is calculated by tying all emotions Gaussian pdfs at every node of the tree. Finally, state duration distributions of the average emotion model are obtained under the same clustering procedure. In the adaptation stage, the spectrum, pitch and duration HMMs of the average emotion model are all adapted to those of the target emotion. To achieve supersegmental feature adaptation, the context decision tree constructed in the training stage is used to tie regression matrices. And because of the correlations between the LSP coefficients of adjacent orders, the appropriate regression matrix format is adopted according to the different amount of training data. At first, the spectrum and pitch HMMs are adapted to the target emotion HMMs. Then, on the basis of the converted spectrum and pitch HMMs, the target emotional utterances are segmented to get the duration adaptation data. So that the duration model adaptation can be achieved. In the synthesis stage, according to the given text to be synthesized, a sentence HMM is constructed by concatenating the converted phoneme HMMs. From the sentence HMM, the LSP and pitch parameter sequences are obtained using the speech

3 HMM-Based Emotional Speech Synthesis Using Average Emotion Model 235 parameter generation algorithm, where phoneme durations are determined based on the state duration distributions. Finally, the generated parameter sequences of spectrum, converted from the LSP coefficients, and F0 are put into the STRAIGHT decoder to synthesize the target emotion speech. Fig. 1. HMM-based emotional speech synthesis system 3 Experiment and Evaluation 3.1 Speech Database We constructed a multi-emotion Chinese speech database of a female speaker including four emotions, neutral, happiness, sadness and anger. There are phonetically balanced 1200 sentences for neutral and 400 sentences for each of the other emotions. Contexts of all the emotion samples are different from each other. Firstly, we evaluated whether the recorded speech samples were uttered in the intended emotions. All the speech samples were randomly presented to ten listeners, and the listeners were asked to select an emotion from the four emotions. The listeners were asked to recognize the emotion of speech samples not by contexts but by acoustic presentations. Table 1 shows the classification rates for each emotion of the recorded speech. We can find that most of the recorded speech can successfully represent the intended emotions.

4 236 L. Qin et al. Table 1. Classification results of the recorded natural speech Classification (%) Neutral Happy Sad Angry Neutral Happy Sad Angry Training Set Design In order to realize an average emotion model, a good coverage for the affective space of the training data is expected. The affective space can be described with Russell s circumplex model [12], [13]. As illustrated in Figure 2, Russell has developed a Fig. 2. Circumplex model of affect as described by Russell (1980) two dimensional circumplex model of affection that makes it straightforward to classify an emotion as close or distant from another one. He called the two dimensions valence and arousal. These terms correspond to a positive/negative dimension and an activity dimension respectively. As the multi-emotion database can only contain several kinds of emotions sampled from the affective space, it is important to choose the most representative emotions for training. In our experiment, the multiemotion database has four emotions, neutral, happiness, sadness, and anger. We decide to use the speech data of neutral, happiness and sadness as the training data for the average emotion model, because happiness that is a very positive emotion with high arousal and sadness that is a very negative emotion with low arousal almost are two corresponding emotions and can be a rational representation of the affective space. Meanwhile, the angry speech data is left for model adaptation and evaluation.

5 HMM-Based Emotional Speech Synthesis Using Average Emotion Model Experimental Conditions The average emotion model is trained by 300 sentences of each emotion, including neutral, happy and sad, selected from the multi-emotion database. A neutral model is trained by 1000 neutral sentences selected from the multi-emotion database for comparison. And 100 angry sentences are used for the model adaptation and evaluation. The speech is sampled at a rate of 16KHz. Spectrum and pitch is obtained by the STRAIGHT analysis. Then they are converted to the LSP coefficients and the logarithm F0 respectively, and their dynamic parameters are calculated. Finally, the feature vector of spectrum and pitch is composed of the 25-order LSP coefficients including the zeroth coefficient, the logarithm F0, as well as their delta and delta-delta coefficients. We use the 5-state left-to-right no-skip HMMs in which the spectral part of each state is modeled by the single diagonal Gaussian output distributions. The duration feature vector is a 5 dimensional vector, corresponding to the 5-state HMMs, and the state durations are modeled by the multi-dimensional Gaussian distributions. 3.4 Experiments on the Average Emotion Model and the Neutral Model Table 2 shows the number of distributions of the average emotion model and the neutral model after decision tree context clustering. Here, we set the weight for adjusting the number of parameters of the model during the shared decision tree context clustering as 0.6. From the table, it can be seen that the two models have comparable distributions. Table 2. The number of distributions after context clustering Neutral Model Average Emotion Model Spectrum F Duration sentences of the synthetic speech generated by each model were also presented to 10 listeners to choose the emotion from the four emotions and the result is illustrated in Table 3. It can be found that both the two models can effectively synthesize neutral speech. However, the result of the neutral model is a little better than that of Table 3. Classification results of the synthetic speech generated by the neutral model and the average emotion model Classification (%) Neutral Happy Sad Angry Neutral Model Average Emotion Model

6 238 L. Qin et al. the average emotion model. Some of the synthetic speech generated by the average emotion was misrecognized as sad. That may be because sadness has a better expression than happiness in the training data, as shown in Table 1, so that the average emotion model has a slight bias towards sadness. 3.5 Experiments on the Emotion Adaptation In the model adaptation stage, the neutral model or the average emotion model is adapted to the target emotion model with 50 angry sentences which are not included in the adaptation training data. The 3-block regression matrix is adopted and the regression matrices are grouped using a context decision tree clustering method. First, 10 listeners were asked to recognize the emotion of 50 synthetic speech samples generated by the two methods from the four emotions. The classification results are presented in Table 4. It can be found that about 70% of the synthetic speech can by successfully recognized by the listeners and the average emotion model has a better adaptation performance. Table 4. Classification results of the synthetic speech generated by the angry model adapted from the neutral model and the average emotion model Classification (%) Neutral Happy Sad Angry Neutral Model Average Emotion Model Compared to the speech synthesized by the adapted average emotion model, some speech samples generated by the adapted neutral model sound to be not natural especially in prosody. Figure 3 demonstrates the F0 contours of the synthetic speech generated from the adapted neutral model and the adapted average emotion model respectively, meanwhile the F0 contour of the target speech is also presented. The dotted red line presents the F0 contour generated from the adapted neutral model while the solid neutral model average model natural speech Fig. 3. Comparison of F0 contours generated by the angry model adapted from the neutral model and the average emotion model

7 HMM-Based Emotional Speech Synthesis Using Average Emotion Model 239 blue line is the result of the adapted average emotion model and the solid black line is the F0 contour of target speech. We can see that the values of F0 generated from the adapted average emotion model are more similar to those of the target speech. 4 Conclusion A HMM-based emotional speech synthesis system is realized using a model adaptation method. At first, an average emotion model is trained using a multi-emotion speech database. Then, the average emotion model is adapted to the target emotion model with a small amount of training data using a MLLR-based model adaptation technique in which a context decision tree is built to group HMMs of the average emotion model. To compare the performance of the proposed method, a neutral model is also trained and adapted. From the results of the subjective tests, it can be seen that both methods can effectively synthesize the intended emotion speech. In addition, the adaptation performance of the average emotion model is slightly better than that of the neutral model. If having more emotional speech data, there will be a better coverage of the affective space, so we can train a more reasonable average emotion model. Our future work will focus on increasing the number of emotion categories in the multi-emotion database and improving the performance of the average emotion model. At the same time, various emotions will be selected as the target emotion to evaluate the effectiveness of the average emotion model. Acknowledgement This work was partially supported by the National Natural Science Foundation of China under grant number References 1. T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, Speech synthesis from HMMs using dynamic features, Proc. ICASSP-1996, pp , C.J. Leggetter and P.C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, vol.9, no.2, pp , T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, Speaker adaptation for HMM-based speech synthesis system using MLLR, The Third ESCA/COCOSDA Workshop on Speech Synthesis, pp , Nov J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis, IEICE Trans. Information and Systems, vol. E88-D, no.3, pp , March J. Yamagishi, M. Tachibana, T. Masuko, and T. Kobayashi, Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis, Proc. ICASSP- 2004, vol.1, pp. 5-8, May 2004.

8 240 L. Qin et al. 6. H. Kawahara, Restructuring speech representations using a pitch-adaptive time frequency smoothing and a instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sound, Speech Communication 27, pp , Y.J. Wu and R.H. Wang, HMM-based trainable speech synthesis for Chinese, to appear in Journal of Chinese Information Processing. 8. Long Qin, Yi-Jian Wu, Zhen-Hua Ling, and Ren-Hua Wang, Improving the performance of HMM-base voice conversion using context clustering decision tree and appropriate regression matrix, to appear in Proc. ICSLP J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, A context clustering technique for average voice models, IEICE Trans. Information and Systems, vol. E86-D, no. 3, pp , March K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, Hidden Markov models based on multi-space probability distribution for pitch pattern modeling, Proc. ICASSP-1999, pp , Mar T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, Duration modeling for HMM-based speech synthesis, Proc. ICSLP-1998, vol.2, pp , Nov J.A. Russell, A circumplex model of affect, Journal of Personality and Social Psychology, vol. 39, pp , R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor, Emotion recognition in human-computer interaction, IEEE Signal Processing Magazine, Vol. 18, Issue 1, pp , Jan

Transform Mapping Using Shared Decision Tree Context Clustering for HMM-Based Cross-Lingual Speech Synthesis

Transform Mapping Using Shared Decision Tree Context Clustering for HMM-Based Cross-Lingual Speech Synthesis INTERSPEECH 2014 Transform Mapping Using Shared Decision Tree Context Clustering for HMM-Based Cross-Lingual Speech Synthesis Daiki Nagahama 1, Takashi Nose 2, Tomoki Koriyama 1, Takao Kobayashi 1 1 Interdisciplinary

More information

STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS. Heiga Zen, Andrew Senior, Mike Schuster. Google

STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS. Heiga Zen, Andrew Senior, Mike Schuster. Google STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS Heiga Zen, Andrew Senior, Mike Schuster Google fheigazen,andrewsenior,schusterg@google.com ABSTRACT Conventional approaches to statistical

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer An analysis of machine translation and speech synthesis in speech-to-speech translation system Citation for published version: Hashimoto, K, Yamagishi, J, Byrne, W, King, S

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK

PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK Divya Bansal 1, Ankita Goel 2, Khushneet Jindal 3 School of Mathematics and Computer Applications, Thapar University, Patiala (Punjab) India 1 divyabansal150@yahoo.com

More information

Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529

Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529 SMOOTHED TIME/FREQUENCY FEATURES FOR VOWEL CLASSIFICATION Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529 ABSTRACT A

More information

Explicit Duration Modelling in HMM-based Speech Synthesis using a Hybrid Hidden Markov Model-Multilayer Perceptron

Explicit Duration Modelling in HMM-based Speech Synthesis using a Hybrid Hidden Markov Model-Multilayer Perceptron Explicit Duration Modelling in HMM-based Speech Synthesis using a Hybrid Hidden Markov Model-Multilayer Perceptron Kalu U. Ogbureke, João P. Cabral, Julie Carson-Berndsen CNGL, School of Computer Science

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

L18: Speech synthesis (back end)

L18: Speech synthesis (back end) L18: Speech synthesis (back end) Articulatory synthesis Formant synthesis Concatenative synthesis (fixed inventory) Unit-selection synthesis HMM-based synthesis [This lecture is based on Schroeter, 2008,

More information

Hidden Markov Model-based speech synthesis

Hidden Markov Model-based speech synthesis Hidden Markov Model-based speech synthesis Junichi Yamagishi, Korin Richmond, Simon King and many others Centre for Speech Technology Research University of Edinburgh, UK www.cstr.ed.ac.uk Note I did not

More information

DIAGNOSTIC EVALUATION OF SYNTHETIC SPEECH USING SPEECH RECOGNITION

DIAGNOSTIC EVALUATION OF SYNTHETIC SPEECH USING SPEECH RECOGNITION DIAGNOSTIC EVALUATION OF SYNTHETIC SPEECH USING SPEECH RECOGNITION Miloš Cerňak, Milan Rusko and Marian Trnka Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia e-mail: Milos.Cernak@savba.sk

More information

ISCA Archive

ISCA Archive ISCA Archive http://wwwisca-speechorg/archive th ISCA Speech Synthesis Workshop Pittsburgh, PA, USA June 1-1, 2 MAPPIN FROM ARTICULATORY MOVEMENTS TO VOCAL TRACT SPECTRUM WITH AUSSIAN MIXTURE MODEL FOR

More information

A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences

A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences Feng-Long Xie 1,2, Frank K. Soong 2, Haifeng Li

More information

DEEP HIERARCHICAL BOTTLENECK MRASTA FEATURES FOR LVCSR

DEEP HIERARCHICAL BOTTLENECK MRASTA FEATURES FOR LVCSR DEEP HIERARCHICAL BOTTLENECK MRASTA FEATURES FOR LVCSR Zoltán Tüske a, Ralf Schlüter a, Hermann Ney a,b a Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University,

More information

Gender Classification Based on FeedForward Backpropagation Neural Network

Gender Classification Based on FeedForward Backpropagation Neural Network Gender Classification Based on FeedForward Backpropagation Neural Network S. Mostafa Rahimi Azghadi 1, M. Reza Bonyadi 1 and Hamed Shahhosseini 2 1 Department of Electrical and Computer Engineering, Shahid

More information

Resources Author's for Indian copylanguages

Resources Author's for Indian copylanguages 1/ 23 Resources for Indian languages Arun Baby, Anju Leela Thomas, Nishanthi N L, and TTS Consortium Indian Institute of Technology Madras, India September 12, 2016 Roadmap Outline The need for Indian

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for -based speech synthesis Zhizheng Wu Pawel Swietojanski Christophe Veaux Steve Renals Simon King The Centre for Speech Technology Research, University of Edinburgh, United

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION

FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION FILTER BANK FEATURE EXTRACTION FOR GAUSSIAN MIXTURE MODEL SPEAKER RECOGNITION James H. Nealand, Alan B. Bradley, & Margaret Lech School of Electrical and Computer Systems Engineering, RMIT University,

More information

A MULTI-LEVEL REPRESENTATION OF F0 USING THE CONTINUOUS WAVELET TRANSFORM AND THE DISCRETE COSINE TRANSFORM. Manuel Sam Ribeiro, Robert A. J.

A MULTI-LEVEL REPRESENTATION OF F0 USING THE CONTINUOUS WAVELET TRANSFORM AND THE DISCRETE COSINE TRANSFORM. Manuel Sam Ribeiro, Robert A. J. A MULTI-LEVEL REPRESENTATION OF F0 USING THE CONTINUOUS WAVELET TRANSFORM AND THE DISCRETE COSINE TRANSFORM Manuel Sam Ribeiro, Robert A. J. Clark Centre for Speech Technology Research, University of Edinburgh,

More information

Doctoral Thesis. High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion. Tomoki Toda

Doctoral Thesis. High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion. Tomoki Toda NAIST-IS-DT0161027 Doctoral Thesis High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion Tomoki Toda March 24, 2003 Department of Information Processing Graduate School

More information

Modified Post-filter to Recover Modulation Spectrum for HMM-based Speech Synthesis

Modified Post-filter to Recover Modulation Spectrum for HMM-based Speech Synthesis GlobalSIP 24: Machine Learning Applications in Speech Processing Modified Postfilter to Recover Modulation Spectrum for HMMbased Speech Synthesis Shinnosuke Takamichi, Tomoki Toda, Alan W Black, and Satoshi

More information

Discriminative Learning of Feature Functions of Generative Type in Speech Translation

Discriminative Learning of Feature Functions of Generative Type in Speech Translation Discriminative Learning of Feature Functions of Generative Type in Speech Translation Xiaodong He Microsoft Research, One Microsoft Way, Redmond, WA 98052 USA Li Deng Microsoft Research, One Microsoft

More information

Natural Speech Synthesizer for Blind Persons Using Hybrid Approach

Natural Speech Synthesizer for Blind Persons Using Hybrid Approach Procedia Computer Science Volume 41, 2014, Pages 83 88 BICA 2014. 5th Annual International Conference on Biologically Inspired Cognitive Architectures Natural Speech Synthesizer for Blind Persons Using

More information

Modulation frequency features for phoneme recognition in noisy speech

Modulation frequency features for phoneme recognition in noisy speech Modulation frequency features for phoneme recognition in noisy speech Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland Ecole Polytechnique

More information

Deep Learning in Speech Synthesis. Heiga Zen Google August 31st, 2013

Deep Learning in Speech Synthesis. Heiga Zen Google August 31st, 2013 Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 Outline Background Deep Learning Deep Learning in Speech Synthesis Motivation Deep learning-based approaches DNN-based statistical parametric

More information

i-vector Algorithm with Gaussian Mixture Model for Efficient Speech Emotion Recognition

i-vector Algorithm with Gaussian Mixture Model for Efficient Speech Emotion Recognition 2015 International Conference on Computational Science and Computational Intelligence i-vector Algorithm with Gaussian Mixture Model for Efficient Speech Emotion Recognition Joan Gomes* and Mohamed El-Sharkawy

More information

AUTOMATIC CHINESE PRONUNCIATION ERROR DETECTION USING SVM TRAINED WITH STRUCTURAL FEATURES

AUTOMATIC CHINESE PRONUNCIATION ERROR DETECTION USING SVM TRAINED WITH STRUCTURAL FEATURES AUTOMATIC CHINESE PRONUNCIATION ERROR DETECTION USING SVM TRAINED WITH STRUCTURAL FEATURES Tongmu Zhao 1, Akemi Hoshino 2, Masayuki Suzuki 1, Nobuaki Minematsu 1, Keikichi Hirose 1 1 University of Tokyo,

More information

arxiv: v1 [cs.cl] 2 Jun 2015

arxiv: v1 [cs.cl] 2 Jun 2015 Learning Speech Rate in Speech Recognition Xiangyu Zeng 1,3, Shi Yin 1,4, Dong Wang 1,2 1 CSLT, RIIT, Tsinghua University 2 TNList, Tsinghua University 3 Beijing University of Posts and Telecommunications

More information

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS Heiga Zen, Haşim Sak Google fheigazen,hasimg@google.com ABSTRACT Long short-term

More information

Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral

Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral EVALUATION OF AUTOMATIC SPEAKER RECOGNITION APPROACHES Pavel Král and Václav Matoušek University of West Bohemia in Plzeň (Pilsen), Czech Republic pkral matousek@kiv.zcu.cz Abstract: This paper deals with

More information

A comparison between human perception and a speaker verification system score of a voice imitation

A comparison between human perception and a speaker verification system score of a voice imitation PAGE 393 A comparison between human perception and a speaker verification system score of a voice imitation Elisabeth Zetterholm, Mats Blomberg 2, Daniel Elenius 2 Department of Philosophy & Linguistics,

More information

Isolated Speech Recognition Using MFCC and DTW

Isolated Speech Recognition Using MFCC and DTW Isolated Speech Recognition Using MFCC and DTW P.P.S.Subhashini Associate Professor, RVR & JC College of Engineering. ABSTRACT This paper describes an approach of isolated speech recognition by using the

More information

Performance improvement in automatic evaluation system of English pronunciation by using various normalization methods

Performance improvement in automatic evaluation system of English pronunciation by using various normalization methods Proceedings of 20 th International Congress on Acoustics, ICA 2010 23-27 August 2010, Sydney, Australia Performance improvement in automatic evaluation system of English pronunciation by using various

More information

The 1997 CMU Sphinx-3 English Broadcast News Transcription System

The 1997 CMU Sphinx-3 English Broadcast News Transcription System The 1997 CMU Sphinx-3 English Broadcast News Transcription System K. Seymore, S. Chen, S. Doh, M. Eskenazi, E. Gouvêa, B. Raj, M. Ravishankar, R. Rosenfeld, M. Siegler, R. Stern, and E. Thayer Carnegie

More information

ROBUST SPEECH RECOGNITION BY PROPERLY UTILIZING RELIABLE FRAMES AND SEGMENTS IN CORRUPTED SIGNALS

ROBUST SPEECH RECOGNITION BY PROPERLY UTILIZING RELIABLE FRAMES AND SEGMENTS IN CORRUPTED SIGNALS ROBUST SPEECH RECOGNITION BY PROPERLY UTILIZING RELIABLE FRAMES AND SEGMENTS IN CORRUPTED SIGNALS Yi Chen, Chia-yu Wan, Lin-shan Lee Graduate Institute of Communication Engineering, National Taiwan University,

More information

Discriminative Learning of Feature Functions of Generative Type in Speech Translation

Discriminative Learning of Feature Functions of Generative Type in Speech Translation Discriminative Learning of Feature Functions of Generative Type in Speech Translation Xiaodong He Microsoft Research, One Microsoft Way, Redmond, WA 98052 USA Li Deng Microsoft Research, One Microsoft

More information

Foreign Accent Classification

Foreign Accent Classification Foreign Accent Classification CS 229, Fall 2011 Paul Chen pochuan@stanford.edu Julia Lee juleea@stanford.edu Julia Neidert jneid@stanford.edu ABSTRACT We worked to create an effective classifier for foreign

More information

mizes the model parameters by learning from the simulated recognition results on the training data. This paper completes the comparison [7] to standar

mizes the model parameters by learning from the simulated recognition results on the training data. This paper completes the comparison [7] to standar Self Organization in Mixture Densities of HMM based Speech Recognition Mikko Kurimo Helsinki University of Technology Neural Networks Research Centre P.O.Box 22, FIN-215 HUT, Finland Abstract. In this

More information

293 The use of Diphone Variants in Optimal Text Selection for Finnish Unit Selection Speech Synthesis

293 The use of Diphone Variants in Optimal Text Selection for Finnish Unit Selection Speech Synthesis 293 The use of Diphone Variants in Optimal Text Selection for Finnish Unit Selection Speech Synthesis Elina Helander, Hanna Silén, Moncef Gabbouj Institute of Signal Processing, Tampere University of Technology,

More information

ACCENT GROUP MODELING FOR IMPROVED PROSODY IN STATISTICAL PARAMETERIC SPEECH SYNTHESIS

ACCENT GROUP MODELING FOR IMPROVED PROSODY IN STATISTICAL PARAMETERIC SPEECH SYNTHESIS ACCENT GROUP MODELING FOR IMPROVED PROSODY IN STATISTICAL PARAMETERIC SPEECH SYNTHESIS Gopala Krishna Anumanchipalli Luís C. Oliveira Alan W Black Language Technologies Institute, Carnegie Mellon University,

More information

Generation of Hierarchical Dictionary for Stroke-order Free Kanji Handwriting Recognition Based on Substroke HMM

Generation of Hierarchical Dictionary for Stroke-order Free Kanji Handwriting Recognition Based on Substroke HMM Generation of Hierarchical Dictionary for Stroke-order Free Kanji Handwriting Recognition Based on Substroke HMM Mitsuru NAKAI, Hiroshi SHIMODAIRA and Shigeki SAGAYAMA Graduate School of Information Science,

More information

RECENT TOPICS IN SPEECH RECOGNITION RESEARCH AT NTT LABORATORIES

RECENT TOPICS IN SPEECH RECOGNITION RESEARCH AT NTT LABORATORIES RECENT TOPICS IN SPEECH RECOGNITION RESEARCH AT NTT LABORATORIES Sadaoki Furui, Kiyohiro Shikano, Shoichi Matsunaga, Tatsuo Matsuoka, Satoshi Takahashi, and Tomokazu Yamada NTT Human Interface Laboratories

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Learning words from sights and sounds: a computational model. Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang.

Learning words from sights and sounds: a computational model. Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang. Learning words from sights and sounds: a computational model Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang Introduction Infants understand their surroundings by using a combination of evolved

More information

Jurnal Teknologi A FAST ADAPTATION TECHNIQUE FOR BUILDING DIALECTAL MALAY SPEECH SYNTHESIS ACOUSTIC MODEL. Full Paper

Jurnal Teknologi A FAST ADAPTATION TECHNIQUE FOR BUILDING DIALECTAL MALAY SPEECH SYNTHESIS ACOUSTIC MODEL. Full Paper Jurnal Teknologi A FAST ADAPTATION TECHNIQUE FOR BUILDING DIALECTAL MALAY SPEECH SYNTHESIS ACOUSTIC MODEL Yen-Min Jasmina Khaw *, Tien-Ping Tan School of Computer Sciences, Universiti Sains sia, 11800

More information

Natural Indonesian Speech Synthesis by using CLUSTERGEN

Natural Indonesian Speech Synthesis by using CLUSTERGEN 2014 International Conference on Information, Communication Technology and System Natural Indonesian Speech Synthesis by using CLUSTERGEN Evan Tysmayudanto Gunawan, Dhany Arifianto Department of Engineering

More information

Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh

Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh Akzharkyn Izbassarova, Aidana Irmanova and Alex Pappachen James School of Engineering, Nazarbayev University, Astana

More information

Speech Emotion Recognition Using Deep Neural Network and Extreme. learning machine

Speech Emotion Recognition Using Deep Neural Network and Extreme. learning machine INTERSPEECH 2014 Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine Kun Han 1, Dong Yu 2, Ivan Tashev 2 1 Department of Computer Science and Engineering, The Ohio State University,

More information

Thai Speech Phonology for Development of Speech Synthesis: A Review

Thai Speech Phonology for Development of Speech Synthesis: A Review American Journal of Applied Sciences 9 (2): 271-277, 2012 ISSN 1546-9239 2012 Science Publications Thai Speech Phonology for Development of Speech Synthesis: A Review Suphattharachai Chomphan Department

More information

Report on the Third Contest on Symbol Recognition

Report on the Third Contest on Symbol Recognition Report on the Third Contest on Symbol Recognition Ernest Valveny 1, Philippe Dosch 2, Alicia Fornes 1 and Sergio Escalera 1 1 Computer Vision Center, Dep. Ciències de la Computació Universitat Autònoma

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

The Construction of Piano Teaching Innovation Model Based on Full-depth Learning

The Construction of Piano Teaching Innovation Model Based on Full-depth Learning The Construction of Piano Teaching Innovation Model Based on Full-depth Learning https://doi.org/10.3991/ijet.v13i03.8369 Anshi Wei Baoji University of Arts and Sciences, Baoji, China laowei135@163.com

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION

AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION AUTOMATIC ARABIC PRONUNCIATION SCORING FOR LANGUAGE INSTRUCTION Hassan Dahan, Abdul Hussin, Zaidi Razak, Mourad Odelha University of Malaya (MALAYSIA) hasbri@um.edu.my Abstract Automatic articulation scoring

More information

Speech Synthesizer for the Pashto Continuous Speech based on Formant

Speech Synthesizer for the Pashto Continuous Speech based on Formant Speech Synthesizer for the Pashto Continuous Speech based on Formant Technique Sahibzada Abdur Rehman Abid 1, Nasir Ahmad 1, Muhammad Akbar Ali Khan 1, Jebran Khan 1, 1 Department of Computer Systems Engineering,

More information

DEEP STACKING NETWORKS FOR INFORMATION RETRIEVAL. Li Deng, Xiaodong He, and Jianfeng Gao.

DEEP STACKING NETWORKS FOR INFORMATION RETRIEVAL. Li Deng, Xiaodong He, and Jianfeng Gao. DEEP STACKING NETWORKS FOR INFORMATION RETRIEVAL Li Deng, Xiaodong He, and Jianfeng Gao {deng,xiaohe,jfgao}@microsoft.com Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA ABSTRACT Deep stacking

More information

Speaker Recognition Using MFCC and GMM with EM

Speaker Recognition Using MFCC and GMM with EM RESEARCH ARTICLE OPEN ACCESS Speaker Recognition Using MFCC and GMM with EM Apurva Adikane, Minal Moon, Pooja Dehankar, Shraddha Borkar, Sandip Desai Department of Electronics and Telecommunications, Yeshwantrao

More information

Sentiment Analysis of Speech

Sentiment Analysis of Speech Sentiment Analysis of Speech Aishwarya Murarka 1, Kajal Shivarkar 2, Sneha 3, Vani Gupta 4,Prof.Lata Sankpal 5 Student, Department of Computer Engineering, Sinhgad Academy of Engineering, Pune, India 1-4

More information

A Taiwanese Text-to-Speech System with Applications to Language Learning

A Taiwanese Text-to-Speech System with Applications to Language Learning A Taiwanese Text-to-Speech System with Applications to Language Learning Min-Siong Liang 1, Rhuei-Cheng Yang 2, Yuang-Chin Chiang 3, Dau-Cheng Lyu 1, Ren-Yuan Lyu 2 1. Dept. of Electrical Engineering,

More information

A Method for Translation of Paralinguistic Information

A Method for Translation of Paralinguistic Information A Method for Translation of Paralinguistic Information Takatomo Kano, Sakriani Sakti, Shinnosuke Takamichi, Graham Neubig, Tomoki Toda, Satoshi Nakamura Graduate School of Information Science, Nara Institute

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Analysis of Gender Normalization using MLP and VTLN Features

Analysis of Gender Normalization using MLP and VTLN Features Carnegie Mellon University Research Showcase @ CMU Language Technologies Institute School of Computer Science 9-2010 Analysis of Gender Normalization using MLP and VTLN Features Thomas Schaaf M*Modal Technologies

More information

Cooperative Interactive Cultural Algorithms Based on Dynamic Knowledge Alliance

Cooperative Interactive Cultural Algorithms Based on Dynamic Knowledge Alliance Cooperative Interactive Cultural Algorithms Based on Dynamic Knowledge Alliance Yi-nan Guo 1, Shuguo Zhang 1, Jian Cheng 1,2, and Yong Lin 1 1 College of Information and Electronic Engineering, China University

More information

Soft-computing Methods for Text-to-Speech Driven Avatars

Soft-computing Methods for Text-to-Speech Driven Avatars Soft-computing Methods for Text-to-Speech Driven Avatars MARIO MALCANGI DICo Dipartimento di Informatica e Comunicazione Università degli Studi di Milano Via Comelico 39 20135 Milano ITALY malcangi@dico.unimi.it

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Automatic Speech Segmentation Based on HMM

Automatic Speech Segmentation Based on HMM 6 M. KROUL, AUTOMATIC SPEECH SEGMENTATION BASED ON HMM Automatic Speech Segmentation Based on HMM Martin Kroul Inst. of Information Technology and Electronics, Technical University of Liberec, Hálkova

More information

Pass Phrase Based Speaker Recognition for Authentication

Pass Phrase Based Speaker Recognition for Authentication Pass Phrase Based Speaker Recognition for Authentication Heinz Hertlein, Dr. Robert Frischholz, Dr. Elmar Nöth* HumanScan GmbH Wetterkreuz 19a 91058 Erlangen/Tennenlohe, Germany * Chair for Pattern Recognition,

More information

Statistical Parametric Speech Synthesis

Statistical Parametric Speech Synthesis Statistical Parametric Speech Synthesis Heiga Zen a,b,, Keiichi Tokuda a, Alan W. Black c a Department of Computer Science and Engineering, Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya,

More information

Proficiency Assessment of ESL Learner s Sentence Prosody with TTS Synthesized Voice as Reference

Proficiency Assessment of ESL Learner s Sentence Prosody with TTS Synthesized Voice as Reference INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Proficiency Assessment of ESL Learner s Sentence Prosody with TTS Synthesized Voice as Reference Yujia Xiao 1,2*, Frank K. Soong 2 1 South China University

More information

Low-Delay Singing Voice Alignment to Text

Low-Delay Singing Voice Alignment to Text Low-Delay Singing Voice Alignment to Text Alex Loscos, Pedro Cano, Jordi Bonada Audiovisual Institute, Pompeu Fabra University Rambla 31, 08002 Barcelona, Spain {aloscos, pcano, jboni }@iua.upf.es http://www.iua.upf.es

More information

in 82 Dutch speakers. All of them were prompted to pronounce 10 sentences in four dierent languages : Dutch, English, French, and German. All the sent

in 82 Dutch speakers. All of them were prompted to pronounce 10 sentences in four dierent languages : Dutch, English, French, and German. All the sent MULTILINGUAL TEXT-INDEPENDENT SPEAKER IDENTIFICATION Georey Durou Faculte Polytechnique de Mons TCTS 31, Bld. Dolez B-7000 Mons, Belgium Email: durou@tcts.fpms.ac.be ABSTRACT In this paper, we investigate

More information

L16: Speaker recognition

L16: Speaker recognition L16: Speaker recognition Introduction Measurement of speaker characteristics Construction of speaker models Decision and performance Applications [This lecture is based on Rosenberg et al., 2008, in Benesty

More information

The GlottHMM Entry for Blizzard Challenge 2012: Hybrid Approach

The GlottHMM Entry for Blizzard Challenge 2012: Hybrid Approach The GlottHMM ntry for Blizzard Challenge 2012: Hybrid Approach Antti Suni 1, Tuomo Raitio 2, Martti Vainio 1, Paavo Alku 2 1 Department of Behavioural Sciences, University of Helsinki, Helsinki, Finland

More information

Human-Machine Dialogue. Takashi YOSHIMURA, Satoru HAYAMIZU, Hiroshi OHMURA and Kazuyo TANAKA Umezono, Tsukuba, Ibaraki 305, JAPAN

Human-Machine Dialogue. Takashi YOSHIMURA, Satoru HAYAMIZU, Hiroshi OHMURA and Kazuyo TANAKA Umezono, Tsukuba, Ibaraki 305, JAPAN Pitch Pattern Clustering of User Utterances in Human-Machine Dialogue Takashi YOSHIMURA, Satoru HAYAMIZU, Hiroshi OHMURA and Kazuyo TANAKA Electrotechnical Laboratory 1-1-4 Umezono, Tsukuba, Ibaraki 305,

More information

Measuring the Gap Between HMM-Based ASR and TTS

Measuring the Gap Between HMM-Based ASR and TTS 1 Measuring the Gap Between HMM-Based ASR and TTS John Dines, Member, IEEE, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE Abstract The EMIME European project is conducting research in

More information

Selection of Lexical Units for Continuous Speech Recognition of Basque

Selection of Lexical Units for Continuous Speech Recognition of Basque Selection of Lexical Units for Continuous Speech Recognition of Basque K. López de Ipiña1, M. Graña2, N. Ezeiza 3, M. Hernández2, E. Zulueta1, A. Ezeiza 3, and C. Tovar1 1 Sistemen Ingeniaritza eta Automatika

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Personalising speech-to-speech translation Citation for published version: Dines, J, Liang, H, Saheer, L, Gibson, M, Byrne, W, Oura, K, Tokuda, K, Yamagishi, J, King, S, Wester,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Synthesizer control parameters. Output layer. Hidden layer. Input layer. Time index. Allophone duration. Cycles Trained

Synthesizer control parameters. Output layer. Hidden layer. Input layer. Time index. Allophone duration. Cycles Trained Allophone Synthesis Using A Neural Network G. C. Cawley and P. D.Noakes Department of Electronic Systems Engineering, University of Essex Wivenhoe Park, Colchester C04 3SQ, UK email ludo@uk.ac.essex.ese

More information

English Alphabet Recognition Based on Chinese Acoustic Modeling

English Alphabet Recognition Based on Chinese Acoustic Modeling English Alphabet Recognition Based on Chinese Acoustic Modeling Linquan Liu, Thomas Fang Zheng, and Wenhu Wu Center for Speech Technology, Tsinghua National Laboratory for Information Science and Technology,

More information

Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses

Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses M. Ostendor~ A. Kannan~ S. Auagin$ O. Kimballt R. Schwartz.]: J.R. Rohlieek~: t Boston University 44

More information

The Breath Segment in Expressive Speech

The Breath Segment in Expressive Speech Computational Linguistics and Chinese Language Processing Vol. 12, No. 1, March 2007, pp. 17-32 17 The Association for Computational Linguistics and Chinese Language Processing The Breath Segment in Expressive

More information

Recognition of Emotions in Speech

Recognition of Emotions in Speech Recognition of Emotions in Speech Enrique M. Albornoz, María B. Crolla and Diego H. Milone Grupo de investigación en señales e inteligencia computacional Facultad de Ingeniería y Ciencias Hídricas, Universidad

More information

SEQUENCE TRAINING OF MULTIPLE DEEP NEURAL NETWORKS FOR BETTER PERFORMANCE AND FASTER TRAINING SPEED

SEQUENCE TRAINING OF MULTIPLE DEEP NEURAL NETWORKS FOR BETTER PERFORMANCE AND FASTER TRAINING SPEED 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SEQUENCE TRAINING OF MULTIPLE DEEP NEURAL NETWORKS FOR BETTER PERFORMANCE AND FASTER TRAINING SPEED Pan Zhou 1, Lirong

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference

A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference 1026 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference Rui Cai, Lie Lu, Member, IEEE,

More information

SPEECH-DRIVEN EYEBROW MOTION SYNTHESIS WITH CONTEXTUAL MARKOVIAN MODELS

SPEECH-DRIVEN EYEBROW MOTION SYNTHESIS WITH CONTEXTUAL MARKOVIAN MODELS SPEECH-DRIVEN EYEBROW MOTION SYNTHESIS WITH CONTEXTUAL MARKOVIAN MODELS Yu Ding Mathieu Radenen Thierry Artières Catherine Pelachaud Université Pierre et Marie Curie (LIP6), Paris, France CNRS-LTCI, Institut

More information

LATTICE-BASED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION

LATTICE-BASED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION LATTICE-SED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION Mukund Padmanabhan, George Saon and Geoffrey Zweig IBM T. J. Watson Research Center P. O. Box 21, Yorktown Heights, NY 1059 ABSTRACT In this paper we

More information

Accent Classification

Accent Classification Accent Classification Phumchanit Watanaprakornkul, Chantat Eksombatchai, and Peter Chien Introduction Accents are patterns of speech that speakers of a language exhibit; they are normally held in common

More information

Synthesis of Singing

Synthesis of Singing Synthesis of Singing Paul Meissner Robert Peharz June 26, 2008 Abstract This paper describes three methods for the synthesis of singing speech. For each system, the speech synthesis process is explained

More information

Speaker Recognition Using Vocal Tract Features

Speaker Recognition Using Vocal Tract Features International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 3, Issue 1 (August 2013) PP: 26-30 Speaker Recognition Using Vocal Tract Features Prasanth P. S. Sree Chitra

More information

Versatile Speech Databases for High Quality Synthesis for Basque

Versatile Speech Databases for High Quality Synthesis for Basque Versatile Speech Databases for High Quality Synthesis for Basque I. Sainz, D. Erro, E. Navas, I. Hernáez, J. Sanchez, I. Saratxaga, I. Odriozola Aholab Dep. of Electronics and Telecommunications. Faculty

More information

A Functional Model for Acquisition of Vowel-like Phonemes and Spoken Words Based on Clustering Method

A Functional Model for Acquisition of Vowel-like Phonemes and Spoken Words Based on Clustering Method APSIPA ASC 2011 Xi an A Functional Model for Acquisition of Vowel-like Phonemes and Spoken Words Based on Clustering Method Tomio Takara, Eiji Yoshinaga, Chiaki Takushi, and Toru Hirata* * University of

More information

Abstract. 1 Introduction. 2 Background

Abstract. 1 Introduction. 2 Background Automatic Spoken Affect Analysis and Classification Deb Roy and Alex Pentland MIT Media Laboratory Perceptual Computing Group 20 Ames St. Cambridge, MA 02129 USA dkroy, sandy@media.mit.edu Abstract This

More information

Talking Condition Recognition in Stressful and Emotional Talking Environments. Based on CSPHMM2s. Ismail Shahin 1. Mohammed Nasser Ba-Hutair 2

Talking Condition Recognition in Stressful and Emotional Talking Environments. Based on CSPHMM2s. Ismail Shahin 1. Mohammed Nasser Ba-Hutair 2 Talking Condition Recognition in Stressful and Emotional Talking Environments Based on CSPHMM2s Ismail Shahin 1 Mohammed Nasser Ba-Hutair 2 Department of Electrical and Computer Engineering University

More information

Automatic identification of individual killer whales

Automatic identification of individual killer whales Automatic identification of individual killer whales Judith C. Brown a) Department of Physics, Wellesley College, Wellesley, Massachusetts 02481 and Media Laboratory, Massachusetts Institute of Technology,

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information