Emotion Recognition and Synthesis in Speech

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker recognition using universal background model on YOHO database

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Edinburgh Research Explorer

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Python Machine Learning

CS Machine Learning

Lecture 1: Machine Learning Basics

Speech Recognition at ICSI: Broadcast News and beyond

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Affective Classification of Generic Audio Clips using Regression Models

Speaker Identification by Comparison of Smart Methods. Abstract

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Learning Methods in Multilingual Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Assignment 1: Predicting Amazon Review Ratings

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Learning From the Past with Experiment Databases

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Word Segmentation of Off-line Handwritten Documents

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Expressive speech synthesis: a review

Support Vector Machines for Speaker and Language Recognition

Calibration of Confidence Measures in Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Probability and Statistics Curriculum Pacing Guide

Axiom 2013 Team Description Paper

Automatic Pronunciation Checker

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Segregation of Unvoiced Speech from Nonspeech Interference

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Generative models and adversarial training

Rule Learning with Negation: Issues Regarding Effectiveness

Speech Recognition by Indexing and Sequencing

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Statewide Framework Document for:

Australian Journal of Basic and Applied Sciences

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Corpus Linguistics (L615)

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Statistical Parametric Speech Synthesis

arxiv: v2 [cs.cv] 30 Mar 2017

Artificial Neural Networks written examination

SARDNET: A Self-Organizing Feature Map for Sequences

Body-Conducted Speech Recognition and its Application to Speech Support System

Independent Assurance, Accreditation, & Proficiency Sample Programs Jason Davis, PE

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Truth Inference in Crowdsourcing: Is the Problem Solved?

CS 446: Machine Learning

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Speaker Recognition. Speaker Diarization and Identification

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Issues in the Mining of Heart Failure Datasets

Model Ensemble for Click Prediction in Bing Search Ads

A survey of multi-view machine learning

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Evaluation of Teach For America:

learning collegiate assessment]

Multi-Lingual Text Leveling

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Spoofing and countermeasures for automatic speaker verification

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Voice conversion through vector quantization

arxiv: v1 [math.at] 10 Jan 2016

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Transcription:

Emotion Recognition and Synthesis in Speech Dan Burrows Electrical And Computer Engineering dburrows@andrew.cmu.edu Maxwell Jordan Electrical and Computer Engineering maxwelljordan@cmu.edu Ajay Ghadiyaram Electrical And Computer Engineering aghadiya@andrew.cmu.edu Amandianeze Nwana Electrical and Computer Engineering aon@andrew.cmu.edu Amber Xu Electrical and Computer Engineering axu@andrew.cmu.edu Abstract In this paper we describe an emotion recognition system that uses Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs). We also describe how to synthesize speech that has emotion using state-of-the-art speech synthesis tools. Our feature set includes f 0, mel cepstral coefficients (MCEPs) and power. Four-fold cross validation is used to test the accuracy of our recognition system. We synthesize and recognize four fundamental emotions: happiness, hot anger, neutrality, and sadness. Our system classifies speech into one of the four emotion categories and responds with speech synthesized in that same emotion. 1 Introduction Emotion synthesis and recognition has many applications. It can be used in automated call centers, lie detector systems and by psychologists. To recognize emotions we used a decision tree to decide between emotions. We built GMMs around three features and used the probabilities they output to feed into a SVM which makes decisions throughout the decision tree. After determining the spoken emotion we synthesis one of four emotions; happy, hot anger, neutral or sadness in response. The voices were created using the Festival Speech Synthesis System and the Festvox project. We explored whether power conversion and durational models can be used to improve synthesis emotional speech. We will discuss the contents of our database in Section 2, describe the emotion classifier in Section 3, and elaborate on converting a voice into different emotions in Section 4. 2 Database The database that was used for the project was the Emotional Prosody Speech and Transcripts database provided by the Linguistic Data Consortium (LDC2002S28). The database is composed of recorded speech from three male and four female professional actors. The database consists of utterances containing only numbers and dates where each utterance is approximately 2 seconds long. The utterances are expressed in 15 unique categories: neutral, disgust, panic, anxiety, hot anger, 1

cold anger, despair, sadness, elation, happy, interest, boredom, shame, pride, and contempt. The sampling rate is 22.05 khz and the speech is stored using dual-channel interleaved 16-bit PCM. The database has a text transcript for each speaker that documents the words that were spoken during each utterance. Each utterance is also labeled with a single emotion category in this transcript. For the purpose of this paper we will focus on classification between four emotions on Male subjects only; Happy, Hot Anger, Sadness and Neutral. This decision was motivated by the low accuracy in classification between all 15 emotions and both sexes in [4]. We normalized the power in each utterance and downsampled the audio to 16 khz so that our data was compatible with Festival. 3 Recognition Emotion Classification Excited Non-Excited Happy Hot Anger Neutral Sadness Figure 1: Hierarchal classification tree. Our speaker dependent emotion classifier is a decision tree. Each level contains a binary decision between two emotional categories. At the first level the decision is between the Excited and Non- Excited classes. The Excited class is a combination of the Happy and Hot Anger classes and the Non-Excited classes is a combination of the Neutral and Sadness classes. After determining which top level class the utterance comes from, we enter the second level of the tree where a binary division occurs between Happy and Hot Anger or Neutral and Sadness depending on which first level class was picked. Three GMMs were constructed for each class and trained on the Mel-Cepstrum coefficients, statistics of the f 0, and statistics of the power. This results in a total of 18 GMMs whose outputs were fed into the SVMs that make up the decision tree. In the following discussion we will describe in detail the features we chose, how we constructed the GMMs, how we determined the SVM kernel to use and our recognition accuracies. 3.1 Features We used a set of three distinct features to train our models and then classify new data. We calculated the Mel-Cepstrum Coefficients by using the Speech Signal Processing Toolkit that is part of Festival. The Mel Cepstrum is described as: F 1 (log( Mel Scale 2 )) (1) Mel Scale = 2595 log 10 ( F(x(t)) 700 + 1) (2) We calculated 24 coefficients for each 10ms frame of speech. This creates points in 24 dimensional space. Length of Utterance(s).01(s) For the fundamental frequency we used a similar method as by Medan et al. [1]. Through the implementation provided in Festival which autocorrelates adjacent windows to determine the τ value, frequency, that maximizes the correlation. We then threshold the overall pitches based on human speech to remove unvoiced segments and outliers due to cracks in the speaker s voice. This builds a vector of voiced pitches per utterance. We calculated the mean, variance, minimum and maximum of the original f 0 and its first and second derivatives to use as features for training the GMMs. 2

Power is also calculated within Festival over 10ms segments as for MCEPs. Similarly to f 0, the mean, variance, minimum and maximum of the power values, its first derivative, and its second derivative make up the features for training the GMMs for power. 3.2 Building the Gaussian Mixture Models We constructed one GMM per class for each of the three features using an Expectation Maximization algorithm as described by Reynolds et al. [2]. We determined experimentally that a full covariance GMM worked best for MCEP and diagonal covariance GMM worked best for f 0 and power. To construct the models we then determined how many component Gaussian densities and nodal variances to include in each model. We did this by using MATLAB s gmdistribution.fit command for different model orders, K, and tracked the resulting accuracies. (a) (b) (c) Figure 2: 4-Fold cross validation for the first level classification for model order K. To determine the optimal K values for each model we compared 4-fold cross validation accuracies (Figure 2). We also considered convergence and computation time of the model. For f 0 and power, the range of K was limited by the amount of test data available which prevented convergence at high values of K; whereas for MCEPs, computation time became unrealistic after K = 20. From our tests we determined that K = 10 for power, K = 12 for f 0, and K = 15 for MCEPs created the most accurate mixture models. 3.3 Building the Support Vector Machine The probabilities from the GMMs for three features over the six classes are fed into an SVM which is used to make decisions through our decision tree. To train the SVMs, we use MATLAB s built in svmtrain to develop an SVM for each of the three decision we make. We also determine which kernel best supports our model. After attempting various kernels as shown in Figure 3, we found a linear kernel produced the highest accuracy between individual emotion with an accuracy of 85.4%. In the graph it is evident that all kernels have a very high accuracy for the first decision (in red) between the Excited and Non-Excited classes but there is less reliability after the second level deciding between specific emotions (in blue). 3.4 Testing After we have created our GMMs and SVMs from the training set we are ready to start testing the model. We calculate the features in the same manner for each test utterance as we did for the training set. At the first level we calculate the probability that the utterance came from either the Excited or Non-Excited class for f 0 and power features, and the normalized log likelihood for MCEPs from the GMMs. We feed these probabilities into the top level linear kernel SVM we trained earlier to determine which class the result is most likely from. After a decision is made, we move down that branch of our tree and classify between either Happy and Hot Anger or Neutral and Sadness. From the Confusion Matrix in Figure 4 we can see that Happy and Hot Anger or Neutral and Sadness are most often confused however Neutral and Happy or Neutral and Hot Anger are 3

98.7% 100.0% SVM Kernel Accuracy 95.0% 96.7% 95.8% 95.4% 95.4% 94.1% 96.7% 90.0% 85.0% 85.4% Accuracy 80.0% 75.0% first level second level 72.8% 70.0% 69.9% 70.3% 67.8% 65.0% 65.3% 66.1% 60.0% linear quadratic poly - 3 poly - 4 poly - 5 rbf mlp SVM kernel Figure 3: SVM Kernel Accuracy Comparison not. This is due to the high accuracy of the top level of the hierarchy split and therefore limits confusion between emotions from different first level classes. The hierarchy clearly limits our error rates. We can also see that the second level of classification skews the decision towards over classifying sadness and happy over neutral and hot anger. We speculate that this is due to having a slightly higher number of utterances for Sadness and Happy. However, due to our small dataset, we could not afford to remove utterances to have the same number of utterances for each emotion. Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Confusion Matrix Happy Hot Anger Neutral Classified Emo2on Sadness Happy Hot Anger Neutral Sadness Figure 4: Confusion Matrix of Overall System 4 Voice Conversion We use the Festival Speech Synthesis System and the Festvox project to synthesize speech. In order to build a synthetic voice from scratch using these tools, a large database of utterances is needed. However, a full Festival voice for a target speaker based on an existing Festival voice can be built using much less data than is needed to build a Festival voice from scratch. This technique is called voice conversion. We use voice conversion because existing databases of emotional speech are very small. The voice conversion technique modifies the pitch, MCEPs and power of the source speaker to generate speech that sounds like the target speaker. We also tried building a duration model for the target speaker. We synthesized emotional speech using several source voices and found that the kal diphone voice has the best quality. We built transforms for the happy, angry, neutral, and sad emotions. There was a noticeable difference between the excited (happy and hot anger) and nonexcited (neutral and sadness) speech but there was little difference between happy and angry speech. 4

There was also little difference between neutral and sad speech. The techniques used to convert pitch, MCEPs, and powers are discussed below. 4.1 MCEP Mapping The technique that is used to convert the source MCEPs is described in detail by Toda et al. and summarized below [3]. We extract the MCEPs and MCEPs from source and target frames (each frame is 10ms of speech). Next, we align the source and target speech in time using dynamic time warping. A GMM is built to model the joint pdf of source and target MCEPs and MCEPs. Using the mean and variance of each mixture component, another GMM containing a penalty term for reduction of global variance is built to model the conditional PDF per target frame given the source frame. Finally the MLE of the target frames are computed using the conditional PDF. Power is also converted using this technique. 4.2 Pitch Mapping The fundamental frequency is converted using the following formula: ŷ t = σy σ x (x t µ x ) + µ y (3) Where ŷ t and x t are the log scaled f 0 of the source speaker and the target speaker at frame t. µ x and µ y are the mean log-scaled f 0 from the source and target speaker. σ x and σ y are the standard deviation of the log-scaled f 0 from the source and target speaker. This formula is derived by calculating the conditional expected value of the target pitch given the source pitch assuming both source and target pitch are from a gaussian distribution. 4.3 Duration The phone durations for the target speech were modeled using a Classification and Regression Tree (CART). The CART tree was built using the Edinburgh Speech Tools Library which is distributed with Festival. The features used include previous and future phones, whether the phone is stressed or unstressed, word position and phrase position. We varied the size of the leaf nodes in the tree from one to fifty. The results are shown in Figure 5. The correlation is always in the neighborhood of 0.1 so we decided not to use durational information in our synthesizer. 0.16 Average Correla6on Value 0.14 0.12 C o rr e 0.1 l a t i o n 0.08 0.06 0.04 0.02 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Minimum Leaf Node Size Figure 5: Plot of the average correlation values versus minimum leaf node size. 5

5 Conclusion In the end we created a system that could identify emotion in speech and respond with an emotional response. The primary shortcoming through our work was the lack of data. The utterances were very short and made training and building voice models most difficult. Attempting to combine emotions such as happy and elation or sadness and despair produced mixed results and skewed classification towards emotions with larger training sets. This also affected synthesis with weaker models to build new voices with. With these shortcomings we still produced accurate results. The recognition portion had a classification accuracy of 85.4% after four-fold cross validation. Subsequently the synthesis aspect could create four distinct emotions based on our classification results. We determined empirically that mixing emotions to create new in between emotions had little effect. This is due to the fact that it is easier for the human ear to recognize emotional extremes. For this reason we built four transformations to uniquely synthesize each of the four emotions we used in training. Once an emotion has been classified we can synthesize a response in the classified emotion. 6 Acknowledgments We would like to thank Dr. Alan Black, Dr. Bhiksha Ramakrishnan, and the 18-797 Machine Learning course staff for their guidance and support during this project. References [1] Yoav Medan, Eyal Yair, and Dan Chazan. Super resolution pitch determination of speech signals. IEEE Transactions On Signal Processing, 39(1):40 48, 1991. [2] Douglas A. Reynolds and Richard C. Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72 83, 1995. [3] Tomoki Toda, Alan Black, and Keiichi Tokuda. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transations of Audio, Speech and Language Processing, 15(8):2222 2236, 2007. [4] Sherif Yacoub, Steve Simske, Xiaofan Lin, and John Burns. Recognition of emotions in interactive voice response systems. Technical report, Hewlett-Packard Company, 2003. 6