Selection of Features for Emotion Recognition from Speech

Similar documents
Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speaker Identification by Comparison of Smart Methods. Abstract

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker recognition using universal background model on YOHO database

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Recognition. Speaker Diarization and Identification

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

WHEN THERE IS A mismatch between the acoustic

Word Segmentation of Off-line Handwritten Documents

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Proceedings of Meetings on Acoustics

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

On the Formation of Phoneme Categories in DNN Acoustic Models

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Body-Conducted Speech Recognition and its Application to Speech Support System

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Learning Methods in Multilingual Speech Recognition

Affective Classification of Generic Audio Clips using Regression Models

Python Machine Learning

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Recognition by Indexing and Sequencing

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning Methods for Fuzzy Systems

CS Machine Learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Automatic Pronunciation Checker

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Circuit Simulators: A Revolutionary E-Learning Platform

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Expressive speech synthesis: a review

REVIEW OF CONNECTED SPEECH

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

On-Line Data Analytics

Lecture 9: Speech Recognition

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Evolutive Neural Net Fuzzy Filtering: Basic Description

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

SIE: Speech Enabled Interface for E-Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Soft Computing based Learning for Cognitive Radio

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Generative models and adversarial training

Rule Learning With Negation: Issues Regarding Effectiveness

Rhythm-typology revisited.

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Segregation of Unvoiced Speech from Nonspeech Interference

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Voice conversion through vector quantization

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Application of Virtual Instruments (VIs) for an enhanced learning environment

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Mandarin Lexical Tone Recognition: The Gating Paradigm

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

LEGO MINDSTORMS Education EV3 Coding Activities

Automatic intonation assessment for computer aided language learning

A Reinforcement Learning Variant for Control Scheduling

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

A student diagnosing and evaluation system for laboratory-based academic exercises

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Course Law Enforcement II. Unit I Careers in Law Enforcement

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

School of Innovative Technologies and Engineering

Lecture 1: Machine Learning Basics

Robot manipulations and development of spatial imagery

Transcription:

Indian Journal of Science and Technology, Vol 9(39), DOI: 10.17485/ijst/2016/v9i39/95585, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Selection of Features for Emotion Recognition from Speech Puja Ramesh Chaudhari and John Sahaya Rani Alex * School of Electronics Engineering, VIT University, Chennai - 632014, Tamil Nadu, India; chaudharipuja7@gmail.com, jsranialex@vit.ac.in. Abstract Background/Objective: Speech is one of the modes for Human Computer Interface (HCI). Speech contains message to convey as well as the speaker characteristics such as speaker identity and emotional state of the speaker. Recently, researchers are taking more interest in the emotional parameters of speech signals which helps to improve the functionality of HCI. This research focus on selecting features which helps to identify the emotion of the speaker. Methods/Statistical Analysis: Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC) and Perceptual Linear Predictive (PLP) methods are used to extract the features. Each emotion is modeled as one Hidden Markov Model (HMM) using Hidden Markov Tool Kit (HTK tool kit). The Beagle Bone Black (BBB) board is chosen for the implementation because of the form factor. Findings: The results indicate that MFCC features gives 100% accuracy for surprise emotion, PLP features gives 100% accuracy for anger emotion and LPCC features give 100% accuracy for fear emotion. Conclusion/Improvement: A hybrid feature extraction method should be devised to detect all emotions with 100% accuracy. Keywords: BBB, Emotion recognition, HCI, HMM, LPCC, MFCC 1. Introduction The speech is a unique parameter for representing the information. Each spoken word is created using the phonetic combination of a set of vowel, semivowel and consonant speech sound units. The emotion recognition plays an important role in identifying and verifying the emotional state from his/her speech signal.emotion includes six basic parameters such as happy, sad, fear, anger, surprise and natural. The emotion recognition system is used for analyzing driver emotion state while driving the car in the city or in the automated customer care. This is very useful in terms of safety and controlling the state of mind. The development in speech analysis has largely defeat the milestone of intelligibility, dynamic research attempts in the area of ingenuousness and blandness. The emotion contains the different parameters such as pitch, energy, intensity of voice and word utterance. Basically emotion recognizing is complex for the long and complicated sentences of human. Also, it is recognizing the emotion from speech and voice intonation. Emotion recognition used for different applications in the field of Medicine, E-learning, Monitoring, Entertainment, Law, Marketing such as in Health centres for monitoring the patients emotional state after the treatment and also used for application demand natural man machine interaction, such as web movies and computer tutorial applications, where the responses to the user depends on the detected emotions. Various literatures have been referred for this work which was helpful for the progress. The stress of person measured by asking and collecting different parameters to get particular marked values and also for trained model contained the amount stress signal which was measured by neural network 1. In this paper 2, technologies related to emotion recognition are discussed which through threw more light on the project as it gave an idea regarding emotion * Author for correspondence

Selection of Features for Emotion Recognition from Speech recognition from speech more clearly. It is also found that emotion recognition of the speech signal helps to ensure naturalness in the performance of existing speech systems and recent works with the ideas of the emotion recognition from emotional databases, speech features and classification models. This paper 3 presents how to develop classification scheme and an emotional speech dataset for system functionality. Emotional interaction of a Thinking Robot 4 is identified, focussing on emotion recognition from speech signals and focuses on the independent emotion recognition systems. Speech emotion recognition system that combines with facial features and gestures included in multimodal interactions. This paper 5 discussed about the formant frequency, pitch methods which add to recognize the happy, sad, natural emotions.in this paper 6, researcher implemented the system for the speech recognition in Xilinx based on the Neutral Network (NN) for hybrid mechanism which is beneficial for the decreasing area and power. In this paper 7, the Self-Organising Feature Map (SOFM) isused to reduced large dimensions of features vector with same recognition accuracy. Low power small embedded board is not used in literature to implement emotion recognition system. The objective of this paper is to use the conventional feature extraction methods such asmel frequency cepstrum coefficient (MFCC) 8,9, Linear Prediction Coefficient (LPC) and Perceptual Linear Predictive (PLP) emotion recognition. Each emotion modeled using Hidden Markov Model (HMM) 10,11 and then each emotion detected using Viterbi decoder. The proposed emotion recognition system is implemented on an embedded board. Beagle Bone Black 12 is chosen as the embedded platform to implement because of its low power and small size. The rest of the paper is organized as follows; Section 2 discusses the FCC, LPC, PLP, HMM modeling of emotion: Section 3 gives the detail description of implementation and Result obtained: Section 4 presents the conclusion. 2. Implementation 2.1 Data based Collection Emotion recognition system start with collecting the information from various speech signals for 10 samples for each emotions and recording by using the head mounted microphone for the various sampling rates such as 16 khz, 44.1 khz and 48 khz as a.wav format. 2.2 Feature Extraction Certain attributes of the speaker is extracted from the speech signals for each emotions. It is basically represented the each emotions by extracting the small portion of data from the voice. Feature extraction for emotion must have some specific characteristics: It should be easily determinable from the set of known voices generated naturally and frequently in speech. It should be consistent for each emotion models. Speech is the slowly time varying signal. When identified the voice signal over the short time period, then voice signal establish as an unmoving. However, when identified the voice signal symptomatic being change on long duration of time. It is showed the distinct voice vocalization existences in spoken words. Thus, small interval of spectral analysis is same manner to differentiate voice parameters. Many algorithms are used to extract emotional parametric representation. They are Linear Predictive Coding (LPC), Perceptual Linear Predictive (PLP) and MFCC. 2.2.1 Mel Frequency Cepstral Coefficient (MFCC) Figure 1. MFCC feature extraction method. MFCC takes human conception sensitivity with respect to frequencies into account which is good for emotion recognition. The MFCC basically performing the Fourier analysis depend on short term power spectrum. The MFCC spaced at low frequencies and logarithmically at high frequencies for collecting important features of voice. The scale of Mel-frequency is the spectral analysis for linear frequency spatial arrangement at a lower place 1000 Hz and for logarithmic frequency spatial arrangement over 1000 Hz. Speech signal is framed and windowed for 25 ms with an overlapping period of 10 ms. Frequency component of the framed signal is passed through 24 triangular Mel scale filter banks. The filtered output is compressed using logarithmic and then the cepstral coefficients are de-correlated by applying Discrete Cosine Transform (DCT). For first 13 output of the DCT block is considered 2 Vol 9 (39) October 2016 www.indjst.org Indian Journal of Science and Technology

Puja Ramesh Chaudhari and John Sahaya Rani Alex as static MFCC. From the Static MFCC, derivatives and double derivatives are calculated and used for emotion recognition. Figure 1 shows the block diagram of MFCC. 2.2.2 Linear Predictive Coding (LPC) Figure 3. LPL feature extraction method. Figure 2. LPC feature extraction method. LPC is the method to extract or compressed the speech signal. It can be consider as a subset of the filter. Sound is the biometric parameter. The speech contains the energy, pitch, frequency of each and every word to be spoken which is periodic in time. The train of impulse and the random noise can be evaluated as an irritation source and digital filter. An unexpressed signals irritation is modelled by a white Gaussian noise source. The speech signal processes through the speech analysis filter annihilate the redundancy in that speech signals. This signal passes through frame blocking window which is compare a small numbers of bit from the speech signals or else transfer the speech function to bring forth original signals. LPC is used for determining the group of predictor coefficients which will reduce the mean squared error across a small section of speech signal. Figure 2 shows the basic blocks of LPC. 2.2.3 Perceptual Linear Predictive (PLP) It is a combination of the Discrete Fourier transform and linear predictor techniques. It is also known as all pole model. The basic idea behind this method is to depict the psychophysics of human hearing more efficiently in the feature extraction method. In this modelled the speech signal passed to the FFT which processed to the frequency wrapping. Equal loudness is used to simulate the pre-emphasis block to compress the amplitude to each signal and converted back to the time domain by using DCT and this small segment represented as the PLP feature. Figure 3 shows the block diagram of PLP. 2.3 HTK Tool Kit In this project HTK tool kit is used for the emotion recognition. This HTK tool kit constructing HMM is used for recognition purpose. The HTK toolkit is a moveable toolkit 8,9. HTK lie in the group of library modules and tools are available in C language. The tools render elegant adeptness for speech analysis, HMM model training, testing and results analysis done through HTK tool. The software corroborates HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build composite HMM systems. 2.4 Hardware In this paper 12, Beagle Bone Black (BBB) embedded board is employed to implement emotion recognition. This is a low cost SitaraXAM3359AZCZ100 cortex A8 ARM processor, 1 GHZ which is shown in Figure 4. Figure 4. Beagle bone black board. It is uses the 512 MB memory devices in BBB. It is handling at a Clock Frequency of 303 MHz. The board is equipped with a single MicroSD connecter to work for board as sources of secondary boot and also we can select as a primary boot source. In this hardware interfacing connect the mini USB cable provided to connect to the USB port of laptop for Vol 9 (39) October 2016 www.indjst.org Indian Journal of Science and Technology 3

Selection of Features for Emotion Recognition from Speech the power supply purpose. HTK is open source software developed by Microsoft and Cambridge University. Viterbi decoder in HTK kit is licensed by Microsoft. All the feature extraction, Modeling and decoding is done using HTK. 3. Implementation and Results In this research, twelve speakers of age between 20-25 are requested to read same set of documents for duration of 1 sec for each emotion. The audio file is stored as.wav format. Using HTK tool kit, MFCC, LPC and PLP features are extracted. For each type of feature method, 8 samples are used for HMM modelling of each emotion is done. Using HTK, Emotion of a person recognized by using Viterbi decoder which takes the maximum probability of the matching between emotional modelled to identify the particular emotion. Speaker dependent testing recognized happy, sad, natural, fear, surprise, and anger emotion of a person by extracting the feature with 3 methods gives 100% accuracy. Figure 8. Emotion recognition with PLP method. Real time testing is shown in Figure 5, 6, 7, 8. For the speaker independent data set (4 samples), MFCC gives 100% accuracy for surprise emotion, PLP gives 100% accuracy for anger and LPC gives 100% accuracy for fear which is shown in Figure 9, 10, 11. Figure 9. Speaker independent emotion recognition with MFCC. Figure 5. BBB. Emotion recognition output from the Figure 10. with LPC. Speaker independent emotion recognition Figure 6. Emotion recognition with MFCC method. Figure 11. with PLP. Speaker independent emotion recognition Figure 7. Emotion recognition with LPC method. From the above table we can say that for the modelled and trained speech signal for the three methods are MFCC, LPC, PLP gives the 100% accuracy for each speech signal of same data set. For the without modelled and trained 4 Vol 9 (39) October 2016 www.indjst.org Indian Journal of Science and Technology

Puja Ramesh Chaudhari and John Sahaya Rani Alex speech signal in MFCC method gave 100% accuracy for Surprise emotion, LPC method gave 100% accuracy for Fear emotion, PLP method gave 100% accuracy for Anger emotion. 4. Conclusion Emotion recognition plays a vital role is determining the states of the human mind. Interactive voice based automated application performance improves by adding an emotion recognition system. It is advantageous to Health centres in monitoring the patients emotional state after the treatment and also useful for applications requiring natural man machine interaction. This system could be employed as to determine the emotional state of the driver while driving the car which helps the safety of the driver and control the car movement. In this paper, experimental evaluation of emotion recognition on a lowcost, small footprint device BBB is carried out. The speaker dependent Emotion Recognition (ER) system gave 100% accuracy for all emotions such as sad, happy, surprise, anger and neutral. For speaker independent ER, MFCC features gives 100% accuracy for surprise emotion, PLP features gives 100% accuracy for anger and LPC features gives 100% accuracy for fear. It is also observed that not a single conventional feature extraction method gave 100% accuracy for all emotion. Because of this all feature extraction methods are implemented in the low power embedded system which increases complexity of emotion recognition system. In future, to reduce complexity of the ER system, a hybrid feature extraction method could be designed to detect all human emotions from speech. 5. Referenceses 1. Scherer S, Hofmann H, Lampmann M, Steffenrhinow M, Mschwenker F, Untherpalm G. Emotion recognition from speech: Stress experiment. Germany; 2008. p. 1 6. 2. Shashidhar G, Koolagudi K, Rao S. Emotion recognition from speech: A review. Int J Speech Technol. 2012; 3(2):1 5. 3. ElAyadi MA, Mohamed SB, Karray BF. Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognition. 2011; 44(3):572 87. 4. Kim EH, Hyun KH, Kim SH, Kwak YK. Improved emotion recognition with a novel speaker-independent feature. IEEE/ASME Transactions on Mechatronics. 2009; 14(3):317 25. 5. Bageshree V, Sathe-Pathak S, Ashish R, Panat P. Extraction of pitch and formants and its analysis to identify 3 different emotional states of persons. IJCSI. 2012; 9(4):1 6. 6. Patel S, Alex JSR, Venkatesan N. Low-power multi-layer perceptron neural network architecture for speech recognition networks. Indian Journal of Science and Technology. 2015; 8(20):1 6. 7. Alex JSR, Mukhedkar AS, Venkatesan N. Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition networks. Indian Journal of Science and Technology. 2015; 8(19):1 8. 8. Malta L, Miyajima C, Kitaoka N, Takeda K. Analysis of real-world driver s frustration. IEEE Transactions on Intelligent Transportation Systems. 2011; 12(1):1 10. 9. Tiwari V. MFCC and its applications in speaker recognition. Int J Emerg Technol. 2010; 1(1):19 22. 10. HTK Book 3.4.1. Available from: www.speech.ee.ntu.edu. tw/homework/dsp_hw2-1/htkbook.pdf 11. Lawrence R, Rabiner R. A tutorial on hidden markov models and selected application in speech recognition. Proceeding of the IEEE. 1989; 77(2):257 68. 12. Beagle Board. Available from: http://beagleboard.org/ Vol 9 (39) October 2016 www.indjst.org Indian Journal of Science and Technology 5