in animals whereby a perceived aggravating stimulus 'provokes' a counter response which is likewise aggravating and threatening of violence.

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Word Segmentation of Off-line Handwritten Documents

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning Methods for Fuzzy Systems

Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Australian Journal of Basic and Applied Sciences

Mandarin Lexical Tone Recognition: The Gating Paradigm

On the Formation of Phoneme Categories in DNN Acoustic Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Python Machine Learning

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Rule Learning with Negation: Issues Regarding Effectiveness

Expressive speech synthesis: a review

Automatic Pronunciation Checker

Speaker Identification by Comparison of Smart Methods. Abstract

Reducing Features to Improve Bug Prediction

Calibration of Confidence Measures in Speech Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

CS Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Evolutive Neural Net Fuzzy Filtering: Basic Description

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Segregation of Unvoiced Speech from Nonspeech Interference

WHEN THERE IS A mismatch between the acoustic

On-Line Data Analytics

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Lecture Notes in Artificial Intelligence 4343

Speaker recognition using universal background model on YOHO database

Voice conversion through vector quantization

Soft Computing based Learning for Cognitive Radio

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Artificial Neural Networks written examination

Lecture 1: Basic Concepts of Machine Learning

Mining Association Rules in Student s Assessment Data

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Assignment 1: Predicting Amazon Review Ratings

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Affective Classification of Generic Audio Clips using Regression Models

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Time series prediction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Proceedings of Meetings on Acoustics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

Course Law Enforcement II. Unit I Careers in Law Enforcement

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CS 446: Machine Learning

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

SIE: Speech Enabled Interface for E-Learning

SARDNET: A Self-Organizing Feature Map for Sequences

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Body-Conducted Speech Recognition and its Application to Speech Support System

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Problems of the Arabic OCR: New Attitudes

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Evolution of Symbolisation in Chimpanzees and Neural Nets

GACE Computer Science Assessment Test at a Glance

Transcription:

www.ardigitech.in ISSN 232-883X,VOLUME 5 ISSUE 4, //27 An Intelligent Framework for detection of Anger using Speech Signal Moiz A.Hussain* *(Electrical Engineering Deptt.Dr.V.B.Kolte C.O.E, Malkapur,Dist. Buldana) mymoiz24@yahoo.co.in* Abstract This paper reports the results of detecting emotions from speech signals, with particular focus on extraction emotion from short speech utterances. The main focus is on distinguishing the differences between anger and neutral speech. To obtain the feature vectors, different methods are used, such as: Support Vector Machines (SVM), Multilayer Perceptron (MLP), Generalized feed forward (GFF). The database used is from the Berlin Database of Emotional Speech. The database contains different sentences, spoken by speakers (5 female, 5 male) in 7 different emotions (Neutral, Anger, Boredom, Disgust, Anxiety/Fear, Happiness, Sadness) in German. Audio files are in WAV format: 6 khz, 6 bit, mono. Keywords: Emotion recognition; Support Vector Machine (SVM); Multilayer Perceptron (MLP); Generalized Feed Forward network (GFF) ; Berlin Emotional Speech Database; Machine Intelligence. Introduction As we begin the 2st century where the implementation of computers in modern industry is widespread, the focus on computer and human interaction is increasing. Human interaction is said to be: 'speech, eye contact & gesture etc. with speech being the most common form of communication. With more effective computer and human interaction, in particular the recognition of different emotions from speech, greater efficiency would be achieved in communications. Anger being recognised as the most important emotion to detect in computer to human interaction, as the result may lead to an end in interaction between the human and computer. Petrushin [], stated anger as the most important emotion in business. Anger is a term for the emotional aspect of aggression, as a basic aspect of the stress response in animals whereby a perceived aggravating stimulus 'provokes' a counter response which is likewise aggravating and threatening of violence. Emotion Recognition The solutions to emotion recognition depend on: Which emotion should be recognised What purpose the emotion should be recognised for Emotion recognition has applications in talking toys, video and computer games, and call centres. Automatic emotion recognition of speech can be viewed as a pattern recognition problem. The results produced by different experiments are characterised by: the features that are believed to be correlated with the speaker s emotional state, the type of emotions that humans are interested in; the database used for training and testing the classifier; the type of classifier used in the experiments. To compare classification results, the same dataset must be used and there must be an agreement on the set of emotions. Database The performance of an emotion classifier relies heavily on the quality of the database used for training and testing and its similarity to real world samples (generalization). In order to conduct the experiment of recognizing human emotion, an audio database that can convey the emotional state of human was used. To ensure the diversity of the database, we used audio samples from ten subjects, speaking ten sentences spoken by five (5) male & five (5) female for seven classes: neutral, anger, boredom, disgust, fear, happiness, and sadness. The Berlin Database of Emotional Speech was used for this purpose [4].In this approach, the speech signal was re-sampled to khz, and the silence segments at the beginning and the end of the speech were cut out artificially. Then the whole database was divided into two parts for the purpose training & testing.

www.ardigitech.in ISSN 232-883X,VOLUME 5 ISSUE 4, //27 Computer Simulation Experiment Feature Extraction The author have developed a program in MATLAB to obtain statistical parameters namely formant frequencies, entropy, variance, minima, median, LPC of a sound. Thus dataset for all 5 speech samples is prepared to feed to Neural Network for Speech Analysis Energy Features: Using 5 types of energy contours, the, Log Entropy, 'Shannonn Energy', Threshold Entropy, Sure Entropy, Norm Entropy Formant Features: Using 5 types of formant frequency contours, Formant, Formant, Formant2, Formant3, and Formant4 Audible Duration Features: Audible segments are determined by choosing a threshold below the maximum energy, and then a contour is produced showing audible/inaudible segments of speech. emotion recognition Calculation of Feature Vectors The main areas of focus for the feature vectors are: pitch related (fundamental frequency) loudness (energy) segments (audible duration) Figure LPC Spectra of speech signal Energy and Pitch Contours From the two feature types (energy and pitch), calculation of maximum, minimum, mean, standard deviation, shimmer/jitter, first derivative. Pre-Processing Filtering to remove noise and normalisation to place alll the sounds from the database into the same dynamic range. Segmentation into 2ms blocks. orig ginal wave Clustering of Feature Vectors These feature vectors will then be used to make a clustered algorithm, to determine if a speech signal contains featuree vectors relating to anger speech, or neutral speech. Amplitude.5 -.5 -.2.4.6.8.2.4 Time (s).6.8 2 Figure 2 - Speech Signal Speech Activity Detection The speech activity detector is a self-normalising, energy based detector that tracks the noise floor of the signal and can adapt to changing noise conditions, used to remove silence from the segments. Generalized Feed Forward Neural Network (GFFNN) 2

www.ardigitech.in ISSN 232-883X,VOLUME 5 ISSUE 4, //27 Figure 3: Schematic Diagram of GFFNN Generalized feed forward networks are a generalization of the MLP such that connections can jump over one or more layers. In theory, a MLP can solve any problem that a generalized feed forward network can solve. In practice, however, generalized feed forward networks often solve the problem much more efficiently. A classic example of this is the two spiral problem. Without describing the problem, it suffices to say that a standard MLP requires hundreds of times more training epochs than the generalized feed forward network containing the same number of processing elements. Table 2: SVM recognition results of cross validation dataset. Neural Network for Emotion Recognition: The generalized procedure for emotion recognition from speech signal using different feature extraction techniques is shown in figure4. We have used LPC, Variance, Formant, Entropy, Median, Minima for feature extractions and SVM, MLP, GFF for emotion recognition. Table 3: SVM recognition results of cross validation dataset based on the performance. Input Feature Extraction Neural Network Output Emotional Speech Input Prosodic Features SVM/ML P/GFF Figure 4: Structure of Emotion Recognition System The structural components of the sex - independent emotion recognition system are depicted in Figure 3. It consists of four modules: speech input, feature extraction, neural network for classification, and the recognized emotion output. Results The SVM, MLP & GFF classifier was used to test the proposed feature vector of speech. The Train N Times training method was used in all the classifier to train the neural network, and experimental results were obtained by using % cross-validation (C.V.) data. The recognition result for SVM, MLP & GFF classifier for both training & C.V. dataset are shown in table below. Table : SVM recognition results of training dataset Recognized Emotions Average MSE.6.5.4.3.2. Average MSE with Standard Deviation Boundaries for 3 Runs 77 353 529 75 88 57 233 49 585 76 Epoch Figure 5: Average Mean Square error of Cross validation dataset & dataset. Table 4: MLP recognition results of cross validation dataset. Cross Validation 3

www.ardigitech.in ISSN 232-883X,VOLUME 5 ISSUE 4, //27 Average MSE.8.6.4.2 -.4 Average MSE with Standard Deviation Boundaries for 3 Runs 2 3 4 5 6 7 8 9 -.2 Epoch Cross Validation Figure 6: Average Mean Square error of Cross validation dataset & dataset. Table 4: GFF recognition results of cross validation dataset based on the performance 9 8 7 6 5 4 3 2 SVM MLP GFF Testing Figure 7: Comparison of recognized emotions by SVM, MLP &GFF Neural Network Conclusion In the work conducted for this project, different methods to distinguish the difference of anger speech and neutral speech were explored. From these results, several conclusions can be drawn. First, decoding of emotions in speech is complex process that is influenced by cultural, social, and intellectual characteristics of subjects. People are not perfect in decoding even such dominant emotions as anger and happiness. Second, anger has different variations, (hot anger, cold anger, etc.) that have different acoustic features and will dramatically effect the accuracy of recognition. It is recommended that these variations be taken into account when labelling a speech database. References [] Klaus R. Scherer, Vocal communication of emotion: A review of research paradigms, Speech Communication 4, pp. 227-256 [2] Lili Cai, Chunhui Jiang, Zhiping Wang, Li Zhao, Cairong Zou, A Method Combining The Global And Time Series Structure Features For Emotion Recognition In Speech, IEEE Int. Conf. Neural Networks & Signal Processing Nanjing, China, December 4-7, 23-783-772-8/3/ 23 IEEE. [3] Yi-Lin, Gang Wei, Speech Emotion Recognition Based on HMM and SVM, Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 8-2 August 25-783-99-/5/25 IEEE. 4

www.ardigitech.in ISSN 232-883X,VOLUME 5 ISSUE 4, //27 [4] Muhammad Waqas Bhatti¹, Yongjin Wang² and Ling Guan3, A Neural Network approach for Human Emotion Recognition in Speech, -783-825-X/4/24 IEEE. [5] Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss, Berlin Database of Emotional Speech available at http://pascal.kgw.tu-berlin.de/emodb/ [6]Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G., Emotion recognition in human-computer interaction, IEEE Signal Processing magazine, Vol. 8, No., pp. 32-8, Jan. 2. [7] J. Nicholson, K. Takabashi and R. Nakatsu, Emotion Recognition in Speech Using Neural Network, Neural Information Processing, 999. [8] Li Zhuo, Xiungmin Qiun, Cuirong Zou, Zhenyung Wu, A Study on Emotional Feature Analysis and Recognition in Speech Signal Journal of China Institute of Communications, Vo.2, No., pp8-25, 2. [9] François Thibault, Formant Trajectory Detection Using Hidden Markov Models, Special Project Course Report MUMT 69, December 4 23. [] Petrushin, V., Emotion in Speech: Recognition and Application to Call Centers, in Proc. of Artificial Neural Networks in Engineering, pp. 7-, Nov. 999. 5