A New Kind of Dynamical Pattern Towards Distinction of Two Different Emotion States Through Speech Signals

Similar documents
Beginning to Flip/Enhance Your Classroom with Screencasting. Check out screencasting tools from (21 Things project)

Speaker Identification by Comparison of Smart Methods. Abstract

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Human Emotion Recognition From Speech

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Voice conversion through vector quantization

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

WHEN THERE IS A mismatch between the acoustic

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Probabilistic Latent Semantic Analysis

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Proceedings of Meetings on Acoustics

On the Formation of Phoneme Categories in DNN Acoustic Models

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Switchboard Language Model Improvement with Conversational Data from Gigaword

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Robot manipulations and development of spatial imagery

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Mandarin Lexical Tone Recognition: The Gating Paradigm

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speaker recognition using universal background model on YOHO database

Application of Virtual Instruments (VIs) for an enhanced learning environment

Utilizing Soft System Methodology to Increase Productivity of Shell Fabrication Sushant Sudheer Takekar 1 Dr. D.N. Raut 2

Speaker Recognition. Speaker Diarization and Identification

9 Sound recordings: acoustic and articulatory data

Course Law Enforcement II. Unit I Careers in Law Enforcement

Evaluation of Various Methods to Calculate the EGG Contact Quotient

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Phonetics. The Sound of Language

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A Pipelined Approach for Iterative Software Process Model

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

SARDNET: A Self-Organizing Feature Map for Sequences

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

A faculty approach -learning tools. Audio Tools Tutorial and Presentation software Video Tools Authoring tools

Learning Methods for Fuzzy Systems

Australian Journal of Basic and Applied Sciences

Indian Institute of Technology, Kanpur

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Reinforcement Learning by Comparing Immediate Reward

Communication around Interactive Tables

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Mastering Team Skills and Interpersonal Communication. Copyright 2012 Pearson Education, Inc. publishing as Prentice Hall.

Case study Norway case 1

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

Segregation of Unvoiced Speech from Nonspeech Interference

Calibration of Confidence Measures in Speech Recognition

IBM Software Group. Mastering Requirements Management with Use Cases Module 6: Define the System

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Speech Recognition by Indexing and Sequencing

Meriam Library LibQUAL+ Executive Summary

Executive Summary. Lava Heights Academy. Ms. Joette Hayden, Principal 730 Spring Dr. Toquerville, UT 84774

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

The My Class Activities Instrument as Used in Saturday Enrichment Program Evaluation

Aviation English Solutions

University of Toronto Physics Practicals. University of Toronto Physics Practicals. University of Toronto Physics Practicals

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Arabic Orthography vs. Arabic OCR

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Spoofing and countermeasures for automatic speaker verification

Plattsburgh City School District SIP Building Goals

Dakar Framework for Action. Education for All: Meeting our Collective Commitments. World Education Forum Dakar, Senegal, April 2000

Advertisement No. 2/2013

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

age, Speech and Hearii

Expressive speech synthesis: a review

Transcription:

A New Kind of Dynamical Pattern Towards Distinction of Two Different Emotion States Through Speech Signals Akalpita Das Gauhati University India dasakalpita@gmail.com Babul Nath, Purnendu Acharjee, Anilesh Dey Kaziranga University India babul@kazirangauniversity.in, purnendu@kazirangauniversity.in, anilesh@kazirangauniversity.in ABSTRACT: Speech Emotion recognition is one of the most popular and widely discussed topics in the present world. Every day human-being shows different types of emotions. In this paper we propose a new technique which can distinguish two emotion states by analyzing speech signals. The quantification is done by fitting an ellipsoid on the reconstructed attractor obtained from the speech signals in two different emotional conditions. Our experiments shows satisfactory results in this context. Keywords: Speech Signal, Phase Space Plot, Ellipsoid Fit Received: 2 July 2017, Revised 10 August 2017, Accepted 17 August 2017 2017 DLINE. All Rights Reserved 1. Introduction The scanning of Speech emotion is widely related with the speech production structure. The whole of speech acoustics has an important role to play while explaining the meaning of some definite acoustic parameters. The flow of air through the vocal tract, enhanced by breathing is the ground of all sounds and noises possible by the human vocal apparatus.[1] An interesting fact to note is that the variety of sounds that humans can produce, is dependent on whether the flow of air attains vibration by continuous movement of the glottis, also termed phonation and thus generating quasi regular sounds. The air tends to pass without any influence through the lower part of the vocal tract and is modified into turbulent sounds because of friction occurring while opening the mouth, in case of sounds which are unvoiced or non-periodic in nature, In addition, the acoustic filter properties of the vocal tract is responsible for the quality of sound produced. Hence, the whole structure of the system which is responsible for producing sound is a complex procedure.[1-2] This complexity is increased when emotions and feelings are added to the sound. The existing studies points on how meaningful content is obvious in the acoustic signal a speaker generates and also on how the listener reacts to the signal. The listeners can perceive and tag meaningful content in such a way that the emotion of the speaker is maintained. Linguists are generally interested in connections between logic of the vocal sounds and the tone in which they are 142 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017

spoken. Phoneticians like to examine how emotion in common changes the way vocal sounds are sent. Our center of interest from a phonetics point of view is on the expressive part of the acoustic waveform of speech and the articulations related with it and their control, not on the origination of the expressed emotion. In the 70 s [1] HMMs have been successfully applied in automatic speech recognition (ASR), but in recent times researchers are attempting to design and implement for more Dramatic (Emotional) speech synthesis. The arranged training data of the state-assembled HMMs are used for HMM state partitioning to explain the speech data for the unit selection. Few decades later in the 90 s Tokuda[2] put forward a totally self-acting and parametric speech synthesizer with HMMs which is accepted worldwide. Although both speech synthesis and ASR uses HMM technology, a lot of dissimilarities occur between these two applications. Speech recognition and synthesis systems that are based on HMMs, exchange the type of characteristics of the probabilistic models and in order to learn the distribution of probability uses similar methods. To be more specific, the HMMs are coached by optimizing the HMMs distribution of probability given by the series of speech characteristic vectors and the sub-word units sequence, e.g. phones. Emotional (Dramatic) Speech synthesis connects to the approximation of speech parameter sequences from input text with the help of HMMs. For a better recognition accuracy, Statistical representation used for ASR focuses to normalize away the variations speech parameters. In this paper we tried to discuss applicability of some new techniques which is a global analysis of the signal from the reconstruction space not from the signal itself. This is the motivation behind considering such a global analysis. In this article the long term dynamic of speech signal of two healthy subjects (male, female) in two different emotions have been studied and proper quantifications have been made to distinguish the two emotion states. The article is presented sequentially. Section 2 deals with the methodology that includes acquisition of speech signals methodology its quantification techniques by using a unique time delay. The core findings highlights the conclusion section. 2. Methodology 2.1 Signal Acquisition 2.1.1. Recording Setup The Recording setup is done in a semi-anechoic and noise proof recording studio setup in the department Electronics and Communication, Kaziranga University, Jorhat, Assam. The components were used in the set up for recording the voices: Microphone Behringer dual diaphragm condenser USB studio microphone B-2 pro. The frequency response was 40Hz to 20Khz. Realtek High Definition Audio Driver was used. The sound card that we used was a Creative Sound Blaster Live 5.1. The distance between the speaker and the microphone was 8 inches. The Block diagram of the recording setup is shown the Figure 1 below: Figure 1. Block diagram of recording setup Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017 143

The specifications that has been considered for recording the sounds are explained: The recording software used is Audacity 1.3.6 with a resolution of 16bit PCM. The format is mono and sampling frequency is 16000 Hz. 2.1.2 Software Tool for Recording -Audacity 1.3.6 Audacity is free open source, cross-platform audio software for multi-track recording and editing. It is available for Windows, Mac OS X, Linux and other operating systems. Audacity can record live audio through a microphone or mixer, or digitize recordings from other media. With some sound cards on it, and on any recent version of Windows, Audacity can also capture streaming audio. To mix and record entire albums, Audacity can be used. The primary features of Audacity includes the following [157]: It can import and export WAV, AIFF, MP3 files It can mix multiple tracks It can record and play back sounds. 2.2 Phase Space Plot and Quantification[3-6] Let us assume that the time series of the trial and error data is specified by implanting dimension and the holdup time for rebuilding of the attractor are m and phase space as:. Let us also assume that the respectively. Thus we obtain the restored with unit lag., here is phase point in m-dimensional phase space and is the number of phase points, explains the assessed trajectory belonging to the system in the phase space Gathering of is A quantification technique [17] in which points of the reconstructed phase space are gathered in three dimensions, is used in three dimensions for differentiating two different phase spaces. Let from any system a continuous signal is obtained. Also by sub-dividing this signal into three groups as with same delay, where, if N is evenand If N is odd. are obtained by reconstructing the three dimensional phase space signal A three dimensional rotation modifies this coordinate system with same angle with respect to X Y and Z axis, which is given by Thus a new co-ordinate system is formed. Let As a final step, an ellipsoid focused at recreated phase space. and with three axes of length SD 1, SD 2 and SD 3 is connected to the already existed 3. Result and Discussion 144 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017

We have taken 2 subjects 1 male and 1 female and recorded 4 statements with two emotions for each of them. The emotions stated are Normal and Anger. Figure 3.1.1 to Figure 3.1.4 shows the phase space plot for male subject in Anger emotion with four different statements. Here AM stands for Anger Male Figure 3.1.1. AM1 Figure 3.1.2. AM2 Figure 3.1.3. AM3 Figure 3.1.4. AM4 Figure 3.2.1 to Figure 3.2.4 shows the phase space plot for male subject in Anger emotion with four different statements. Here NM stands for Normal Male Figure 3. 2.1. NM1 Figure 3.2.2. NM2 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017 145

Figure 3.2.3. NM3 Figure 3.2.4. NM4 Figure 3.3.1 to Figure 3.3.4 shows the phase space plot for female subject in Anger emotion with four different statements. Here AF stands for Anger Female. Figure 3.3.1. AF1 Figure 3.3.2. AF2 Figure 3.3.3. AF3 Figure 3.3.4. AF4 146 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017

Figure 3.4.1 to Figure 3.4.4 shows the phase space plot for female subject in normal emotion with four different statements. Here NF stands for Normal Female. Figure 3.4.1. NF1 Figure 3.4.2. NF2 Figure 3.4.3. NF3 Figure 3.4.4. NF4 It is visibly distinguishable from all of the phase space plots that normal emotion are more clustered then the anger emotion. Actually, no proven canonical way, has been determined yet, to eliminate these abnormalities of the phase space plots. But this things aren t of much importance here. Rather, our focus should be on the main cluster, because most of the principal, applicable and mandatory information in this situation is hidden within the positioning of the cluster. Thus we quantify these phase plots by fitting an ellipsoid to their respective main clusters. Finally quantification parameters {(SD1+SD2+SD3)/3} are found by averaging the axes of the ellipsoid. The results of quantification are shown in table 1: It is observed from Table.1 that the quantifying parameters in case of both the subjects in all the four samples are larger in anger emotion as compared to the normal emotions. It is also evident that for anger emotion, the quantifying parameter of male subjects are larger than that of female subjects. On the contrary, in case of normal emotion, the quantifying parameter of female subject is higher than male. Thus 3D Phase Space Plot with proper delay is a proper tool for distinguishing the two different emotions of speech signals. Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017 147

Table 1. Quantification Table of 3D Phase Space Plot of speech signals in Anger and Normal states for Female and Male 4. Conclusion It is to be noted that average value SD1,SD2,SD3 of the fitted ellipsoid reduces in case of normal state as compared to the corresponding value in the anger condition for male and female subjects both. The same analysis shows that the same quantifying parameter decreases in anger state of the female as cooperating to male subject. Since it is a well known fact that anger increases the stress of human being. The quantifying parameter also stands as an indicator of stress reduction As the sample size is small, the whole study is substantiated by the statistical hypothesis testing. References [1]Allen, J. B., Rabiner, L. R. (1977). In: Proceedings of IEEE, 65, 1558. [2] Masuko, T., Tokuda, K., Kobayashi, T., Imai, S. (1996). Speech synthesis using hmm with dynamic features. IEEE, 389-392. [3] Anilesh Dey, D. K., Bhattacharya., Sanjay Kumar Palit., D.N. Tibarewala. Study of the effect of music and meditation on heart rate variability, Encyclopedia of Information Science and Technology, IGI Global, Category: Music Technologies. [4] Anilesh Dey., DK Bhattacharyya., DN Tibarewala., Nilanjan Dey., Amira S Ashour., Dac-Nhuong Le., Evgeniya Gospodinova., Mitko Gospodinov., International Journal of Interactive Multimedia and Artificial Intelligence, 3 (7) 87-95. [5] Madhuparna Das.,Tuhin Jana., Parna Dutta., Ria Banerjee., Anilesh Dey., D. K. Bhattacharya., M.R.Kanjilal. (2015). Study the Effect of Music on HRV Signal using 3D Poincare Plot in Spherical Co-ordinates - A Signal Processing Approach, IEEE International Conference on Communication and Signal Processing, April 2-4, India. [6] Anilesh Dey., Sanjay Kumar Palit., Sayan Mukherjee., D. K. Bhattacharya., D.N. Tibarewala. (2011). A new technique for the classification of pre-meditative and meditative states, IEEE International Conference, ICCIA-2011. 148 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017