A New Kind of Dynamical Pattern Towards Distinction of Two Different Emotion States Through Speech Signals Akalpita Das Gauhati University India dasakalpita@gmail.com Babul Nath, Purnendu Acharjee, Anilesh Dey Kaziranga University India babul@kazirangauniversity.in, purnendu@kazirangauniversity.in, anilesh@kazirangauniversity.in ABSTRACT: Speech Emotion recognition is one of the most popular and widely discussed topics in the present world. Every day human-being shows different types of emotions. In this paper we propose a new technique which can distinguish two emotion states by analyzing speech signals. The quantification is done by fitting an ellipsoid on the reconstructed attractor obtained from the speech signals in two different emotional conditions. Our experiments shows satisfactory results in this context. Keywords: Speech Signal, Phase Space Plot, Ellipsoid Fit Received: 2 July 2017, Revised 10 August 2017, Accepted 17 August 2017 2017 DLINE. All Rights Reserved 1. Introduction The scanning of Speech emotion is widely related with the speech production structure. The whole of speech acoustics has an important role to play while explaining the meaning of some definite acoustic parameters. The flow of air through the vocal tract, enhanced by breathing is the ground of all sounds and noises possible by the human vocal apparatus.[1] An interesting fact to note is that the variety of sounds that humans can produce, is dependent on whether the flow of air attains vibration by continuous movement of the glottis, also termed phonation and thus generating quasi regular sounds. The air tends to pass without any influence through the lower part of the vocal tract and is modified into turbulent sounds because of friction occurring while opening the mouth, in case of sounds which are unvoiced or non-periodic in nature, In addition, the acoustic filter properties of the vocal tract is responsible for the quality of sound produced. Hence, the whole structure of the system which is responsible for producing sound is a complex procedure.[1-2] This complexity is increased when emotions and feelings are added to the sound. The existing studies points on how meaningful content is obvious in the acoustic signal a speaker generates and also on how the listener reacts to the signal. The listeners can perceive and tag meaningful content in such a way that the emotion of the speaker is maintained. Linguists are generally interested in connections between logic of the vocal sounds and the tone in which they are 142 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017
spoken. Phoneticians like to examine how emotion in common changes the way vocal sounds are sent. Our center of interest from a phonetics point of view is on the expressive part of the acoustic waveform of speech and the articulations related with it and their control, not on the origination of the expressed emotion. In the 70 s [1] HMMs have been successfully applied in automatic speech recognition (ASR), but in recent times researchers are attempting to design and implement for more Dramatic (Emotional) speech synthesis. The arranged training data of the state-assembled HMMs are used for HMM state partitioning to explain the speech data for the unit selection. Few decades later in the 90 s Tokuda[2] put forward a totally self-acting and parametric speech synthesizer with HMMs which is accepted worldwide. Although both speech synthesis and ASR uses HMM technology, a lot of dissimilarities occur between these two applications. Speech recognition and synthesis systems that are based on HMMs, exchange the type of characteristics of the probabilistic models and in order to learn the distribution of probability uses similar methods. To be more specific, the HMMs are coached by optimizing the HMMs distribution of probability given by the series of speech characteristic vectors and the sub-word units sequence, e.g. phones. Emotional (Dramatic) Speech synthesis connects to the approximation of speech parameter sequences from input text with the help of HMMs. For a better recognition accuracy, Statistical representation used for ASR focuses to normalize away the variations speech parameters. In this paper we tried to discuss applicability of some new techniques which is a global analysis of the signal from the reconstruction space not from the signal itself. This is the motivation behind considering such a global analysis. In this article the long term dynamic of speech signal of two healthy subjects (male, female) in two different emotions have been studied and proper quantifications have been made to distinguish the two emotion states. The article is presented sequentially. Section 2 deals with the methodology that includes acquisition of speech signals methodology its quantification techniques by using a unique time delay. The core findings highlights the conclusion section. 2. Methodology 2.1 Signal Acquisition 2.1.1. Recording Setup The Recording setup is done in a semi-anechoic and noise proof recording studio setup in the department Electronics and Communication, Kaziranga University, Jorhat, Assam. The components were used in the set up for recording the voices: Microphone Behringer dual diaphragm condenser USB studio microphone B-2 pro. The frequency response was 40Hz to 20Khz. Realtek High Definition Audio Driver was used. The sound card that we used was a Creative Sound Blaster Live 5.1. The distance between the speaker and the microphone was 8 inches. The Block diagram of the recording setup is shown the Figure 1 below: Figure 1. Block diagram of recording setup Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017 143
The specifications that has been considered for recording the sounds are explained: The recording software used is Audacity 1.3.6 with a resolution of 16bit PCM. The format is mono and sampling frequency is 16000 Hz. 2.1.2 Software Tool for Recording -Audacity 1.3.6 Audacity is free open source, cross-platform audio software for multi-track recording and editing. It is available for Windows, Mac OS X, Linux and other operating systems. Audacity can record live audio through a microphone or mixer, or digitize recordings from other media. With some sound cards on it, and on any recent version of Windows, Audacity can also capture streaming audio. To mix and record entire albums, Audacity can be used. The primary features of Audacity includes the following [157]: It can import and export WAV, AIFF, MP3 files It can mix multiple tracks It can record and play back sounds. 2.2 Phase Space Plot and Quantification[3-6] Let us assume that the time series of the trial and error data is specified by implanting dimension and the holdup time for rebuilding of the attractor are m and phase space as:. Let us also assume that the respectively. Thus we obtain the restored with unit lag., here is phase point in m-dimensional phase space and is the number of phase points, explains the assessed trajectory belonging to the system in the phase space Gathering of is A quantification technique [17] in which points of the reconstructed phase space are gathered in three dimensions, is used in three dimensions for differentiating two different phase spaces. Let from any system a continuous signal is obtained. Also by sub-dividing this signal into three groups as with same delay, where, if N is evenand If N is odd. are obtained by reconstructing the three dimensional phase space signal A three dimensional rotation modifies this coordinate system with same angle with respect to X Y and Z axis, which is given by Thus a new co-ordinate system is formed. Let As a final step, an ellipsoid focused at recreated phase space. and with three axes of length SD 1, SD 2 and SD 3 is connected to the already existed 3. Result and Discussion 144 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017
We have taken 2 subjects 1 male and 1 female and recorded 4 statements with two emotions for each of them. The emotions stated are Normal and Anger. Figure 3.1.1 to Figure 3.1.4 shows the phase space plot for male subject in Anger emotion with four different statements. Here AM stands for Anger Male Figure 3.1.1. AM1 Figure 3.1.2. AM2 Figure 3.1.3. AM3 Figure 3.1.4. AM4 Figure 3.2.1 to Figure 3.2.4 shows the phase space plot for male subject in Anger emotion with four different statements. Here NM stands for Normal Male Figure 3. 2.1. NM1 Figure 3.2.2. NM2 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017 145
Figure 3.2.3. NM3 Figure 3.2.4. NM4 Figure 3.3.1 to Figure 3.3.4 shows the phase space plot for female subject in Anger emotion with four different statements. Here AF stands for Anger Female. Figure 3.3.1. AF1 Figure 3.3.2. AF2 Figure 3.3.3. AF3 Figure 3.3.4. AF4 146 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017
Figure 3.4.1 to Figure 3.4.4 shows the phase space plot for female subject in normal emotion with four different statements. Here NF stands for Normal Female. Figure 3.4.1. NF1 Figure 3.4.2. NF2 Figure 3.4.3. NF3 Figure 3.4.4. NF4 It is visibly distinguishable from all of the phase space plots that normal emotion are more clustered then the anger emotion. Actually, no proven canonical way, has been determined yet, to eliminate these abnormalities of the phase space plots. But this things aren t of much importance here. Rather, our focus should be on the main cluster, because most of the principal, applicable and mandatory information in this situation is hidden within the positioning of the cluster. Thus we quantify these phase plots by fitting an ellipsoid to their respective main clusters. Finally quantification parameters {(SD1+SD2+SD3)/3} are found by averaging the axes of the ellipsoid. The results of quantification are shown in table 1: It is observed from Table.1 that the quantifying parameters in case of both the subjects in all the four samples are larger in anger emotion as compared to the normal emotions. It is also evident that for anger emotion, the quantifying parameter of male subjects are larger than that of female subjects. On the contrary, in case of normal emotion, the quantifying parameter of female subject is higher than male. Thus 3D Phase Space Plot with proper delay is a proper tool for distinguishing the two different emotions of speech signals. Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017 147
Table 1. Quantification Table of 3D Phase Space Plot of speech signals in Anger and Normal states for Female and Male 4. Conclusion It is to be noted that average value SD1,SD2,SD3 of the fitted ellipsoid reduces in case of normal state as compared to the corresponding value in the anger condition for male and female subjects both. The same analysis shows that the same quantifying parameter decreases in anger state of the female as cooperating to male subject. Since it is a well known fact that anger increases the stress of human being. The quantifying parameter also stands as an indicator of stress reduction As the sample size is small, the whole study is substantiated by the statistical hypothesis testing. References [1]Allen, J. B., Rabiner, L. R. (1977). In: Proceedings of IEEE, 65, 1558. [2] Masuko, T., Tokuda, K., Kobayashi, T., Imai, S. (1996). Speech synthesis using hmm with dynamic features. IEEE, 389-392. [3] Anilesh Dey, D. K., Bhattacharya., Sanjay Kumar Palit., D.N. Tibarewala. Study of the effect of music and meditation on heart rate variability, Encyclopedia of Information Science and Technology, IGI Global, Category: Music Technologies. [4] Anilesh Dey., DK Bhattacharyya., DN Tibarewala., Nilanjan Dey., Amira S Ashour., Dac-Nhuong Le., Evgeniya Gospodinova., Mitko Gospodinov., International Journal of Interactive Multimedia and Artificial Intelligence, 3 (7) 87-95. [5] Madhuparna Das.,Tuhin Jana., Parna Dutta., Ria Banerjee., Anilesh Dey., D. K. Bhattacharya., M.R.Kanjilal. (2015). Study the Effect of Music on HRV Signal using 3D Poincare Plot in Spherical Co-ordinates - A Signal Processing Approach, IEEE International Conference on Communication and Signal Processing, April 2-4, India. [6] Anilesh Dey., Sanjay Kumar Palit., Sayan Mukherjee., D. K. Bhattacharya., D.N. Tibarewala. (2011). A new technique for the classification of pre-meditative and meditative states, IEEE International Conference, ICCIA-2011. 148 Journal of Multimedia Processing and Technologies Volume 8 Number 4 December 2017