Affective computing. Emotion recognition from speech. Fall 2018

Size: px

Start display at page:

Download "Affective computing. Emotion recognition from speech. Fall 2018"

Mervin Underwood
5 years ago
Views:

1 Affective computing Emotion recognition from speech Fall 2018 Henglin Shi,

2 Outlines Introduction to speech features Why speech in emotion analysis Speech Features Speech and speech production Types of speeches Fundamental and resonant frequencies Speech production model Speech feature extraction Short time analysis Acoustic feature extraction Prosodic feature extraction

3 Why speeches in emotion analysis Speech is a main modality for people to express message Speech is more REAL than other modalities

4 Speech Emotion Features Lexical features Acoustic features Prosodic features

5 Lexical Features Explicit affective messages. Affective words, stress, etc.

6 Acoustic Features Traditional acoustic features (MFCC, LPC, PLP, etc.) Many filter bank and prefiltering options Simple signal measures (e.g. zerocrossings, HNR) Other spectral measures (e.g. formants, long term spectrum)

7 Prosodic Features Pitch/F0 Pitch tracker F0 contour and derivative distributions Rhythm/Duration Voiced/unvoiced/silence segmentation Distributions of segments and segment ratios Phoneme segmentation Speech rate Loudness/Intensity FFT and short segment energy Energy contours and spectral parameters Quality Inverse filtering Vocal source parameters

8 Process of Speech Production

9 Process of Speech Production

10 The mechanism of Speech Production Vocal tract begins at the opening of the vocal chord, and ends at the lips Vocal tract is a non-uniform acoustic tube of different diameter. The cross-section area of the vocal, determined by the positions of the tongue, lips, jaw and velum.

11 The mechanism of Speech Production The nasal tract begins at the velum and ends at the nostrilss (Diameter of nasal tract also varies). When velum is lowered, the nasal tract is acoustically coupled to the vocal tract to produce the nasal sounds of speech.

12 Classification of Speech Sounds Voiced Sounds The vocal chord vibrates when air is passing through E.g. vowels like /a/, /e/, /i/ Unvoiced Sounds The Vocal chord does not vibrate E.g. /f/, /s/, /k/ Other Sounds Nasal sounds Plosive sounds

13 Voiced Sound Vocal chords usually vibrate at a particular frequency, which is called the fundamental frequency (F0) of the sound. Different persons have different fundamental frequency 50 to 200 Hz for male speakers 150 to 300 Hz for femal speackers 200 to 400 Hz for child speakers The inverse of the fundamental frequency is the estimation on the pitch period.

15 Unvoiced Sound Characterised by high frequency components, just like random noise. For unvoiced sound, the vocal chord is held open, and the air rushes through the lungs, through the vocal tract, shaped by the vocal tract and comes out at the lips.

16 Other Sound Classes Nasal Sounds Vocal chord may vibrate Coupled with nasal cavity Sound radiated from the nostrils and lips E.g. /ing/ Plosive Sounds Generated by pressure behind the closure with sudden releases E.g. /k/, /t/

17 Resonant Frequencies of Vocal Tract Vocal tract is a non-uniform acoustic tube of different diameter, i.e. it s diameter varies. Vocal tract with different diameters will generate different resonant frequencies, which are called formants. Three to four formants present below 4kHz of speech.

18 Formants

22 Formants: vowels

23 Speech production P(f) R(f) T(f) U(f) P(f) = U f T f R(f)

24 Linear model of speech production Prosodic parameters Pitch and intonations Quality (Vocal source and tract) Intensity Durations

25 Speech feature extraction Local feature and global feature Local Feature: features describe a frame Global Feature: features describe an utterance Short time analysis

26 Framing/Windowing Using window functions to segment speech signal into small frames 10 to 20 ms for each frame Examples: Rectangular window Hamming window

27 Acoustic Feature: Zero Crossing Count The ZCC refects the signal frequency The ZCC is calculated according to: N 1 ZCC i = 0.5 sign s k sign s k 1 k=1 ZCC reflects the frequency of the signal DC offset should be removed

28 Acoustic Feature: Mel-cepstrum Mel-Frequency Cepstral Coefficients (MFCC) Mel-scale spaced filter bank Corresponds to human auditory system (equal perceived pitch increments) Usually ~12-24 coefficients used with 50% overlapping window Pre-emphasis often used for loudness equalization Mean cepstral subtraction for relative features, Delta and delta-delta features possible for sequences Alternatively, critical band energy features, i.e. logarithms of the band filters, no DCT

29 Pre-emphisize Filtering S z E z = A v z 1 2 P a k z k 1+ k=1 1 z 1 Speech signal is not the original signal from the vocal tract If we want to focus the vocal tract, we have to apply a high-pass filter to cancel factors which are not belong to the vocal tract

30 Prodosic Featur: Short Time Energy Summation of sqares of all samples within a frame Used to distinguish voiced and unvoiced sounds Larger STE: voiced sound Smaller STE: unvoiced sound E m = n s n w m n 2

31 Utilizing the Features STE (high) Voiced Speech ZCC (low) ZCC (high) Unvoiced Speech STE (low)

32 Prodosic Featur: Fundamental Frequency Pitch Period = 1 F 0 F 0 is the fundamental frequency of vocal chord vibration Method for estimating Pitch Period Time domain methods Frequency domain method

33 Extract Pitch Period in Time Domain Short Time Autocorrelation Function Assumption: One signal can be considered as a delayed version of another φ k = 1 N n=0 N 1 s n s n k Finding k to maximize φ k Average Magnitude Difference Function D k = 1 N n=0 N 1 s n n k Finding k to minimize D k

34 Features within an Utterance

35 Emperical Result on Emotional effects in speech

38 Same text and speakers with different emotions Same text with different speakers and different emotions Emotional F0 contour examples Neutral Bored Angry Happy

39 Features: Prosodic features

40 Emotion recognition from speech Traditional machine learning tools used frequently Feature selection and transformations Sequential floating search (SFFS), principal component analysis (PCA), nonlinear manifold modeling, etc. Classifiers Linear discriminant analysis (LDA), k-nearest neighbors (knn), support vector machines (SVM), hidden markov models (HMM), neural networks (NN) Validation and regularization Cross-validation, cost/penalty functions, Bayesian Information Criterion (BIC), structural risk minimization (SRM), etc.

41 State-of-the-art methods Pitch tracker Autocorrelation is probably the best short term method Need a better estimate of glottal closures e.g. waveform matching (time-domain) Classifier SVM or neural network Any classifier accepting nonlinear data will do deep neural architectures are a current trend Feature training Genetic algorithms, floating search PCA transformation of traditional features seems to help very little, nonlinear methods (e.g. Isomap) are better

42 Deep Learning Current paradigm in ASR State-of-the-art approach used in all major speech recognition solutions (Apple, Google, Facebook, Microsoft, ) Alternative to feature engineering Can use e.g. raw spectrograms and/or (large) sets of traditional acoustic features as inputs Hidden layers used to learn nonlinear features or filter banks Fusion of multimodal sources straightforward Computational costs and overlearning problems, but, if correctly applied, offers very promising performance

43 State-of-the-art performance Theoretical performance according to literature % in an automatic speaker-independent limited emotion case (discrimination) Neutral, sad, happy, angry 55-70% for human reference in a non-limited recognition of basic emotions in multicultural context In practice Neutral, sad, happy, angry, disgusted, surprised, fearful % depending on the scenario constraints, sample size, quality, number of emotions, and available features

44 Linear Predictive Coding (LPC) Very useful for estimating pitch, formants, spectra, and vocal tract parameters Assumption: a speech signal sample can be estimated as a linear combination of past samples.

45 LPC (Cont.) Inverse z-transformation of the vocal tract model : p s n = a k s n k + Gu n k=1 If we can use another set of a k which can make p k=1 Thus we have p a k s n k a k s n k = s n Gu n k=1 p e n = s n a k s n k = Gu n k=1

46 Usage of LPC Measure the pitch period more precisely using e n = s n p k=1 a k s n k = Gu n Questions: If we apply this model on unvoiced speech? Note: These coefficients should be calculated short-timely within frames

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,