Automatic Recognition of Speaker Age in an Inter-cultural Context

Size: px

Start display at page:

Download "Automatic Recognition of Speaker Age in an Inter-cultural Context"

Aron Shelton
6 years ago
Views:

1 Automatic Recognition of Speaker Age in an Inter-cultural Context Michael Feld, DFKI in cooperation with Meraka Institute, Pretoria FEAST

Speaker Classification Purposes Bootstrapping a User Model based only on speech Retrieval and annotation of paralinguistic information from speech segments

2 Speaker Classification Purposes Bootstrapping a User Model based only on speech Retrieval and annotation of paralinguistic information from speech segments Age: senior Gender: male Language: french Cogn. load: high child high arousal adult low arousal angry sad happy Adaptation / Personalization Semantic Interpretation

3 Application Scenarios User adaptation on mobile devices User adaptation at public terminals Phonebased services

4 In-Car Services Scenario Where can I go shopping nearby? SB-DFKI 2009

5 Pattern Classification System Sensing Feature Extraction Classification Segmentatiotation Postprocessing Post- Duda, Hart and Stork (2000)

6 GMM-SVM Supervector Approach (1) Audio Data MFCC extraction (HTK) Full feature table Frame filter (silence removal, speaker/length balancing, dataset selection)

7 GMM-SVM Supervector Approach (2) UBM UBM Training UBM GMM MAP Adaptation Train DevTest Eval Utterance GMM Training Utterance GMMs

8 GMM-SVM Supervector Approach (3) Target SVM SVM Training Utterance GMM Export Train Normalization Data Means (coefficients * mixtures) DevTest Classification with SVM Threshold Tuning Evaluation Eval

9 Classifier Tuning To find the point of optimal classifier performance, threshold tuning can be applied to trained classifiers. Eval

10 Parameters MFCC extraction step width MFCC extraction window size MFCC coefficients MFCC delta coefficients Intensity-based frame filter Number of Gaussians MAP relation factor GMM Initialization Number of GMM training steps (EM algorithm) Nuisance variability compensation SVM input feature normalization method SVM kernel function SVM margin trade-off

11 Classification Task Description German Corpus from Deutsche Telekom Telephone speech (8000 Hz), high quality ~700 speakers, 1-6 sessions, 18 turns Short utterances (numbers, names, commands, ) 70% training/test, 30% eval Best-path experiment (thus eval = test2) Children Class Young female Young male Adult female Adult male Senior female Senior male Age

12 Evaluation Results on DTAG Confusion Matrix (Identification Task) % 1 (C) 2 (YF) 3 (YM) Classified as 4 (AF) 5 (AM) 6 (SF) 7 (SM) 1 48,07% 18,21% 9,66% 9,13% 1,7% 11,07% 2,17% 13,41% 2 11,2% 43,53% 1,27% 30,7% 0,92% 11,86% 0,51% 15,42% 3 1,36% 1,36% 60,26% 1,62% 16,19% 2,09% 17,13% 15,04% Tested 4 4,73% 18,54% 8,42% 31,99% 5,13% 28,25% 2,94% 15,76% 5 1,04% 0,88% 34,7% 2,24% 31,69% 2,35% 27,09% 14,35% 6 8,28% 11,55% 3,79% 30,16% 1,28% 42,88% 2,04% 13,46% 7 1% 1,19% 26,19% 2,31% 20% 2,56% 46,75% 12,56% ,51% ,11% ,66% ,92% ,84% ,51% ,44% Correct: 5534 Incorrect: 7201 Accuracy: 43,46%

13 Impact of Language / Culture Is the approach in general independent of language/culture? Can we apply models trained on only one language to another language? Can we apply generic models to a particular language? Does using language-specific models improve the classification? Which features are affected?

14 The Lwazi Corpus Created as part of South African speech technology project Balance of genders and landline/mobile Varying quality telephone recordings, some very low and with background noise Estimation difficult even for humans Very different cultural backgrounds and language differences ( clicks)

15 Lwazi Corpus Sighting Age distribution Class Age % Heavy focus on ages Children No* children 2 Young female Few senior speakers 3 Young male Consequences Use only classes 2-7 or 2-5 Choose different boundaries Use regression approach Adult female Adult male Senior female Senior male

16 Long-term Features Features extracted by Praat scripts, averaged over an utterance Fundamental frequency F0: pitch_min, pitch_max, pitch_quant, pitch_mean, pitch_stdev, pitch_mas, pitch_swoj Jitter (F0 micro-variations): jitt_l, jitt_la, jitt_ppq, jitt_rap, jitt_ddp Intensity: intens_mean, intens_min, intens_max, intens_stdev Shimmer (Amplitude micro-variations): shim_l, shim_ldb, shim_apq3, shim_apq5, shim_apq11, shim_dda

17 Corpus Analysis (1) SA English

18 Corpus Analysis (2) Young Female

19 Corpus Analysis (3) Adult Male

20 Corpus Analysis (4) Adult Female

21 Evaluation Results on Lwazi Accuracy with GMM-SVM supervector models considerably lower, more tests needed Linear regressor based on long-term features: mean values of the absolute errors between 7.7 and 12.8 years Language-dependent behavior training language prediction error test language

22 Next Steps Further work on multi-linguality Pre-processing of Lwazi data Training of models on on Lwazi corpus Further improvement of the classification Extend parameter space True regression approach Application side Application of GMM-SVM supervector system for in-car acoustic event detection and further speaker properties Integration of automatic age/gender/ recognition as one knowledge source into a KM system

23 Thank you!

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,