Automatic Recognition of Speaker Age in an Inter-cultural Context

Automatic Recognition of Speaker Age in an Inter-cultural Context Michael Feld, DFKI in cooperation with Meraka Institute, Pretoria FEAST

Speaker Classification Purposes Bootstrapping a User Model based only on speech Retrieval and annotation of paralinguistic information from speech segments Age: senior Gender: male Language: french Cogn. load: high child high arousal adult low arousal angry sad happy Adaptation / Personalization Semantic Interpretation

Application Scenarios User adaptation on mobile devices User adaptation at public terminals Phonebased services

In-Car Services Scenario Where can I go shopping nearby? SB-DFKI 2009

Pattern Classification System Sensing Feature Extraction Classification Segmentatiotation Postprocessing Post- Duda, Hart and Stork (2000)

GMM-SVM Supervector Approach (1) Audio Data MFCC extraction (HTK) Full feature table Frame filter (silence removal, speaker/length balancing, dataset selection)

GMM-SVM Supervector Approach (2) UBM UBM Training UBM GMM MAP Adaptation Train DevTest Eval Utterance GMM Training Utterance GMMs

GMM-SVM Supervector Approach (3) Target SVM SVM Training Utterance GMM Export Train Normalization Data Means (coefficients * mixtures) DevTest Classification with SVM Threshold Tuning Evaluation Eval

Classifier Tuning To find the point of optimal classifier performance, threshold tuning can be applied to trained classifiers. Eval

Parameters MFCC extraction step width MFCC extraction window size MFCC coefficients MFCC delta coefficients Intensity-based frame filter Number of Gaussians MAP relation factor GMM Initialization Number of GMM training steps (EM algorithm) Nuisance variability compensation SVM input feature normalization method SVM kernel function SVM margin trade-off

Classification Task Description German Corpus from Deutsche Telekom Telephone speech (8000 Hz), high quality ~700 speakers, 1-6 sessions, 18 turns Short utterances (numbers, names, commands, ) 70% training/test, 30% eval Best-path experiment (thus eval = test2) 1 2 3 4 5 6 7 Children Class Young female Young male Adult female Adult male Senior female Senior male Age 0 14 15 24 15 24 25 54 25 54 55 + 55 +

Evaluation Results on DTAG Confusion Matrix (Identification Task) 12735 100% 1 (C) 2 (YF) 3 (YM) Classified as 4 (AF) 5 (AM) 6 (SF) 7 (SM) 1 48,07% 18,21% 9,66% 9,13% 1,7% 11,07% 2,17% 13,41% 2 11,2% 43,53% 1,27% 30,7% 0,92% 11,86% 0,51% 15,42% 3 1,36% 1,36% 60,26% 1,62% 16,19% 2,09% 17,13% 15,04% Tested 4 4,73% 18,54% 8,42% 31,99% 5,13% 28,25% 2,94% 15,76% 5 1,04% 0,88% 34,7% 2,24% 31,69% 2,35% 27,09% 14,35% 6 8,28% 11,55% 3,79% 30,16% 1,28% 42,88% 2,04% 13,46% 7 1% 1,19% 26,19% 2,31% 20% 2,56% 46,75% 12,56% 1339 10,51% 1797 14,11% 2631 20,66% 2027 15,92% 1381 10,84% 1848 14,51% 1712 13,44% Correct: 5534 Incorrect: 7201 Accuracy: 43,46%

Impact of Language / Culture Is the approach in general independent of language/culture? Can we apply models trained on only one language to another language? Can we apply generic models to a particular language? Does using language-specific models improve the classification? Which features are affected?

The Lwazi Corpus Created as part of South African speech technology project Balance of genders and landline/mobile Varying quality telephone recordings, some very low and with background noise Estimation difficult even for humans Very different cultural backgrounds and language differences ( clicks)

Lwazi Corpus Sighting Age distribution Class Age % Heavy focus on ages 20-45 1 Children 0 14 0.3 No* children 2 Young female 15 24 7.9 Few senior speakers 3 Young male 15 24 7.6 Consequences Use only classes 2-7 or 2-5 Choose different boundaries Use regression approach 4 5 6 7 Adult female Adult male Senior female Senior male 25 54 25 54 55 + 55 + 21.2 19.8 1.6 2.3

Long-term Features Features extracted by Praat scripts, averaged over an utterance Fundamental frequency F0: pitch_min, pitch_max, pitch_quant, pitch_mean, pitch_stdev, pitch_mas, pitch_swoj Jitter (F0 micro-variations): jitt_l, jitt_la, jitt_ppq, jitt_rap, jitt_ddp Intensity: intens_mean, intens_min, intens_max, intens_stdev Shimmer (Amplitude micro-variations): shim_l, shim_ldb, shim_apq3, shim_apq5, shim_apq11, shim_dda

Corpus Analysis (1) SA English

Corpus Analysis (2) Young Female

Corpus Analysis (3) Adult Male

Corpus Analysis (4) Adult Female

Evaluation Results on Lwazi Accuracy with GMM-SVM supervector models considerably lower, more tests needed Linear regressor based on long-term features: mean values of the absolute errors between 7.7 and 12.8 years Language-dependent behavior training language prediction error test language

Next Steps Further work on multi-linguality Pre-processing of Lwazi data Training of models on on Lwazi corpus Further improvement of the classification Extend parameter space True regression approach Application side Application of GMM-SVM supervector system for in-car acoustic event detection and further speaker properties Integration of automatic age/gender/ recognition as one knowledge source into a KM system

Thank you!