Automatic Recognition of Speaker Age in an Inter-cultural Context

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A study of speaker adaptation for DNN-based speech synthesis

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speaker recognition using universal background model on YOHO database

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Support Vector Machines for Speaker and Language Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Assignment 1: Predicting Amazon Review Ratings

WHEN THERE IS A mismatch between the acoustic

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Python Machine Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Generative models and adversarial training

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Affective Classification of Generic Audio Clips using Regression Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Calibration of Confidence Measures in Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learning Methods in Multilingual Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Linking Task: Identifying authors and book titles in verbose queries

CS 446: Machine Learning

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Australian Journal of Basic and Applied Sciences

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Word Segmentation of Off-line Handwritten Documents

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Lecture 1: Machine Learning Basics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Segregation of Unvoiced Speech from Nonspeech Interference

Automatic Pronunciation Checker

Speech Recognition by Indexing and Sequencing

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Proceedings of Meetings on Acoustics

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Rule Learning With Negation: Issues Regarding Effectiveness

Multivariate k-nearest Neighbor Regression for Time Series data -

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

A Case Study: News Classification Based on Term Frequency

Generating Test Cases From Use Cases

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Reducing Features to Improve Bug Prediction

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

CS Machine Learning

Multilingual Sentiment and Subjectivity Analysis

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using dialogue context to improve parsing performance in dialogue systems

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probabilistic Latent Semantic Analysis

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Edinburgh Research Explorer

Introduction to Mobile Learning Systems and Usability Factors

Multi-Lingual Text Leveling

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Using EEG to Improve Massive Open Online Courses Feedback Interaction

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Finding Translations in Scanned Book Collections

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Modeling user preferences and norms in context-aware systems

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Rule Learning with Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Investigation on Mandarin Broadcast News Speech Recognition

Transcription:

Automatic Recognition of Speaker Age in an Inter-cultural Context Michael Feld, DFKI in cooperation with Meraka Institute, Pretoria FEAST

Speaker Classification Purposes Bootstrapping a User Model based only on speech Retrieval and annotation of paralinguistic information from speech segments Age: senior Gender: male Language: french Cogn. load: high child high arousal adult low arousal angry sad happy Adaptation / Personalization Semantic Interpretation

Application Scenarios User adaptation on mobile devices User adaptation at public terminals Phonebased services

In-Car Services Scenario Where can I go shopping nearby? SB-DFKI 2009

Pattern Classification System Sensing Feature Extraction Classification Segmentatiotation Postprocessing Post- Duda, Hart and Stork (2000)

GMM-SVM Supervector Approach (1) Audio Data MFCC extraction (HTK) Full feature table Frame filter (silence removal, speaker/length balancing, dataset selection)

GMM-SVM Supervector Approach (2) UBM UBM Training UBM GMM MAP Adaptation Train DevTest Eval Utterance GMM Training Utterance GMMs

GMM-SVM Supervector Approach (3) Target SVM SVM Training Utterance GMM Export Train Normalization Data Means (coefficients * mixtures) DevTest Classification with SVM Threshold Tuning Evaluation Eval

Classifier Tuning To find the point of optimal classifier performance, threshold tuning can be applied to trained classifiers. Eval

Parameters MFCC extraction step width MFCC extraction window size MFCC coefficients MFCC delta coefficients Intensity-based frame filter Number of Gaussians MAP relation factor GMM Initialization Number of GMM training steps (EM algorithm) Nuisance variability compensation SVM input feature normalization method SVM kernel function SVM margin trade-off

Classification Task Description German Corpus from Deutsche Telekom Telephone speech (8000 Hz), high quality ~700 speakers, 1-6 sessions, 18 turns Short utterances (numbers, names, commands, ) 70% training/test, 30% eval Best-path experiment (thus eval = test2) 1 2 3 4 5 6 7 Children Class Young female Young male Adult female Adult male Senior female Senior male Age 0 14 15 24 15 24 25 54 25 54 55 + 55 +

Evaluation Results on DTAG Confusion Matrix (Identification Task) 12735 100% 1 (C) 2 (YF) 3 (YM) Classified as 4 (AF) 5 (AM) 6 (SF) 7 (SM) 1 48,07% 18,21% 9,66% 9,13% 1,7% 11,07% 2,17% 13,41% 2 11,2% 43,53% 1,27% 30,7% 0,92% 11,86% 0,51% 15,42% 3 1,36% 1,36% 60,26% 1,62% 16,19% 2,09% 17,13% 15,04% Tested 4 4,73% 18,54% 8,42% 31,99% 5,13% 28,25% 2,94% 15,76% 5 1,04% 0,88% 34,7% 2,24% 31,69% 2,35% 27,09% 14,35% 6 8,28% 11,55% 3,79% 30,16% 1,28% 42,88% 2,04% 13,46% 7 1% 1,19% 26,19% 2,31% 20% 2,56% 46,75% 12,56% 1339 10,51% 1797 14,11% 2631 20,66% 2027 15,92% 1381 10,84% 1848 14,51% 1712 13,44% Correct: 5534 Incorrect: 7201 Accuracy: 43,46%

Impact of Language / Culture Is the approach in general independent of language/culture? Can we apply models trained on only one language to another language? Can we apply generic models to a particular language? Does using language-specific models improve the classification? Which features are affected?

The Lwazi Corpus Created as part of South African speech technology project Balance of genders and landline/mobile Varying quality telephone recordings, some very low and with background noise Estimation difficult even for humans Very different cultural backgrounds and language differences ( clicks)

Lwazi Corpus Sighting Age distribution Class Age % Heavy focus on ages 20-45 1 Children 0 14 0.3 No* children 2 Young female 15 24 7.9 Few senior speakers 3 Young male 15 24 7.6 Consequences Use only classes 2-7 or 2-5 Choose different boundaries Use regression approach 4 5 6 7 Adult female Adult male Senior female Senior male 25 54 25 54 55 + 55 + 21.2 19.8 1.6 2.3

Long-term Features Features extracted by Praat scripts, averaged over an utterance Fundamental frequency F0: pitch_min, pitch_max, pitch_quant, pitch_mean, pitch_stdev, pitch_mas, pitch_swoj Jitter (F0 micro-variations): jitt_l, jitt_la, jitt_ppq, jitt_rap, jitt_ddp Intensity: intens_mean, intens_min, intens_max, intens_stdev Shimmer (Amplitude micro-variations): shim_l, shim_ldb, shim_apq3, shim_apq5, shim_apq11, shim_dda

Corpus Analysis (1) SA English

Corpus Analysis (2) Young Female

Corpus Analysis (3) Adult Male

Corpus Analysis (4) Adult Female

Evaluation Results on Lwazi Accuracy with GMM-SVM supervector models considerably lower, more tests needed Linear regressor based on long-term features: mean values of the absolute errors between 7.7 and 12.8 years Language-dependent behavior training language prediction error test language

Next Steps Further work on multi-linguality Pre-processing of Lwazi data Training of models on on Lwazi corpus Further improvement of the classification Extend parameter space True regression approach Application side Application of GMM-SVM supervector system for in-car acoustic event detection and further speaker properties Integration of automatic age/gender/ recognition as one knowledge source into a KM system

Thank you!