the question of disguised voice

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker recognition using universal background model on YOHO database

Support Vector Machines for Speaker and Language Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Spoofing and countermeasures for automatic speaker verification

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Recognition. Speaker Diarization and Identification

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Reducing Features to Improve Bug Prediction

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Word Segmentation of Off-line Handwritten Documents

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

Calibration of Confidence Measures in Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Recognition by Indexing and Sequencing

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Lecture Notes in Artificial Intelligence 4343

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Generative models and adversarial training

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Detecting English-French Cognates Using Orthographic Edit Distance

Learning Methods for Fuzzy Systems

Lecture 1: Machine Learning Basics

Segregation of Unvoiced Speech from Nonspeech Interference

Australian Journal of Basic and Applied Sciences

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Affective Classification of Generic Audio Clips using Regression Models

Assignment 1: Predicting Amazon Review Ratings

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Why Did My Detector Do That?!

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Python Machine Learning

BioSecure Signature Evaluation Campaign (ESRA 2011): Evaluating Systems on Quality-based categories of Skilled Forgeries

Learning From the Past with Experiment Databases

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Edinburgh Research Explorer

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Probabilistic Latent Semantic Analysis

arxiv: v2 [cs.cv] 30 Mar 2017

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-Supervised Face Detection

Speaker Recognition For Speech Under Face Cover

arxiv: v1 [cs.lg] 3 May 2013

On the Formation of Phoneme Categories in DNN Acoustic Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Evolutive Neural Net Fuzzy Filtering: Basic Description

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Proceedings of Meetings on Acoustics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Cross Language Information Retrieval

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Linking Task: Identifying authors and book titles in verbose queries

A Comparison of Two Text Representations for Sentiment Analysis

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Issues in the Mining of Heart Failure Datasets

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Data Fusion Models in WSNs: Comparison and Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

INPE São José dos Campos

Self-Supervised Acquisition of Vowels in American English

Knowledge Transfer in Deep Convolutional Neural Nets

Transcription:

the question of disguised voice P. Perrot and G. Chollet Telecom Paris Tech, 46 rue Barrault, 75013 Paris, France perrot@tsi.enst.fr 5681

Many applications including bank, multimedia, biometrics, need the verification of speaker identity. The current performance of speaker recognition can be considered as sufficient in many fields, but in forensic sciences, caution must be a priority due to the lack of robustness of the systems. Nevertheless the problem of identification is essential in forensic sciences. In most criminal cases, offenders try to disguise their voice before sending an anonymous or miscellaneous call. This is the reason why it is important to study the possibilities of voice disguise before trying to identify a speaker. The purpose of this paper is to present the applications of statistical algorithms in order to detect and identify four specific disguises. The choice of the disguises is based on the most common ones used by offenders. 1 Introduction Due to the wide range of commercial and law enforcement applications, major breakthroughs and initiatives in the past twenty years have propelled biometrics and specifically voice recognition technology into the spotlight. Voices have been used recently to verify the identity of persons, including security systems and criminal identifications. Forensic speaker recognition used to be performed by phoneticians but nowadays increasing interest is placed on automatic statistical techniques. However, there is a large gap between commercial and forensic applications. Actually, nowadays cheap and versatile systems make it possible to easily and quickly identify a speaker but the level of performance and the robustness of the system are not estimated. The question of disguise is not a real problem in the case of commercial applications because the will of spoofing the system is not in the user mind in most cases. On the contrary, voice applications in a forensic context suffer from the question of disguise, especially in the case of speaker recognition. Few systems take into account this problem. So, the study proposed in this paper presents the results of three statistical algorithms (K-nearest neighbours, Gaussian Mixture Model, and Support Vector Machine) and the use of different features. This work focuses on the classification of four disguises: hand over the mouth, low, high pitch and pinched nostrils voice. After a brief state of the art on the question of voice disguise, the different algorithms will be described before proposing the result of the classification on specific features MFCC (Mel Frequency Cepstral Coefficient), MFCC and derivatives). 2 State of the art Voice disguise is a deliberate action of a speaker who wants to falsify or to conceal his/her identity. The problem of voice alteration caused by channel distortion is not presented in this work. Lots of possibilities are offered to a speaker to change his own voice and to forger a human ear or an automatic system. He could transform his voice by electronic scrambling or more simply by exploiting the intra-speaker variability: modification of his own pitch, modification of the position of the articulators like lips or tongue which affect the formant frequencies. So, the question of voice disguise includes the voice transformation, the voice conversion and the alteration of the voice by mechanic means. In this study, we limited our applications to non electronic voice transformations, that is to say, a modification based on simple means corresponding to that used in the cases of offences. Research on voice disguise started in the 1970s with phoneticians like Künzel, Koester, and it is really over the past 10 years that researchers have tried to develop automatic systems to detect the disguise. This question of voice disguise in forensic sciences was not very developed in the literature, certainly because of the difficulties to distinguish a normal voice from a disguised voice in criminal applications. Nevertheless, the increase of voice use in multimedia applications and the current performance of speaker recognition systems offer a new interest for voice disguise. Hollien revealed that in the case of several diguised voices (except whisper and foreign accent) the identification performance of a machine was a little better than chance [8]. In [9], Masthoff establishes a report on the way used by speakers to disguise their voice. His results come from an experiment on 20 German speakers. He notices that the preferred forms of disguise appeared to involve changes in phonation and either one or two techniques. It also results that the disguise depends on the form of the experiment. Künzel proposes in [2] a very complete study on the link between fundamental frequency and voice disguise. He reveals that it is possible to link the F0 in an undisguised mode of a speaker with his disguised F0. Torstensson and al. [11] provide information on the imitation of a foreign accent as disguise. This means has a serious impact on the individual s ability to recognize a speaker. In [5] a detailed way to analyze and to identify voice disguise in an automatic way are described. 3 Speaker identification and disguised voice The first step of our work is to establish the impact of disguised voice on the performance of automatic speaker identification [2][3][4]. Four specific disguises have been chosen according to their use in criminal cases: hand over the mouth, pinched nostril, high pitch and low pitch. The principle of the automatic identification approach is divided in two parts. Previously for each speaker, 12 MFCC (Mel Frequency cepstral coefficients) and their derivatives have been extracted from a 20ms frame (10ms overlapped) in each speech segment after a silence removal step. The first part consists in a training session, which aim is to build different models for each speaker. This session consists in modeling the speech features of each speaker by the use of GMM. GMMs are widely used in statistical models in many pattern recognition applications. The principle is to approximate any probability density function from a sufficient number of components. The second part is the test which evaluates a distance between the query voice and the different models. The chosen distance in our system is a likelihood ratio and the maximum of this value determines the good speaker. 5682

In order to measure the influence of disguise on our automatic system, different speech segments in disguised voices have been used. The impact on the performance is represented by figure 1. What we notice is a very significant degradation of the performance and this result can be compared to the conclusion of Hollien ( a little bit better than chance ). These features are derived from the outputs of a bank filter placed in a mel frequency scale. The filters are typically in triangular shapes, and are operated in the frequency domain. A second approach includes the derivatives of the MFCC in order to take into account the dynamic of the speech. So, a 24-dimension vector is computed. 4.3 Applications on disguised voice detection Fig.1 Disguised voice degradation So, this DET curve reveals the need to be able to decide if a voice is disguised or not before planning an automatic speaker identification. 4 Experiment and Results Three different experiments based on three different sets of features and three different classification methods have been carried out in order to detect disguised voices. 4.1 Corpus description Two kinds of speech text have been chosen. The first one A is used for training. It is composed by 30 speaker audio files and three different corpus are built from the phonetic balanced text: the north wind and the sun (in French): A1: 5mn of different speakers in four different disguises. A1 is a general model for disguise A2: 5mn of normal voices from different speakers A3: 5mn for each kind of disguise from different speakers. The second one B is composed by 25 speakers and is used for test. This corpus is based on 10 phonetic French balanced sentences and has a duration between 15 and 20 seconds for each disguise (included normal voice) The recordings are direct and the test dataset did not participate in the training process. 4.2 Feature extraction Different sets of feature have been extracted from speech in order to evaluate the relevance of these specific features. 12 MFCC, and 12 MFCC + 12 derivatives. A first approach is dedicated to 12 MFCC. MFCC, well known as the most common features in the case of speaker as well as speech recognition, are used. These coefficient vectors are computed on a 20 ms window with 10 ms shift. In order to evaluate the best way to detect automatically disguised voice, three different classifications have been used on the previously described features. - k-nearest-neighbors - GMM (Gaussian Mixture Model) - VQ (vector quantization) and SVM (Support Vector Machine) The k-nearest Neighbors (k-nn) classification rule is a technique for non-parametric supervised pattern classification. Given the training knowledge of N prototype patterns (vectors of dimension D) and their correct classification into several classes, it assigns an unclassified pattern to the class that is most heavily represented among its k nearest neighbors in the pattern space. The first comparative analysis focuses on 12 MFCC. In the experiment, after different tests for the k value, 20-nearest neighbors have been chosen. Voice Type Normal Disguised Normal 62% 38% Disguised 22% 78% Table 1 K-nearest-neighbors disguised voice detection This method is efficient to detect a disguise but the risk to confuse a normal voice with a disguised voice is too important. In addition, a significant drawback of this algorithm is its very important time computing. Another interesting and well known method in speech or speaker recognition is the use of GMM [6]. The principle is to build a GMM for disguised voices and another one for normal voices. A GMM is basically composed of a superposition of K Gaussian densities. Each density k is weighted with a mixture coefficient c k. K p( x / m) = c N( x, μ, Σ k= 1 The mixture coefficient obeys for each model m=1 M the probabilistic constraint: = K c k 1 = 1 During the recognition phase the scores log(p(x/m)) are accumulated for the sequence X = {x 1,x 2,.x p } ) 5683

P S(X/m) = j= 1 log( p( x j / m)) 120 100 normale déguisee and the model is chosen according to the highest likelihood ratio score. 80 m = arg max m S(x/m) 60 The result obtained are proposed in Table n 2: 40 20 Type of voice Normal Disguised 0 QV(512)+SVM QV(128)+SVM QV(512)+SVM - QV(128)+SVM - GMM - 12 MFCC Kppv - 12 MFCC Normal 15% 85% avec delta avec delta 12 MFCC 12 MFCC Disguised 6% 94% Table 2 GMM (1024) disguised voice detection By applying a GMM classification the level of recognition of disguised voices is very high but the risk of confusing a normal with a disguised voice is also very high. The last classification method used is based on a vector quantization followed by the application of SVM (Support Vector Machine) discrimination. The aim is to build a fast and efficient SVM classifier of data. The Vector Quantization (VQ) is used to simplify the training set. The principle is to represent the vector of each class by specific representatives. SVM is a binary classification method based on a supervised training. Specific kernels are used to optimize the data discrimination. The idea is to find a classifier able to discriminate the data and to optimize the interclass distance. The results proposed in figure 2 are based on 128 and 512 centroïds. 100 90 80 70 60 50 40 30 20 10 0 27 93 58 128 512 90 normal disguise Fig.2 VQ+SVM classification on MFCC + derivatives These results are very positive even if it will be necessary to increase the test dataset in order to get a more significant number. What is interesting is that these results are confirmed by using 128 centroïds instead of 512 because 96% of normal voices are recognized as normal and 97% of disguised voices are recognized as disguised. So this set of features composed by 12 MFCC and their first derivatives and this classification technique appear to be thoroughly adapted to the disguised voice detection. The figure n 3 summarizes the different results of classification: Fig.3 Detection of disguised voices 4.4 Applications on disguised voice identification The aim of identification is to be able to say which kind of disguise is used among the four studied disguises. Two different supervised classifiers have been analysed: GMM and VQ+SVM. 4.4.1. Identification based on a GMM classifier The idea of this method is to build a specific model for each kind of disguise based on GMM. Each test speech segment is compared to the different models and the decision is taken according to the maximum likelihood ratio. This is the same principle as in 4.3. The advantage of this method is to be able to measure the level of identification (position n 1, n 2 and so on), according to the number of disguises. In our experiment four disguises have been analyzed based on 12 MFCC. The figure n 4 reveals the level of identification for each kind of disguise on 25 tests. Pourcentage de reconnaissance 120,00 100,00 CMS reconnaissance en fonction du type de déguisement 80,00 60,00 40,00 20,00 0,00 0,00 1,00 2,00 3,00 4,00 5,00 Rang Fig 4: Cumulative matching score in % Main Nez Grave Aigu 4.4.2 Identification based on a SVM classifier The principle of this classification is to be able to discriminate one disguise against all. For instance, to measure the identification level of high pitch voice, a similarity distance is calculated between test high pitch voice features against a model of high pitch voice versus a model of all disguise voices (except high pitch voice). 5684

The classifier is based on VQ and SVM [7]. The identification process has been carried out from 512 centroids for the quantization vector step. The Figure n 5 represents a DET curve [10] on 25 clients and 100 impostors. DET curve is a good way to measure the performance of a system in term of false acceptance rate and false rejection rate. classification algorithms. The idea is to find a solution to detect and if possible to identify what kind of disguise has been used. The experiment results show that MFCC + their derivatives and QV+SVM classification provide interesting results in a case of detection, that is to say in a case where the question is to be able to say if a normal voice is normal or if a disguised voice is disguised. On the question of identification the results are unbalanced. The SVM classifier has a real problem with the hand over the mouth detection contrary to the GMM classifier that presents a correct level of identification. In the case of the other disguises, the QV+SVM classifier presents some good results. What is planned for future is to measure the robustness of this kind of detection and identification on other noise environments and the influence of the number of centroids. References Fig 5: SVM classifer: DET curve on normal conditions A same process of evaluation has been realized after adding a babble noise in order to be closer to forensic conditions. The following figure presents the results: Fig 6: SVM classifer: DET curve on babble conditions Identification of some specific disguises (high and low pitch voice, and hand over the mouth) is largely degraded by adding a babble noise 5 Conclusion In forensic science cases, experts are more and more involved on speaker recognition question. One of the main problems is to be able to detect and to identify a disguise in order to avoid false automatic speaker identification. Actually, offenders try to forger their identity by using disguise techniques to avoid being recognized. Disguise has a important impact on the performance of an automatic recognition system. We present a series of experiments and results based on different features and different [1] L.J. Boë Ben Laden et le mythe de l empreinte vocale In revue du vivant n 1 [2] H. Künzel, J. Gonzalez-Rodriguez, J. Ortega- Garcia. «Effect of voice disguise on the performance of a forensic automatic speaker recognition system, Proceedings Speaker Odyssey 2004 [3] Mats Blomberg, Daniel Elenius, Elisabeth Zetterholm Speaker verification scores and acoustic analysis of a professional impersonator Proccedings FONETIK 2004 [4] S.Kajarekar, H. Bratt, E. Shriber, and R. Leaon. A Study of Intentional Voice Modifications for Evading Automatic Speaker Recognition, in Proc. IEEE Odyssey 2006 Speaker and Language Recognition Workshop [5] P. Perrot, G. Aversano, G. Chollet. Voice disguise and automatic detection: review and perspective Progress in Non linear Speech Processing, Stylianou Y. Et al (eds) LNCS4391, Springler 2007 [6] D. Reynolds. Speaker Identification and Verification using Gaussian Mixture Models Speech Communication, vol.45, pp.139-152, 2005 [7] Vladimir Vapnik. The Nature of Statistical Learning Theory. In Springer-Verlag, 1995. [8] H. Hollien. Forensic Voice Identification, ed. AP 2001 [9] H. Masthoff, A report on voice disguise experiment, Forensic Linguistics 160-167 - 1996 [10] A. Martin, G. Doddington, T Karam, MO.rdowski, M. Przybocki, The DET curve in assessement of detection task performance, Proc. Eurospeech 97, Greece [11] N. Torstensson, E.J. Eriksson, K.P.H. Sullivan, Mimicked accents - Do speakers have similars cognitive prototypes? Proc. of Australian International Conference on Speech Science and Technology 2004 5685