II. SID AND ITS CHALLENGES

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A study of speaker adaptation for DNN-based speech synthesis

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Learning Methods in Multilingual Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition at ICSI: Broadcast News and beyond

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Hybrid Text-To-Speech system for Afrikaans

Speaker recognition using universal background model on YOHO database

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker Identification by Comparison of Smart Methods. Abstract

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Modeling function word errors in DNN-HMM based LVCSR systems

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

WHEN THERE IS A mismatch between the acoustic

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Word Segmentation of Off-line Handwritten Documents

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Speech Recognition by Indexing and Sequencing

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Using dialogue context to improve parsing performance in dialogue systems

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Support Vector Machines for Speaker and Language Recognition

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speaker Recognition. Speaker Diarization and Identification

A Case Study: News Classification Based on Term Frequency

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Calibration of Confidence Measures in Speech Recognition

An Online Handwriting Recognition System For Turkish

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

On the Combined Behavior of Autonomous Resource Management Agents

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Author's personal copy

Spoofing and countermeasures for automatic speaker verification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Segregation of Unvoiced Speech from Nonspeech Interference

Affective Classification of Generic Audio Clips using Regression Models

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Firms and Markets Saturdays Summer I 2014

Circuit Simulators: A Revolutionary E-Learning Platform

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Python Machine Learning

GACE Computer Science Assessment Test at a Glance

Modeling user preferences and norms in context-aware systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

On-Line Data Analytics

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Voice conversion through vector quantization

Strategy and Design of ICT Services

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Automatic Pronunciation Checker

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Lecture 9: Speech Recognition

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Measurement & Analysis in the Real World

Seminar - Organic Computing

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Lecturing Module

Data Fusion Models in WSNs: Comparison and Analysis

THE RECOGNITION OF SPEECH BY MACHINE

Appendix L: Online Testing Highlights and Script

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Bluetooth mlearning Applications for the Classroom of the Future

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Natural Language Processing. George Konidaris

Evolution of Symbolisation in Chimpanzees and Neural Nets

Intermediate Computable General Equilibrium (CGE) Modelling: Online Single Country Course

Transcription:

Call Centre Speaker Identification using Telephone and Data Lerato Lerato and Daniel Mashao Dept. of Electrical Engineering, University of Cape Town Rondebosch 7800, Cape Town, South Africa llerato@crg.ee.uct.ac.za, daniel@eng.uct.ac.za Abstract- Telecommunications network services and callcentres are examples of avenues for speech technology. In these services speech technology largely exists in the form of speech recognition (voice to text). This paper explores the use of speaker identification (who has spoken?) in call centres. Call centres acquire information telephonically (cellular or landline). This work therefore reports results that show speaker identification (SID) from simulated, real (Vodacom) and telephone speech data. Channel noise reduces SID rate. The large enrolment of speakers to the SID system also degrades the performance of the SID system. This paper finally proposes the hierarchical method of improving SID performance on the large population of enrolled speakers. I. INTRODUCTION Speaker identification (SID) system recognises a person from either known words (text-dependent) or any utterance (text-independent) in the closed group of speakers (e.g. bank customers). The SID is encapsulated in the speech technology for telecommunications [1] as a good candidate for secure access to information through network services and call centres [2]. SID systems, as we have reported [3], perform less efficiently on speech that has passed through the communication channel. Speakers enroll into the SID system forming their individual templates or models and these (templates) are compared to the test talker s speech features during the identification process. There are many speech databases available for testing SID systems. The data used in this study originate from TIMIT speech (clean speech) database [4]. NTIMIT database [4] is a telephone transmitted TIMIT speech. This database was created by Texas Instruments (T.I.) and Massachusetts Institute of Technology (M.I.T.). In this work, clean speech was passed through 06.10 [5] forming database (figure 4). 06.10 is ETSI standard [6] for full rate (FR). The second data resulted from transmission of clean speech through local (Vodacom) [7] network. Section V is a brief account of how these databases were formed. Several studies have focused on telecommunications applications of speaker identification at different levels. Westall F. A. et al [1] have put the whole picture of speech technology into telecommunications perspective. Murthy H.A. [8] tried to compensate communication channel noise and managed to improve the performance of SID on telephone speech. Kuitert and Boves [9] showed that a limited telephone band of 300-3400Hz does not significantly affect the performance of speaker recognition. Mashao D. J. [10] reported the same results using our SID system. Grassi S. et al [11] concluded that coding degrades the SID performance. Although channel noise seems to be the main obstacle to better performance of SID systems, it was also noticed that the increase in speaker population lowers identification rate [3]. This degradation of performance results from the confusion between speakers with similar voices. The hierarchical configuration of speaker identification as described in section IV attempts to solve the problem of misclassification resulting from this large enrolment of speakers. Section VI reports results of both simulated and real full rate (Vodacom) data in order to compare the SID system behaviour in the practical environment with the one simulated in the laboratory. The same section also displays observations when telephone speech is used with the large number of enrolled speakers. Section VII concludes this paper and also gives future directions. II. SID AND ITS CHALLENGES This section is a brief description of speaker identification and some of the problems that affect it. Speaker identification (SID) is in the same category as speaker verification (SV) in the field of speaker recognition [4]. SID recognises the speaker from his or her voice without prior knowledge of the talker. SV verifies the speaker from his or her claimed identity through voice. Speaker specific speech features are important in the SID process. The physiological structure of the speaker s vocal tract is unique and it is possible to extract speaker specific features from his or her voice. Feature extraction process of the SID system is known as the front end while the classifier is called the back end of the system. There are many front ends that can be used. Linear predictive coding (LPC) used in full rate codec [5], is an example of feature extraction algorithm. Perceptual linear prediction (PLP) [12] is an example of auditory-based feature extraction. Our system uses parameterised feature sets (PFS) [13] for speech feature extraction. Classification algorithms are the decision-making tools through probabilistic scores of how likely the talker is compared to the voice models (templates) stored in the database. Hidden Markov models (HMM) and Gaussian mixture models are mostly used as the back ends (classifiers) in speech and speaker recognition respectively. Challenges facing the robust SID system implementation [14] are methods of correcting errors that affect its performance. Some of these errors include inaccurate modeling of vocal tract, extreme emotional state of the speaker, time-varying speech acquisition equipment, head colds and aging of the speaker. Overcoming these problems and limiting channel noise could enhance the implementation of the SID systems in call centres.

III. SID IN CALL CENTRES Call centre is a telephone calls management unit which several companies use for handling large number of calls. The key functions of a typical call centre include the automatic queue management and coordination of calls and customer data [2]. Automatic queue management directs incoming calls to suitable agents [2] who can also receive the caller s information on the computer for new records or for consulting other agents. This functionality is also done using automatic call distributor (ACD). Interactive voice response (IVR) is a form of interaction between a customer and telephone answering machine. The customer presses the keys on the handset while the system prompts him/her with voice. IVR is also common in the network-subscriber interaction such as in PSTN where the customer uses the telephone dial or voice [2] to access information. Most automated voice network services are hierarchical and consume a long time before the information is arrived at. An example of such could be pressing key 1 for help followed by key 5 for account information and so on. Several companies have deployed speech recognition in most call centres and telephone services. Nortel [15] announced speech enabled call center facility that will replace dialing of telephone for enquiries. Loquendo [16] s voice technologies have proved to be successful in the implementation of speech recognition. Intelleca Voice & Mobile [17] has built a speech recognition system for ticket bookings for Ster Kinekor s national call centre in South Africa. These are a few of many companies involved in the implementation of this technology in call centres. Speaker identification (SID) has not seen much recognition in call centre operations because less robust SID systems. SID could be used in line with speech recognition where security of information is necessary. Nuance [18] has implemented similar system. Works [19] also advocates the importance of speaker recognition (verification) in telephone services. According to Rachel Lebihan from ZDNet Australia [20], experts verify the relevance of SID in call centres. Our proposed method in the following section contributes towards the optimisation of SID performance in call centres. IV. THE PROPOSED HIERARCHICAL SID SYSTEM The hierarchically configured speaker identification system tries to split voice templates of speakers who sound the same. Figure 1 illustrates the proposed system. Our initial approach [21] to this proposal was based on grouping speakers according to gender but no improvement was observed because the normal SID [3] already had gender separation capabilities. This means that the group detector should use different speech features to those of front-end (Feature extraction). Signal Group Detector Feature Extraction Testing λ 1 Training... Matching and Maximum scores Identification Figure 1: Hierarchical SID system λ Ns During training, the group detector (figure 1) classifies a speaker and gives him/her a group tag so that his/her voice template, λ, can be stored in the correct database. Testing is done by first determining the group that the test speaker belongs to. Consequently the speech features are matched with the correct database (λ is ). The feature extraction (frontend) and classification (matching and scoring) are still the same as in our previous paper [3]. Several studies [22, 23] have used the hierarchical approach to try and improve on processing time. We seek to use this approach for improving the identification rate. Two groups, which result in two databases, are used in this study because of the initial stage of the proposal. Perceptual linear prediction (PLP) [12] algorithm has been used in our group detector. A. Perceptual linear prediction We implemented the PLP algorithm according to Hermasky [12] as shown in figure 2. The speech signal power spectrum is warped into a Bark scale [24]. The result is convolved with the critical band curve, which approximates the resolution of human ear. During the equal-loudness preemphasis, [12, 24] a filter known as the approximate nonequal sensitivity of human hearing is multiplied with the convolved signal. The result is run through the IDFT (inverse discrete Fourier transform). A potion of the resulting spectrum from sample 20 to 100 is used as a grouping segment. Peaks from sample 20 to 55 are added and the sum is stored for group A. Group B is the sum of peaks from sample 55 to 90. The choice of these values was experimental and also based on the knowledge of the spectrum symmetry. Grouping is done such that if sum A is greater than sum B then the speaker belongs to group A and vice versa. The normal speaker identification process then continues.

TIMIT database Down-Sample 16kHz 8kHz 06.10 codec up-sample 8kHz 16kHz database Figure 4: Creation of full rate database full rate compresses the signal to 13kbps for transmission and the speech must be 8 khz [25]. TIMIT speech is sampled at 16 khz and therefore down-sampling was required before the identification process. Finally figure 5 shows how identification processes executes. Figure 2: Hermansky [12] PLP analysis block diagram TIMIT Full Rate SID Identification V. TELEPHONE AND SPEECH EXPERIMENTS The speech signal processed by the system illustrated in figure 1 exists in different forms. NTIMIT telephone speech has been chosen as the first to test capability of our new architecture because this database is commonly used in literature. speech database used was created in the laboratory [3, 25]. TIMIT was also transmitted through Vodacom network as shown in figure 3, for obtaining the real speech data. TIMIT speech from sound card Local Network Vodacom speech to be stored in a PC Figure 3: Generating real coded speech Given the laboratory constraints only a small set of speakers (18 speakers) was tested as indicated by the results in table 1. Figure 4 illustrates how 06.10 [5, 6] was used for forming simulated data. Figure 5: Speaker identification process of speech VI. RESULTS AND DISCUSSIONS There are two types of results in this section. Results in table 1 show speaker identification of 18 talkers from transmitted voices. Table 2 and figure 6 reflect identification rate from 630 speakers. These results indicate how the SID system performs on a telephone-transmitted voice. A. speech results speaker identification was done on a small population. The results in table 1 highlight the impact of channel effects on speaker identification when real network processes the speech. Channel noise [8] and even the mobile device circuitry contribute to the degradation of identification rate. Each speaker spoke for 15 seconds and 6 seconds during training and testing respectively. Gaussian mixture model classifier [3] used in our experiment performs better if talkers speak for a longer time during training. This means that real network may perform better if we train speakers for a longer time. This is due for further investigation. data Table 1: SID on speech Total number of speakers No. of correctly identified speakers Simulated ( 06.10) 18 17 Vodacom 18 15 B. Telephone speech results During the investigation of our proposed hierarchical SID performance, a population of 630 speakers from the whole telephone speech database (NTIMIT) was used. Normal SID in table 2 refers to a non-hierarchical SID [21] while ideal dialect hierarchical SID indicates the dialect based group

detection. Hierarchical SID in table 2 used PLP group detector. This detector generated only two groups of speakers. The speakers in NTIMIT database are already grouped according to their dialect. Taking advantage of this arrangement, the group detector (figure 1) produced 8 database groups. This grouping is ideal because it was not done according to the group detector algorithm but was prearranged. The results in table 2 are competitive with those reported in literature [4, 11] using the same data. Architecture of the SID SID rate on 630 speaker (NTIMIT) Normal SID 69.5 % Hierarchical SID 75.7 % Ideal dialect hierarchical SID 86.5 % Table 2: SID rate on large population using NTIMIT sentences 8 and 9 to test and their other 8 for training. Table 3 shows the population of speakers in dialect region (dr) from telephone speech database (NTIMIT). Same experiment was carried out for each dialect region in order to observe the consistency of the proposed architecture on telephone speech. Figure 6 is a graph of identification rates for different dialect groups of speakers. The results in figure 6 however, do not reflect text independence. Speakers uttered same sentence for the group detection process. Region dr1 dr2 dr3 dr4 dr5 dr6 dr7 dr8 Population 49 102 102 100 98 46 100 33 Table 3: Population of speakers in each dialect region SID rate (%) 95.0 90.0 85.0 80.0 75.0 70.0 dr1 dr2 dr3 dr4 dr5 dr6 dr7 dr8 Dialect Region Normal Hierarchical Figure 6: Average SID rates from all dialect regions VII. CONCLUSIONS Prompting the client to say certain fixed words or sentences such as account number and physical address could enhance practical application of speaker identification at call centres. Our SID system shows a reasonable performance on 630 people in a completely text independent setup as tabulated in table 2. We have seen in section III that companies have already deployed speech technology in the form of speech recognition in call centres. Although only few speakers are used in testing real speech on the SID system, the proposed hierarchical architecture (figure 1) of speaker identification could ensure that the speakers models (voice templates) are grouped into several smaller databases. This may maintain high identification rate in real time conditions. This projection is proved by the ideal dialect hierarchical SID rate in table 2. This result proves that the more model databases we have the better identification rate we achieve. Now that it is possible to test local telecommunications network transmitted speech, the second step will be to pass more speakers voices through local network (Vodacom) and get better SID assessment. The transmission of TIMIT speech database (clean speech) through a local telephone network (Telkom) will generate a telephone database that can replace NTIMIT for future assessment of our SID system. This local database can be tested on our system and the SID application in call centres can be evaluated. REFERENCES [1] Westall F.A (2001). technology telecommunications. BT Technol J. Vol. 14 No.1, pp 9-27 [2] Ericsson Telecom AB and Telia AB (1998). Understanding telecommunications 2 pp 61-62,163-164 [3] Lerato L., Mashao D. J. (2002). Evaluation of speaker identification system using data. SATNAC 2002. Vol. 1 pp 75-78 [4] Le Floch J.L. et al. (1995). Speaker recognition experiments on the NTIMIT database Proceedings of Eurospeech 95, Vol.1 pp 379-382 [5] http://kbs.cs.tu-berlin.de/~jutta/toast.html [6] http://www.etsi.org [7] http://www.vodacom.co.za [8] Murthy, A.H. et al. (1999). "Robust text-independent speaker identification over telephone channels." IEEE Transactions of and Audio processing, Vol. 7 No.5 pp 554 568 [9] Kuitert M., Boves L. (1997). "Speaker verification with coded telephone speech", Proceedings Eurospeech 1997 [10] Mashao D.J. (2002). "Towards an inverse solution for speaker recognition", Proc. of 13th Annual Symposium of the PRASA pp 61-65. [11] Grassi, S. et al. (2000). Influence of Coding on the performance of text-independent speaker recognition." Proceedings of ICASSP 2000 [12] Hermansky H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America Vol. 87 no. 4, pp. 1738-1752 [13] Mashao D. J., Adcock J. E. (1997). Utterance dependent parametric warping for a talker-independent HMM-based recognizer. Proceedings of ICASSP 97. Munich, Germany. pp 1235 1238 [14] Campbell J. P. (1997). Speaker Recognition: A tutorial. Proceedings of the IEEE, Vol. 85, No. 9. pp 1437-1462 [15] Phil Hochmut, (2003), "Nortel brings IVR to midsize call centres", http://www.nwfusion.com/news/2003/0616nortel.html [16] Loquendo voice technologies integrated in Trenitalia call centres http://www.hltcentral.org/usr_docs/case_studies/euromap/loq uendo_fs_informa_web.pdf. [17] INTELLECA VOICE & MOBILE: http://www.intelleca.co.za [18] http://www.nuance.com/assets/pdf/verifier3faq.pdf [19] http://www.speechworks/products/ [20] Rachel Lebihan (ZDNet Australia) (2002). "Biometrics: the key to call centre fraud?" http://news.zdnet.co.uk [21] Lerato L., Mashao D.J. (2002). "Hierarchical approach for improving speaker identification", Proc. of 13th Annual Symposium of the PRASA pp 51-55.

[22] Beigi H.S.M. et al. (1999). A hierarchical approach to largescale speaker recognition, Proceedings of Eurospeech 99. Vol. 5, pp 2203-2206. [23] Pan Z. (2000). A on-line hierarchical method of speaker identification for large population. Proc. Of IEEE Nordic signal processing symposium, pp 33-36 [24] Gunawan W. et al. (2001). PLP Coefficients can be quantized at 400 BPS. Proceedings of ICASSP 01 [25] Besacier L. et al. (2000). " speech coding and speaker recognition." Proceedings of ICASSP 2000