Call Centre Speaker Identification using Telephone and Data Lerato Lerato and Daniel Mashao Dept. of Electrical Engineering, University of Cape Town Rondebosch 7800, Cape Town, South Africa llerato@crg.ee.uct.ac.za, daniel@eng.uct.ac.za Abstract- Telecommunications network services and callcentres are examples of avenues for speech technology. In these services speech technology largely exists in the form of speech recognition (voice to text). This paper explores the use of speaker identification (who has spoken?) in call centres. Call centres acquire information telephonically (cellular or landline). This work therefore reports results that show speaker identification (SID) from simulated, real (Vodacom) and telephone speech data. Channel noise reduces SID rate. The large enrolment of speakers to the SID system also degrades the performance of the SID system. This paper finally proposes the hierarchical method of improving SID performance on the large population of enrolled speakers. I. INTRODUCTION Speaker identification (SID) system recognises a person from either known words (text-dependent) or any utterance (text-independent) in the closed group of speakers (e.g. bank customers). The SID is encapsulated in the speech technology for telecommunications [1] as a good candidate for secure access to information through network services and call centres [2]. SID systems, as we have reported [3], perform less efficiently on speech that has passed through the communication channel. Speakers enroll into the SID system forming their individual templates or models and these (templates) are compared to the test talker s speech features during the identification process. There are many speech databases available for testing SID systems. The data used in this study originate from TIMIT speech (clean speech) database [4]. NTIMIT database [4] is a telephone transmitted TIMIT speech. This database was created by Texas Instruments (T.I.) and Massachusetts Institute of Technology (M.I.T.). In this work, clean speech was passed through 06.10 [5] forming database (figure 4). 06.10 is ETSI standard [6] for full rate (FR). The second data resulted from transmission of clean speech through local (Vodacom) [7] network. Section V is a brief account of how these databases were formed. Several studies have focused on telecommunications applications of speaker identification at different levels. Westall F. A. et al [1] have put the whole picture of speech technology into telecommunications perspective. Murthy H.A. [8] tried to compensate communication channel noise and managed to improve the performance of SID on telephone speech. Kuitert and Boves [9] showed that a limited telephone band of 300-3400Hz does not significantly affect the performance of speaker recognition. Mashao D. J. [10] reported the same results using our SID system. Grassi S. et al [11] concluded that coding degrades the SID performance. Although channel noise seems to be the main obstacle to better performance of SID systems, it was also noticed that the increase in speaker population lowers identification rate [3]. This degradation of performance results from the confusion between speakers with similar voices. The hierarchical configuration of speaker identification as described in section IV attempts to solve the problem of misclassification resulting from this large enrolment of speakers. Section VI reports results of both simulated and real full rate (Vodacom) data in order to compare the SID system behaviour in the practical environment with the one simulated in the laboratory. The same section also displays observations when telephone speech is used with the large number of enrolled speakers. Section VII concludes this paper and also gives future directions. II. SID AND ITS CHALLENGES This section is a brief description of speaker identification and some of the problems that affect it. Speaker identification (SID) is in the same category as speaker verification (SV) in the field of speaker recognition [4]. SID recognises the speaker from his or her voice without prior knowledge of the talker. SV verifies the speaker from his or her claimed identity through voice. Speaker specific speech features are important in the SID process. The physiological structure of the speaker s vocal tract is unique and it is possible to extract speaker specific features from his or her voice. Feature extraction process of the SID system is known as the front end while the classifier is called the back end of the system. There are many front ends that can be used. Linear predictive coding (LPC) used in full rate codec [5], is an example of feature extraction algorithm. Perceptual linear prediction (PLP) [12] is an example of auditory-based feature extraction. Our system uses parameterised feature sets (PFS) [13] for speech feature extraction. Classification algorithms are the decision-making tools through probabilistic scores of how likely the talker is compared to the voice models (templates) stored in the database. Hidden Markov models (HMM) and Gaussian mixture models are mostly used as the back ends (classifiers) in speech and speaker recognition respectively. Challenges facing the robust SID system implementation [14] are methods of correcting errors that affect its performance. Some of these errors include inaccurate modeling of vocal tract, extreme emotional state of the speaker, time-varying speech acquisition equipment, head colds and aging of the speaker. Overcoming these problems and limiting channel noise could enhance the implementation of the SID systems in call centres.
III. SID IN CALL CENTRES Call centre is a telephone calls management unit which several companies use for handling large number of calls. The key functions of a typical call centre include the automatic queue management and coordination of calls and customer data [2]. Automatic queue management directs incoming calls to suitable agents [2] who can also receive the caller s information on the computer for new records or for consulting other agents. This functionality is also done using automatic call distributor (ACD). Interactive voice response (IVR) is a form of interaction between a customer and telephone answering machine. The customer presses the keys on the handset while the system prompts him/her with voice. IVR is also common in the network-subscriber interaction such as in PSTN where the customer uses the telephone dial or voice [2] to access information. Most automated voice network services are hierarchical and consume a long time before the information is arrived at. An example of such could be pressing key 1 for help followed by key 5 for account information and so on. Several companies have deployed speech recognition in most call centres and telephone services. Nortel [15] announced speech enabled call center facility that will replace dialing of telephone for enquiries. Loquendo [16] s voice technologies have proved to be successful in the implementation of speech recognition. Intelleca Voice & Mobile [17] has built a speech recognition system for ticket bookings for Ster Kinekor s national call centre in South Africa. These are a few of many companies involved in the implementation of this technology in call centres. Speaker identification (SID) has not seen much recognition in call centre operations because less robust SID systems. SID could be used in line with speech recognition where security of information is necessary. Nuance [18] has implemented similar system. Works [19] also advocates the importance of speaker recognition (verification) in telephone services. According to Rachel Lebihan from ZDNet Australia [20], experts verify the relevance of SID in call centres. Our proposed method in the following section contributes towards the optimisation of SID performance in call centres. IV. THE PROPOSED HIERARCHICAL SID SYSTEM The hierarchically configured speaker identification system tries to split voice templates of speakers who sound the same. Figure 1 illustrates the proposed system. Our initial approach [21] to this proposal was based on grouping speakers according to gender but no improvement was observed because the normal SID [3] already had gender separation capabilities. This means that the group detector should use different speech features to those of front-end (Feature extraction). Signal Group Detector Feature Extraction Testing λ 1 Training... Matching and Maximum scores Identification Figure 1: Hierarchical SID system λ Ns During training, the group detector (figure 1) classifies a speaker and gives him/her a group tag so that his/her voice template, λ, can be stored in the correct database. Testing is done by first determining the group that the test speaker belongs to. Consequently the speech features are matched with the correct database (λ is ). The feature extraction (frontend) and classification (matching and scoring) are still the same as in our previous paper [3]. Several studies [22, 23] have used the hierarchical approach to try and improve on processing time. We seek to use this approach for improving the identification rate. Two groups, which result in two databases, are used in this study because of the initial stage of the proposal. Perceptual linear prediction (PLP) [12] algorithm has been used in our group detector. A. Perceptual linear prediction We implemented the PLP algorithm according to Hermasky [12] as shown in figure 2. The speech signal power spectrum is warped into a Bark scale [24]. The result is convolved with the critical band curve, which approximates the resolution of human ear. During the equal-loudness preemphasis, [12, 24] a filter known as the approximate nonequal sensitivity of human hearing is multiplied with the convolved signal. The result is run through the IDFT (inverse discrete Fourier transform). A potion of the resulting spectrum from sample 20 to 100 is used as a grouping segment. Peaks from sample 20 to 55 are added and the sum is stored for group A. Group B is the sum of peaks from sample 55 to 90. The choice of these values was experimental and also based on the knowledge of the spectrum symmetry. Grouping is done such that if sum A is greater than sum B then the speaker belongs to group A and vice versa. The normal speaker identification process then continues.
TIMIT database Down-Sample 16kHz 8kHz 06.10 codec up-sample 8kHz 16kHz database Figure 4: Creation of full rate database full rate compresses the signal to 13kbps for transmission and the speech must be 8 khz [25]. TIMIT speech is sampled at 16 khz and therefore down-sampling was required before the identification process. Finally figure 5 shows how identification processes executes. Figure 2: Hermansky [12] PLP analysis block diagram TIMIT Full Rate SID Identification V. TELEPHONE AND SPEECH EXPERIMENTS The speech signal processed by the system illustrated in figure 1 exists in different forms. NTIMIT telephone speech has been chosen as the first to test capability of our new architecture because this database is commonly used in literature. speech database used was created in the laboratory [3, 25]. TIMIT was also transmitted through Vodacom network as shown in figure 3, for obtaining the real speech data. TIMIT speech from sound card Local Network Vodacom speech to be stored in a PC Figure 3: Generating real coded speech Given the laboratory constraints only a small set of speakers (18 speakers) was tested as indicated by the results in table 1. Figure 4 illustrates how 06.10 [5, 6] was used for forming simulated data. Figure 5: Speaker identification process of speech VI. RESULTS AND DISCUSSIONS There are two types of results in this section. Results in table 1 show speaker identification of 18 talkers from transmitted voices. Table 2 and figure 6 reflect identification rate from 630 speakers. These results indicate how the SID system performs on a telephone-transmitted voice. A. speech results speaker identification was done on a small population. The results in table 1 highlight the impact of channel effects on speaker identification when real network processes the speech. Channel noise [8] and even the mobile device circuitry contribute to the degradation of identification rate. Each speaker spoke for 15 seconds and 6 seconds during training and testing respectively. Gaussian mixture model classifier [3] used in our experiment performs better if talkers speak for a longer time during training. This means that real network may perform better if we train speakers for a longer time. This is due for further investigation. data Table 1: SID on speech Total number of speakers No. of correctly identified speakers Simulated ( 06.10) 18 17 Vodacom 18 15 B. Telephone speech results During the investigation of our proposed hierarchical SID performance, a population of 630 speakers from the whole telephone speech database (NTIMIT) was used. Normal SID in table 2 refers to a non-hierarchical SID [21] while ideal dialect hierarchical SID indicates the dialect based group
detection. Hierarchical SID in table 2 used PLP group detector. This detector generated only two groups of speakers. The speakers in NTIMIT database are already grouped according to their dialect. Taking advantage of this arrangement, the group detector (figure 1) produced 8 database groups. This grouping is ideal because it was not done according to the group detector algorithm but was prearranged. The results in table 2 are competitive with those reported in literature [4, 11] using the same data. Architecture of the SID SID rate on 630 speaker (NTIMIT) Normal SID 69.5 % Hierarchical SID 75.7 % Ideal dialect hierarchical SID 86.5 % Table 2: SID rate on large population using NTIMIT sentences 8 and 9 to test and their other 8 for training. Table 3 shows the population of speakers in dialect region (dr) from telephone speech database (NTIMIT). Same experiment was carried out for each dialect region in order to observe the consistency of the proposed architecture on telephone speech. Figure 6 is a graph of identification rates for different dialect groups of speakers. The results in figure 6 however, do not reflect text independence. Speakers uttered same sentence for the group detection process. Region dr1 dr2 dr3 dr4 dr5 dr6 dr7 dr8 Population 49 102 102 100 98 46 100 33 Table 3: Population of speakers in each dialect region SID rate (%) 95.0 90.0 85.0 80.0 75.0 70.0 dr1 dr2 dr3 dr4 dr5 dr6 dr7 dr8 Dialect Region Normal Hierarchical Figure 6: Average SID rates from all dialect regions VII. CONCLUSIONS Prompting the client to say certain fixed words or sentences such as account number and physical address could enhance practical application of speaker identification at call centres. Our SID system shows a reasonable performance on 630 people in a completely text independent setup as tabulated in table 2. We have seen in section III that companies have already deployed speech technology in the form of speech recognition in call centres. Although only few speakers are used in testing real speech on the SID system, the proposed hierarchical architecture (figure 1) of speaker identification could ensure that the speakers models (voice templates) are grouped into several smaller databases. This may maintain high identification rate in real time conditions. This projection is proved by the ideal dialect hierarchical SID rate in table 2. This result proves that the more model databases we have the better identification rate we achieve. Now that it is possible to test local telecommunications network transmitted speech, the second step will be to pass more speakers voices through local network (Vodacom) and get better SID assessment. The transmission of TIMIT speech database (clean speech) through a local telephone network (Telkom) will generate a telephone database that can replace NTIMIT for future assessment of our SID system. This local database can be tested on our system and the SID application in call centres can be evaluated. REFERENCES [1] Westall F.A (2001). technology telecommunications. BT Technol J. Vol. 14 No.1, pp 9-27 [2] Ericsson Telecom AB and Telia AB (1998). Understanding telecommunications 2 pp 61-62,163-164 [3] Lerato L., Mashao D. J. (2002). Evaluation of speaker identification system using data. SATNAC 2002. Vol. 1 pp 75-78 [4] Le Floch J.L. et al. (1995). Speaker recognition experiments on the NTIMIT database Proceedings of Eurospeech 95, Vol.1 pp 379-382 [5] http://kbs.cs.tu-berlin.de/~jutta/toast.html [6] http://www.etsi.org [7] http://www.vodacom.co.za [8] Murthy, A.H. et al. (1999). "Robust text-independent speaker identification over telephone channels." IEEE Transactions of and Audio processing, Vol. 7 No.5 pp 554 568 [9] Kuitert M., Boves L. (1997). "Speaker verification with coded telephone speech", Proceedings Eurospeech 1997 [10] Mashao D.J. (2002). "Towards an inverse solution for speaker recognition", Proc. of 13th Annual Symposium of the PRASA pp 61-65. [11] Grassi, S. et al. (2000). Influence of Coding on the performance of text-independent speaker recognition." Proceedings of ICASSP 2000 [12] Hermansky H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America Vol. 87 no. 4, pp. 1738-1752 [13] Mashao D. J., Adcock J. E. (1997). Utterance dependent parametric warping for a talker-independent HMM-based recognizer. Proceedings of ICASSP 97. Munich, Germany. pp 1235 1238 [14] Campbell J. P. (1997). Speaker Recognition: A tutorial. Proceedings of the IEEE, Vol. 85, No. 9. pp 1437-1462 [15] Phil Hochmut, (2003), "Nortel brings IVR to midsize call centres", http://www.nwfusion.com/news/2003/0616nortel.html [16] Loquendo voice technologies integrated in Trenitalia call centres http://www.hltcentral.org/usr_docs/case_studies/euromap/loq uendo_fs_informa_web.pdf. [17] INTELLECA VOICE & MOBILE: http://www.intelleca.co.za [18] http://www.nuance.com/assets/pdf/verifier3faq.pdf [19] http://www.speechworks/products/ [20] Rachel Lebihan (ZDNet Australia) (2002). "Biometrics: the key to call centre fraud?" http://news.zdnet.co.uk [21] Lerato L., Mashao D.J. (2002). "Hierarchical approach for improving speaker identification", Proc. of 13th Annual Symposium of the PRASA pp 51-55.
[22] Beigi H.S.M. et al. (1999). A hierarchical approach to largescale speaker recognition, Proceedings of Eurospeech 99. Vol. 5, pp 2203-2206. [23] Pan Z. (2000). A on-line hierarchical method of speaker identification for large population. Proc. Of IEEE Nordic signal processing symposium, pp 33-36 [24] Gunawan W. et al. (2001). PLP Coefficients can be quantized at 400 BPS. Proceedings of ICASSP 01 [25] Besacier L. et al. (2000). " speech coding and speaker recognition." Proceedings of ICASSP 2000