Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling

Size: px
Start display at page:

Download "Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling"

Transcription

1 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling Ran D. Zilca, Member, IEEE Abstract This paper describes a computationally simple method to perform text independent speaker verification using second order statistics. The suggested method, called utterance level scoring (ULS), allows obtaining a normalized score using a single pass through the frames of the tested utterance. The utterance sample covariance is first calculated and then compared to the speaker covariance using a distortion measure. Subsequently, a distortion measure between the utterance covariance and the sample covariance of data taken from different speakers is used to normalize the score. Experimental results from the 2000 NIST speaker recognition evaluation are presented for ULS, used with different distortion measures, and for a Gaussian mixture model (GMM) system. The results indicate that ULS as a viable alternative to GMM whenever computational complexity and verification accuracy needs to be traded. Index Terms Covariance modeling, speaker verification, text independent, utterance level scoring. I. INTRODUCTION MOST current text independent speaker verification systems are based on the following approach. When a new speaker enrolls to the system, an explicit expression of the probability density function (PDF) representing the speech features in the enrollment session is estimated and serves as a speaker model. Typically this estimation is based on a predefined PDF structure such as a Gaussian mixture model (GMM) [1], and the PDF is obtained by estimating the GMM parameters, namely means variances and weights. Modeling of the opposite hypothesis (i.e., that the speech originated from a different speaker than the claimant) is performed in one of the two following ways. 1) World modeling or universal background model (UBM): Estimating the PDF of a larger session that includes many different speakers [2]. 2) Cohort modeling: training few different speaker models and taking the average score against the cohort speakers as the world s score [1]. When a test utterance needs to be verified against a claimant speaker model, each of the speech frames is scored against the claimant speaker model and against the world model or cohort background models. The normalized score of individual frames is then accumulated to a total score of the tested utterance, typically by averaging. We refer to this approach as frame level scoring (FLS). The FLS approach is based on the notion that the tested utterance includes only a subset of the phonetic space modeled Manuscript received January 22, 2001; revised January 24, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Philip C. Loizou. The author was with the Research and Development Division, Amdocs, Israel. He is now with the IBM Research T. J. Watson Center, Yorktown Heights, NY USA ( zilca@us.ibm.com). Digital Object Identifier /TSA by the text independent model, and therefore cannot be compared to it in its entirety. In fact, each frame score is effectively influenced only by the closest Gaussian components. This multimodal nature of modeling and verification mandates extensive computation since each frame is scored separately against multimodal models during verification. In addition, training multimodal speaker and background models usually requires iterative procedures. In the heart of the FLS approach lies the assumption that the exact position in the feature space of a vector extracted from a single speech frame contains the information regarding the speaker s identity, rather then the nature of the distribution of the entire utterance frames. However in practice it is clear that the exact position of the feature vector extracted from a single frame is significantly affected by the channel, handset type and additive noise. Although averaging the individual frame scores reduces these effects, this observation motivates a pursue for a speaker verification system that scores an utterance by comparing the utterance distribution in feature space to the speaker and world models, rather than averaging frame scores. Following this approach, once the utterance model is computed the utterance score is calculated in a single operation. We therefore term it utterance level scoring (ULS). ULS was suggested previously for multimodal models [3]. Yet, it was shown to be suitable mainly when long test utterances are available, due to the need of training a multimodal utterance model. We model each tested utterance solely by its sample covariance matrix. This simple utterance model does not include many parameters to be estimated and therefore allows shorter utterances. In brief, we are looking for a speaker verification method that would be computationally light and compact. This is obtained first by using uni-modal speaker and world models, composed only of the second order statistics of the training data. In addition, the use of ULS allows further reduced computation since it allows going through the frames of a speech utterance only once for computing its sample covariance (the utterance model), and subsequently scoring against speaker and world/background models with a single operation. We also conjecture that our suggested approach may have desirable robustness properties, due to the use of ULS rather than FLS. In a previous study we presented preliminary experimental results of ULS with second order statistics [4]. This paper describes more thorough experimentation and provides an in-depth discussion about the robustness properties of the suggested method. The rest of the paper is organized as follows. Section II provides an overview of covariance models for speaker verification, both with FLS and with ULS. The computational advantages of using ULS with covariance modeling are given in Section III. This is fol /02$ IEEE

2 364 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 lowed by a description of the experiments that were conducted in the 2000 NIST speaker recognition evaluation in Section IV, and a presentation of the experimental results in Section V. We then discuss the results and provide our conclusions. II. COVARIANCE MODELS The term covariance model (CM) [4] refers to a single Gaussian speaker model, ignoring the mean vector. Thus training a speaker CM involves only computing the sample covariance matrix,, of the training data where is the number of training feature vectors, is the th training vector and is the sample mean of the training data. The purpose of ignoring the mean vector is to focus only on the shape of the distribution of features extracted from the speaker data, rather then their exact location in feature space. However, many speaker verification systems that use cepstral vectors perform cepstral mean subtraction [5] to compensate for channel effects, resulting in a zero mean regardless of the CM approach. A. Using CM With Frame Level Scoring CM may be used for FLS, in which case each speech frame is scored using an explicit Gaussian expression or a Mahalanobis distance. We refer to this CM-based system as a single Gaussian full covariance GMM, or SG. When used for speaker verification, the score with regard to the speaker CM should be normalized using the background or world CM,. The background CM is calculated similar to (1), but using vectors collected from a variety of different speakers. Since SG follows a likelihood ratio classification approach [6], the normalized score of a single frame,, is obtained by subtracting the two scores where is a th component feature vector extracted from a speech frame and and are the explicit Gaussian PDF expressions, e.g., Up to an expression that depends only on the covariance determinants, the Gaussian expressions in (3) may be replaced with Mahalanobis distances when substituted in (2) When using SG, the means and may be either used directly as described above, or set to zero in order to disregard the exact location in feature space. As mentioned earlier, when using cepstral coefficients with cepstral mean subtraction, the means equal zero by nature regardless of the modeling approach. B. Using CM With Utterance Level Scoring An alternative approach to SG would be to first calculate the sample covariance of an utterance and then score the utterance (1) (2) (3) (4) with a single operation with respect to each model (i.e., the speaker and background covariances). Unlike the SG/GMM system presented in the previous subsection, the ULS approach does not concern single frames. Instead, it involves measuring the similarity between the shape of the distribution of the entire tested utterance and the shape of the speaker and world distributions. Same as for the speaker and world data, the covariance matrix of the tested utterance represents its shape. Therefore the first step in performing ULS verification is to compute the sample covariance matrix of the utterance, i.e., where is the number of the frames within the utterance and is the sample mean of the utterance data. Once is computed, it should be compared with the world and claimed speaker covariances, to find which it matches most. Since this scoring stage may be performed in a single operation, using distortion measures between covariance matrices, We term this approach ULS. The distortion measure between the utterance and the speaker covariances should be normalized by the similarity to the world covariance. Let us consider a general distortion measure between two covariance matrices. We suggest to use the following form of normalized utterance score for ULS: As seen from this expression, a speech utterance having a small distortion measure with respect to the claimant speaker s CM, relative to its distortion measure from the world CM, will produce a high score, and vice versa. Several distortion measures between covariance matrices were previously used for speaker identification, with no world/background normalization (e.g., [7] [9]). We chose to investigate the new normalization method suggested in (6) using two of these distortion measures: the divergence shape and the arithmetic-harmonic sphericity measure. The main computational advantage of ULS over likelihood based methods, is that although the utterance sample covariance is calculated frame by frame, each frame calculation is performed regardless of the speaker model. Consequently, once the utterance covariance is obtained, scoring with respect to a model (either speaker or world) involves only a single operation. Speaker verification involves scoring with respect to a speaker model and at least one more additional world or background model, in order to obtain a normalized score. Therefore, ULS allows to obtain a normalized score while going through each frame only once (unlike likelihood based systems where the process of going thorough the entire utterance frame by frame needs to be repeated for every model). This computational advantage is more significant when several models rather than a single world model are used for score normalization. For example, when using cohort based normalization [1], or when gender and handset-type dependent world models are used not only for normalization but also as gender (5) (6)

3 ZILCA: TEXT-INDEPENDENT SPEAKER VERIFICATION USING UTTERANCE LEVEL SCORING AND COVARIANCE MODELING 365 and handset-type detectors (by using the best scoring world model for normalization). 1) Divergence Shape Ratio: The divergence shape is an information-theoretic measure between two Gaussian classes, introduced by Campbell as a measure of dissimilarity between a reference speaker utterance and a tested utterance [8]. Given the PDFs of two classes and, the symmetric directed divergence (resulting from the sum of discriminating information for class 1 versus class 2 and vice verse) equals Assuming the two classes follow a Gaussian distribution with means and covariances respectively, and substituting the Gaussian PDF expression of the form (3) in (7), we obtain an expression composed of a sum of two components, as follows: (8) where is the trace operator and. The first component depends solely on the covariances, and may therefore be attributed to the dissimilarity between the shapes of the two classes. The second expression is a distortion measure between the two means, and is related to the distance in feature space between the two distributions. Campbell suggested using only the first component, termed the divergence shape (DS), i.e., The divergence shape was introduced in the context of closed set speaker identification. We propose using this measure for speaker verification score normalization, as part of the approach stated in (6). Also, while the DS was tested previously only on clean speech using line spectra pairs (LSP), the experiments presented herein are performed on conversational telephone-quality speech, using Mel frequency cepstral coefficients (MFCC). We suggest a new normalized ULS method, called the divergence shape ratio (DSR), where the utterance score,, is calculated following (6) with the divergence shape as the distortion measure, i.e., (7) (9) (10) 2) Sphericity Measure Ratio: The sphericity measure ratio (SMR) is computed by applying (6) to a distortion measure termed arithmetic-harmonic sphericity measure, or the sphericity measure (SM). The expression for the SM is derived from the analysis in [9] and it was recently used in [10]. The same as for DS, this measure was previously used only for closed set speaker identification. We used the SM for background normalization in speaker verification in accordance with (6), and tested it on the 2000 NIST speaker recognition evaluation. A detailed analysis on using second order statistical measures and the derivation of the arithmetic-harmonic sphericity measure may be found in [9]. The SM is computed as follows: (11) and the SMR III. COMPUTATIONAL ADVANTAGES OF UTTERANCE LEVEL SCORING (12) In order to demonstrate the computational efficiency of using the CM/ULS approach, let us consider three criteria for the evaluation of computational complexity: number of memory cells required for storing a speaker model; CPU time for speaker model training; CPU time for verifying a test utterance. SG shares the modest storage requirements with DSR and SMR (resulting from using CM), but requires more verification CPU time, since verification is performed at the frame level, for each model. A. Speaker Model Storage Requirements A CM includes parameters, where is the dimension of the speech feature vector. A diagonal covariance GMM with components requires parameters, and a full covariance GMM requires. When no compression is used, each parameter requires a single memory cell. For example, given a 18th order cepstral feature vector, the CM s require only 171 memory cells. A 16 component full covariance GMM requires 3040 cells, a 64 component diagonal covariance GMM requires 2368 cells, and 512 diagonal GMM adapted from a background model [2] requires cells (!). B. Enrollment and Verification CPU Times CPU times cannot be evaluated as straightforward as the storage requirements, since computation time depends on the implementation specifics (e.g., number of training iterations, trading elaborate functions with look-up tables). In addition, executing identical experiments on different computing environments would result in different measured times depending on the compiler/interpreter, and the hardware setup. However, the dramatic difference in computational requirements between the CM/ULS approach and multimodal FLS systems, may be demonstrated using a single example. CPU times, in seconds, are shown for such an example in Table I. All times were measured for the utterance eaaa and speaker 4309 that were chosen randomly from the NIST-1999 evaluation. Training and verification were performed using Matlab simulation on a P-III, 450 MHz workstation. To allow a fair comparison, approximated enrollment and verification was used for different GMM configurations as follows. Non-adapted GMM s were trained using DB-GMM (K-means with no EM iterations) [11]. Verification for these methods was computed using only the 20 nearest components of the world model for each frame. Adapted GMM verification was performed using a five-component approximation as described in [12]. The computational advantage of the covariance modeling methods is evident. Training is four orders of magnitude faster than GMM. SG verification is 17 to 24 times faster than GMM,

4 366 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 TABLE I CPU TIME [SECONDS] FOR ENROLLMENT AND VERIFICATION and DSR verification is more than three orders of magnitude faster than GMM verification, owing to utterance level scoring. Trials on other utterances and speakers exhibited similar results. IV. EXPERIMENTAL SETTINGS The experiments described in this section were performed as part of the participation in the 2000 NIST speaker recognition evaluation. The data includes conversational telephone-quality speech taken from the Switchboard 2 corpus. The evaluation includes scoring and verifying 6096 target trials and impostor trials involving 926 target speakers, male and female. All of the test segments are recorded from calls made from a telephone number that is different from the one used to enroll. Therefore, all test utterances may be considered to be collected using a different handset than the one used for training the speaker model. Each speaker is trained using a single twominute session, while testing utterances range between few seconds and a minute (with a primary focus on utterances with s duration). All the speech utterances (both training and testing) are labeled by the type of handset (electret or carbonbutton) using the MIT-LL handset classifier [13]. A detailed description of the evaluation settings and rules may be found in [14]. A. Signal Processing An identical signal processing procedure was applied to all the tested modeling and verification systems. The speech vector was first segmented into 25 ms frames every 12.5 ms. Each frame was multiplied by a Hamming window and analyzed using a voicing detector. Only voiced frames are selected for further use. Very aggressive voicing detection was applied, filtering about half of the speech data. FFT based 18th order MFCC were extracted from each speech frame using a triangle filter bank. Cepstral mean subtraction was then performed to compensate for convolutional channel effects; i.e., the mean vector of each utterance was subtracted from each single MFCC vector within the utterance. B. Tested Systems Four systems were tested, as follows. Fig. 1. DET curves for all trials. GMM bold, DSR solid, SG dashed, and SMR dash-dot. 1) Frame Level Scoring a) Diagonal covariance GMM obtained using Bayesian Adaptation [2]. b) SG. 2) Utterance Level Scoring a) DSR. b) SMR. SG, DSR, and SMR are described in this paper. In GMM the model is trained by adapting from a background model that includes 512 components, which is also used for background normalization during verification. C. Modeling Settings As for the signal-processing portion of the experiments, the following modeling settings were kept identical for all the tested systems. Four different world models were trained for each system, depending on the handset type (carbon-button or electret) and gender. The world models were trained using the entire test data of the 1999 NIST evaluation, about two hours of data for each model. For the CM systems (SG, DSR, and SMR) this involved simply calculating the sample covariance matrix of this data. For the GMM the world model was trained using the DB-GMM procedure [11]. That is, the data was clustered using the K-means algorithm, and the GMM parameters were calculated as follows: the weights were given by the relative number of vectors in each cluster, and the means and variances were the sample mean and variance of each cluster. No EM iterations were performed. The speaker GMMs were adapted from the prospective world model. The choice of world model for normalization of the verification trials was according to the type of handset in the training session and the claimant speaker s gender (the evaluation does not include any cross gender trials). For GMM the verification procedure was approximated using five Gaussian components, as described in [12].

5 ZILCA: TEXT-INDEPENDENT SPEAKER VERIFICATION USING UTTERANCE LEVEL SCORING AND COVARIANCE MODELING 367 Fig. 2. DET curves for match/mismatch conditions. Et=et bold, cb=et solid, et=cb dashed, and cb=cb dash-dot. V. RESULTS The results of the experiments are given as DET curves [15] showing the trade off between false impostor acceptance and false utterance rejection each system is capable to produce. Fig. 1 shows a performance comparison between the four tested systems, for the entire set of trials. It is clearly seen that GMM performs the best, and also that SMR is the best CM system. For numerical comparison purposes, let us consider the equal error rate (EER) which is the point on the DET curve where the false acceptance and false rejection rates are equal. GMM obtains 17.3% EER, while SMR, SG and DSR reach 23.8%, 26.4% and 27.6% respectively. In other words the CM systems exhibit up to 60% relative degradation in EER with respect to GMM. These results may be expected considering the simplistic nature of CM. However, SMR performs only 6.5% below GMM (37% relative degradation). In order to examine the robustness of the tested systems let us consider the DET curves shown in Fig. 2. The NIST-2000 speaker recognition evaluation includes only handset-mismatched trials (different telephone-number), so in terms of the handset, only robustness to the handset type may be evaluated. We use the following notation for the different match/mismatch conditions. Training and verification performed using an electret handset. Training and verification performed using a carbon button handset. Training performed using electret and verification with carbon button. Training performed using carbon button and verification with electret. As expected, all the tested systems exhibit best performance for and worst performance for. This is related to the degradation caused by the low quality carbon button microphone, as expressed in the speaker model, and the mismatch in handset types. However, the worst degradation for con-

6 368 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 TABLE II ROBUSTNESS MEASURES ditions is observed for SG 47.1% EER that is very close to a 50% random guess. In order to compare the robustness of the systems let us define the following robustness measures (13) represents the relative degradation in accuracy when the type of the handset used to collect the verification utterance is different than the one used for training the claimed speaker s model. It is the average of two components. The first one,, is the relative degradation in EER between and conditions, and the second,, is the degradation from to. A second robustness measure,, represents system sensitivity to the type of handset in matched conditions. is calculated as the relative EER degradation between matched and conditions, i.e., (14) We have measured the EER for the different conditions based on the performance shown in Fig. 2 and calculated the values of, and as shown in Table II. As seen, DSR exhibits high sensitivity to the handset type but is not very sensitive to handset type mismatch. SMR on the other hand is not sensitive to the handset type, but rather to mismatch. The extreme robustness of SMR to handset type is noteworthy only 9% relative degradation. This property is also observable in the absolute EER values: in conditions SMR obtains 23% EER which is only a relative degradation of 18% with respect to the respective GMM EER, which is 19.5%. Yet, it should be noted that from Fig. 2(d) it appears that SMR has better performance for thresholds tuned around the EER than for other regions of the DET curve. VI. DISCUSSION The clearest advantage of the suggested CM/ULS approach is its extreme computational simplicity. This advantage may be useful in applications where verification accuracy may be traded with computational requirements. In addition, this approach may be useful for other open set speaker recognition problems, such as the open-set speaker identification task. This task includes no knowledge of a claimant speaker and the system is required to blindly recognize the sepaker s identity among a predefined set of speaker models. In addition it is required to decide whether the speaker exists at all in the speaker database by thresholding the scores. Fast approximated open-set speaker identification may be performed by scoring against all the speaker models using ULS to determine the top scoring Fig. 3. Score histogram, all trials (a) GMM and (b) SMR. speakers, followed by a more precise GMM/FLS to determine the speaker s identity among the best scoring ones. This approach utilizes the tradeoff between the scoring accuracy and the number of best-detected speakers. However, although the computational simplicity of CM/ULS may be useful in some applications, it is highly desirable to consider possible ways to improve its recognition accuracy. Since (6) is a form of score transformation, the first possibility that comes in mind for modifying CM/ULS is using a different form of score transformation than (6). However, in preliminary experiments we tried several different simple mathematical expressions involving the two distortion measures and chose the form of (6) because of the resulting Gaussian-like score histograms. The fact that score distributions are close to Gaussian is also verified in the DET curves, which are approximately linear. An example of score histograms is shown in Fig. 3. The figure shows the impostor score histogram superimposed on the true speaker histogram. The two histograms are displayed in different granularity in order to have similar occurrence values. Fig. 3(a) shows the score histograms for GMM, where score normalization is obtained using frame level likelihood ratio, and Fig. 3(b) shows the histograms for SMR where normalization is performed using (6). The difference in verification accuracy is clear, as the GMM histograms are much more easily separable than SMR. However, the SMR histograms are seen to be Gaussian-like, same as for GMM. This supports the use of a score transformation function of the form (6).

7 ZILCA: TEXT-INDEPENDENT SPEAKER VERIFICATION USING UTTERANCE LEVEL SCORING AND COVARIANCE MODELING 369 VII. CONCLUSIONS In this paper, we have described a computationally simple method to perform text independent speaker verification. The method models the speaker and background using a sample covariance. Verification is performed by comparing the utterance sample covariance with the covariance models, using a distortion measure. Since this method allows to score the utterance with respect to a covariance model in a single computational operation rather than frame by frame, we termed this method utterance level scoring (ULS). The suggested approach provides substantial computational simplification with respect to likelihood ratio classifiers such as GMM. ULS using several distortion measures was tested on the 2000 NIST speaker recognition evaluation corpus and compared to GMM and a single Gaussian classifier. We also compared the tested systems in different conditions of mismatch between the type of handset used for enrollment and verification. The results indicate that ULS may be a viable technique when computational complexity needs to be gracefully traded with verification accuracy, in particular whenever it is required to score a tested utterance with respect to several models. Examples for such scenarios include open set speaker identification and using world models as gender and handset-type detectors (by using the best scoring world model for normalization). REFERENCES Fig. 4. DET curves SMR using UBM and cohort normalization for all trials, and for electret data. Another modification of CM/ULS that might prove useful is using cohort-based normalization rather than a single world model. Since CM/ULS scoring involves a single operation once the utterance covariance has been calculated, the additional computational cost of using cohort normalization [1] instead of a single world model is relatively small. When using cohort normalization, no world model is pre-trained, and for the verification procedure the denominator of (6) is replaced by an average of scores with respect to few speaker models. Fig. 4 shows the performance of SMR for cohort normalization comparing to normalization using a world model. We chose SMR due to its superior performance relative to DSR and SG. The cohort speakers were taken from the 1999 NIST speaker recognition evaluation. As with the world models, different normalization was used according to gender and handset type. There are 857 male-electret cohort speakers, 591 male-carbon button, 1449 female-electret, and 523 female-carbon button. Fig. 4 clearly indicates that cohort normalization performs worse than world normalization for SMR. [1] D. A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun., vol. 17, pp , [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Process., vol. 10, no. 1 3, pp , [3] H. S. M. Beigi, S. H. Maes, and J. S. Sorensen, A distance measure between collections of distributions and its application to speaker recognition, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Processing, 1998, pp [4] R. D. Zilca, Text independent speaker verification using covariance modeling, IEEE Signal Processing Lett., vol. 8, pp , Apr [5] R. J. Mammone, X. Zhang, and R. P. Ramachandran, Robust speaker recognition, IEEE Signal Processing Mag., vol. 13, pp , Sept [6] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, [7] H. Gish, Robust discrimination in automatic speaker identification, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Processing, 1990, pp [8] J. P. Campbell, Speaker recognition: A tutorial, Proc. IEEE, vol. 85, no. 9, pp , Sept [9] F. Bimbot, I. M. Magrin-Chagnolleau, and L. Mathan, Second order statistical measures for text independent speaker identification, Speech Commun., vol. 17, pp [10] C. Alonso-Martinez and M. Faundez-Zanuy, Speaker identification in mismatch training and testing conditions, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Processing, 2000, pp [11] R. D. Zilca and Y. Bistritz, Distance-based Gaussian mixture model for speaker recognition over the telephone, in Proc. ICSLP-2000, Beijing, China, 2000, pp [12] D. A. Reynolds, Comparison of background normalization methods for text-independent speaker verification, in Proc. Eurospeech-97, Rhodes, Greece, 1997, pp [13], HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Proc., vol. 2, 1997, pp [14] [Online] [15] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, The DET curve in assessment of detection task performance, in Proc. Eurospeech-97, Rhodes, Greece, 1997, pp

8 370 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 Ran D. Zilca (M 92) received the B.S. degree with honors in electrical engineering from Ben Gurion University, Israel, in 1993, and M.S. degree in electrical engineering from Tel Aviv University, Israel, in From 1993 to 1999, he was with the Israel Defense Forces (IDF), where he served as an R&D Engineer and Project Manager. From 1999 to 2001, he was a Research Scientist with the R&D division of Amdocs, Israel, where he conducted research in text independent speaker recognition for landline and cellular environments, focusing on computationally efficient large population open-set speaker identification. During this time, he also served as the project manager for the development of a robust voice portal for cellular providers. In 2001, he joined the Human Language Technology Department at the IBM T. J. Watson Center, Yorktown Heights, NY, where he is conducting research in telephone speaker recognition. His current research interests include statistical pattern recognition, and robust speaker verification and identification.

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Support Vector Machines for Speaker and Language Recognition

Support Vector Machines for Speaker and Language Recognition Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Lorene Allano 1*1, Andrew C. Morris 2, Harin Sellahewa 3, Sonia Garcia-Salicetti 1, Jacques Koreman 2, Sabah Jassim

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Grade 6: Correlated to AGS Basic Math Skills

Grade 6: Correlated to AGS Basic Math Skills Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and

More information

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation Taufiq Hasan Gang Liu Seyed Omid Sadjadi Navid Shokouhi The CRSS SRE Team John H.L. Hansen Keith W. Godin Abhinav Misra Ali Ziaei Hynek Bořil

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Xinyu Zhou, Yuxin Wu, and Tiezheng Li Tsinghua University Contents 1 Introduction 1 2 Algorithms 2 2.1 VAD..................................................

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

Automatic Pronunciation Checker

Automatic Pronunciation Checker Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Speaker Recognition. Speaker Diarization and Identification

Speaker Recognition. Speaker Diarization and Identification Speaker Recognition Speaker Diarization and Identification A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

Why Did My Detector Do That?!

Why Did My Detector Do That?! Why Did My Detector Do That?! Predicting Keystroke-Dynamics Error Rates Kevin Killourhy and Roy Maxion Dependable Systems Laboratory Computer Science Department Carnegie Mellon University 5000 Forbes Ave,

More information

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Sanket S. Kalamkar and Adrish Banerjee Department of Electrical Engineering

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters. UMass at TDT James Allan, Victor Lavrenko, David Frey, and Vikas Khandelwal Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA 3 We spent

More information

Spoofing and countermeasures for automatic speaker verification

Spoofing and countermeasures for automatic speaker verification INTERSPEECH 2013 Spoofing and countermeasures for automatic speaker verification Nicholas Evans 1, Tomi Kinnunen 2 and Junichi Yamagishi 3,4 1 EURECOM, Sophia Antipolis, France 2 University of Eastern

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

School of Innovative Technologies and Engineering

School of Innovative Technologies and Engineering School of Innovative Technologies and Engineering Department of Applied Mathematical Sciences Proficiency Course in MATLAB COURSE DOCUMENT VERSION 1.0 PCMv1.0 July 2012 University of Technology, Mauritius

More information

Speech Recognition by Indexing and Sequencing

Speech Recognition by Indexing and Sequencing International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 4 (212) pp. 358 365 c MIR Labs, www.mirlabs.net/ijcisim/index.html Speech Recognition

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Author's personal copy

Author's personal copy Speech Communication 49 (2007) 588 601 www.elsevier.com/locate/specom Abstract Subjective comparison and evaluation of speech enhancement Yi Hu, Philipos C. Loizou * Department of Electrical Engineering,

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS Annamaria Mesaros 1, Toni Heittola 1, Antti Eronen 2, Tuomas Virtanen 1 1 Department of Signal Processing Tampere University of Technology Korkeakoulunkatu

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions 26 24th European Signal Processing Conference (EUSIPCO) Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions Emma Jokinen Department

More information

Affective Classification of Generic Audio Clips using Regression Models

Affective Classification of Generic Audio Clips using Regression Models Affective Classification of Generic Audio Clips using Regression Models Nikolaos Malandrakis 1, Shiva Sundaram, Alexandros Potamianos 3 1 Signal Analysis and Interpretation Laboratory (SAIL), USC, Los

More information

Lecture 2: Quantifiers and Approximation

Lecture 2: Quantifiers and Approximation Lecture 2: Quantifiers and Approximation Case study: Most vs More than half Jakub Szymanik Outline Number Sense Approximate Number Sense Approximating most Superlative Meaning of most What About Counting?

More information

Circuit Simulators: A Revolutionary E-Learning Platform

Circuit Simulators: A Revolutionary E-Learning Platform Circuit Simulators: A Revolutionary E-Learning Platform Mahi Itagi Padre Conceicao College of Engineering, Verna, Goa, India. itagimahi@gmail.com Akhil Deshpande Gogte Institute of Technology, Udyambag,

More information

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS R.Barco 1, R.Guerrero 2, G.Hylander 2, L.Nielsen 3, M.Partanen 2, S.Patel 4 1 Dpt. Ingeniería de Comunicaciones. Universidad de Málaga.

More information

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics

More information

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

CHAPTER 4: REIMBURSEMENT STRATEGIES 24 CHAPTER 4: REIMBURSEMENT STRATEGIES 24 INTRODUCTION Once state level policymakers have decided to implement and pay for CSR, one issue they face is simply how to calculate the reimbursements to districts

More information

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION Session 3532 COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION Thad B. Welch, Brian Jenkins Department of Electrical Engineering U.S. Naval Academy, MD Cameron H. G. Wright Department of Electrical

More information

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number 9.85 Cognition in Infancy and Early Childhood Lecture 7: Number What else might you know about objects? Spelke Objects i. Continuity. Objects exist continuously and move on paths that are connected over

More information

CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY

CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY FALL 2017 COURSE SYLLABUS Course Instructors Kagan Kerman (Theoretical), e-mail: kagan.kerman@utoronto.ca Office hours: Mondays 3-6 pm in EV502 (on the 5th floor

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Speaker Recognition For Speech Under Face Cover

Speaker Recognition For Speech Under Face Cover INTERSPEECH 2015 Speaker Recognition For Speech Under Face Cover Rahim Saeidi, Tuija Niemi, Hanna Karppelin, Jouni Pohjalainen, Tomi Kinnunen, Paavo Alku Department of Signal Processing and Acoustics,

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Truth Inference in Crowdsourcing: Is the Problem Solved?

Truth Inference in Crowdsourcing: Is the Problem Solved? Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade The third grade standards primarily address multiplication and division, which are covered in Math-U-See

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Systematic reviews in theory and practice for library and information studies

Systematic reviews in theory and practice for library and information studies Systematic reviews in theory and practice for library and information studies Sue F. Phelps, Nicole Campbell Abstract This article is about the use of systematic reviews as a research methodology in library

More information

Memory-based grammatical error correction

Memory-based grammatical error correction Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,

More information

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information