Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Size: px
Start display at page:

Download "Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project"

Transcription

1 Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, in partial satisfaction of the requirements for the degree of Master of Science, Plan II. Approval for the Report and Comprehensive Examination: Committee: Professor Nelson Morgan Research Advisor Date * * * * * * Dr. Nikki Mirghafori Second Reader Date

2

3 Contents Contents List of Figures List of Tables Acknowledgements i iii v vii 1 Introduction 1 2 Background The Speaker Recognition Problem Cepstral Features System Approaches Gaussian Mixture Model (GMM) Support Vector Machine (SVM) Normalizations Score-level Feature-level Model-level Related Work Phonetically Discriminative Features Speaker Discriminative Features Using Neural Networks Using a Speaker Space i

4 3.3 Hybrid Phonetic and Speaker Approaches Approach Motivation Method Tandem/HATS-MLP Features Speaker-MLP Features Training Speaker Selection Training of the Speaker-MLP Networks Using the Hidden Activations of the Speaker-MLP as Features Using an SVM Instead of a GMM System Baseline Cepstral Systems for System Comparison System Fusion Experimental Results Databases Training Database Testing Database Performance Measures Results Using Tandem/HATS-MLP Features in a GMM System Tandem/HATS-GMM Tandem-GMM HATS-GMM Results Using Speaker-MLP Features Cross-validation Accuracies Hidden Activations in SVM System (Speaker-SVM) Conclusion 42 References ii

5 List of Figures 2.1 Generation of MFCC features Generation of PLP features HATS-MLP System (taken from Chen s Learning Long-Term Temporal Features in LVCSR Using Neural Networks [1]) Tandem/HATS-MLP Features Speaker-MLP Training Setup Speaker-MLP Interpretation Tandem/HATS-GMM System Setup DET curves for Tandem/HATS-GMM combination with Basic-GMM DET curves for Tandem/HATS-GMM combination with SRI-GMM DET curves for the Tandem/HATS-GMM combination with the Basic-GMM under matched and mismatched conditions DET curves for the Tandem/HATS-GMM combination with the SRI-GMM under matched and mismatched conditions Tandem-GMM System Setup DET curves for Tandem-GMM combination with Basic-GMM DET curves for Tandem-GMM combination with SRI-GMM DET curves for the Tandem-GMM combination with the Basic-GMM under matched and mismatched conditions DET curves for the Tandem-GMM combination with the SRI-GMM under matched and mismatched conditions HATS-GMM System Setup DET curves for HATS-GMM combination with Basic-GMM DET curves for HATS-GMM combination with SRI-GMM iii

6 5.14 DET curves for the HATS-GMM combination with the Basic-GMM under matched and mismatched conditions DET curves for the HATS-GMM combination with the SRI-GMM under matched and mismatched conditions Speaker-SVM System Setup DET curves for the Speaker-SVM systems using different numbers of Speaker-MLP training speakers DET curves for Speaker-SVM combination with Basic-GMM DET curves for Speaker-SVM combination with SRI-GMM DET curves for Basic-GMM, Speaker-SVM and their score-level combination, under matched and mismatched conditions DET curves for SRI-GMM, Speaker-SVM and their score-level combination, under matched and mismatched conditions iv

7 List of Tables 5.1 Tandem/HATS-GMM system improves upon Basic-GMM system, especially in combination, but there is no improvement for SRI-GMM system Fusion results with the Tandem/HATS-GMM system for both matched and mismatched conditions show improvements for the Basic-GMM, but no change for the SRI-GMM Tandem-GMM system yields improvements over the Basic-GMM system, especially in combination, but none for the SRI-GMM system Fusion results with the Tandem-GMM system show improvements for the Basic- GMM, especially for the mismatched case, but only improvements in the matched condition for the SRI-GMM HATS-GMM system results show gains for combination with the Basic-GMM, but none for fusion with the SRI-GMM Fusion results with the HATS-GMM system show improvements in the matched and mismatched cases for the Basic-GMM, but only improvement in the matched case for the SRI-GMM CV accuracy improves as the number of hidden units increase Speaker-SVM results improve as the number of hidden units increase System combination with 64 speaker, 1000 hidden unit Speaker-SVM improves Basic- GMM results, but not the SRI-GMM Fusion results with the Speaker-SVM system for both matched and mismatched conditions show improvement for the Basic-GMM, but virtually no change for the SRI-GMM Results for the MLP-based systems and their feature-level and score-level combinations with the Basic-GMM Results for the MLP-based systems and their score-level combinations with the SRI- GMM Breakdown of results for matched and mismatched conditions for the MLP-based systems and their score-level fusions with the Basic-GMM v

8 6.4 Breakdown of results for matched and mismatched conditions for the MLP-based systems and their score-level fusions with the SRI-GMM vi

9 Acknowledgements First and foremost, I would like to thank Joe Frankel and Nikki Mirghafori, who collaborated with me on this work. Joe s contributions and enthusiasm for the project and Nikki s suggestions and insight were both invaluable and much appreciated. I also want to extend my gratitude to my official research advisor, Professor Nelson Morgan, who allowed me to join the speech group at the International Computer Science Institute (ICSI), and who has provided me with a useful perspective on my project work. For their support and helpful comments, I thank the speaker recognition group at ICSI, including George Doddington, Andy Hatch, Howard Lei, Mary Knox, and Christian Mueller. Additional thanks go to Sachin Kajarekar and Andreas Stolcke of SRI, who provided technical expertise for my system implementation. This work was supported by a National Science Foundation Graduate Research Fellowship, and by a UC-Berkeley Graduate Research Opportunity Fellowship. vii

10 Chapter 1 Introduction The speaker recognition task is that of deciding whether or not a new test utterance belongs to a given target speaker, for whom there is only a limited amount of training data available. The traditionally successful approach to speaker recognition involves using low-level cepstral features extracted from speech in a Gaussian mixture model (GMM) system [2]. Alternative approaches attempt to make use of so-called high-level features, which involve, for instance, prosody and word usage [3]; however, the systems using low-level features are still the best base system, as the contribution of high-level approaches generally comes in combination with an approach using low-level features. We, as humans, are able to recognize a speaker s identity when we hear him speak, provided that the speaker is known to us, that is, we have heard enough of his speech. While a human system is able to extract the information necessary to identify a speaker under a wide range of conditions, it is a significant challenge for an automatic speaker recognition system to extract information from the speech in a meaningful way. Although cepstral features have proven to be the most successful choice of low-level features for speech processing, discriminatively trained features in a speaker space may be better suited to the speaker recognition problem. Accordingly, the focus of this report is on the development of low-level features specifically designed for use in the task of speaker recognition. The approach taken in this report utilizes multi-layer perceptrons (MLPs) as a means of performing a feature transformation of cepstral features. A comparison is made between using MLP networks trained to discriminate between phones and using MLPs trained to discriminate between speakers. In the former case, Tandem/HATS features, originally developed at ICSI for speech recognition, are tested in a GMM speaker recognition system. In the latter case, 3-layer MLPs of different sizes, that is, with varying numbers of hidden units, are trained to distinguish between a set of output speakers; several sizes of training speaker sets are considered as well. Then, the hidden activations of the Speaker-MLPs are tested as features in a support vector machine (SVM) system. The basic motivation behind testing the Tandem/HATS features in a speaker recognition system is to determine whether or not the inclusion of longer-term (i.e., around 500ms) phonetic 1

11 information is able to aid in distinguishing between speakers. Although these phonetic features have been able to improve speech recognition systems, and other speaker recognition systems have been successful in utilizing phonetic content, to the best of my knowledge, the Tandem/HATS features have never been tested in a speaker recognition task. By training the Speaker-MLPs to discriminate between speakers, the features they produce should, by design, be speaker discriminative, and thus well-suited to speaker recognition. Although similar speaker-discriminative MLP work has been done, the Speaker-MLPs of this report experiment with larger network topologies with the idea that increasing the number of training speakers and size of the network may yield additional improvements in performance. The report begins with background material relevant to speaker recognition, including a description of the task, cepstral features, GMM and SVM systems, and various normalizations, in Chapter 2. A discussion of research related to a discriminant feature approach follows in Chapter 3, including work done in developing phonetically discriminative and speaker discriminative features, some of which also involves the use of MLPs. Chapter 4 explains this report s approach and methodology to the feature generation and usage, including the motivation, detailed descriptions of the Tandem/HATS MLP features and Speaker-MLP features, baseline systems, and how the systems are combined. Chapter 5 outlines the data used for training and testing, the performance measures presented, the experimental setup, and the results of the experiments. Finally, the report ends with conclusions in Chapter 6. 2

12 Chapter 2 Background In order to provide a framework of relevant information, this chapter presents a brief overview of topics related to speaker recognition. The basic knowledge given here will be assumed for the remainder of the report. To begin, the characteristics of the speaker recognition problem are described in Section The Speaker Recognition Problem As its name implies, automatic speaker recognition attempts to recognize, or identify, a given speaker by processing his/her speech automatically, that is to say, in a fully objective and reproducable manner, without the aid of human listening. In order to be able to recognize the speaker of a given test utterance, it is necessary to have training data first, so that the system can learn the speaker(s) of interest. The term speaker recognition can be used to refer to a variety of tasks. One type of task is speaker identification, when the system must produce the identity of the speaker, given a test utterance, from a set of speakers. With closed-set speaker identification, the number of speakers in the set is fixed, and the system must choose which among the given speakers is a match to the speaker of the test utterance. Open-set speaker identification adds a layer of complexity by allowing the test utterance to belong to a speaker not in the set of speakers for whom there is training data available. A second type of task is speaker verification, which involves a hypothetical target speaker match to the test speaker, and the system must determine whether or not the test speaker identity is as claimed. Regardless of which type of task, the problem may be further characterized as being textdependent or text-independent. In the text-dependent case, the train and test utterances are required to be a specific word or set of words; the system can then exploit the knowledge of what is spoken in order to better make a decision. For the text-independent case, there is no constraint on what is said in the speech utterances, allowing for generalization to a wider variety of situations. This report focuses on the text-independent speaker verification task. For each target, or hy- 3

13 pothesis, speaker and test utterance pair, the system must decide whether or not the speaker identities are the same. In this case, two types of errors arise: false acceptance, or false alarm, and false rejection, or missed detection. A false accept occurs when the system incorrectly verifies an impostor test speaker as the target speaker. A false reject occurs when the system incorrectly identifies the test speaker as an impostor speaker. In order to perform a speaker recognition task, the system must first parameterize the speech in a meaningful way that will allow the system to distinguish and characterize speakers and their speech; this step is addressed in Section 2.2, which discusses two types of cepstral features commonly used in speech processing applications. For speaker verification, which is essentially a binary decision task, a statistical framework is utilized in the following way. Models are trained for both the target speaker, using the training data, as well as for a background speaker model, using separate data from a large number of speakers, representative of some generic speaker. Then, the probability of the test utterance is determined for each model, and a score relating to the likelihood ratio of the probabilities of the test utterance given each model is produced. Two commonly used approaches to the statistical modeling, Gaussian mixture models and support vector machines, are discussed in Section 2.3. Although there are many additional issues that arise in the area of speaker recognition, this report will only consider one more type of difficulty, that of channel mismatch. Since the speaker recognition system directly uses the speech signal, in parameterized form, variations in the signal due to channel type or noise can significantly impact performance of the system. If there is a channel mismatch between the training and test data, i.e., if they were recorded on different channels, then the system may falsely reject a true target speaker match because the data appears to belong to a different speaker because of the channel variation. Normalizations developed to address such a channel issue are discussed in Section 2.4. These include normalizations that are applied on the score-level, the feature-level, and the model-level. 2.2 Cepstral Features The process of parameterizing speech is referred to as feature extraction. Low-level features are those based directly on frames of the speech signal, where a frame refers to a window that is typically 25 ms long. High-level features, on the other hand, incorporate information from more than just one frame of speech, and include, for example, speaker idiosyncrasies, prosodic patterns, pronunciation patterns, and word usage. There are two types of low-level cepstral features relevant to this report, and they are explained here. Mel-frequency cepstral coefficients, or MFCCs, are generated by the process shown in Figure 2.1. First, an optional preemphasis filter is applied, to enhance the higher spectral frequencies and compensate for the unequal perception of loudness at different frequencies. Next, the speech signal is windowed into 20-30ms frames (with 10ms overlap) and the absolute value of the fast Fourier transform (FFT) is calculated for each frame. A Mel-frequency triangular filterbank is then applied, where the Mel scale is an auditory scale based on pitch perception. The Mel frequency is related to the linear frequency scale by f Mel = 1127 log(1 + f linear 700 ) log2 (2.1) 4

14 For this work, the number of filters is 24. After the spectrum has been smoothed, the log is taken. Finally, a discrete cosine transform (DCT) is applied to obtain the cepstral coefficients, c n : c n = K k=1 [ S k cos n(k 1 2 ) π ], n = 1, 2,..., L (2.2) K where S k are the log-spectral vectors from the previous step, K is the total number of log-spectral coefficients, and L is the number of coefficients to be kept (this is called the order of the MFCCs), with L K. Speech signal Pre emphasis filter Window FFT Mel frequency Filterbank Log DCT Cepstral coefficients Figure 2.1. Generation of MFCC features Perceptual linear prediction coefficients, or PLPs, are based on linear predictive coding, and are quite similar to MFCCs. The process of generating PLPs is shown in Figure 2.2. First, the speech signal is windowed, as before, into 20-30ms frames with 10ms overlap, and the magnitude of the FFT is calculated. Then, a filterbank is applied, this time using trapezoidally shaped filters on a Bark scale, where f Bark = 13 arctan( f linear ) arctan( f linear 7500 )2 (2.3) Next is the pre-emphasis step, which is done by weighting the spectrum with an equal loudness curve, which again emphasizes the higher frequencies. Instead of taking the log, as in the MFCC case, the cube root is taken, as an approximation to the power-law of hearing that relates intensity to loudness. As with MFCCs, the inverse DFT is taken; in the PLP case, the results are not cepstral coefficients, but are instead similar to autocorrelation coefficients. Next, the compressed spectrum is smoothed using an autogressive model. Finally, the autoregressive coefficients are converted to cepstral coefficients. Speech signal Window FFT Filterbank Pre emphasis 1/3 IDFT Spectral smoothing Cepstral transform PLP Figure 2.2. Generation of PLP features In addition to using the MFCCs or PLPs, it is common to include estimates of their first, second, and possibly third derivatives as additional features. These are referred to as deltas, double-deltas, and triple-deltas. The polynomial approximations of the first and second derivatives are as follows: 5

15 c m = c m = l k= l kc m+k l k= l k l k= l k2 c m+k l k= l k2 (2.4) Furthermore, an energy term and/or its derivative can also be included in the feature parameterization. 2.3 System Approaches Gaussian Mixture Model (GMM) The Gaussian mixture model is a powerful tool for modeling certain types of unknown distributions effectively. The GMM uses a mixture of multivariate Gaussians to model the probability density function of observed variables. That is, for a GMM with N Gaussians, with variable x (n-dimensional), the probability density is given by p(x λ) = N 1 i=0 π i N (x; µ i, Σ i ) (2.5) where π i are the mixture weights, which sum to 1, and N (x; µ i, Σ i ) are Gaussian distributions with mean vectors µ i and covariance matrices Σ i, specifically, ( 1 N (x; µ i, Σ i ) = (2π) n/2 exp 1 ) Σ i 1/2 2 (x µ i) T Σ 1 i (x µ i ) (2.6) The model parameters are denoted by λ = (π i, µ i, Σ i ), for i = 0,..., N 1. The expectationmaximization (EM) algorithm iteratively learns the model parameters from the data, which are the observations. The covariance matrix is typically chosen to be diagonal, for improved computational efficiency as well as better performance. In the context of using features extracted from speech, each feature vector would correspond to x in equation (2.5). Based on the assumption that speech frames are independent, the individual frame probabilities can be multiplied to obtain the probability of a speech utterance. That is, the probability of a speech segment X, composed of feature vectors {x 0, x 1,..., x M 1 }, is given by p(x λ) = M 1 j=0 N 1 i=0 π i N (x j ; µ i, Σ i ) (2.7) for a mixture of N Gaussians. In a speaker recognition setting, there are several GMM approaches that can be taken. Here, 6

16 only the currently prevalent approach is described, for which two GMM models are needed: one for the target speaker and one for the background model [2]. Using training data from a large number of speakers, a speaker-independent universal background model, or UBM, is generated. So that every target speaker model is in the same space and can be compared to one another, the speaker-dependent models are adapted from the UBM using maximum a posteriori (MAP) adaptation. For a given test utterance X, and a given target speaker, a log likelihood ratio (LLR) can then be calculated: LLR(X) = log p (X λ target ) log p (X λ UBM ) (2.8) Comparing the LLR to a threshold, Θ, will determine the decision made about the test speaker s identity: if LLR(X) > Θ, the test speaker is identified as a true speaker match, otherwise, the test speaker is determined to be an impostor Support Vector Machine (SVM) Support vector machines, or SVMs, are a supervised learning method that can be used for pattern classification problems [4]. For binary classification, which is the task of interest here, the SVM is a linear classifier that finds a separating hyperplane between data points in each class. Furthermore, the SVM learns the maximum-margin hyperplane that will separate the data, making it a maximum-margin classifier. In this case, the model for each target speaker is the defining hyperplane, and instead of probabilities for data given a distribution, distances from the hyperplane are used. In mathematical terms, the SVM problem can be formulated as subject to min w 2 + C ξ i i y i (w x i b) 1 ξ i, 1 i n (2.9) where ξ i are slack variables, x i are the training data points, y i are the corresponding class labels (+1 or 1), C is a constant, and w and b are the hyperplane parameters. Essentially, the goal is to find the hyperplane such that sign(w x i b) = y i, up to some soft margin involving ξ i. The SVM is used in speaker recognition by taking one or more positive examples of the target speaker, as well as a set of negative examples of impostor speakers, and producing a hyperplane decision boundary. Since there are far more impostor speaker examples than target speaker examples, a weighting factor is used to make the target example(s) count as much as all of the impostor examples. Once the hyperplane for a given target speaker is known, the test speaker can be classified as belonging to either the target speaker or impostor speaker class. Instead of a log likelihood ratio, a score can be produced by using the distance of the test data from the hyperplane boundary. 7

17 2.4 Normalizations One issue that arises in speaker recognition and other speech-related processing is that of a channel mismatch. Although cepstral features extracted from the speech would ideally be robust to channel variation, in practice this does not, in general, prove to be true. Here, various normalizations that address this problem are discussed. H-norm and T-norm are examples of compensations at the score-level, which are discussed in section The feature-level approaches of cepstral mean subtraction (CMS) and feature mapping are described in section Finally, the model-based approach of speaker model synthesis (SMS) is covered in section Since these approaches are taken in different domains and in varying ways, their effects on equalizing a channel mismatch can potentially be additive. For instance, it is not uncommon to use CMS, feature mapping, and T-norm in the same system, with a gain coming from each. At the same time, it s possible that applying additional channel equalization techniques will not yield any improvement over using fewer such methods Score-level Although it does not specifically address the channel variation problem, one type of score-level normalization is zero normalization, or Z-norm [5]. In Z-norm, an impostor score distribution is obtained by testing a speaker model against impostor speech utterances. Then, the statistics of this speaker-dependent impostor distribution, namely the mean and variance, are used to normalize the scores produced for that speaker. That is, for a test utterance X, and a target speaker model T, S ZN (X) = S(X) µ impostor(t ) σ impostor (T ) (2.10) where S ZN (X) is the normalized score, S(X) is the original score, and µ impostor (T ) and σ impostor (T ) are the mean and standard deviation of the distribution of impostor scores for target model T. A variant of Z-norm is handset normalization, or H-norm, which aims to address the issue of having different handsets for the training and testing data [6]. H-norm tries to remove the handset dependent biases present in the scores produced, and it requires having a handset detector to label the handset of the speech segments. For each speaker, handset-dependent means and variations are determined for each type of handset (typically electret and carbon-button) by generating scores for a set of impostor test utterances from each handset type. Then, the score is normalized by the mean and standard deviation of the distribution corresponding to the handset of the test utterance, as determined by the handset detector. For test utterance X, S HN (X) = S(X) µ(hs(x)) σ(hs(x)) (2.11) where S HN (X) is the new score, S(X) is the original score, and HS(X) is the handset label of X. The final normalization of interest is test normalization, or T-norm, which generates scores for a test utterance against impostor models (in addition to the target model), in order to estimate the impostor score distribution s statistics [7]. T-norm is a test-dependent normalization, since the 8

18 same test utterance is used for testing and for generating normalization parameter estimates. In mathematical terms, S T N (X) = S(X) µ impostor(x) (2.12) σ impostor (X) where S T N (X) is the normalized score, S(X) is the original score, and µ impostor (X) and σ impostor (X) are the mean and standard deviation of the distribution of scores for test utterance X against the set of impostor speaker models Feature-level Cepstral mean subtraction (CMS) is a fairly simple technique that is applied at the featurelevel [8]. CMS subtracts the time average from the output cepstrum in order to produce a zero mean log cepstrum. That is, for a temporal sequence of each cepstral coefficient c m, ĉ m (t) = c m (t) 1 T T c m (τ) (2.13) τ=1 The purpose of CMS is to remove the effects of the transmission channel, yielding improved robustness. However, any non-linear channel effects will remain, as will any time-varying linear channel effects. Furthermore, CMS can remove some of the speaker characteristics, as the average cepstrum does contain speaker-specific information. Another feature-level channel compensation method is feature mapping [9]. Feature mapping aims to map features from different channels into the same channel-independent feature space. A channel-independent root GMM is trained, and channel-dependent background GMMs are adapted from the root. Feature-mapping functions are obtained from the model parameter changes between the channel-independent and channel-dependent models. The most likely channel is detected for the speaker data, which is then mapped to the channel-independent space. Adaptation to target speaker models is done using mapped features, and during verification, the mapped features of the test utterance are used for scoring. The root GMM is used as the UBM for calculating the log likelihood ratios Model-level Speaker model synthesis (SMS) is a model-based technique that utilizes channel-dependent models [10]. Rather than having one speaker-independent UBM, the SMS approach begins with a channel- and gender-independent root model, and then uses Bayesian adaptation to obtain channel- and gender-dependent background models. Channel-specific target speaker models are also adapted from the appropriate background model, after the gender and channel of the target speaker s training data have been detected. Furthermore, a transformation for each pair of channels is calculated using the channel-dependent background models; this transformation maps the weights, means, and variances of a channel a model to the corresponding parameters of a channel b model. During testing, if the detected channel of the test utterance matches 9

19 the type of channel of the target speaker model, then that speaker model and the appropriate channel-dependent background model are used to calculate the LLR for that test utterance. On the other hand, if the detected channel of the test utterance is not a match to the target speaker model, then a new speaker model is synthesized using the previously calculated transformation between the target and test channels. Then, the synthesized model and the corresponding channel-dependent background model are used to calculate the LLR for the test utterance. 10

20 Chapter 3 Related Work In speech processing applications, the development of discriminant features typically involves discrimination based on either phones or speakers, i.e., the transformation or projection of features into either a phonetic or a speaker space. There are also some approaches that cannot be identified as involving just one or the other, but are rather a mixture incorporating both phone and speaker spaces. This chapter outlines the prior work most related to the approach taken in this report. 3.1 Phonetically Discriminative Features The use of features generated by a MLP (or possibly MLPs) trained to distinguish between phones has been shown to improve performance for automatic speech recognition (ASR). Researchers at ICSI developed Tandem/HATS-MLP features, which incorporate longer term temporal information through the use of MLPs whose outputs are phone posteriors [1, 11 14]. The Tandem component of the features is generated using a 3-layer MLP, which takes as input 9 frames of perceptual linear prediction (PLP) features, with deltas and double-deltas (about 200 ms), trained to output 46 phone posteriors. The HATS, or Hidden Activation TempoRAl PatternS, component involves two stages of MLPs. The first stage is a set of MLPs, one for each critical frequency band, which each take as input 500 ms of critical band energies, and are each trained to output 46 phone posteriors; the hidden activation outputs of these MLPs are then joined using the second stage MLP. Finally, the Tandem and HATS phone posteriors are merged using a weighted average, and further transformed by log, principal component analysis (PCA), and truncation to produce the final features. For the Large Vocabulary Conversational Speech Recognition (LVCSR) of the NIST 2001 Hub-5 test set, augmenting PLPs with the Tandem/HATS MLP features yielded a 10% relative error reduction over a PLP baseline [12]. For the 2004 Rich Transcription evaluation, a system using MFCCs augmented with the Tandem/HATS MLP features yields a 9.9% relative 11

21 improvement over a baseline system that uses MFCC and PLP features [13]. 3.2 Speaker Discriminative Features Using Neural Networks The work most directly related to my own also uses artificial neural networks (ANN) trained to perform speaker discrimination. Heck and Konig, et al., focused on extracting speaker discriminative features from Mel-frequency cepstral coefficents (MFCCs) using a MLP [15, 16]. A 5-layer MLP was trained to discriminate between 31 speakers; features were then extracted by taking the 34 outputs from the third, bottleneck layer of the MLP for use in a GMM speaker recognition system. The input to the MLP consisted of 9 frames of 17th order MFCCs and an estimate of pitch. A comparison of frame-level cross-validation results, which were found to correlate strongly to the results of the complete speaker verification system, i.e., the results when using the features in a GMM system, showed that 9 frames of MFCCs performed better than just 1 frame, and that the inclusion of pitch as an input also improved performance. Furthermore, the MLP features, when combined on the score-level with a cepstral only system, yielded consistent improvement when the training data and testing data were collected from mismatched telephone handsets [15]. In a similar approach, Morris and Wu, et al., generated MLP-based features for speaker identication experiments [17 19]. However, in addition to the 5-layer MLP topology, they also tested 3-layer and 4-layer MLPs, finding that features taken from the net-input values to the second hidden layer of the 5-layer MLP performed the best [18]; for each network topology, the dimension of the bottleneck hidden layer (from which the features were obtained) was 19, the same as the number of inputs, while the other hidden layers used 100 units. The issue of selecting a basis of speakers to use for training the MLP was also addressed by Morris et al., who concluded that a Maximum Average Distance method of speaker basis selection, where the distance is an estimate of the Kullback-Leibler distance, worked best [18]. Speaker identification performance also improved as more speakers were used to train the MLP, up to a certain limit [17, 19] Using a Speaker Space For the tasks of speaker recognition and identification, several methods specifically consider a speaker space. Sturim, et al. utilized anchor models in order to create a characterization vector, where each component is a likelihood score from one of N pre-trained anchor models, for each speech utterance [20]. A vector distance between two characterization vectors, which are essentially projections of the speech utterance into a speaker space, could then be used to compare a target speaker s speech segment with an unknown speaker s speech segment. When testing the anchor system in a speaker detection task, they found that the baseline GMM system had significantly better performance than the anchor system. However, when considering a Speaker Indexing task, 12

22 the computational efficiency of the anchor model system provided an advantage. Mami and Charlet used a similar anchor model approach, with certain modifications. They found that using the angle between coordinate vectors of speakers yielded better results than using the distance between the vectors, and that orthogonalization of the vectors through linear discriminant analysis (LDA) post-processing also improved speaker identification performance [21]. When tested on a France Telecom R&D telephone speech database for a speaker identification task, they found that the percentage of correctly identified speakers increased from the GMM system s 66.0% to as much as 76.6% when using an anchor model speaker space. Furthermore, they explored the effect of changing the number of speakers in the space, finding that, over a range of 50 to 500 speakers, the optimal performance corresponded to having a space of 200 speakers. Thyes, et al. adopted another speaker-space approach termed Eigenvoices [22]. By employing an eigenspace obtained from (GMM) models for a set of training speakers, a new target speaker model is constrained to the eigenspace through adaption using Maximum Likelihood EigenDecomposition (MLED). Then, a test speaker can be projected into the eigenspace using MLED, and the distance between the test and target speaker points can be used to do eigendistance decoding. Alternatively, the points in the eigenspace can be used to generate speaker models from which the likelihood of the test data can be calculated, in order to perform eigengmm decoding. The Eigenvoices technique performed significantly better than a conventional GMM approach in the case when there was very limited (10 seconds) training data for the target speakers; in situations where far more training data was available, the conventional GMM system outperformed Eigenvoices. 3.3 Hybrid Phonetic and Speaker Approaches There are relevant hybrid phone and speaker-based approaches as well. Genoud et al. used speaker-adapted connectionist models to perform both speech and speaker recognition [23]. In this case, a baseline world 3-layer MLP network was trained to discriminate between 53 phones plus non-speech, using 9 frames of 12th order MFCCs as input, with a hidden layer of 2000 units. Speaker-adapted Twin-Output MLPs (TO-MLPs) were produced by cloning the 53 phone-class output units of the world net, for a total of 107 output units, and then peforming a second stage of training using equal amounts of both speaker-specific and world training data. Essentially, the MLP models target speaker-specific phones as well as phones produced by non-target speakers. Speech recognition performance improved in the case when the test utterances matched the target speaker. The speaker recognition results seemed promising, but had no baseline for comparison. Another hybrid approach is that of Stolcke et al., who used maximum-likelihood linear regression (MLLR) transforms as features for speaker recognition [24]. In the context of a speech recognition system, MLLR applies an affine transform to the Gaussian mean vectors in order to map speakerindependent means to speaker-dependent means. The coefficients from one of more of these MLLR adaptation transforms are then used in an SVM speaker recognition system. Thus, the MLLR transforms apply phonetic features (from the speech recognition) in a speaker space. The MLLR- SVM system yields a significant increase in speaker recognition performance. 13

23 Chapter 4 Approach 4.1 Motivation The discriminative powers of MLPs, which incorporate longer-term information, i.e., on the order of ms, rather than ms as with traditional features, are utilized in order to extract additional speaker-relevant information from low-level cepstral features. Phonetically discriminant Tandem/HATS-MLP features, originally developed for automatic speech recognition (ASR), are applied to a speaker recognition task. Inspired by the well-established infrastructure for neural network training at ICSI, speaker discriminant Speaker-MLP networks are implemented on a much larger scale than any previous work, with the hope that a larger 3-layer MLP topology may yield improved results over much smaller, 5-layer MLPs. In automatic speech recognition, it was observed that including features capturing long-term information in addition to short-term features improved upon systems that used only short-term features. In the hope that this result will translate to the speaker recognition domain as well, MLP-based features seem to be a natural direction to take. 4.2 Method My approach to the generation of discriminant features is twofold, considering both discrimination by phones and discrimination by speakers. In the phonetic space, Tandem/HATS-MLP features developed at ICSI are applied to a speaker recognition task. Although these features have been shown to improve performance in ASR, as far as I am aware, they have never been tested in a speaker recognition application. The Tandem/HATS combined features, as well as each of the Tandem-MLP and HATS-MLP features alone, are tested in a GMM speaker recognition system. The purpose of using such features is to take advantage of the phonetic information of a speaker 14

24 in order to distinguish that speaker from others. The generation of Tandem/HATS features is discussed in detail in section 4.3. In the speaker space, 3-layer Speaker-MLPs of varying sizes are trained to discriminate between a set of speakers; the Speaker-MLPs are then used to generate features for a speaker recognition system. The first step in developing these Speaker-MLP features is the selection of the speakers that will be used to train the Speaker-MLP. The features produced will not prove to be very speaker discriminative if the Speaker-MLP is trained only on speakers that are unusual outliers, or only on speakers who sound very much alike. Accordingly, the two methods utilized for MLP training speaker selection are discussed in section The first, bruteforce, method simply takes all the speakers with some minimum amount of training data available, while the second method performs speaker clustering in order to select small subsets of training speakers. The next step in the Speaker-MLP feature development is training the Speaker-MLPs, the details of which are described in section Section considers the differences between using the hidden activations of the Speaker-MLP as features and using the output activations of the Speaker-MLP as features, and the reason behind choosing the hidden activations as features. Once the features have been taken from the Speaker-MLP, they must then be utilized in a speaker recognition system; section discusses the reasons why an SVM system is better suited to the hidden activation features than a GMM system. In order to rate the performance of both types of MLP features, there must be a basis for comparison: section 4.5 describes the two baseline cepstral GMM systems used for this purpose. Additionally, section 4.6 describes how systems may be combined at the score-level in order to yield an improvement in performance over the individual systems. 4.3 Tandem/HATS-MLP Features As mentioned in Section 3.1, there are two components to the Tandem/HATS-MLP features, namely the Tandem-MLP and the HATS-MLP. The Tandem-MLP is a single 3-layer MLP, which takes as input 9 frames of PLPs (12th order plus energy) with deltas and double-deltas, contains 20,800 units in its hidden layer, and has 46 outputs, corresponding to phone posteriors. The hidden layer applies the sigmoid function, while the output uses softmax. The Tandem-MLP utilizes medium-term information (roughly 100ms) from the speech in order to recognize phonetic patterns. The HATS-MLP is actually a set of MLPs, based on the TRAPS (TempoRAl PatternS) MLP architecture [25, 26], which uses two stages of MLPs to perform phonetic classification with longterm ( ms) information. The first stage MLPs take as input 51 frames of log critical band energies (LCBE), with one MLP for each of the 15 critical bands; each MLP has 60 hidden units (with sigmoid applied), and the output layer has 46 units (with softmax) corresponding to phones. For the HATS (Hidden Activation TRAPS) features, the hidden layer outputs are taken from each first-stage critical band MLP, and then input to the second-stage merger MLP, which contains 750 hidden units, and 46 output units. Figure 4.1 shows the HATS-MLP system in detail, illustrating the difference between HATS and TRAPS features. The Tandem-MLP and HATS-MLP features can either be used individually, or combined 15

25 Figure 4.1. HATS-MLP System (taken from Chen s Learning Long-Term Temporal Features in LVCSR Using Neural Networks [1]) 16

26 using a weighted sum, where the weights are a normalized version of inverse entropy. In all three cases (Tandem, HATS, and Tandem/HATS), the log is applied to the output, and a Karhunen- Loeve Transform (KLT) dimensionality reduction is applied to reduce the output feature vector to an experimentally determined optimal length of 25. This process is shown in Figure 4.2 for the Tandem/HATS-MLP features. PLP Analysis 9 frames of PLP Speech Tandem Net Critical Band Energy Analysis Single Frame Posterior Combination Log KLT Tandem/HATS Features 51 frames of critical band energy HATS Net Figure 4.2. Tandem/HATS-MLP Features The Tandem/HATS MLP system is trained on roughly 1800 hours of conversational speech from the Fisher [27] and Switchboard [28] corpora. 4.4 Speaker-MLP Features Training Speaker Selection One Speaker-MLP uses as many output speakers for training as have enough conversations available, taking advantage of the idea that including more training speakers will yield better results. In contrast, Speaker-MLPs trained using only subsets of specifically chosen speakers are also implemented. These speakers were chosen through clustering in the following way. First, a background GMM model was trained using 286 speakers from the Fisher corpus. Then, a GMM was adapted from the background model with the data from each MLP-training speaker. These GMMs used 32 Gaussians, with input features of 12th order MFCCs plus energy and their first 17

27 order derivatives. The length-26 mean vectors of each Gaussian were concatenated to form a length-832 feature vector for each speaker. Principal component analysis was performed, keeping the top 16 dimensions of each feature vector, accounting for 68% of the total variance. In this reduced-dimensionality speaker space, k-means clustering was done, using the Euclidean distance between speakers, for k = 64 and k = 128. Finally, the sets of 64 and 128 speakers were chosen by selecting the speaker closest to each of the 64 or 128 cluster centroids Training of the Speaker-MLP Networks A set of 64, 128, or 836 speakers was used to train each Speaker-MLP, with 6 conversation sides per speaker used for training, and 2 for cross-validating. Training labels were produced using a 1-of-N encoding of the speakers. The acoustic waveform was parameterized as 12 PLPs plus energy, with first and second order derivatives appended. A 21 frame context window was used, giving 819 input units, and there was 1 output unit corresponding to each speaker in the training set. A softmax activation function was used at the output layer, with sigmoid activations for the hidden layer. The training setup is shown in Figure 4.3. Sigmoid Softmax frames of PLP (output speaker data) S 1 S N N output speakers (used for training) Figure 4.3. Speaker-MLP Training Setup ICSI s QuickNet MLP training tool [29] was used with a training scheme in which an initial learn rate (0.008) is set and held constant until the frame-level CV accuracy improves by less than 0.5% absolute, at which point the learn rate is halved at each epoch until convergence. For the 64 speaker set, two MLPs were trained, with either 400 or 1000 hidden units. Two MLPs were also trained for the 128 speaker set, 1000 and 2000 hidden units. In both cases, the larger number of hidden units corresponds to the number of free parameters being 15% of the number of training frames. For the 836 speaker net, several MLPs were trained, with the number 18

28 of hidden units ranging from 400 to The largest of these corresponds to setting the total number of free parameters to be 2.5% of the number of training frames Using the Hidden Activations of the Speaker-MLP as Features In keeping with the idea of utilizing a speaker space, the output activations of the Speaker- MLP may be used as features. Since the Speaker-MLP outputs are trained to correspond to a set of speakers, i.e., for a given training speaker s data at the input, the MLP output corresponding to that speaker will be 1, with all the other outputs at 0, the outputs of the Speaker-MLP for a new speaker s data can be thought of as a representation of the new speaker in the speaker space defined by the MLP s training speakers. However, preliminary experiments using the output activations in both GMM and SVM systems indicated that these features perform very poorly. This is most likely due to the noisiness of the output activations. Alternatively, the hidden activations of the Speaker-MLP can be used as features, with two motivations. First is that for the HATS-MLP system, the hidden activations proved to be more useful than the output activations from the critical band MLPs. Second is the work of Heck and Konig, et al., who used the bottleneck layer outputs from their 5-layer MLP as features, with the interpretation that the input-to-bottleneck portion of the MLP represents a feature extraction into a smaller feature space, and the bottleneck-to-output portion represents a closed-set speaker classifier [15]. In a similar interpretation, one can think of the input-to-hidden layer of the Speaker- MLP as a transformation of the input features into a general set of speaker patterns, and the hidden-to-output layer as a speaker classifier for the MLP training speakers, as shown in Figure 4.4. Thus, when the hidden activations are used as features, the closed-set speaker classifier is simply being replaced with a more general speaker recognition system Using an SVM Instead of a GMM System The GMM implementation available for my use is limited to using features with less than 100 dimensions. Since the dimensionality of the hidden activations of the Speaker-MLPs exceeds this limit, it is necessary to perform dimensionality reduction in order to test these features in a GMM system. Such a dimensionality reduction represents a significant loss of information, as the order of the number of features is changing from hundreds or thousands to just tens. In some applications, reducing the dimension of a feature vector can actually improve performance, but for others, the information loss can prove detrimental. Previous work in speech recognition (HATS) has shown that there is a great deal of information in the hidden structure of the MLP. Accordingly, it is desirable to be able to test the Speaker-MLP features in their full dimensionality. The GMM system is well suited to modeling features with fewer than 100 dimensions. However, problems of data sparsity and singular covariance matrices soon arise in trying to estimate high dimensional Gaussians. Therefore, in order to take advantage of the speaker discriminative information in the hidden activations of the Speaker-MLPs, an SVM speaker recognition system is used; such a system is better suited to handle the high dimensional sparse features, is naturally 19

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

Support Vector Machines for Speaker and Language Recognition

Support Vector Machines for Speaker and Language Recognition Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon

More information

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation Taufiq Hasan Gang Liu Seyed Omid Sadjadi Navid Shokouhi The CRSS SRE Team John H.L. Hansen Keith W. Godin Abhinav Misra Ali Ziaei Hynek Bořil

More information

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Lorene Allano 1*1, Andrew C. Morris 2, Harin Sellahewa 3, Sonia Garcia-Salicetti 1, Jacques Koreman 2, Sabah Jassim

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Xinyu Zhou, Yuxin Wu, and Tiezheng Li Tsinghua University Contents 1 Introduction 1 2 Algorithms 2 2.1 VAD..................................................

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Personalising speech-to-speech translation Citation for published version: Dines, J, Liang, H, Saheer, L, Gibson, M, Byrne, W, Oura, K, Tokuda, K, Yamagishi, J, King, S, Wester,

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

Speaker Recognition. Speaker Diarization and Identification

Speaker Recognition. Speaker Diarization and Identification Speaker Recognition Speaker Diarization and Identification A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Automatic Pronunciation Checker

Automatic Pronunciation Checker Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

INPE São José dos Campos

INPE São José dos Campos INPE-5479 PRE/1778 MONLINEAR ASPECTS OF DATA INTEGRATION FOR LAND COVER CLASSIFICATION IN A NEDRAL NETWORK ENVIRONNENT Maria Suelena S. Barros Valter Rodrigues INPE São José dos Campos 1993 SECRETARIA

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer

More information

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

Why Did My Detector Do That?!

Why Did My Detector Do That?! Why Did My Detector Do That?! Predicting Keystroke-Dynamics Error Rates Kevin Killourhy and Roy Maxion Dependable Systems Laboratory Computer Science Department Carnegie Mellon University 5000 Forbes Ave,

More information

Speech Recognition by Indexing and Sequencing

Speech Recognition by Indexing and Sequencing International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 4 (212) pp. 358 365 c MIR Labs, www.mirlabs.net/ijcisim/index.html Speech Recognition

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Sanket S. Kalamkar and Adrish Banerjee Department of Electrical Engineering

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

Comment-based Multi-View Clustering of Web 2.0 Items

Comment-based Multi-View Clustering of Web 2.0 Items Comment-based Multi-View Clustering of Web 2.0 Items Xiangnan He 1 Min-Yen Kan 1 Peichu Xie 2 Xiao Chen 3 1 School of Computing, National University of Singapore 2 Department of Mathematics, National University

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE Shaofei Xue 1

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE EE-589 Introduction to Neural Assistant Prof. Dr. Turgay IBRIKCI Room # 305 (322) 338 6868 / 139 Wensdays 9:00-12:00 Course Outline The course is divided in two parts: theory and practice. 1. Theory covers

More information

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

International Journal of Advanced Networking Applications (IJANA) ISSN No. : International Journal of Advanced Networking Applications (IJANA) ISSN No. : 0975-0290 34 A Review on Dysarthric Speech Recognition Megha Rughani Department of Electronics and Communication, Marwadi Educational

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Truth Inference in Crowdsourcing: Is the Problem Solved?

Truth Inference in Crowdsourcing: Is the Problem Solved? Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer

More information

Semi-Supervised Face Detection

Semi-Supervised Face Detection Semi-Supervised Face Detection Nicu Sebe, Ira Cohen 2, Thomas S. Huang 3, Theo Gevers Faculty of Science, University of Amsterdam, The Netherlands 2 HP Research Labs, USA 3 Beckman Institute, University

More information

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS Pranay Dighe Afsaneh Asaei Hervé Bourlard Idiap Research Institute, Martigny, Switzerland École Polytechnique Fédérale de Lausanne (EPFL),

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Softprop: Softmax Neural Network Backpropagation Learning

Softprop: Softmax Neural Network Backpropagation Learning Softprop: Softmax Neural Networ Bacpropagation Learning Michael Rimer Computer Science Department Brigham Young University Provo, UT 84602, USA E-mail: mrimer@axon.cs.byu.edu Tony Martinez Computer Science

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v2 [cs.cv] 30 Mar 2017 Domain Adaptation for Visual Applications: A Comprehensive Survey Gabriela Csurka arxiv:1702.05374v2 [cs.cv] 30 Mar 2017 Abstract The aim of this paper 1 is to give an overview of domain adaptation and

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Evidence for Reliability, Validity and Learning Effectiveness

Evidence for Reliability, Validity and Learning Effectiveness PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies

More information

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY Sergey Levine Principal Adviser: Vladlen Koltun Secondary Adviser:

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 Recognizing speech 2 Feature calculation Dan Ellis Michael Mandel 3 Sequence

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information