Class-Level Spectral Features for Emotion Recognition

Size: px
Start display at page:

Download "Class-Level Spectral Features for Emotion Recognition"

Transcription

1 University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science Class-Level Spectral Features for Emotion Recognition Dmitri Bitouk University of Pennsylvania Ragini Verma University of Pennsylvania, Ani Nenkova Univesity of Pennsylvania, Follow this and additional works at: Part of the Computer Sciences Commons Recommended Citation Dmitri Bitouk, Ragini Verma, and Ani Nenkova, "Class-Level Spectral Features for Emotion Recognition",. July Bitouk, D., Verma, R., & Nenkova, A., Class-Level Spectral Features for Emotion Recognition, Speech Communication, Volume 52, Issues 7-8, July-August 2010, doi: /j.specom This paper is posted at ScholarlyCommons. For more information, please contact

2 Class-Level Spectral Features for Emotion Recognition Abstract The most common approaches to automatic emotion recognition rely on utterance-level prosodic features. Recent studies have shown that utterance-level statistics of segmental spectral features also contain rich information about expressivity and emotion. In our work we introduce a more fine-grained yet robust set of spectral features: statistics of Mel-Frequency Cepstral Coefficients computed over three phoneme type classes of interest stressed vowels, unstressed vowels and consonants in the utterance. We investigate performance of our features in the task of speaker-independent emotion recognition using two publicly available datasets. Our experimental results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement. Given the large number of class-level spectral features, we expected feature selection will improve results even further, but none of several selection methods led to clear gains. Further analyses reveal that spectral features computed from consonant regions of the utterance contain more information about emotion than either stressed or unstressed vowel features. We also explore how emotion recognition accuracy depends on utterance length. We show that, while there is no significant dependence for utterance-level prosodic features, accuracy of emotion recognition using class-level spectral features increases with the utterance length. Disciplines Computer Sciences Comments Bitouk, D., Verma, R., & Nenkova, A., Class-Level Spectral Features for Emotion Recognition, Speech Communication, Volume 52, Issues 7-8, July-August 2010, doi: /j.specom This conference paper is available at ScholarlyCommons:

3 Available online at Speech Communication 52 (2010) Class-level spectral features for emotion recognition q Dmitri Bitouk a, *, Ragini Verma a, Ani Nenkova b a Department of Radiology, Section of Biomedical Image Analysis, University of Pennsylvania, 3600 Market Street, Suite 380, Philadelphia, PA 19104, United States b Department of Computer and Information Science, University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104, United States Received 24 November 2009; received in revised form 2 February 2010; accepted 9 February 2010 Abstract The most common approaches to automatic emotion recognition rely on utterance-level prosodic features. Recent studies have shown that utterance-level statistics of segmental spectral features also contain rich information about expressivity and emotion. In our work we introduce a more fine-grained yet robust set of spectral features: statistics of Mel-Frequency Cepstral Coefficients computed over three phoneme type classes of interest stressed vowels, unstressed vowels and consonants in the utterance. We investigate performance of our features in the task of speaker-independent emotion recognition using two publicly available datasets. Our experimental results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement. Given the large number of class-level spectral features, we expected feature selection will improve results even further, but none of several selection methods led to clear gains. Further analyses reveal that spectral features computed from consonant regions of the utterance contain more information about emotion than either stressed or unstressed vowel features. We also explore how emotion recognition accuracy depends on utterance length. We show that, while there is no significant dependence for utterance-level prosodic features, accuracy of emotion recognition using classlevel spectral features increases with the utterance length. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Emotions; Emotional speech classification; Spectral features 1. Introduction Emotion content of spoken utterances is clearly encoded in the speech signal, but pinpointing the specific features that contribute to conveying emotion remains an open question. Descriptive studies in psychology and linguistics have mostly dealt with prosody, concerned with the question how an utterance is produced. They have identified a number of acoustic correlates of prosody q This paper is an expanded version of the work first presented in Interspeech 2009 (Bitouk et al., 2009). This paper includes additional material on feature selection and feature analysis, experiments linking utterance length and system performance, speaker-dependent recognition and expanded discussion of the related work. * Corresponding author. address: Dmitri.Bitouk@uphs.upenn.edu (D. Bitouk). indicative of given emotions. For example, happy speech has been found to be correlated with increased mean fundamental frequency (F0), increased mean voice intensity and higher variability of F0, while boredom is usually linked to decreased mean F0 and increased mean of the first formant frequency (F1) (Banse and Scherer, 1996). Following this tradition, most of the work on automatic recognition of emotion has made use of utterance-level statistics (mean, min, max, std) of prosodic features such as F0, formant frequencies and intensity (Dellaert et al., 1996; McGilloway et al., 2000). Others employed Hidden Markov Models (HMM) (Huang and Ma, 2006; Fernandez and Picard, 2003) to differentiate the type of emotion expressed in an utterance based the prosodic features in a sequence of frames, thus avoiding the need to compute utterance-level statistics /$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi: /j.specom

4 614 D. Bitouk et al. / Speech Communication 52 (2010) On the other hand, spectral features, based on the shortterm power spectrum of sound, such as Linear Prediction Coefficients (LPC) and Mel-Frequency Cepstral Coefficients (MFCC), have received less attention in emotion recognition. While spectral features are harder to be intuitively correlated with affective state, they provide a more detailed description of speech signal and, thus, can potentially improve emotion recognition accuracy over prosodic features. However, spectral features, which are typically used in speech recognition, are segmental and convey information on both what is being said and how it is being said. Thus, the major challenge in using spectral information in emotion analysis is to define features in a way that does not depend on the specific phonetic content of an utterance, while preserving cues for emotion differentiation. Most of the previous methods that do use spectral features ignore this challenge by modelling how emotion is encoded in speech independent of its phonetic content. Phoneme-level classification of emotion has received relatively little attention, barring only a few exceptions. For example the work of Lee et al. (2004) takes into account phonetic content of speech by training phoneme-dependent HMM for speaker-dependent emotion classification. Sethu et al. (2008) used phoneme-specific Gaussian Mixture Models (GMM) and demonstrated that emotion can be better differentiated by some phonemes than others. However, such phoneme-specific approach cannot be directly applied to emotion classification due to sparsity of phoneme occurance. In this paper, we present novel spectral features for emotion recognition computed over phoneme type classes of interest: stressed vowels, unstressed vowels and consonants in the utterance. These larger classes are general enough and do not depend on specific phonetic composition of the utterance and thus abstract away from what is being said. Unlike previous approaches which used spectral features, our class-level spectral features are technically simple and exploit linguistic intuition rather than rely on sophisticated machine learning machinery. We use the forced alignment between audio and the manual transcript to obtain the phoneme-level segmentation of the utterance and compute statistics of MFCC from parts of the utterance corresponding to the three phoneme classes. Compared to previous approaches which use utterance-level statistics of spectral features, the advantage of our approach is two-fold. Firstly, the use of phoneme classes reduces dependence of the extracted spectral features on the phonetic content of the utterance. Secondly, it captures better the intuition that emotional affect can be expressed to a greater extent in some phoneme classes than others, and, thus, increases the discriminating power of spectral features. In our work we analyze performance of phoneme class spectral features in speaker-independent emotion classification of English and German speech using two publicly available datasets (Section 5). We demonstrate that classlevel spectral features outperform both the traditional prosodic features and utterance-level statistics of MFCC. We test several feature selection algorithms in order to further improve emotion recognition performance of class-level spectral features and evaluate contributions from each phoneme class to emotion recognition accuracy (Section 7). Our results indicate that spectral features, computed from the consonant regions of the utterance, outperform features from both stressed and unstressed vowel regions. Since consonant regions mostly correspond to unvoiced speech segments which are not accounted for by prosodic features derived from pitch and intensity profiles, this result implies that class-level spectral features can provide complimentary information to both utterance-level prosodic and spectral features. Finally, cross-corpus comparisons of emotion recognition motivated an analysis of the impact of utterance length on classification accuracy which, to the best of our knowledge, has not been addressed in the literature (Section 6). We investigate this dependence using synthetic emotional speech data constructed by concatenating short utterances from LDC dataset. We demonstrate that, while there is no significant dependence for utterance-level prosodic features, performance of class-level spectral features increases with utterance length, up to utterance length of 16 syllables. Further increases in utterance length do not seem to affect performance. It should be noted that these results are obtained using concatenated emotional speech and need to be cross-validated on naturally spoken emotional corpora when appropriate corpora become available. 2. Prior work Although the main body of previous work on emotion recognition in speech uses suprasegmental prosodic features, segmental spectral features which are typically employed in automatic speech recognition have also been studied for the task. The most commonly used spectral features for emotion recognition are Mel-Frequency Cepstral Coefficients (MFCC) (Tabatabaei et al., 2007; Lee et al., 2004; Kwon et al., 2003; Neiberg et al., 2006; Schuller et al., 2005; Luengo et al., 2005; Hasegawa-Johnson et al., 2004; Vlasenko et al., 2007; Grimm et al., 2006; Schuller and Rigoll, 2006; Meng et al., 2007; Sato and Obuchi, 2007; Kim et al., 2007; Hu et al., 2007; Shafran et al., 2003; Shamia and Kamel, 2005; Vlasenko et al., 2008; Vondra and Vich, 2009; Wang and Guan, 2005). As in automatic speech recognition, MFCC are extracted using a 25 ms Hamming window at intervals of 10 ms and cover frequency range from 300 Hz to the Nyquist frequency. In addition to MFCC, the log-energy as well as delta and acceleration coefficients (first and second derivatives) are also used as features. A low-frequency version of MFCC (Neiberg et al., 2006) which uses low-frequency filterbanks in Hz range has been found not to provide emotion recognition performance gains. Other spectral fea-

5 D. Bitouk et al. / Speech Communication 52 (2010) ture types used for emotion recognition are Linear Prediction Cepstral Coefficients (LPC) (Nicholson et al., 2000; Pao et al., 2005), Log Frequency Power Coefficients (Nwe et al., 2003; Song et al., 2004) and Perceptual Linear Prediction (PLP) coefficients (Scherer et al., 2007; Ye et al., 2008). Prior approaches which used spectral features for emotion recognition in speech are summarized in Table 1. The majority of spectral methods for emotion recognition make use of either frame-level or utterance-level features. Frame-level approaches model how emotion is encoded in speech using features sampled at small time intervals (typically ms) and classify utterances using either HMMs or by combining predictions from all of the frames. On the other hand, utterance-level methods rely on computing statistical functionals of spectral features over the entire utterance Frame-level methods HMMs have been applied with great success in automatic speech recognition to integrate frame-level information, and can be used similarly for emotion recognition as well. One group of HMM-based methods models all utterances using a fixed HMM topology independent of what is being said. In this case, each emotion is represented using its own HMM. Ergodic HMM topology is often used in order to accommodate varying utterance length. Unlike automatic speech recognition in which HMM states usually correspond to sub-phoneme units, there is no clear interpretation for the states of emotion-level HMM which are employed as a mean to integrate framelevel information into a likelihood score for each emotion. For example, Nwe et al. (2003) trained speaker-dependent, four-state ergodic HMMs for each emotion. To classify a novel utterance into an emotion category, likelihood scores of the utterance features given each emotion were evaluated using the trained HMMs. The utterance is classified as expressing the emotion which yields the highest likelihood score. This method achieved 71% accuracy in classifying the six basic emotions in speaker-dependent settings, but its speaker-independent performance was not investigated. Song et al. (2004) proposed a straightforward extension of this approach to estimate three discrete emotion intensity levels by effectively treating each emotion s intensity levels as a separate category. Another group of HMM-based approaches aims to integrate emotion labels into automated speech recognition systems. This is typically accomplished by building HMMs with emotiondependent states. Meng et al. (2007) proposed joint speech and emotion recognition by expanding the dictionary to include multiple versions of each word, one for each emotion. Emotion classification was then performed using majority voting between emotion labels in the hypothesis obtained using standard decoding algorithms. A similar emotion-dependent HMM approach was also used by Hasegawa-Johnson et al. (2004) to differentiate between confidence, puzzle and hesitation affective states in an intelligent tutoring application. Lee et al. (2004) train phonemedependent HMMs in order to take into account phonetic content of speech. During the training stage, emotiondependent HMMs were constructed for each of the five phoneme classes vowels, glides, nasals, stops and fricatives. In order to classify an utterance, its likelihood scores given each emotion were computed and the emotion with the maximum likelihood score was chosen as the decision. However, this approach was only tested for speakerdependent emotion recognition using a proprietary database which consisted of recording from a single speaker. Another popular approach to emotion recognition at frame-level is to ignore temporal information altogether and treat acoustic observations at each time frame as the values of independent, identically distributed random variables. Under this assumption, Gaussian Mixture Models (GMMs) are commonly used to model conditional distributions of acoustic features in the utterance given emotion categories. Neiberg et al. (2006) used GMMs trained on the extracted MFCC and pitch features to classify utterances into neutral, negative and positive emotion categories in call center and meeting datasets. Luengo et al. (2005) also employed GMMs to classify utterances from a single speaker database into the six basic emotions. A real-time systems for discriminating between angry and neutral speech was implemented in Kim et al. (2007) using GMMs for MFCC features in combination with a prosody-based classifier. Vondra and Vich (2009) applied GMMs to emotion recognition using a combined feature set obtained by concatenating MFCC and prosodic features. Hu et al. (2007) employed the GMM supervector approach in order to extract fixed-length feature vectors from utterances with variable durations. A GMM supervector consists of the estimated means of the mixtures in GMM. A mixture model was trained for each utterance and GMM supervectors were used as features for support vector machine classifiers. Frame-wise emotion classification based on vector quantization techniques was used by Sato and Obuchi (2007). In the training stage, a set of codewords was obtained for each emotion. In order to classify an input utterance, an emotion label was computed for each frame by finding the nearest emotion codeword. Finally, the whole utterance was classified using a majority voting scheme between frame-level emotion labels. It was demonstrated that such a simple frame-wise technique outperforms HMM-based methods. Vlasenko et al. (2007) integrated GMM log-likelihood score with commonlyused suprasegmental prosody-based emotion classifiers in order to investigate combination of features at different levels of granularity. Sethu et al. (2008) used phonemespecific GMMs and demonstrated that emotion can be better differentiated by some phonemes than others. However, such phoneme-specific approach cannot be directly applied to emotion classification due to sparsity of phoneme occurance.

6 616 D. Bitouk et al. / Speech Communication 52 (2010) Table 1 Prior work on emotion recognition in speech using spectral features. Reference Language Spectral features Granularity Use of prosody Classifier Speaker independence Neiberg et al. (2006) Swedish, English MFCC Frame U GMM Luengo et al. (2005) Basque MFCC Frame U HMM Hasegawa-Johnson et al. (2004) English MFCC Frame U HMM Meng et al. (2007) German MFCC Frame HMM Sato and Obuchi (2007) English MFCC Frame Voting U Kim et al. (2007) English MFCC Frame U GMM Hu et al. (2007) Mandarin MFCC Frame GMM, SVM Shafran et al. (2003) English MFCC Frame U HMM Vondra and Vich (2009) German MFCC Frame U GMM U Pao et al. (2005) Mandarin LPC, MFCC... Frame HMM, knn U Nwe et al. (2003) Burmese, Mandarin LFPC Frame HMM Song et al. (2004) Mandarin LFPC Frame HMM Scherer et al. (2007) German PLP Frame knn Vlasenko et al. (2007) German, English MFCC Frame U GMM U Utterance SVM Sethu et al. (2008) English MFCC Frame U GMM U Lee et al. (2004) English MFCC Phoneme HMM Tabatabaei et al. (2007) English MFCC Utterance U SVM Kwon et al. (2003) English, German MFCC Utterance U SVM, LDA U Schuller et al. (2005) German MFCC Utterance U SVM, AdaBoost U Grimm et al. (2006) English MFCC Utterance U Fuzzy logic Wang and Guan (2005) Multiple MFCC Utterance U LDA Ye et al. (2008) Mandarin MFCC, PLP Utterance SVM Schuller and Rigoll (2006) German MFCC Segment U Various Shamia and Kamel (2005) English MFCC Segment U knn, SVM Nicholson et al. (2000) Japanese LPC Segment U NN This paper English, German MFCC Phoneme U SVM U 2.2. Utterance-level methods In contrast to frame-level approaches, utterance-level methods rely on extracting fixed-length feature vectors. Such features are usually composed of various statistics of acoustic parameters computed over the entire utterance. Commonly-used statistics are mean, standard deviation, skewness and extrema values. In utterance-level emotion recognition, statistics of spectral features are often combined with statistics of prosodic measures, and classification is performed using both sources of information. For example, Kwon et al. (2003) used statistics such as mean, standard deviation, range and skewness of pitch, energy and MFCC to recognize emotions using Support Vector Machine (SVM) classifiers. Similarly, 276 statistical functionals of pitch, energy, MFCC and voice quality measures along with linguistic features were used in Schuller et al. (2005). Instead of computing independent statistics, Ye et al. (2008) used covariance matrices of prosodic and spectral measures evaluated over the entire utterance. Since positive-definite covariance do not form a vector space, classification has to be performed using manifold learning methods. Schuller and Rigoll (2006) investigated levels of granularity finer than the entire utterance. In particular, they demonstrated that statistics of spectral and prosodic features computed over speech segments obtained by splitting utterances at fixed relative positions (such as halves and thirds) can improve recognition performance over the utterance-level features. Shamia and Kamel (2005) computed prosodic and spectral features for each voiced segment of the utterance and constructed segment-level emotion classifier. In order to classify an utterance, the posterior class probability was evaluated by combining posterior probabilities of each voiced segment in the utterance. 3. Databases In our work we used two publicly available databases of emotional speech: an English emotional speech database from Linguistic Data Consortium (LDC) (2002) and Berlin database of German emotional speech (Burkhardt et al., 2005) LDC emotional speech database The LDC database contains recording of native English actors expressing the following 15 emotional states: neutral, hot anger, cold anger, happy, sadness, disgust, panic, anxiety (fear), despair, elation, interest, shame, boredom, pride and contempt. In addition, utterances in LDC database vary by the distance between the speaker and the listener: tet a tet, conversation and distant. In our experiments we consider only the utterances corresponding to conversation distance to the listener and the six basic emotions which include anger (hot anger), fear, disgust, happy, sadness

7 D. Bitouk et al. / Speech Communication 52 (2010) and neutral. There are 548 utterances from seven actors (three female/four male) corresponding to these six basic emotions from LDC database. Almost all of the extracted utterances are short, four-syllable utterances containing dates and numbers (e.g. Nineteen hundred ). Note that emotion labels of utterances in LDC database were not validated by external subject assessment as done in the creation of other databases (such as Berlin Emotional Speech Database). Instead, each utterance is simply labeled as the intended emotion given to the speaker at the time of recording. As a result, some of the utterances might not be that good examples of the intended emotion and could even be perceived by listeners as expressing a different emotion altogether. Such characteristics of the corpus are likely to negatively affect emotion recognition accuracy Berlin emotional speech database Berlin dataset contains emotional utterances produced by 10 German actors (five female/five male) reading one of 10 pre-selected sentences typical of everyday communication ( She will hand it in on Wednesday, I just will discard this and then go for a drink with Karl, etc). The dataset provides examples of the following seven emotions: anger, boredom, fear, disgust, joy (happy), sadness, neutral emotion. Utterances corresponding to boredom were removed from the analysis and we focus on the six basic emotions that were also present in LDC data. In comparison to LDC dataset, utterances in Berlin dataset are notably longer. The underlying sentences were designed to maximize the number of vowels. In addition, each of the recorded utterances was rated by 20 human subjects with respect to perceived naturalness. Subjects were also asked to classify each utterance as expressing one of the possible emotions. Utterances for which intended emotion recognition was low or which had low perceived naturalness were removed from the dataset. Due to these differences in corpus preparation, we expected to achieve higher emotion recognition rates on Berlin dataset than on LDC dataset (and this indeed was the case). using generic monophone HMM models of English trained on non-emotional speech. We used a pronunciation dictionary which contained multiple transcriptions of each word based on various pronunciation variants and stress positions. For each utterance, its transcript was expanded into a multiple-pronunciation recognition network using the dictionary. We used Viterbi decoding to order to find the most likely path through the network which yields starting and ending times of each phoneme in the utterance as well as the actual lexical stress. It should be noted that the obtained vowel stress is not fixed to a single dictionary pronunciation but depends on the observed acoustic evidence. For Berlin dataset, we did not have available German acoustic models, so we used the manual segmentations provided as a part of the dataset. Thus, the emotion recognition results on Berlin dataset presented in the paper cannot be considered fully automatic. We grouped phonemes into three phoneme type classes of interest: stressed vowels, unstressed vowels and consonants. Class-level features were created by computing statistics of prosodic and spectral measurements from parts of the utterance corresponding to these classes. Such partition of phonemes into classes reduces dependence of our features on specific utterance content and, at the same time, provides robustness and avoids sparsity given that a single utterance contains only a small number of phonemes. In order to analyze the usefulness of class-level spectral features and compare their performance with existing approaches, we computed four different sets of features, varying the type of features (spectral or prosodic) and the region of the utterance over which they were computed as shown in Fig. 1. The regions were either the entire utterance or local regions corresponding to phoneme classes. In the latter setting the features from each of the three phoneme classes were concatenated to form the feature vector descriptive of the entire utterance. While the primary goal of this paper is to investigate the performance of the class-level spectral features alone and in 4. Features In our work we compared and combined two types of features: traditional prosodic features and spectral features for three distinct phoneme classes. Prosodic features used in this paper are derived from pitch, intensity and first formant frequency profiles as well as voice quality measures. Our spectral features which are comprised of statistics of Mel-Frequency Cepstral Coefficients (MFCC). Given an input utterance, the first step in our feature extraction algorithm is to obtain its phoneme-level segmentation. For LDC dataset, we used Viterbi forced alignment (Odell et al., 2002) between an utterance and its transcript to find the starting and ending time of each phoneme, as well as to detect presence of lexical stress for each of the vowels in the utterance. Forced alignment was performed Fig. 1. We computed four different types of features by varying the type of features (prosodic or spectral) and the region of utterances where they are computed (utterance-level and class-level).

8 618 D. Bitouk et al. / Speech Communication 52 (2010) a combination with prosodic features, we used the additional feature sets as baseline benchmarks in emotion classification experiments as well as to gain insights on how phoneme-level analysis can improve emotion differentiation in speech. Below, we describe each of the feature sets in detail Utterance-level prosodic features Previous approaches to emotion analysis in speech have used various statistics of the fundamental frequency (F0), formant frequencies and voice intensity profiles. Following prior work, we used Praat software (Boersma and Weenink, 2001) to estimate F0 and F1 contours. For each utterance, we normalized intensity, F0 and F1 contours by computing speaker-specific z-scores. In addition to the features derived from formant frequencies and voice intensity, we also extracted micro-prosodic measures of voice quality such as jitter (the short-term period-to-period fluctuation in fundamental frequency) and shimmer (the random short-term changes in the glottal pulse amplitude) as well as the relative duration of voiced segments which characterizes speech rhythm and the relative spectral energy above 500 Hz (HF500). We computed statistics over the entire utterance such as mean value, standard deviation, minimum and maximum of F0 and its first derivative, voice intensity and its derivative as well as of the first formant frequency (F1). In total, the set of utterance-level prosodic features contains 24 features: mean, std, min, max of F0 andf0 derivative mean, std, min, max of F1 mean, std, min, max of voice intensity and its derivative jitter, shimmer, HF500 relative duration of voiced segments 4.2. Class-level prosodic features Instead of utterance-level statistics, class-level prosodic features use statistics of voice intensity and formants computed over utterance segments which correspond to stressed and unstressed vowel classes. We did not use the consonant class since formant frequencies are not defined for voiceless phonemes. Jitter, shimmer and HF500 were computed over the voiced part of the utterance. The set of class-level prosodic features consists of 44 individual features: mean, std, min, max of F0 and F0 derivative over stressed vowel region mean, std, min, max of F0 and F0 derivative over unstressed vowel region mean, std, min, max of F1 over stressed vowel region mean, std, min, max of F1 over unstressed vowel region mean, std, min, max of voice intensity and its derivative over stressed vowel region mean, std, min, max of voice intensity and its derivative over unstressed vowel region jitter, shimmer, HF500 relative duration of voiced segments 4.3. Utterance-level spectral features Utterance-level spectral features are mean values and standard deviations of MFCC computed over the entire utterance. For each utterance, we computed 13 MFCC (including log-energy) using a 25 ms Hamming window at intervals of 10 ms. For each utterance, we normalized MFCC trajectory by computing speaker-specific z-scores. In addition, we computed delta and acceleration coefficients as the first and second derivatives of MFCC using finite differences (26 features). The total number of utterance-level spectral features is 78 which includes means and standard deviations of MFCC as well as the delta and acceleration coefficients Class-level spectral features Class-level spectral features model how emotion is encoded in speech at the phoneme-level. Using the phoneme-level segmentation of the utterance, we formed the spectral feature vector by concatenating class-conditional means and standard deviations of MFCC for each of stressed vowel, unstressed vowel and consonant classes. In addition, we computed the average duration of the above phoneme classes. In summary, the class-level spectral feature vector is 237 dimensional and consists of the following feature groups: mean and std of MFCC over stressed vowel region mean and std of MFCC over unstressed vowel region mean and std of MFCC over consonant region mean duration of stressed vowels mean duration of unstressed vowels mean duration of consonants 4.5. Combined features In order to investigate performance of spectral features in combination with prosodic features, we created a combined feature set by concatenating the sets of class-level spectral and utterance-level prosodic features. 1 In total, the combined set consists of 261 features. 5. Emotion classification In our experiments on emotion recognition, we used SVM classifiers with radial basis kernels constructed using LIBSVM library (Chang and Lin, 2001). Since the number 1 Since utterance-level features are the most common prosodic features, we do not report any other combinations to avoid clutter in presentation and interpretation of the results.

9 D. Bitouk et al. / Speech Communication 52 (2010) of utterances per emotion class varied widely in both LDC and Berlin datasets, we used the Balanced ACcuracy (BAC) as a performance metric for emotion recognition experiments presented below. BAC is defined as the average over all emotion classes of recognition accuracy for each class: BAC ¼ 1 K X K i¼1 n i N i ; where K is the number of emotion classes, N i is the total number of utterances belonging to class i and n i is the number of utterances in this class which were classified correctly. Unlike the standard classification accuracy defined as the total proportion of correctly classified instances, BAC is not sensitive to imbalance in distribution between emotion classes. For example, let us consider a binary classification of neutral emotion versus happiness in a dataset containing 90 neutral and 10 happy utterances. Predicting all utterances as the majority class (neutral) would correspond to the relative accuracy of classification of 90%, while BAC is equal to 50%. In order to confirm stability and speaker independence of the obtained classifiers, testing was performed using Leave-One-Subject-Out (LOSO) paradigm such that the test set did not contain utterances from the speakers used in the training set. Classification experiments were performed in a round-robin manner by consecutively assigning each of the speakers to the test set and using utterances from the rest of the speakers in the database as the training set. 2 The optimal values of the SVM parameters for each fold were computed using a cross-validation procedure over the training set. We computed the overall BAC recognition accuracy by applying Eq. (1) to the set of predictions combined from all of the LOSO folds. In the experiments presented below, we investigated performance of each of the four sets of features introduced in Section 4, plus that of the combination of utterance-level prosodic features and class-level spectral features (combined). It should be noted that, while a number of previous approaches described in Section 2 focused only on speakerdependent emotion recognition, our experiments are on speaker-independent emotion recognition since our recognition experiments made use of utterances from the speakers which were unseen during classifier training Multi-class emotion recognition In our first experiment, we considered the task of multiclass classification of the six basic emotions. The accuracy of speaker-independent, multi-class classification on LDC and Berlin datasets is shown in Table 2 for features of different types (prosodic and spectral) and granularity levels (utterance-level and class-level). The accuracies for the 2 This in effect corresponds to 7-fold and 10-fold cross-validation for LDC and Berlin datasets respectively. ð1þ Table 2 Speaker-independent, multi-class emotion classification rates for six emotion task on LDC and Berlin datasets using prosodic and spectral features with different levels of granularity: utterance-level (UL) and classlevel (CL). Classification rates for the complete set of 15 emotions for LDC data is given to allow comparison with prior work. Best performance is shown in bold. LDC dataset six emotions Berlin dataset six emotions UL prosody CL prosody UL spectral CL spectral Combined LDC dataset 15 emotions complete set of 15 emotions for LDC data is given to allow comparison with prior work. Our results indicate that class-level spectral features perform better than other types of features for both LDC and Berlin datasets. Class-level spectral features also outperform the utterance-level prosodic features by absolute 21.4% in LDC and 7.8% in Berlin datasets. There are also noticeable improvements over commonly used utterancelevel spectral features. For Berlin dataset, the best results are obtained when combination of the class-level spectral and utterance-level prosodic features is used. However, the combined features perform slightly worse than classlevel spectral features in LDC dataset. While Table 2 presents speaker-independent emotion recognition accuracy, the majority of previous work did not use LOSO paradigm and focused on recognizing emotions in speaker-dependent settings. For the sake of comparison, we computed speaker-dependent emotion recognition accuracy using prosodic and spectral features with different granularity levels (utterance- and class-level). Each dataset was randomly split into the training set which contained 70% of the total number of utterances and the test set which included remaining 30% of the utterances. The accuracy of speaker-dependent emotion recognition for LDC and Berlin datasets is shown in Table 3. While similarly to the speaker-independent case, class-level spectral features outperform other feature types, the overall recognition performance is significantly higher than for the speakerindependent case. For example, speaker-dependent performance of utterance-level prosodic features in LDC dataset is almost twice the accuracy of the speaker-independent recognition. In order to compare performance of the class-level spectral features to the results of previous work on speakerindependent emotion classification (Yacoub et al., 2003; Huang and Ma, 2006), we conducted an experiment on classification of all 15 emotions in LDC dataset. The accuracy of 15-class classification is given in the last column of Table 2. Classification accuracy of 30.7% obtained using class-level spectral features is considerably higher than the prosody-based classification accuracy of 18% reported in Huang and Ma (2006) and 8.7% reported in Yacoub et al. (2003) on the same task. Note that the results might

10 620 D. Bitouk et al. / Speech Communication 52 (2010) Table 3 Speaker-dependent multi-class emotion classification rates for six emotion task on LDC and Berlin datasets using prosodic and spectral features with different levels of granularity: utterance-level (UL) and class-level (CL). LDC dataset six emotions UL prosody CL prosody UL spectral CL spectral Combined not be directly comparable because it is unclear how the earlier studies accounted for imbalance between emotion classes or how cross-validation folds were formed One-versus-all emotion recognition Berlin dataset six emotions In the second experiment, we performed recognition of each of the six basic emotions versus the other five emotions. For example, one of the tasks was to recognize if an utterance conveys sadness versus some other emotion among anger, fear, disgust, happy and neutral. The balanced accuracy of one-versus-all classification on LDC and Berlin datasets is shown in Tables 4 and 5 for sets of features with different types (prosodic and spectral) and granularity levels (utterance-level and class-level). Recognition accuracy changes with respect to granularity for both prosodic and spectral features. Our results indicate that the class-level prosodic features do not provide any consistent improvement over the utterance-level features. This is not surprising since prosodic features are suprasegmental. On the other hand, class-level spectral features provide a consistent performance improvement over the utterancelevel spectral features in most of the cases with exception of recognition of disgust and happiness in LDC and neutral in Berlin database. For example, the absolute performance gain is as high as 13.9% for recognition of disgust in Berlin dataset. Class-level spectral features also yield noticeably higher emotion recognition accuracy compared to utterance-level prosodic features for most of the emotions. For instance, the absolute improvements in recognition accuracy of neutral for LDC and disgust for Berlin datasets are 30.3% and 25.6% respectively. The only exceptions are recognition of fear and happiness in the Berlin dataset, where prosodic features lead to improvements over spectral features of 3.1% and 5.9% respectively. Moreover, the combination of the class-level spectral and the utterance-level prosodic features yields even further improvements in some cases. In other cases, the combined set of features yields classification accuracy which is lower than accuracy of either utterance-level prosodic or class-level spectral features. We believe that this is due to high dimensionality of the combined feature set. We test several feature selection algorithms in Section 7 in order to improve the performance of both class-level spectral and combined features. 6. Utterance-length dependence Multi-class emotion recognition results presented in Table 2 indicate that the overall accuracy of emotion recognition obtained on Berlin dataset is much higher than the one on LDC dataset. Besides differences in language and recording scenarios between the two datasets, better separation between emotions can be attributed to the fact that Berlin dataset contains longer utterances. To the best of our knowledge, effects of the utterance length on emotion recognition accuracy have not been explored in the literature. Table 4 Accuracy of one-versus-all classification for LDC dataset using prosodic and spectral features with different levels of granularity: utterance-level (UL) and class-level (CL). Best performance is shown in bold. Anger Fear Disgust Happy Sadness Neutral UL prosody CL prosody UL spectral CL spectral Combined Table 5 Accuracy of one-versus-all classification for Berlin dataset using prosodic and spectral features with different levels of granularity: utterance-level (UL) and class-level (CL). Best performance is shown in bold. Anger Fear Disgust Happy Sadness Neutral UL prosody CL prosody UL spectral CL spectral Combined

11 D. Bitouk et al. / Speech Communication 52 (2010) In order to investigate how emotion recognition performance depends on the utterance length, we constructed longer speech segments by concatenating utterances from LDC dataset. We built four additional synthetic datasets with progressively longer speech segments by containing together 2, 3, 4 and 5 randomly chosen utterances produced by the same actor. While utterances in LDC dataset contain four-syllable phrases, the four synthetic datasets contained speech segments with 8, 12, 16 and 20 syllables respectively. For example, in order to build a synthetic dataset consisting of eight-syllable segments, we randomly split the set of utterances produced by the same speaker into pairs and concatenated them. Since each utterance from the original dataset was used only once, the synthetic datasets contained fewer utterances than the original LDC dataset. We calculated speaker-independent emotion recognition accuracy of the six basic emotions for each of the synthetic datasets. Each of the datasets was split into the training and test sets using the LOSO procedure described in Section 5. Fig. 2 shows recognition accuracy as a function of the number of syllables in the utterances. While the performance of utterance-level prosodic features does not noticeably change with the utterance length, accuracy of the class-level spectral features increases as longer utterances are used. We would like to point out that, since our results are obtained on synthetic datasets, these predictions will not necessarily apply to naturally spoken emotional utterances. It would be interesting to further investigate how emotion recognition performance depends on utterance length using emotion speech corpora which is rich with utterances of different durations in order to cross-validate our findings. 7. Feature selection The high dimensionality of class-level spectral feature vectors as well as the presence of highly correlated features in the combined set of prosodic and spectral features can negatively affect performance of machine learning algorithms such as, for example, SVM classifiers used in this Fig. 2. Dependence of emotion recognition accuracy on utterance length. paper. Moreover, multi-class emotion classification results on LDC dataset (Table 2) indicate that the combination of prosodic and spectral features performs worse than spectral features used alone which might be due to presence of irrelevant or highly correlated features. Emotion classification accuracy of class-level spectral feature and the combined features can be improved by using feature selection algorithms which aim to find a lower-dimensional subset of features which yields better classification performance. In this section, we apply feature selection algorithms in component-wise and group-wise settings. In the component-wise case, our goal is to select an optimal set of individual features by performing a greedy search over the set of all possible feature combinations. On the other hand, group-wise selection aims to find the best combination of feature subgroups such as prosodic and spectral features defined over each phoneme class. While, in principle, component-based feature selection should achieve better classification accuracy, group-wise selection and ranking can help us to understand how different types of features contribute to emotion differentiation Component-wise feature selection Component-wise feature selection methods rely on searching through all possible subsets of features and fall into either wrapper or filter categories based on the type of criteria used to evaluate subsets of features. While wrapper approaches search for a subset of features by maximizing accuracy of a classifier on a hold-out subset of the training data, filter methods perform selection of features as a pre-processing step independent of any particular classification approach. In this paper, we used wrapper methods which maximize the accuracy of linear SVM classifiers. We also used filter methods such as subset evaluation (Hall and Smith, 1997) based on correlation measures and information gain ratio (Hall and Smith, 1998). Since exhaustive evaluation of all possible subsets of features is computationally prohibitive, we employed greedy search algorithms. Greedy stepwise search starts with the full set of features and iteratively removes individual features until the objective criterion can no longer be improved. Rank search uses a ranking of features based on the gain ratio metric in order to evaluate feature subset of increasing size which are constructed by iterative addition of the best ranked features. In the case of information gain ratio selection criterion, we did not use any search algorithm and simply selected features with positive information gain ratio values. For each LOSO fold, all utterances from one of the speakers in the training set were sequentially assigned to the hold-out set used for evaluation and the rest of utterances were used to train linear SVM classifiers for wrapper-based feature selection. This process was repeated for each speaker in the training set and selected features from each iteration were combined into a single set of selected

12 622 D. Bitouk et al. / Speech Communication 52 (2010) features. It should be noted that different sets of features were selected in different LOSO folds. In order to compare different feature selection algorithms, we calculated their balanced accuracy over LDC and Berlin datasets by combining classifier predictions from individual LOSO folds. We applied feature selection algorithms to utterance-level prosodic, class-level spectral and combined sets of features. Tables 6 and 7 compare emotion recognition of different wrapper and filter feature selection algorithms used in this paper. Wrapper selection utilizing greedy stepwise search improves the accuracy of utterance-level prosodic and class-level spectral features in LDC dataset by 1.0% and 2.5% respectively. However, none of the feature selection algorithms provide any noticeable improvement for the combination of prosodic and spectral features in LDC dataset. On the other hand, the best results in Berlin dataset are achieved by filter selection based on the information gain ratio yielding modest improvements of 1.8% and 0.3% for utterance-level prosodic and class-level spectral features, and 0.9% improvement for their combination. Despite the high dimensionality of class-level spectral features, none of the feature selection methods tested in this paper lead to clear performance gains for either Berlin or LDC datasets Group-wise feature selection and ranking While component-wise feature selection searches through all possible combinations of individual features, the list of selected features is difficult to analyze. Instead, one may be interested in investigating combinations and contributions from features defined over different phonetic groups rather than looking at individual feature components such as filterbank responses. In this section, we perform group-wise feature selection and ranking by focusing on subgroups of class-level spectral features. We split the set of class-level spectral features into consonant, stressed vowel and unstressed vowel subgroups. Then the combined features consist of four subgroups which include consonant, stressed vowel and unstressed vowel spectral subgroups as well as the group of prosodic features. Our goal is to find a combination of feature subgroups which maximizes emotion classification accuracy. For the small number of groups in this case, such a combination can be found by performing a brute force search among all possible subgroup combinations. Group-wise selection follows wrapper feature selection procedure described in Section 7.1. However, instead of using a greedy search to find an optimal combination of subgroups, we evaluate classifier performance on all possible subgroup combinations and choose the one which yields the best classification accuracy. Tables 8 and 9 show accuracy of recognizing the six basic emotions using group-wise feature selection for LDC and Berlin datasets. While group-wise selection improves classification accuracy of class-level spectral features by 1.6% and 1.2% in LDC and Berlin datasets respectively, feature subgroups selected from the combined set of utterance-level prosodic and class-level spectral features in Berlin dataset performs worse than the entire combined set. Table 10 shows how often each subgroup was selected in all of the LOSO folds in LDC and Berlin datasets. For example, prosodic subgroup was selected in one out of seven LOSO folds in LDC and 10 out of 10 folds in Berlin dataset. On the other hand, spectral features derived from the stressed vowel regions were always selected in LDC and only in five out of 10 folds in Berlin datasets. Spectral features derived from unstressed vowels were chosen in three out of seven folds in LDC and nine out of 10 folds in Berlin dataset. While selection frequency of prosodic and vowel spectral features varied between both datasets, spectral features derived from consonant regions were chosen for all of LOSO fold in both datasets. We believe that this is due to the fact that consonant spectral features are always complementary to prosodic and vowel spectral subgroups. Table 6 Multi-class emotion classification rates with feature selection for six emotion recognition in LDC datasets. W/o selection Rank search SVM wrapper Rank search subset eval. Greedy stepwise SVM wrapper UL prosody CL spectral Combined Info gain ratio Table 7 Multi-class emotion classification rates with feature selection for six emotion recognition in Berlin dataset. W/o selection Rank search SVM wrapper Rank search subset eval. Greedy stepwise SVM wrapper UL prosody CL spectral Combined Info gain ratio

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Affective Classification of Generic Audio Clips using Regression Models

Affective Classification of Generic Audio Clips using Regression Models Affective Classification of Generic Audio Clips using Regression Models Nikolaos Malandrakis 1, Shiva Sundaram, Alexandros Potamianos 3 1 Signal Analysis and Interpretation Laboratory (SAIL), USC, Los

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access Joyce McDonough 1, Heike Lenhert-LeHouiller 1, Neil Bardhan 2 1 Linguistics

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA LANGUAGE AND SPEECH, 2009, 52 (4), 391 413 391 Variability in Word Duration as a Function of Probability, Speech Style, and Prosody Rachel E. Baker, Ann R. Bradlow Northwestern University, Evanston, IL,

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Automatic intonation assessment for computer aided language learning

Automatic intonation assessment for computer aided language learning Available online at www.sciencedirect.com Speech Communication 52 (2010) 254 267 www.elsevier.com/locate/specom Automatic intonation assessment for computer aided language learning Juan Pablo Arias a,

More information

Automatic Pronunciation Checker

Automatic Pronunciation Checker Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale

More information

Expressive speech synthesis: a review

Expressive speech synthesis: a review Int J Speech Technol (2013) 16:237 260 DOI 10.1007/s10772-012-9180-2 Expressive speech synthesis: a review D. Govind S.R. Mahadeva Prasanna Received: 31 May 2012 / Accepted: 11 October 2012 / Published

More information

Speaker Recognition. Speaker Diarization and Identification

Speaker Recognition. Speaker Diarization and Identification Speaker Recognition Speaker Diarization and Identification A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Voiced-voiceless distinction in alaryngeal speech - acoustic and articula Nord, L. and Hammarberg, B. and Lundström, E. journal:

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty Julie Medero and Mari Ostendorf Electrical Engineering Department University of Washington Seattle, WA 98195 USA {jmedero,ostendor}@uw.edu

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Support Vector Machines for Speaker and Language Recognition

Support Vector Machines for Speaker and Language Recognition Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab Revisiting the role of prosody in early language acquisition Megha Sundara UCLA Phonetics Lab Outline Part I: Intonation has a role in language discrimination Part II: Do English-learning infants have

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

An Online Handwriting Recognition System For Turkish

An Online Handwriting Recognition System For Turkish An Online Handwriting Recognition System For Turkish Esra Vural, Hakan Erdogan, Kemal Oflazer, Berrin Yanikoglu Sabanci University, Tuzla, Istanbul, Turkey 34956 ABSTRACT Despite recent developments in

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Personalising speech-to-speech translation Citation for published version: Dines, J, Liang, H, Saheer, L, Gibson, M, Byrne, W, Oura, K, Tokuda, K, Yamagishi, J, King, S, Wester,

More information

Voice conversion through vector quantization

Voice conversion through vector quantization J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories,

More information

Rhythm-typology revisited.

Rhythm-typology revisited. DFG Project BA 737/1: "Cross-language and individual differences in the production and perception of syllabic prominence. Rhythm-typology revisited." Rhythm-typology revisited. B. Andreeva & W. Barry Jacques

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all Human Communication Science Chandler House, 2 Wakefield Street London WC1N 1PF http://www.hcs.ucl.ac.uk/ ACOUSTICS OF SPEECH INTELLIGIBILITY IN DYSARTHRIA EUROPEAN MASTER S S IN CLINICAL LINGUISTICS UNIVERSITY

More information

Journal of Phonetics

Journal of Phonetics Journal of Phonetics 41 (2013) 297 306 Contents lists available at SciVerse ScienceDirect Journal of Phonetics journal homepage: www.elsevier.com/locate/phonetics The role of intonation in language and

More information

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS Annamaria Mesaros 1, Toni Heittola 1, Antti Eronen 2, Tuomas Virtanen 1 1 Department of Signal Processing Tampere University of Technology Korkeakoulunkatu

More information

Running head: DELAY AND PROSPECTIVE MEMORY 1

Running head: DELAY AND PROSPECTIVE MEMORY 1 Running head: DELAY AND PROSPECTIVE MEMORY 1 In Press at Memory & Cognition Effects of Delay of Prospective Memory Cues in an Ongoing Task on Prospective Memory Task Performance Dawn M. McBride, Jaclyn

More information

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Perceptual scaling of voice identity: common dimensions for different vowels and speakers DOI 10.1007/s00426-008-0185-z ORIGINAL ARTICLE Perceptual scaling of voice identity: common dimensions for different vowels and speakers Oliver Baumann Æ Pascal Belin Received: 15 February 2008 / Accepted:

More information

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh The Effect of Discourse Markers on the Speaking Production of EFL Students Iman Moradimanesh Abstract The research aimed at investigating the relationship between discourse markers (DMs) and a special

More information

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech Dept. for Speech, Music and Hearing Quarterly Progress and Status Report VCV-sequencies in a preliminary text-to-speech system for female speech Karlsson, I. and Neovius, L. journal: STL-QPSR volume: 35

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation Taufiq Hasan Gang Liu Seyed Omid Sadjadi Navid Shokouhi The CRSS SRE Team John H.L. Hansen Keith W. Godin Abhinav Misra Ali Ziaei Hynek Bořil

More information

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH Mietta Lennes Most of the phonetic knowledge that is currently available on spoken Finnish is based on clearly pronounced speech: either readaloud

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Using EEG to Improve Massive Open Online Courses Feedback Interaction Using EEG to Improve Massive Open Online Courses Feedback Interaction Haohan Wang, Yiwei Li, Xiaobo Hu, Yucong Yang, Zhu Meng, Kai-min Chang Language Technologies Institute School of Computer Science Carnegie

More information

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer

More information

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode Diploma Thesis of Michael Heck At the Department of Informatics Karlsruhe Institute of Technology

More information

Phonological and Phonetic Representations: The Case of Neutralization

Phonological and Phonetic Representations: The Case of Neutralization Phonological and Phonetic Representations: The Case of Neutralization Allard Jongman University of Kansas 1. Introduction The present paper focuses on the phenomenon of phonological neutralization to consider

More information

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

A comparison of spectral smoothing methods for segment concatenation based speech synthesis D.T. Chappell, J.H.L. Hansen, "Spectral Smoothing for Speech Segment Concatenation, Speech Communication, Volume 36, Issues 3-4, March 2002, Pages 343-373. A comparison of spectral smoothing methods for

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation Ingo Siegert 1, Kerstin Ohnemus 2 1 Cognitive Systems Group, Institute for Information Technology and Communications

More information