REcent data on mobile phone users all over the world, the number of telephone landlines in operation, and recent VoIP

Size: px
Start display at page:

Download "REcent data on mobile phone users all over the world, the number of telephone landlines in operation, and recent VoIP"

Transcription

1 Applications of Speech Technology: Biometrics Doroteo Torre Toledano, Joaquín González-Rodríguez, Javier González Domínguez and Javier Ortega García ATVS Biometric Recognition Group, Universidad Autónoma de Madrid, Spain 1 Abstract The field of biometrics, or more precisely biometric authentication, refers to the discipline of identifying people by physical, chemical or behavioral characteristics, and has emerged in the last years as an important field due to its wide application. Although voice biometrics was among the first forms of biometrics considered due to its naturalness, voice biometric deployments are still a small proportion relative to other biometric applications, probably because most current biometric deployments are on-site, that is, the person is where the fingerprint, iris or signature is acquired. Voice, on the other hand, is by far the most adequate biometric modality for remote authentication because of its convenience (voice communication is everywhere and worldwide available through mobile, landline and VoIP phones) and the high reliability that current state-of-the-art applications show. This course on voice biometrics starts presenting the different sources of identity information that can be found in the speech signal and the different technologies to extract and take advantage of them. The course will review current tecnology used both for text-dependent and text-independent speaker recognition and potential fields of application for these technologies. Biometrics, voice biometrics, speaker recognition. Index Terms I. INTRODUCTION REcent data on mobile phone users all over the world, the number of telephone landlines in operation, and recent VoIP (Voice over IP networks) deployments, confirm that voice is the most accessible biometric trait as no extra acquisition device or transmission system is needed. This fact gives voice an overwhelming advantage over other biometric traits, especially when remote users or systems are taken into account. However, the voice trait is not only related with personal characteristics, but also with many environmental and sociolinguistic variables, as voice generation is the result of an extremely complex process. Thus, the transmitted voice will embed a degraded version of speaker specificities and will be influenced by many contextual variables that are difficult to deal with. Fortunately, state-of-the-art technologies and applications are presently able to compensate for all those sources of variability allowing for efficient and reliable value-added applications that allow remote authentication or voice detection based just in telephone-transmitted voice signals [56], [23]. A. Applications Due to the pervasiveness of voice signals, the range of possible applications of voice biometrics is wider than for other usual biometric traits. We can distinguish three major types of applications which take advantage of the biometric information present in the speech signal: Voice authentication (access control, typically remote by phone) and background recognition (natural voice checking) [14]. Speaker detection (e.g. blacklisting detection in call centers or wiretapping and surveillance), also known as speaker spotting. Forensic speaker recognition (use of the voice as evidence in courts of law or as intelligence in police investigations) [60]. These applications will be addressed in section VI. B. Technology The main source of information encoded in the voice signal is undoubtely the linguistic content. For that reason it is not surprising that depending on how the linguistic content is used or controlled, we can distinguish two very different types of speaker recognition technologies with different potential applications. Firstly, text-dependent technologies, where the user is required to utter an specific key-phrase (e.g., Open, Sesame ) or sequence (e.g., ), have been the major subject of biometric access control and voice authentication applications [55], [23]. The security level of password based systems can then be enhanced by requiring knowledge of the password, and also requiring the true owner of the password to utter it. In order to avoid possible theft recordings of true passwords, text-dependent systems can be enhanced to ask for random prompts, unexpected to the caller, which cannot be easily fabricated by an impostor. All the technological details related with text-dependent speaker recognition and applications are addressed in section IV.

2 2 The second type of speaker recognition technologies are those known as text-independent. They are the driving factor of the remaining two types of applications, namely speaker detection and forensic speaker recognition. Since the linguistic content is the main source of information encoded in the speech, text-independency has been a major challenge and the main subject of research of the speaker recognition community in the last two decades. The NIST SRE (Speaker Recognition Evaluations) conducted yearly since 1996 [48], [56] have fostered excellence in research in this area, with extraordinary progress obtained year by year based in blind evaluation with common databases and protocols, and very specially the sharing of information among participants in the follow-up workshop after each evaluation. Text-independent systems, including technological details and applications, will be addressed in detail in section V. II. IDENTITY INFORMATION IN THE SPEECH SIGNAL In this section, we will deal with how the speaker specificities are embedded into the speech signal. Speech production is a extremely complex process whose result depends on many variables at different levels, including from sociolinguistic factors (e.g. level of education, linguistic context and dialectal differences) to physiological issues (e.g. vocal tract length, shape and tissues and the dynamic configuration of the articulatory organs). These multiple influences will be simultaneously present in each speech act, and some or all of them will contain specificities of the speaker. For that reason, we need to clarify and clearly distinguish the different levels and sources of speaker information that we should be able to extract in order to model speaker individualities. A. Language generation and speech production The process by which humans are able to construct a language-coded message has been the subject of study for years in the area of psycholinguistics. But once the message has been coded in the human brain, a complex physiological and articulatory process is still needed to finally produce a speech waveform (the voice) that contains the linguistic message (as well as many other sources of information, one of which is the speaker identity) encoded as a combination of temporal-spectral characteristics. This process is the subject of study of phoneticians and some other speech analysis related areas (engineers, physicians, etc.). Details on language generation and speech production can be found in [68], [38], [59]. The speech production process is very complex and would deserve a whole book by itself, but we are here interested in those aspects related with the encoding of some kind of individual information in the final speech signal that is generated. In both stages of voice production (language generation and speech production), speaker specificities are introduced. In the field of voice biometrics also known as speaker recognition these two components correspond with which is usually known as high-level (linguistic) and low-level (acoustic) characteristics. B. Identity information levels in the speech signal Experiments with human listeners have shown, as our own experience tells us, that humans recognize speakers by a combination of different information levels, and what is specially important, with different weights for different speakers (e.g. one speaker can show very characteristic pitch contours, and another can have a strong nasalization which make them sound different). Automatic systems should try to take advantage of the different sources of information available, combining them in the best possible way for every speaker [22]. Idiolectal characteristics of a speaker [18] are at the highest level that is usually taken into account by the technology to date, and describe how a speaker uses a specific linguistic system. This use is determined by a multitude of factors, some of them quite stable in adults such as level of education, sociological and family conditions and town of origin. But there are also some high-level factors which are highly dependent on the environment, as e.g., a male doctor does not use language in the same way when talking with his colleagues at the hospital (sociolects), with his family at home, or with his friends playing cards. We will describe idiolectal recognition of speakers in more detail in section V-B, taking advantage of frequency of use of different linguistic patterns, which will be extracted as shown in section III-C. As second major group of characteristics going down towards lower information levels in the speech signal we find phonotactics [13], which describe the use by each speaker of the phone units and possible realizations available. Phonotactics are essential for the correct use of a language, and a key in foreign language learning, but when we look into phonotactic speaker specificities we can find certain usage patterns distinctive from other users. The use of phonotactics for automatic speaker recognition is fully described in section V-C, while the extraction of features for these systems is described in section III-C. In a third group we find prosody, which is the combination of instantaneous energy, intonation, speech rate and unit durations that provides speech with naturalness, full sense, and emotional tone. Prosody determines prosodic objectives at the phrase and discourse level, and define instantaneous actions to comply with those objectives. It helps to clarify the message ( nine hundred twenty seven can be distinguished as 927 or by means of prosody), the type of message (declarative, interrogative, imperative), or the state of mind of the speaker. But in the way each speaker uses the different prosodic elements, many speaker specificities are included, such as, for example, characteristic pitch contours in start and end of phrase or accent

3 3 group. The automatic extraction of pitch and energy information is described in section III-D, while the use of prosodic features to automatically recognize speakers is described in section V-D. Finally, at the lower level, we find the short-term acoustic/spectral characteristics of the speech signals, directly related to the individual articulatory actions related with each phone being produced and also to the individual physiological configuration of the speech production apparatus. This spectral information has been the main source of individuality in speech used in actual applications, and the main focus of the research for almost twenty years [61], [75], [11]. Spectral information intends to extract the peculiarities of speaker s vocal tracts and their respective articulation dynamics. Two types of low level information has been typically used, static information related to each analysis frame and dynamic information related to how this information evolves in adjacent frames, taking into account the strongly speaker-dependent phenomenon of co-articulation, the process by which an individual dynamically moves from one articulation position to the next one. Details on short term analysis and parameterization will be given in sections III-A and III-B, while short-term spectral systems will be described in section V-A. III. FEATURE EXTRACTION AND TOKENIZATION The first step in the construction of automatic speaker recognition systems is the reliable extraction of features that contain identifying information of interest. In this section, we will briefly show the procedures used to extract both short-term feature vectors (spectral information, energy, pitch) and mid-term and long-term features as phones, syllables and words. A. Short-term analysis In order to perform reliable spectral analysis, signals must show stationary properties that are not easy to observe in constantlychanging speech signals. However, if we restrict our analysis window to short lengths between 20 and 40 ms., our articulatory system is not able to significantly change in such a short time frame, obtaining what is usually called pseudo-stationary signals per frame. This process is depicted in figure 1. Those windowed signals can be assumed, due to pseudo-stationarity, to come from a specific LTI (linear time-invariant) system for that frame, and then we can perform, usually after using some kind of cosine-like windowing as hamming or hanning, spectral analysis over this short-term window, obtaining spectral envelopes that change frame by frame [59], [38]. Fig. 1. Short-term analysis and parameterization of a speech signal. B. Spectral feature extraction This short-time hamming/hanning windowed signals have all of the desired temporal/spectral information, albeit at a high bit rate (e.g. telephone speech digitized with sampling frequency 8 khz in a 32 ms. window means 256 samples x 16 bits/sample = 4096 bits = 512 bytes per frame). Linear Predictive Coding (LPC) of speech has proved to be a valid way to compress the spectral envelope in an all-pole model (valid for all non-nasal sounds, and still a good approximation for nasal sounds) with just 10 to 16 coefficients, which means that the spectral information in a frame can be represented in about 50 bytes, which is 10% of the original bit rate. Instead of LPC coefficients, highly correlated among them (covariance matrix far from diagonal), pseudo-orthogonal cepstral coefficients are usually used, either directly derived as in LPCC (LPC-derived Cepstral vectors) from LPC coefficients, or directly obtained from a perceptually-based Mel filterbank spectral analysis as in MFCC (Mel-Frequency based Cepstral Coefficients). In the last years it has been also quite common to use a percetptually motivated variation of LPCC called PLP (Perceptually based Linear Prediction) [36]. By far, one of the main factors of speech variability comes from

4 4 the use of different transmission channels (e.g. testing telephone speech with microphone-recorded speaker models). Cepstral representation has also the advantage that invariant channels add a constant cepstral offset that can be easily subtracted (CMS.- Cepstral Mean Subtraction), and non-speech cepstral components can also be eliminated as done in RASTA filtering of cepstral instantaneous vectors [37]. In order to take coarticulation into account, delta (velocity) and delta-delta (acceleration) coefficients are obtained from the static window-based information, computing an estimate of how each frame coefficient varies across adjacent windows (typically between ±3, no more than ±5). C. Phonetic and lexical feature extraction Hidden Markov Models (HMM) [58] are the most succesful and widely used tool (with the exception of some ANN architectures [53]) for phonetic, syllable and word tokenization, that is, the translation from sampled speech into a time-aligned sequence of linguistic units. Left-to-Right HMMs are state-machines which statistically model pseudostationary pieces of speech (states) and the transitions (left-to-right forced, keeping a temporal sense) between states, trying to imitate somehow the movements of our articulatory organs, which tend to rest (in all non-plosive sounds) in articulatory positions (assumed as pseudostationary states) and continuously move (transition) from one state to the following. Presently, most HMMs model the information in each state with continuous probability density functions, typically mixtures of gaussians. This particular kind of models are usually known as CDHMM (Continuous Density HMM, as opposite to the former VQ-based Discrete Density HMMs). HMM training is usually done through Baum-Welch estimation, while decoding and time alignment is usually performed through Viterbi decoding. The performance of those spectral-only HMMs is improved by the use of language models, which impose some linguistic or grammatical constraints on the infinite combination of all possible units. To allow for increased efficiency, pruning of the beam search is also a generalized mechanism to significantly accelerate the recognition process with no or little degradation on the performance. D. Prosodic feature extraction Basic prosodic features as pitch and energy are also obtained at a frame level. The window energy is very easily obtained either in temporal or spectral form, and the instantaneous pitch can be determined by, e.g., autocorrelation or cepstral-decomposition based methods, usually smoothed with some time filtering [59]. Other important prosodic features are those related with linguistic units duration, speech rate, and all those related with accent. In all those cases, precise segmentation is required, marking the syllable positions and the energy and pitch contours to detect accent positions and phrase or speech turn markers. Phonetic and syllabic segmentation of speech is a complex issue that is far from solved [73] and although it can be useful for speaker recognition [1], prosodic systems do not always require such a detailed segmentation [20]. IV. TEXT-DEPENDENT SPEAKER RECOGNITION Automatic speaker recognition tries to recognize the speaker that produces a particular speech utterance. Depending on the constraints imposed on the linguistic content of the utterance there are two types of speaker recognition: text-independent speaker recognition in which the linguistic content of the speech recording is unknown by the system and text-dependent speaker recognition where the linguistic content of the speech is known. This distinction makes these two subtypes of speaker recognition systems very different in terms both of techniques used and of potential applications. This section is devoted to text-dependent speaker recognition systems, which find their main application in interactive systems where collaboration from the users is required in order to authenticate their identities. The typical example of these applications is voice authentication over the telephone for interactive voice response systems that require some level of security like banking applications or password reset. The use of a text-dependent speaker recognition system requires, similarly to other biometric modalities, an enrollment phase in which the user provides several templates to build a user model and a recognition phase in which a new voice sample is matched against the user model. In recent years the National Institute of Standards and Technology (NIST) has promoted research in the context of textindependent speaker recognition with the organization of yearly international competitive evaluations [48] [56] which have fostered the definition of challenging tasks through a strong effort in the development of publicly available speech databases. Despite its potential applications in interactive voice response systems, the absence of similar competitive evaluations has kept text-dependent speaker recognition at a slower pace of development and the number and extent of the databases for research in this field is more limited. A. Classification of systems and techniques We can classify text-dependent speaker recognition systems from an application point of view into two types: fixed-text and variable-text systems. In fixed-text systems, the lexical content in the enrollment and the recognition samples is always the same. In variable-text systems, the lexical content in the recognition sample is different in every access trial from the lexical content of the enrollment samples. Variable-text systems are more flexible and more robust against attacks that use recordings from an user or imitations after hearing of the true speaker uttering the correct password. An interesting possibility

5 5 is the generation of a randomly generated password prompt that is different each time the user is verified, thus making it almost impossible to use a recording. With respect to the techniques used for text-dependent speaker recognition, it has been demonstrated [21] that information present at different levels of the speech signal (glottal excitation, spectral and suprasegmental features) can be used effectively to detect the user s identity. However, the most widely used information is the spectral content of the speech signal, determined by the physical configuration and dynamics of the vocal tract. This information is typically summarized as a temporal sequence of MFCC vectors, each of which represents a window of ms of speech. In this way, the problem of text-dependent speaker recognition is reduced to a problem of comparing a sequence of MFCC vectors to a model of the user. For this comparison there are two methods that have been widely used: template-based methods and statistical methods. In template-based methods [29] [24] the model of the speaker consists of several sequences of vectors corresponding to the enrollment utterances, and recognition is performed by comparing the verification utterance against the enrollment utterances. This comparison is performed using Dynamic Time Warping (DTW) as an effective way to compensate for time misalignments between the different utterances. While these methods are still used, particularly for embedded systems with very limited resources, statistical methods, and in particular Hidden Markov Models (HMMs) [58], tend to be used more often than template based models [4] [45] [25] [55] [9]. Most works on text-dependent speaker recognition using HMMs tend to use a speaker independent set of HMMs and retrain the parameters of these HMMs using Baum-Welch reestimation to produce a speaker-dependent set of HMMs. After these models have been trained, an utterance is verified by performing speech recognition with the speaker independent and the speaker-dependent HMMs and comparing the acoustic scores obtained. Recently other works in the literature [69] [70] have started to modify this method by substituting Baum-Welch retraining by Maximum Likelihood Linear Regression (MLLR) adaptation [42] of the speaker independent HMMs. This allows to use more complex (and, if properly trained, more reliable) HMMs while keeping the speaker models small (since only the MLLR transformation matrices need to be stored). Other recent improvements in the field include the use of discriminative methods after the MLLR adaptation [69] or phoneme or state-based T-Normalization [70]. B. Databases and benchmarks The first databases used for text-dependent speaker verification were databases not specifically designed for this task like the TI-DIGITS [43] and TIMIT [31] databases. One of the first databases specifically designed for text-dependent speaker recognition research is YOHO [8]. It consists of 96 utterances for enrollment collected in 4 different sessions and 40 utterances for test collected on 10 sessions for each of a total of 138 speakers. Each utterance consists in different sets of three digit pairs (e.g ). This is probably the most extended and well known benchmark for comparison and is frequently used to assess text-dependent systems. However, the YOHO database has several limitations. For instance, it only contains speech recorded on a single microphone in a quiet environment and was not designed to simulate informed forgeries (i.e. impostors uttering the password of an user). More recently the MIT Mobile Device Speaker Verification Corpus [76] has been designed to allow research on text-dependent speaker verification on realistic noisy conditions. Due to the increasing interest in multimodal biometric recognition (of which text-dependent speaker recognition is just a particular modality), and given that one of the main difficulties in capturing a biometric database is recruiting donors, many of the newly developed biometric databases are multimodal and cover several biometric traits. Some of these databases include speech as a particular modality and can potentially be used for text-dependent speaker recognition research. Some of the most veteran and widely used multimodal biometric databases are XM2VTS [47] containing microphone speech and face images of 295 people captured in 4 different sessions, and MCYT [52] database including fingerprints and signature of 330 subjects. More recent databases include BIOMET [30], BANCA [3], MYIDEA [19], MBioID [17], and M3 [46]. Other recent initiatives in multimodal database collection in which our group has been involved include: BioSec [28]. The BioSec database was acquired under FP6 EU BioSec Integrated Project, and comprises fingerprint images acquired with three different sensors, frontal face images from a webcam, iris images, and voice utterances of 250 subjects. The speech part of the corpus (the most interesting part for this document) was recorded at 44 KHz stereo with 16 bits (PCM with no compression) using both a headset and a distant webcam microphone. Each subject utters 4 repetitions of a user-specific keyword consisting of 8 digits both in English and Spanish. Speakers are mainly native Spanish speakers. In addition, every subject says 3 keywords corresponding to other users to simulate informed forgeries in which an impostor has access to the number of a client. The 8 digits were always pronounced digit-by-digit in a single continuous and fluent utterance. BiosecurID [27]. This database includes 7 unimodal biometric traits, namely: speech, iris, face, handwriting, fingerprints, hand and keystroking. The database comprises 400 subjects and was acquired in a realistic office-like scenario. BioSecure [51]. This database considers three acquisition scenarios, namely: unsupervised Internet acquisition, including voice, and face; supervised office-like scenario, including voice, finger prints, face, iris, signature and hand; and acquisition in a mobile device, including signature, fingerprints, voice, and face. The database comprises over 1000 subjects for the Internet scenario, and about 700 users for the other two. In addition to the increased number of subjects and a more balanced distribution of donors, the BioSec, BiosecurID and BioSecure databases have several advantages with respect to other well known databases such as YOHO. For instance they

6 utts., adaptation 6 utts., reestimation 24 utts., adaptation 24 utts., reestimation 96 utts., adaptation 96 utts., reestimation Miss probability (in %) False Alarm probability (in %) Fig. 2. Example results on YOHO of two text-dependent speaker recognition systems based on speaker-independent phonetic HMMs and MLLR speakeradaptation and Baum-Welch re-estimation for different amounts of enrollment speech. allow the simulation of informed forgeries and even studies based on age and long term (2 year) temporal variability, because they have some subjects in common. Despite the existence of these databases that can act as benchmarks, it is still difficult to compare text-dependent speaker recognition systems. One of the main difficulties of the comparison is that these systems tend to be language dependent, and therefore many researchers present their results in their custom database, making it impossible to make direct comparisons. The comparison of different commercial systems is even more difficult, although there are some attempts [23]. To conclude these comments about comparison and performance evaluation, we should remind that, as with other biometric modalities, technical performance is not the only dimension to evaluate and other measures related to the usability of the systems should be evaluated as well [71]. C. Case study: HMM based text-dependent speaker recognition with MLLR adaptation and Baum-Welch reestimation As an example of text-dependent system tested on the YOHO benchmark database, we present the results obtained with two text-dependent speaker recognition systems developed by the authors. The systems simulate a text-prompted system based on a set of speaker-independent and context-independent phonetic HMMs trained on TIMIT. Enrollment consists in using several sentences of a speaker to adapt the HMMs to the speaker. We compare two ways of performing this adaptation: with a single pass of Baum-Welch re-estimation and with Maximum Likelihood Linear Regression (MLLR) [42]. The former is the most conventional approach but requires using very simple HMMs (just one or a few Gaussians per state). The later is more novel and allows using more complex HMMs. Speaker verification consists in computing the acoustic score produced during the forced alignment of an utterance with its phonetic transcription using both the speaker adapted HMMs and the speaker-independent HMMs. The final score in this experiment is simply the ratio between those scores (no score normalization is included in the results presented). An important issue in developing text-dependent speaker recognition systems is the amount of training material required for enrollment. YOHO contains 4 sessions with 24 utterances each. This is a very large amount of enrollment material that could rarely be obtained in a realistic application. For this reason figure 2 shows results for the two systems training with the four sessions (96 utterances), one session (24 utterances) or only 6 utterances from one session. As could be expected, performance is greatly improved with more training material, but practical systems need to find a compromise between performance and ease and convenience of use. Figure 2 also compares the system based on Baum-Welch re-estimation and the one based MLLR adaptation, showing better performance for the MLLR-based systems for all enrollment conditions.

7 7 False Rejection Probability (in %) utts, MLLR 6 utts, Baum Welch 24 utts, MLLR 24 utts, Baum Welch 96 utts, MLLR 96 utts, Baum Welch 6 utts, MLLR + MAP False Acceptance Probability (in %) Fig. 3. DET curves with Baum-Welch re-estimation, MLLR adaptation and MLLR adaptation followed by MAP with 6, 24 and 96 utterances for enrollment. TABLE I EERS (%) WITH BAUM-WELCH RE-ESTIMATION, MLLR ADAPTATION AND MLLR ADAPTATION FOLLOWED BY MAP WITH 6, 24 AND 96 UTTERANCES FOR ENROLLMENT. Enrollment utterances (and sessions) Baum-Welch MLLR MLLR + MAP 6 (1 session) 5,6 4,6 3,56 24 (1 session) 3,2 2,1 96 (4 sessions) 1,9 0,9 D. Case study: HMM based text-dependent speaker recognition with MAP adaptation and sub-word level T-normalization A further extension, also proposed by the authors, of the systems described in section IV-C consists in using Maximum A Posteriori (MAP) adaptation [32] after MLLR for a better speaker modelling [74] and using T-normalization at a sub-word (phone or HMM state) level. Using MAP adaptation after the MLLR adaptation yields increased speaker recognition performance (figure 3 and table I). The EER decreased by 1.04% absolute (22.6% relative improvement). This improvement comes at increased computational and storage costs (we need to store a whole new set of phonetic HMMs for each speaker, not only the transformation matrices) but in some applications we can take advantage of it. We have only performed experiments with MLLR followed by MAP for the 6 utterances enrollment condition because this is the most interesting condition for common text-dependent applications. In text-independent speaker recognition it is very common to use T-normalization by comparing the score obtained with a test segment, not only to the model of the speaker in the test segment, but also against the models of other speakers (i.e. against a cohort of impostors). The direct translation of this approach to text-dependent speaker recognition is what we call Utterance- Level T-Norm, to distinguish it from the novel T-Normalization schemes that we proposed. In any T-normalization scheme, we need to define a cohort of M speakers and compute the unnormalized scores not only using the model of the speaker to verify but also the models for the M speakers in the cohort. After we have done this we T-Normalize the score by normalizing the impostor scores to a unit variance, zero mean Gaussian and applying the same normalization to the score under study. With this T-Normalization scheme we T-Normalize the final scores after averaging over the whole utterance. In this sense, we are combining scores computed on very different parts of the test utterance (i.e. on different phonemes or different parts of the phonemes) which may produce scores with very different distributions. For that reason it seems to be a good idea to try to normalize the scores for similar segments before averaging the scores. We propose the use of sub-word level T-Normalization schemes in which we perform T-Normalization on averages of the acoustic scores over segments corresponding to phonemes or even HMM states within the phoneme before averaging the already T-Normalized scores over the whole utterance. We call these methods Phoneme-Level T-Normalization and State-Level T-Normalization. The idea behind these new T-Normalization schemes is relatively simple and we consider that a detailed description here is unnecessary. However, the interested reader

8 8 TABLE II T-NORM RESULTS (EERS IN %) OBTAINED ON YOHO (WITH ONLY 6 UTTERANCES FROM A SINGLE SESSION AS ENROLLMENT MATERIAL) USING MLLR AND MAP ADAPTATION. THE TABLE COMPARES RESULTS OBTAINED WITHOUT NORMALIZATION AND WITH UTTERANCE-LEVEL, PHONEME-LEVEL AND STATE-LEVEL T-NORM FOR DIFFERENT SET-UPS FOR THE COHORT. Gender Condition Cohort Type of T-Norm Male Female All NO NO Utterance G.I. 10m + 10f Phoneme State Utterance G.D. 10m + 10f Phoneme State Utterance G.D. 30m - 30f Phoneme State Utterance 2.55 G.D. All male Phoneme 2.43 State 2.52 can find a detailed description of these methods in [70]. We have tested these three different schemes for T-Normalization with different set-ups of the cohort. Results from this extensive testing are summarized in terms of Equal Error Rate (EER) in percentage in table II. The first line of table II presents results obtained with MLLR plus MAP adaptation without normalization, and serves as the baseline results. These correspond to figure 3, but further detailed according to the gender in the trials. The last column of the table presents global results obtained by considering all trials, including same gender and cross gender trials. The rest of the table is organized in blocks of three rows which represent results obtained with Utterance-Level, Phoneme-Level and State-Level T-Norm for the following cohorts of impostors: G.I. 10m+10f: A gender independent cohort including 10 male speakers and 10 female speakers. G.D. 10m - 10f: Two gender dependent cohorts obtained by dividing the previous cohort into two gender-dependent cohorts. G.D. 30m - 30f: Two gender-dependent cohorts with 30 speakers for each gender. G.D. All male: A male cohort including all speakers in YOHO except those involved in the trial. From the table we observe that Phoneme-Level and State-Level T-Norm clearly outperform Utterance-Level T-Norm for the smaller cohorts (10 male and 10 female), irrespective of whether the cohorts are gender-dependent or independent. In these cases, Utterance-Level T-Norm actually worsens the results obtained without normalization, while Phoneme and State-Level T-Norm produce important improvements. In the case of two gender-dependent cohorts with 10 male and 10 female speakers the relative improvement achieved by State-Level T-Norm over Utterance-Level T-Norm reaches 20.1% (0.73% absolute) in the all gender condition. When we move to larger cohorts we observe that Phoneme and State-Level T-Norm still tend to perform better than Utterance- Level T-Norm. However, the increase of the cohort has a larger improvement effect on Utterance-Level T-Norm than on sub-word levels T-Norm. This reduces the difference between utterance and sub-word levels T-Norm. It is reasonable to consider that the different phonemes have different discrimination capabilities. In fact, this is the hypothesis of a work [69] in which the scores produced by different phonemes are combined with different weights using boosting for improved performance. In the context of T-Norm this will mean that the scores produced by different phonemes should be normalized in different ways. In fact, we have studied the impostor score distributions for different phonemes (not presented here due to space limitations) and have noticed important differences among them, which again suggest the convenience of sub-word level normalizations. Our experiments, however, have found these advantages particularly for small cohorts, pointing out other important advantage of sub-word score normalization schemes: their robustness to small cohorts. V. TEXT-INDEPENDENT SPEAKER RECOGNITION Text-independent speaker recognition have been largely dominated, since 1970s to the end of 20th century, by short term spectral-based systems. Since 2000, higher level systems started to be developed with good enough results in the same highly challenging tasks (NIST SR evaluations) and for some time they were considered as the more likely way to improve performance in the future. However, spectral systems have always continued to outperform high-level systems (NIST 2010 SRE was the latest benchmark by the time of writing) and have taken the lead clearly with the improvements derived from advanced channel compensation mechanisms based on Factor Analysis (FA) and more recently with the development of the approach based on Total Variability, also called ivectors.

9 9 A. Short-term spectral systems When short-time spectral analysis is used to model the speaker specificities, we are modeling the different sounds a person can produce, specially due to his/her own vocal tract and articulatory organs. As humans need multiple sounds (or acoustically different symbols) to speak in any common language, we are clearly facing a multiclass space of characteristics. Vector Quantization techniques are efficient in such multiclass problems, and have been used for speaker identification [7], typically obtaining a specific VQ model per speaker, and computing the distance from any utterance to any model as the weighted sum of the minimum per frame distances to the closest codevector of the codebook. The use of boundaries and centroids instead of probability densities yields poorer performance for VQ than for fully-connected Continuous Density HMMs, known as ergodic HMMs (E-HMM) [44]. However, the critical performance factor in E-HMM is the product number of states times number of Gaussians per state, which strongly cancels the influence of transitions in those fully-connected models. Then, a 5-state 4-Gaussian per state E-HMM system will perform similarly than a 4-state 5-Gaussian/state, a 2-state 10-Gaussian/state, or even, what is specially interesting, a 1-state 20 Gaussian/state system, which is generally known as GMM or Gaussian Mixture Model. Those one-state E-HMMs, or GMMs, have the large advantage that avoids both Baum-Welch estimation for training, as no alignment between speech and states is necessary (all speech is aligned with the same single state), and Viterbi decoding for testing (again no need for time alignment), which accelerates computation times with no degradation of performance. GMM is a generative technique where a mixture of multidimensional gaussians tries to model the underlying unknown statistical distribution of the speaker data. GMM became the state-of-the-art technique in the 1990 s, both when maximum likelihood (through Expectation-Maximization, EM) or discriminative training (Maximum Mutual Information, MMI) was used. However, it was the use of MAP adaptation of the means from a Universal Background Model (UBM) which gave GMMs a major advantage over other techniques [61], specially when used with compensation techniques as Z-norm (impostor score normalization), T-norm (utterance compensation), H-norm (handset dependent Z-norm), HT-norm (H+T-norm) or Feature Mapping (channel identification and compensation) [62]. Discriminative techniques such as the use of Artificial Neural Networks have been used for years [26], but their performance never approached that of GMMs. However, the availability in the late 90 s of Support Vector Machines (SVM) [65] as an efficient discriminatively trained classifier, has given GMM its major competitor as equivalent performance is obtained using SVM in a much higher dimensional space when appropriate kernels such as GLDS (Generalized Linear Discriminant Sequence Kernel) [11] are used. Some time after starting using SVMs instead of GMMs both techniques were combined to give even better performance. The resulting system was called GMM-SVM [12]. This new technique considers the means of the GMM for every utterance (both in training and testing) as points in a very high dimensional space (dimension equals the number of mixtures of the GMM times the dimension of the parameterized vectors) that is classified with a SVM per speaker as belonging or not to that speaker. The high dimensional vector of means of the GMM has received the name of GMM SuperVector or just SuperVector [40]. The concept of SuperVector gave rise to a whole new set of channel compensation methods [41] based on detecting subspaces with maximum intra-speaker (i.e. inter-session) and inter-speaker variability. The former were modelled as channel factors and the latter as spaker fectors. These new techniques were collectivelly known as Factor Analysis (channel and speaker factors) and several variations of them receive particular names such as Channel Factors (CF), Nuisance Attribute Projection (NAP) or Within Class Covariance Normalization (WCCN). More recently, after noticing that channel factors also include speaker-specific information, some authors started to model both channel and speaker variability subspaces using a single subspace that received the name of Total Variability subspace [15]. That subspace was a projection from the supervector space to a much lower dimensional space (typically 400 dimensions instead of the typical 40K dimensions of the supervector space) containing most of the speaker and channel variability. The vectors in this Total Variability space are called ivectors [16] and represent whole utterances of the speaker containing both speaker and inter-session variability. Once the problem has translated to this reduced space, standard techniques such as Within Class Covariance Normalization (WCCN), Linear Discriminant Analysis (LDA) or Nuissance Attribute Projection (NAP) are easily applied to compensate and remove intra-speaker variability, giving rise to inter-session variability compensated vectors that can then be compared using a simple cosine distance. This approach represents current state of the art in speaker recognition, provides excellent performance and extremely efficient systems in computational terms. B. Idiolectal systems Most text-independent speaker recognition systems were based on short-term spectral features until the work of Doddington [18] opened a new world of possibilities for improving text-independent speaker recognition systems. Doddington realized and proved that speech from different speakers differ not only on the acoustics, but also on other characteristics like the word usage. In particular, in his work he modeled the word usage of each particular speaker using an n-gram that modeled word sequences and their probabilities and demonstrated that using those models could improve the performance of a baseline acoustic/spectral GMM system. More important than this particular result is the fact that this work boosted research in the use of higher levels of information (idiolectal, phonotactic, prosodic, etc.) for text-independent speaker recognition. After the publication of this

10 10 Fig. 4. Verification of an utterance against a speaker model in phonotactic speaker recognition work a number of researchers met at the summer workshop SuperSID [22] where these ideas were further developed and tested on a common testbed. Next sections describe two of the most successful systems exploiting higher levels of information: phonotactic systems, which try to model pronunciation idiosyncrasies, and prosodic systems, which model speaker-specific prosodic patterns. C. Phonotactic systems A typical phonotactic speaker recognition system consists of two main building blocks: the phonetic decoders, which transform speech into a sequence of phonetic labels and the n-gram statistical language modeling stage, which models the frequencies of phones and phone sequences for each particular speaker. The phonetic decoders typically based on Hidden Markov Models (HMMs) can either be taken from a preexisting speech recognizer or trained ad hoc. For the purpose of speaker recognition, it is not very important to have very accurate phonetic decoders and it is not even important to have a phonetic decoder in the language of the speakers to be recognized. This somewhat surprising fact has been analyzed in [72] showing that speaker-dependent phonetic errors made by the decoder seem to be speaker specific, and therefore useful information for speaker recognition as long as these errors are consistent for each particular speaker. Once a phonetic decoder is available, the phonetic decodings of many sentences from many different speakers can be used to train a Universal Background Phone Model (UBPM) representing all the possible speakers. Speaker Phone Models (SPM i ) are trained using several phonetic decoders of each particular speaker. Since the speech available to train a speaker model is often limited, speaker models are interpolated with the UBPM to increase robustness in parameter estimation. Once the statistical language models are trained, the procedure to verify a test utterance against a speaker model SPM i is represented in figure 4. The first step is to produce its phonetic decoding, X, in the same way as the decodings used to train SPM i and UBPM. Then, the phonetic decoding of the test utterance, X, and the statistical models (SPM i, UBPM) are used to compute the likelihoods of the phonetic decoding, X, given the speaker model SPM i and the background model UBPM. The recognition score is the log of the ratio of both likelihoods. This process, which is usually described as Phone Recognition followed by Language Modeling (PRLM) may be repeated for different phonetic decoders (e.g., different languages or complexities) and the different recognition scores simply added or fused for better performance, yielding a method known as Parallel PRLM or PPRLM. Recently, several improvements have been proposed on the baseline PPRLM systems. One of the most important in terms of performance improvement is the use of the whole phone recognition lattice [35] instead of the one-best decoding hypothesis. The recognition lattice is a directed acyclic graph containing the most likely hypotheses along with their probabilities. This much richer information allows for a better estimation of the n-grams on limited speech materials, and therefore for much better results. Other important improvement is the use of SVMs for classifying the whole n-grams trained with either the one-best hypotheses or with lattices [10], [35] instead of using them in a statistical classification framework. D. Prosodic systems One of the pioneering and most successful prosodic systems in text-independent speaker recognition is the work of Adami [20]. The system consists of two main building blocks: the prosodic tokenizer, which analyzes the prosody, and represents it as

11 11 Fig. 5. Prosodic token alphabet (top table) and sample tokenization of pitch and energy contours (bottom figure). a sequence of prosodic labels or tokens and the n-gram statistical language modeling stage, which models the frequencies of prosodic tokens and their sequences for each particular speaker. Some other possibilities for modeling the prosodic information that have also proved to be quite successful are the use of Non-uniform Extraction Region Features (NERFs) delimited by long-enough pauses [39] or NERFs defined by the syllabic structure of the sentence (SNERFs) [66]. The authors have implemented a prosodic system based on Adami s work in which the second block is exactly the same for phonotactic and prosodic speaker recognition with only minor adjustments to improve performance. The tokenization process consists of two stages. Firstly, for each speech utterance, temporal trajectories of the prosodic features, (fundamental frequency -or pitch- and energy) are extracted. Secondly, both contours are segmented and labelled by means of a slope quantification process. Figure 5 shows a table containing 17 prosodic tokens. One token represents unvoiced segments, while 16 are used for representing voiced segments depending on the slope (fast-rising, slow-rising, fast-falling, slow-falling) of the energy and pitch. Figure 5 shows also an example utterance segmented and labelled using these prosodic tokens. E. Databases and Benchmarks In the early 1990s, text-independent speaker recognition was a major challenge, with a future difficult to foresee. By that time, modest research initiatives were developed with very limited databases, resulting in non-homogenous publications with no way to compare and improve systems in similar tasks. Fortunately, in 1996 NIST started the yearly Speaker Recognition Evaluations, which have been undoubtfuly the driving force of significant advances. Present state-of-the-art performance was totally unexpected just 10 years ago. This success has been driven by two factors. Firstly, the use of common databases and protocols in blind evaluation of systems has permitted fair comparison between systems on exactly the same task. Secondly, the post-evaluation workshops have allowed participants to share their experiences, improvements, failures, etc. in a highly cooperative environment. The role of the LDC (Linguistic Data Consortium) providing new challenging speech material is also noticeable, as the needs have been continuously increasing (both in amount of speech and requirements in recording). From the different phases of Switchboard to the latest Fisher-style databases, much progress has been made. Past evaluation sets (development, train and test audio and keys -solutions-) are available through LDC for new researchers to evaluate their systems without competitive pressures. Even though official results have been restricted to participants, it is extremely easy to follow the progress of the technology as participants often present their new developments in Speaker ID sessions in international conferences as ICASSP or InterSpeech, or the series of ISCA/IEEE Odyssey workshops. F. Case study: the ATVS NIST SRE 2006 text-independent multilevel system The authors have participated in NIST SRE yearly tests since 2001, and have developed different spectral (generative and discriminative) and higher level systems. A detailed description of our multilevel approach is found in [33], and here we present our results in NIST SRE06 in the 8c1c task (8 training conversations and 1 conversation for testing), in order to see

12 12 Fig. 6. systems. Performance of ATVS subsystems in NIST 06 Speaker Recognition Evaluation comparing spectral (GMM and SVM), phonotactic and prosodic the performance of different subsystems on the same task. The main differences of 2006 ATVS systems compared to the 2005 systems described in [33] are the use of Feature Mapping in both GMM and SVM, the use of 3rd order polynomial expansion (instead of 2nd order) in the GLDS kernel, and the use of one PRLM trained with SpeechDat (the best from the three PRLM systems shown). As shown in figure 6, the spectral systems (GMM and SVM) perform similarly, while our higher level systems obtain enough individualization information ( 20% EER) but still far from the performance of spectral systems. After the evaluation, SuperVector-GMM and NAP channel compensation have been included in our system, providing significant enhancements over the best spectral systems, as shown in figure 7 for the NIST SRE06 1c1c-male subtask. G. Case study: the ATVS NIST SRE 2010 text-independent ivectors system Our last participation in NIST SRE evaluations at the time of writting this document has been in the 2010 edition. In contrast to the 2006 edition where we were focused on higher-level systems, in the 2010 edition we focused on building a single and very efficient system based on the new concepts of Total Variability and ivectors [15] [16]. NIST SRE 2010 [49] contained different conditions. We only were interested in the so called core-core condition in which the training and testing material was one two-channel telephone conversational excerpt (we call this type of data tel data), of approximately five minutes total duration or a microphone recorded conversational segment (we call this type of data mic data) of three to fifteen minutes total duration involving the interviewee (target speaker) and an interviewer, in both cases with the target speaker channel designated. The type of data was known in advance for the systems. The evaluation established a maximum of 6000 speaker models and a maximum of test segments with a maximum of trials. The real evaluation was close to those figures. In our system, all audio except that used for tel-tel trials (tel data used for train and test) was first filtered with the QIO (Qualcomm-ICSI-OGI) Wiener filter in order to reduce noise [57]. Feature extraction is performed after noise reduction. It computes 38 coefficients per frame (19 Mel-Frequency Cepstrum Coefficients, MFCC, and deltas) using 20 ms. Hamming windows, overlapped 10 ms and 20 mel-spaced ( Hz) magnitude filters. Once these features are calculated three channel compensation methods are applied in sequence: CMN, RASTA [37] and Feature Warping [54] with 3 second windows. Given that the data provided by NIST included speech from conversations, there were long periods in which the target speaker was in silence. In order to avoid processing those segments and achieve better performance we have used two different VAD (Voice Activity Detection) configurations depending on whether the data is mic or tel, but these details go beyond the scope of this case study.

13 Miss probability (in %) SuperVectors: Raw + Nap (64) : EER DET = ; DCF opt = SuperVectors: Tnorm + Nap (64) : EER DET = ; DCF opt = SVM GLDS: Tnorm : EER DET = ; DCF opt = SuperVectors : Raw : EER DET = ; DCF opt = GMM : EER DET = ; DCF opt = False Alarm probability (in %) Fig. 7. Post-eval performance improvements over NIST 06 SRE ATVS system based on NAP channel compensation and SuperVector-GMMs (1c-1c male sub-task). Fig. 8. Developing (training) and testing phase of ATVS-UAM NIST SRE 2010 System.

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Support Vector Machines for Speaker and Language Recognition

Support Vector Machines for Speaker and Language Recognition Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA

More information

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Lorene Allano 1*1, Andrew C. Morris 2, Harin Sellahewa 3, Sonia Garcia-Salicetti 1, Jacques Koreman 2, Sabah Jassim

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation Taufiq Hasan Gang Liu Seyed Omid Sadjadi Navid Shokouhi The CRSS SRE Team John H.L. Hansen Keith W. Godin Abhinav Misra Ali Ziaei Hynek Bořil

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Spoofing and countermeasures for automatic speaker verification

Spoofing and countermeasures for automatic speaker verification INTERSPEECH 2013 Spoofing and countermeasures for automatic speaker verification Nicholas Evans 1, Tomi Kinnunen 2 and Junichi Yamagishi 3,4 1 EURECOM, Sophia Antipolis, France 2 University of Eastern

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

Speaker Recognition. Speaker Diarization and Identification

Speaker Recognition. Speaker Diarization and Identification Speaker Recognition Speaker Diarization and Identification A dissertation submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Voice conversion through vector quantization

Voice conversion through vector quantization J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories,

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA LANGUAGE AND SPEECH, 2009, 52 (4), 391 413 391 Variability in Word Duration as a Function of Probability, Speech Style, and Prosody Rachel E. Baker, Ann R. Bradlow Northwestern University, Evanston, IL,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Automatic Pronunciation Checker

Automatic Pronunciation Checker Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access Joyce McDonough 1, Heike Lenhert-LeHouiller 1, Neil Bardhan 2 1 Linguistics

More information

Automatic intonation assessment for computer aided language learning

Automatic intonation assessment for computer aided language learning Available online at www.sciencedirect.com Speech Communication 52 (2010) 254 267 www.elsevier.com/locate/specom Automatic intonation assessment for computer aided language learning Juan Pablo Arias a,

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University 1 Perceived speech rate: the effects of articulation rate and speaking style in spontaneous speech Jacques Koreman Saarland University Institute of Phonetics P.O. Box 151150 D-66041 Saarbrücken Germany

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 2aSC: Linking Perception and Production

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all Human Communication Science Chandler House, 2 Wakefield Street London WC1N 1PF http://www.hcs.ucl.ac.uk/ ACOUSTICS OF SPEECH INTELLIGIBILITY IN DYSARTHRIA EUROPEAN MASTER S S IN CLINICAL LINGUISTICS UNIVERSITY

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer

More information

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab Revisiting the role of prosody in early language acquisition Megha Sundara UCLA Phonetics Lab Outline Part I: Intonation has a role in language discrimination Part II: Do English-learning infants have

More information

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District Report Submitted June 20, 2012, to Willis D. Hawley, Ph.D., Special

More information

Speech Recognition by Indexing and Sequencing

Speech Recognition by Indexing and Sequencing International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 4 (212) pp. 358 365 c MIR Labs, www.mirlabs.net/ijcisim/index.html Speech Recognition

More information

CEFR Overall Illustrative English Proficiency Scales

CEFR Overall Illustrative English Proficiency Scales CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 Recognizing speech 2 Feature calculation Dan Ellis Michael Mandel 3 Sequence

More information

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Xinyu Zhou, Yuxin Wu, and Tiezheng Li Tsinghua University Contents 1 Introduction 1 2 Algorithms 2 2.1 VAD..................................................

More information

English Language and Applied Linguistics. Module Descriptions 2017/18

English Language and Applied Linguistics. Module Descriptions 2017/18 English Language and Applied Linguistics Module Descriptions 2017/18 Level I (i.e. 2 nd Yr.) Modules Please be aware that all modules are subject to availability. If you have any questions about the modules,

More information

How to Judge the Quality of an Objective Classroom Test

How to Judge the Quality of an Objective Classroom Test How to Judge the Quality of an Objective Classroom Test Technical Bulletin #6 Evaluation and Examination Service The University of Iowa (319) 335-0356 HOW TO JUDGE THE QUALITY OF AN OBJECTIVE CLASSROOM

More information

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

International Journal of Advanced Networking Applications (IJANA) ISSN No. : International Journal of Advanced Networking Applications (IJANA) ISSN No. : 0975-0290 34 A Review on Dysarthric Speech Recognition Megha Rughani Department of Electronics and Communication, Marwadi Educational

More information

Why Did My Detector Do That?!

Why Did My Detector Do That?! Why Did My Detector Do That?! Predicting Keystroke-Dynamics Error Rates Kevin Killourhy and Roy Maxion Dependable Systems Laboratory Computer Science Department Carnegie Mellon University 5000 Forbes Ave,

More information

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting Turhan Carroll University of Colorado-Boulder REU Program Summer 2006 Introduction/Background Physics Education Research (PER)

More information

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas Exploiting Distance Learning Methods and Multimediaenhanced instructional content to support IT Curricula in Greek Technological Educational Institutes P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou,

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

An Introduction to Simio for Beginners

An Introduction to Simio for Beginners An Introduction to Simio for Beginners C. Dennis Pegden, Ph.D. This white paper is intended to introduce Simio to a user new to simulation. It is intended for the manufacturing engineer, hospital quality

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397, Adoption studies, 274 275 Alliteration skill, 113, 115, 117 118, 122 123, 128, 136, 138 Alphabetic writing system, 5, 40, 127, 136, 410, 415 Alphabets (types of ) artificial transparent alphabet, 5 German

More information

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY Sergey Levine Principal Adviser: Vladlen Koltun Secondary Adviser:

More information

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech Dept. for Speech, Music and Hearing Quarterly Progress and Status Report VCV-sequencies in a preliminary text-to-speech system for female speech Karlsson, I. and Neovius, L. journal: STL-QPSR volume: 35

More information

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions 26 24th European Signal Processing Conference (EUSIPCO) Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions Emma Jokinen Department

More information

Lecture Notes in Artificial Intelligence 4343

Lecture Notes in Artificial Intelligence 4343 Lecture Notes in Artificial Intelligence 4343 Edited by J. G. Carbonell and J. Siekmann Subseries of Lecture Notes in Computer Science Christian Müller (Ed.) Speaker Classification I Fundamentals, Features,

More information

Evidence for Reliability, Validity and Learning Effectiveness

Evidence for Reliability, Validity and Learning Effectiveness PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies

More information

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Improvements to the Pruning Behavior of DNN Acoustic Models

Improvements to the Pruning Behavior of DNN Acoustic Models Improvements to the Pruning Behavior of DNN Acoustic Models Matthias Paulik Apple Inc., Infinite Loop, Cupertino, CA 954 mpaulik@apple.com Abstract This paper examines two strategies that positively influence

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Modeling user preferences and norms in context-aware systems

Modeling user preferences and norms in context-aware systems Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

Using SAM Central With iread

Using SAM Central With iread Using SAM Central With iread January 1, 2016 For use with iread version 1.2 or later, SAM Central, and Student Achievement Manager version 2.4 or later PDF0868 (PDF) Houghton Mifflin Harcourt Publishing

More information

Corpus Linguistics (L615)

Corpus Linguistics (L615) (L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282) B. PALTRIDGE, DISCOURSE ANALYSIS: AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC. 2012. PP. VI, 282) Review by Glenda Shopen _ This book is a revised edition of the author s 2006 introductory

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information