PAPER Bayesian Learning of a Language Model from Continuous Speech

Size: px
Start display at page:

Download "PAPER Bayesian Learning of a Language Model from Continuous Speech"

Transcription

1 614 IEICE TRANS. INF. & SYST., VOL.E95 D, NO.2 FEBRUARY 2012 PAPER Bayesian Learning of a Language Model from Continuous Speech Graham NEUBIG a), Masato MIMURA, Shinsuke MORI, Nonmembers, and Tatsuya KAWAHARA, Member SUMMARY We propose a novel scheme to learn a language model (LM) for automatic speech recognition (ASR) directly from continuous speech. In the proposed method, we first generate phoneme lattices using an acoustic model with no linguistic constraints, then perform training over these phoneme lattices, simultaneously learning both lexical units and an LM. As a statistical framework for this learning problem, we use non-parametric Bayesian statistics, which make it possible to balance the learned model s complexity (such as the size of the learned vocabulary) and expressive power, and provide a principled learning algorithm through the use of Gibbs sampling. Implementation is performed using weighted finite state transducers (WFSTs), which allow for the simple handling of lattice input. Experimental results on natural, adult-directed speech demonstrate that LMs built using only continuous speech are able to significantly reduce ASR phoneme error rates. The proposed technique of joint Bayesian learning of lexical units and an LM over lattices is shown to significantly contribute to this improvement. key words: language modeling, automatic speech recognition, Bayesian learning, weighted finite state transducers 1. Introduction A language model (LM) is an essential part of automatic speech recognition (ASR) systems, providing linguistic constraints on the recognizer and helping to resolve the ambiguity inherent in the acoustic signal. Traditionally, these LMs are learned from digitized text, preferably text that is similar in style and content to the speech that is to be recognized. In this paper, we propose a new paradigm for LM learning, using not digitized text but audio data of continuous speech. The proposition of learning an LM from continuous speech is motivated from a number of viewpoints. First, the properties of written and spoken language are very different [1], and LMs learned from continuous speech can be expected to naturally model spoken language, removing the need to manually transcribe speech or compensate for these differences when creating an LM for ASR [2]. Second, learning words and their context from speech can allow for out-of-vocabulary word detection and acquisition, which has been shown to be useful in creating more adaptable and robust ASR or dialog systems [3], [4]. Learning LMs from speech could also prove a powerful tool in efforts for technology-based language preservation [5], particularly for languages that have a rich oral, but not written tradition. Finally, as human children learn language from speech, not Manuscript received June 21, Manuscript revised September 26, The authors are with the Graduate School of Informatics, Kyoto University, Kyoto-shi, Japan. a) neubig@ar.media.kyoto-u.ac.jp DOI: /transinf.E95.D.614 text, computational models for learning from speech are of great interest in the field of cognitive science [6]. There has been a significant amount of work on learning lexical units from speech data. These include statistical models based on the minimum description length or maximum likelihood frameworks, which have been trained on one-best phoneme recognition results [7] [9] or recognition lattices [10]. There have also been a number of works that use acoustic matching methods combined with heuristic cutoffs that may be adjusted to determine the granularity of the units that need to be acquired [11] [13]. Finally, many works, inspired by the multi-modal learning of human children, use visual and audio information (or at least abstractions of such) to learn words without text [6], [14], [15]. This work is different from these other approaches in that it is the first model that is able to learn a full word-based n-gram model from raw audio. In order to learn an LM from continuous speech, we first generate lattices of phonemes without any linguistic constraints using a standard ASR acoustic model. To learn an LM from this data, we build on recent work in unsupervised word segmentation of text [16], proposing a novel inference procedure that allows for models to be learned over lattice input. For LM learning, we use the hierarchical Pitman-Yor LM (HPYLM) [17], a variety of LM that is based on non-parametric Bayesian statistics. Non-parametric Bayesian statistics are well suited to this learning problem, as they allow for automatically balancing model complexity and expressiveness, and have a principled framework for learning through the use of Gibbs sampling. To perform sampling over phoneme lattices, we represent all of our models using weighted finite state transducers (WFSTs), which allow for simple and efficient combination of the phoneme lattices with the LM. Using this combined lattice, we use a variant of the forward-backward algorithm to efficiently sample a phoneme string and word segmentation according to the model probabilities. By performing this procedure on each of the utterances in the corpus for several iterations, it is possible to effectively discover phoneme strings and lexical units appropriate for LM learning, even in the face of acoustic uncertainty. In order to evaluate the feasibility of the proposed method, we performed an experiment on learning an LM from only audio files of fluent adult-directed meeting speech with no accompanying text. We demonstrate that, despite the lack of any text data, the proposed model is able to both decrease the phoneme recognition error rate over a separate test set and acquire a lexicon with many intuitively reason- Copyright c 2012 The Institute of Electronics, Information and Communication Engineers

2 NEUBIG et al.: BAYESIAN LEARNING OF A LANGUAGE MODEL FROM CONTINUOUS SPEECH 615 able lexical entries. Moreover, we demonstrate that the proposed lattice processing approach is effective for overcoming acoustic ambiguity present during the training process. In Sect. 2 we briefly overview ASR, including language modeling and representation of ASR models in the WFST framework. Section 3 describes previous research on LMbased unsupervised word segmentation, which learns LMs even when there are no clear boundaries between words. In Sect. 4 we propose a method for formulating LM-based unsupervised word segmentation using a combination of WF- STs and Gibbs sampling. We conclude the description in Sect. 4.3 by showing that the WFST-based formulation allows for LM learning directly from speech, even in the presence of acoustic uncertainty. Section 5 describes the results of an experimental evaluation demonstrating the effectiveness of the proposed method, and Sect. 6 concludes the paper and discusses future directions. 2. Speech Recognition and Language Modeling This section provides an overview of ASR and language modeling and provides definitions that will be used in the rest of the paper. 2.1 Speech Recognition ASR can be formalized as the task of finding a series of words W given acoustic features X of a speech signal containing these words. Most ASR systems use statistical methods, creating a model for the posterior probability of the words given the acoustic features, and searching for the word sequence that maximizes this probability Ŵ = argmaxp(w X). (1) W As this posterior probability is difficult to model directly, Bayes s law is used to decompose the probability P(X W)P(W) Ŵ = argmax (2) W P(X) = argmaxp(x W)P(W). (3) W Here, P(X W) is computed by the acoustic model (AM), which makes a probabilistic connection between words and their acoustic features. However, directly modeling the acoustic features of the thousands to millions of words in large-vocabulary ASR systems is not realistic due to data sparsity issues. Instead, AMs are trained to recognize sequences of phonemes Y, which are then mapped into the word sequence W. Phonemes are defined as the smallest perceptible linguistic unit of speech. Thus, the entire ASR process can be described as finding the optimal word sequence according to the following formula Ŵ = argmax W P(X Y)P(Y W)P(W). (4) Y This is usually further approximated by choosing the single most likely phoneme sequence to allow for efficient search: Ŵ = argmaxp(x Y)P(Y W)P(W). (5) W,Y Here, P(X Y) indicates the AM probability and P(Y W) isa lexicon probability that maps between words and their pronunciations. P(W) is computed by the LM, which we will describe in more detail in the following section. It should be noted that in many cases a scaling factor α is used Ŵ = argmaxp(x Y)P(Y W)P(W) α. (6) W,Y This allows for the adjustment of the relative weight put on the LM probability. 2.2 Language Modeling The goal of the LM probability P(W) istoprovideapreference towards good word sequences, assigning high probability to word sequences that the speaker is likely to say, and low probability to word sequences that the speaker is unlikely to say. By doing so, this allows the ASR system to select linguistically proper sequences when purely acoustic information is not enough to correctly recognize the input. The most popular form of LM is the n-gram, which is notable for its simplicity, computational efficiency, and surprising power [18]. n-gram LMs are based on the fact that it is possible to calculate the joint probability of W = w I 1 sequentially by conditioning on all previous words in the sequence using the chain rule P(W) = I i=1 P(w i w i 1 1 ). (7) Conditioning on previous words in the sequence allows for the consideration of contextual information in the probabilistic model. However, as few sentences will contain exactly the same words as any other, conditioning on all previous words in the sentence quickly leads to data sparseness issues. n-gram models resolve this problem by only conditioning on the previous (n 1) words when choosing the next word in the sequence P(W) I i=1 P(w i w i 1 i n+1 ). (8) The conditional probabilities are generally trained from a large corpus of word sequences W. From W we calculate the counts of each subsequence of n words w i i n+1 (an n-gram ). From these counts, it is possible to compute conditional probabilities using maximum likelihood estimation P ml (w i w i 1 i n+1 ) = c(wi i n+1 ) (9) c(w i 1 ). i n+1 However, even if we set n to a relatively small value, we will never have a corpus large enough to exhaustively cover

3 616 IEICE TRANS. INF. & SYST., VOL.E95 D, NO.2 FEBRUARY 2012 all possible n-grams. In order to deal with this data sparsity issue, it is common to use a framework that references higher order n-gram probabilities when they are available, and falls back to lower order n-gram probabilities according to a fallback probability P(FB w i 1 i n+1 ): P(w i w i 1 i n+1 ) = P s (w i w i 1 i n+1 ) P(FB w i 1 i n+1 )P(w i w i 1 i n+2 ) ifc(wi i n+1 ) > 0, (10) By combining more accurate but sparse higher-order n-grams with less accurate but more reliable lower-order n- grams, it is possible to create LMs that are both accurate and robust. To reserve some probability for P(FB w i 1 i n+1 ), we replace P ml with the smoothed probability distribution P s. P s can be defined according to a number of smoothing methods, which are described thoroughly in [19]. 2.3 Bayesian Language Modeling While traditional methods for LM smoothing are based on heuristics (often theoretically motivated), it is also possible to motivate language modeling from the perspective of Bayesian statistics [17], [20]. In order to perform smoothing in the Bayesian framework, we first define a variable g wi w i 1 that specifies n-gram probabilities g wi w i 1 = P(w i w i 1 ) (11) where 0 m n 1 is the length of the context being considered. As we are not sure of the actual values of the n-gram probabilities due to data sparseness, the standard practice of Bayesian statistics suggests we treat all probabilities as random variables G that we can learn from the training data W. Formally, this learning problem consists of estimating the posterior probability P(G W). This can be calculated in a Bayesian fashion by placing a prior probability P(G) over G and combining this with the likelihood P(W G) and the evidence P(W) P(G W) = P(W G)P(G) (12) P(W) P(W G)P(G). (13) We can generally ignore the evidence probability, as the training data is fixed throughout the entire training process. It should be noted that LMs are a collection of multinomial distributions G w i 1 = {g wi =1 w i 1,...,g w i =N w i 1 } where N is the number of words in the vocabulary. There is one multinomial for each history w i 1, with the length of w i 1 being 0 through n 1. As the variables in G w i 1 belong to a multinomial distribution, it is natural to use priors based on the Pitman-Yor process [21]. The Pitman-Yor process is useful in that it is able to assign probabilities to the space of variables that form multinomial distributions. Formally, this means that if we define the prior over G w i 1 using a Pitman-Yor process, we will be guaranteed that its elements will add to one N x=1 g wi =x w i 1 = 1 (14) and be between zero and one N x=1 0 g w i =x w i 1 1. (15) The Pitman-Yor process has three parameters: the discount parameter d m, the strength parameter θ m, and the base measure G w i 1 i m+2 G w i 1 PY(d m,θ m, G w i 1 ). (16) i m+2 The discount d m is subtracted from observed counts, and when it is given a large value (close to one), the model will give more probability to frequent words. The strength θ m controls the overall sparseness of the distribution, and when it is given a small value the distribution will be sparse. The base measure G w i 1 of the Pitman-Yor process indicates the i m+2 expected value of the probability distributions it generates, and is essentially the default value used when there are no words in the training corpus for context w i 1. It should be noted that here, we are setting the base measure of each G w i 1 to that of its parent context G w i 1 i m+2. This forms a hierarchical structure that is referred to as the hierarchical Pitman-Yor LM (HPYLM, [17]) and shown in Fig. 1. This hierarchical structure implies that each set of m-gram (e.g., trigram) probabilities will be using its corresponding (m 1)-gram (e.g., bigram) probabilities as a starting point when no or little training data is available. As a result, we achieve a principled probabilistic interpolation of m-gram and (m 1)-gram smoothing similar to the heuristic methods described in Sect Finally, the base measure of the unigram model G 0 indicates the prior probability over words in the vocabulary. If we have a vocabulary of all the words that the HPYLM is expected to generate, we can simply set this so that a uniform probability is given to each word in the vocabulary. For the Pitman-Yor process, the actual probabilities of the LM can be calculated through Gibbs sampling and the Chinese Restaurant Process (CRP) formulation, the details Fig. 1 An example of the hierarchical structure of the HPYLM. The better-known Dirichlet process is a specific case of the Pitman-Yor process, where the discount parameter is set to zero. Following [17], we give the strength and discount parameters a prior and allow them to be chosen automatically.

4 NEUBIG et al.: BAYESIAN LEARNING OF A LANGUAGE MODEL FROM CONTINUOUS SPEECH 617 of which are beyond the scope of this paper but described in [17]. The important thing to note is that for each n-gram probability, it is possible to calculate the expectation of the probability given a set of sufficient statistics S 1 P(w i w i 1 i n+1, S ) = 0 g wi w i 1 i n+1 P(g w i w i 1 i n+1 S )dg w i w i 1 i n+1. (17) The statistics S mainly consist of n-gram counts, but also some auxiliary variables that summarize the configuration of the CRP. These can be easily computed given a wordsegmented corpus W. The practical implication of this is that we do not need to directly estimate the parameters G, but only need to keep track of the sufficient statistics needed to calculate this expectation of P(w i w i 1 i n+1, S ). This fact becomes useful when using this model in unsupervised learning, as described in later sections. 2.4 Weighted Finite State ASR In recent years, the paradigm of weighted finite state transducers (WFSTs) has brought about great increases in the speed and flexibility of ASR systems [22]. Finite state transducers are finite automata with transitions labeled with input and output symbols. WFSTs also assign a weight to transitions, allowing for the definition of weighted relations between two strings. These weights can be used to represent probabilities of each model for ASR including the AM, lexicon, and the LM, examples of which are shown in Fig. 2. In figures of the WFSTs, edges are labeled as a/b:c, where a indicates the input, b indicates the output, and c indicates the weight. b may be omitted when a and b are the same value, and c will be omitted when it is equal to 1. The standard AM for P(X Y) in most ASR systems is based on a Hidden Markov Model (HMM), and its WFST representation, which we will call A. A simplified example of this model is shown in Fig. 2 (a). As input, this takes acoustic features, and after several steps through the HMM outputs a single phoneme such as e- or s. The transition and emission probabilities are identical to the standard HMM used in ASR acoustic models, but we have omitted them from the figure for simplicity. The WFST formulation for the lexicon, which we will call L, shown in Fig. 2 (b), takes phonemes as input and outputs words along with their corresponding lexicon probability P(Y W). Excluding the case of homographs (words with the same spelling but different pronunciations), the probability of transitions in the lexicon will be 1. Finally, the LM probability P(W) can also be represented in the WFST format. Figure 2 (c) shows an example of a bigram LM with only two words w 1 and w 2 in the vocabulary. Each node represents a unique n-gram context w i 1, and the outgoing edges from the node represent the probability of symbols given this context P(w i w i 1 ). In order to handle the fallback to lower-order contexts as described in Sect. 2.2, edges that fall back from w i 1 to wi 1 i m+2 are added, weighted with the fallback probability (marked with FB in the figure). The label ɛ on these edges indicates the empty string, which means they can be followed at any time, regardless of the input symbol. The main advantage of using WFSTs to describe the ASR problem is the existence of efficient algorithms for operations such as composition, intersection, determinization, and minimization. In particular, composition (written X Y) allows the combination of two WFSTs in sequence, so if we compose A L G together, we can create a single WFST that takes acoustic features as input and outputs weighted strings of words entailed by the acoustic features. We use this property of WFSTs later to facilitate the implementation of our learning of LMs from continuous speech. 3. Learning LMs from Unsegmented Text While Sects. 2.2 and 2.3 described how to learn LMs when we are given a corpus of word sequences W, there are some cases when the word sequence is not obvious. For example, when human babies learn words they do so from continuous speech, even though there often are not explicit boundaries between words in the phoneme stream. In addition, many languages such as Japanese and Chinese are written without boundaries between words, and thus the definition of words is not uniquely fixed. These two facts have led to significant research interest in unsupervised word segmentation (WS), the task of finding words and learning LMs from unsegmented phoneme or character strings with no manual intervention [7], [16], [23] [26]. 3.1 Unsupervised WS Modeling Fig. 2 The WFSTs for ASR including (a) the acoustic model A, (b)the lexicon L, and (c) the language model G. In this work, we follow [16] in taking an LM-based approach to unsupervised WS, learning a word-based LM G from a corpus of unsegmented phoneme strings Y. This

5 618 IEICE TRANS. INF. & SYST., VOL.E95 D, NO.2 FEBRUARY 2012 problem can be specified as finding a model according to the posterior probability of the LM P(G Y), which we can decompose using Bayes s law P(G Y) P(Y G)P(G). (18) However, as G is a word-based LM, we also assume that there are hidden word sequences W, and model the probability given these sequences P(G Y) P(Y W)P(W G)P(G). (19) W Here, P(Y W) indicates that the words in W must correspond to the phonemes in Y, and will be 1 if and only if Y can be recovered by concatenating the words in W together. P(W G) is the likelihood given the LM probabilities, and is identical to that described in Eq. (8). P(G) can be set using the previously described HPYLM, with one adjustment. With the model we described in Sect. 2.3, it was necessary to know the full vocabulary in advance so that we could set the base measure G 0 to a uniform distribution over all the words in the vocabulary. However, when learning an LM from unsegmented text, W is not known in advance, and thus it is impossible to define a closed vocabulary before training starts. As a result, it is necessary to find an alternative method of defining G 0 that allows the model to flexibly decide which words to include in the vocabulary as training progresses. In order to do so, [16] uses a spelling model H, which assigns prior probabilities over words by using an LM specified over phonemes. If we have a word w i that consists of phonemes, y 1,...,y J, we define the spelling model probability of w i according to the n-gram probabilities of H: G 0 (w i ) = P(w i = y 1,...,y J H) = J j=1 H y j y j 1 j n+1 (20) We assume that H is also distributed according to the HPYLM, and that the set of phonemes is closed and thus we are able to define a uniform distribution over phonemes H 0. The probabilities of H can be calculated from the set of phoneme sequences of words generated from the spelling model, much like the probabilities of G can be calculated from the set of word sequences contained in the corpus. This gives us a full generative model for the corpus Y that first generates the LM probabilities compact (and thus have high generalization capacity). The HPYLM priors for H and G have a preference for simple models, and thus will tend to induce compact models, while the likelihoods for W bias towards larger and more expressive models that describe the data well. 3.2 Inference for Unsupervised WS The main difficulty in learning LM G from the phoneme string Y is solving Eq. (19). Here, it is necessary to sum over all possible configurations of W, which represent all possible segmentations of Y. However, for all but the smallest of corpora, the number of possible segmentations is astronomical and thus it is impractical to explicitly enumerate all possible W. Instead, we can turn to Gibbs sampling [27], [28], a method for calculating this sum approximately. Gibbs sampling approximates the integral or sum over multivariate distributions by stepping through each variable in the distribution and sampling it given all of the other variables to be estimated. As we are interested in calculating W, for each step of the algorithm we take a single sentence W k W and sample it according to a distribution P(W k Y k, S Wk ). S indicates the sufficient statistics calculated from the current configuration of W required to calculate language model probabilities (as described in Sect. 2.3). S Wk indicates the sufficient statistics after subtracting the n-gram counts and corresponding CRP configurations that were obtained from the sentence W k. These sufficient statistics allow us to calculate the conditional probability of W k given all other sentences, a requirement to properly perform Gibbs sampling. It should be noted that each W k contains multiple variables (words), so this is a variant of blocked Gibbs sampling, which samples multiple variables simultaneously [29]. The full sampling procedure is shown in Fig. 3, and we further detail how a single sentence W k can be sampled according to this distribution in the following section. By repeating Gibbs sampling for many iterations, the sampled values of each sentence W k, and the LM sufficient statistics S calculated therefrom, will gradually approach the high-probability areas specified by the model. As men- H HPYLM(d H, θ H, H 0 ) (21) G HPYLM(d G, θ G, P(w H)) (22) then generates each word sequence W Wand concatenates it into a phoneme sequence W P(W G) (23) Y concat(w). (24) This generative story is important in that it allows for the creation of LMs that are both highly expressive and Fig. 3 The algorithm for Gibbs sampling of the word sequence W and the sufficient statistics S necessary for calculating LM probabilities. On the first iteration, we start with an empty S, and gradually add the statistics for each sentence as they are sampled.

6 NEUBIG et al.: BAYESIAN LEARNING OF A LANGUAGE MODEL FROM CONTINUOUS SPEECH 619 tioned previously, the HPYLM-based formulation prefers highly expressive, compact models. Lexicons that contain many words are penalized by the HPYLM prior, preventing segmentations of W that result in a large number of unique words. On the other hand, if the lexicon is too small, it will result in low descriptive power. Thus the sampled values are expected to be those with a consistent segmentation for words, and with common phoneme sequences grouped together as single words. 3.3 Calculating Predictive Probabilities As the main objective of an LM is to assign a probability to an unseen phoneme string Y, we are interested in calculating the predictive distribution P(Y Y) = P(W G)P(G Y)dG. (25) G W { W:concat( W)=Y} However, computing this function directly is computationally difficult. To reduce this computational load we approximate the summation over W with the maximization, assuming that the probability of Y is equal to that of its most likely segmentation. In addition, assume we have I effective samples of the sufficient statistics obtained after iterations of the previous sampling process. Using these samples, we can approximate the integral over G with the mean of the probabilities given the sufficient statistics {S 1,...,S I } While this method is applicable to unsegmented text strings, it is not applicable to situations where uncertainty exists in the input, such as the case of learning from speech. Here we propose an alternative formulation that uses the WFST framework. This is done by first creating a WFST-based formulation of the WS model (Sect. 4.1), then describing a dynamic programming method for sampling over WFSTs (Sect. 4.2). This formulation is critical for learning from continuous speech, as it allows for sampling a word string W from not only one-best phoneme strings, but also phoneme lattices that are able to encode the uncertainty inherent in acoustic matching results. 4.1 A WFST Formulation for Word Segmentation Our formulation for sampling word sequences consists of first generating a lattice of all possible segmentation candidates using WFSTs, then performing sampling over this lattice. The three WFSTs used for WS (Fig. 4) are quite similar to the ASR WFSTs shown in Fig. 2. In place of the acoustic model WFST used in ASR, we P(Y Y) 1 I I i=1 max W { W:concat( W)=Y} P(W S i ). (26) While Eq. (26) approximates the probability using the average maximum-segmentation probability of each S i, search for such a solution at decoding time is a non-trivial problem. As an approximation to this sum, we find the onebest solution mandated by each of the samples, and combine the separate solutions using ROVER [30]. 4. WFST-based Sampling of Word Sequences While the previous section described the general flow of the inference process, we still require an effective method to sample the word sequence W according to the probability P(W Y, S W ). One way to do so would be to explicitly enumerate all possible segmentations for Y, calculate their probabilities, and sample based on these probabilities. However, as the number of possible segmentations of Y grows exponentially in the length of the sentence, this is an unrealistic solution. Thus, the most difficult challenge of the algorithm in Fig. 3 is efficiently obtaining a word sequence W given a phoneme sequence Y according to the language model probabilities specified by S W. One solution is proposed by [16], who use a dynamic programming algorithm that allows for efficient sampling of a value for W according to the probability P(W Y, S W ). Fig. 4 The WFSTs for word segmentation including (a) the input Y, (b) the lexicon L, and (c) the language model GH. Some samples may be skipped during the early stages of sampling (a process called burn-in ) to help ensure that samples are likely according to the HPYLM.

7 620 IEICE TRANS. INF. & SYST., VOL.E95 D, NO.2 FEBRUARY 2012 simply use a linear chain representing the phonemes in Y, as shown in Fig. 4 (a). The lexicon WFST L in Fig. 4 (b) is identical to the lexicon WFST used in ASR, except that in addition to creating words from phonemes, it also allows all phonemes in the input to be passed through as-is. This allows words in the lexicon to be assigned word-based probabilities according to the language model G, and all words (in the lexicon or not) to be assigned probabilities according to the spelling model H. This is important in the unsupervised WS setting, where the lexicon is not defined in advance, and words outside of the lexicon are still assigned a small probability. The training process starts with an empty lexicon, and thus no paths emitting words are present. When a word that is not in the lexicon is sampled as a phoneme sequence, L is modified by adding a path that converts the new word s phonemes into its corresponding word token. Conversely, when the last sample containing a word in the lexicon is subtracted from the distribution and the word s count becomes zero, its corresponding path is removed from L. It should be noted that we assume that each word can be mapped onto a single spelling, so P(Y W) will always be 1. More major changes are made to the LM WFST, which is shown in Fig. 4 (c). Unlike the case in ASR, where we are generally only concerned with words that exist in the vocabulary, it is necessary to model unknown words that are not included in the vocabulary. The key to the representation is that the word-based LM G and the phoneme-based spelling model H are represented in a single WFST, which we will call GH. GH has weighted edges falling back from the base stateofg to H, and edges accepting the terminal symbol for unknown words and transitioning from H to the base state of G. This allows for the WFST to transition as necessary between the known word model and the spelling model. By composing together these three WFSTs as Y L GH, it is possible to create a WFST representing a lattice of segmentation candidates weighted with probabilities according to the LM. 4.2 Sampling over WFSTs Once we have a WFST lattice representing the model probabilities, we can sample a single path through the WFST according to the probabilities assigned to each edge. This is done using a technique called forward-filtering/backwardsampling, a concept similar to that of the forward-backward algorithm for hidden Markov models (HMM). This algorithm can be used to acquire a sample from all probabilistically weighted, acyclic WFSTs defined by a set of states S and a set of edges E. The first step of the algorithm consists of choosing an ordering for the states in S, which we will write s 1,...,s I. This ordering must be chosen so that all states included in paths that travel to state s i should be processed before s i itself. Each edge in E is defined as e k = s i, s j,w k traveling from s i to s j and weighted by w k. Assuming the graph is acyclic, we can choose the ordering so that for all edges in Fig. 5 A WFSA representing a unigram segmentation (words of length greater than three are not displayed). E, i < j. Given this ordering, if all states are processed in ascending order, we can be ensured that all states will be processed after their predecessors. Next, we perform the forward filtering step, identical to the forward pass of the forward-backward algorithm for HMMs, where probabilities are accumulated from the start state to following states. The initial state s 0 is given a forward probability f 0 = 1, and all following states are updated with the sum of the forward probabilities of each of the incoming states multiplied by the weights of the edges to the current state f j = f i w k. (27) e k = s i,s j,w k {E: j= j} This forward probability can be interpreted as the total probability of all paths that travel to f j from the initial state. We provide an example of this process using a weighted finite state acceptor (WFSA) for the unigram segmentation model of e- e s a r ( ASR ) shown in Fig. 5. In this case, the forward step will push probabilities from the first state as follows: f 1 = P(e-) f 0 (28) f 2 = P(e-e) f 0 + P(e) f 1 (29). The backward sampling step of the algorithm consists of sampling a path starting at the final state s I of the WFST. For the current state, s j, we can calculate the probability of all incoming edges P(e k = s i, s j,w k ) = f i w k, (30) f j and sample a single incoming edge according to this probability. Here w k considers the likelihood of e k itself, while f i considers the likelihood of all paths traveling up to s i,allowing for the correct sampling of an edge e k according to the probability of all paths that travel through it to the current state s j. In the example, the edge incoming to state s 5 is sampled according to P(s 4 s 5 ) = P(r) f 4 (31) P(s 3 s 5 ) = P(ar) f 3 (32). In this work, we assume that all words are represented by their phonetic spelling, not considering the graphemic representation used in usual text. For example, the word ASR will be transcribed as e-esar in the learned model.

8 NEUBIG et al.: BAYESIAN LEARNING OF A LANGUAGE MODEL FROM CONTINUOUS SPEECH 621 Fig. 6 A WFSA representing a phoneme lattice. interactive, adult-directed speech with a potentially large vocabulary, as opposed to the simplified grammars or infantdirected speech used in some previous work [6], [14]. 5.1 Experimental Setup Through this process, a path representing the segmentation of the phoneme string can be sampled according to the probability of the models included in the lattice. Given this path, it is possible to recover Y and W by concatenating the phonemes and words represented by the input and output of the sampled path respectively. 4.3 Extension to Continuous Speech Input When learning from continuous speech, the input is not a set of phoneme strings Y, but a set of spoken utterances X. As a result, instead of sampling just the word sequences W, we now need to additionally sample the phoneme strings Y. If we can create a single lattice representing the probability of both W and Y for a particular X, it is possible to use the forward-filtering/backward-sampling algorithm to sample phoneme strings and their segmentations together. With the WFST-based formulation described in the previous section, it is straight-forward to create this lattice representing candidates for Y and W. In fact, all we must do is replace the string of phonemes Y that was used in the WS model in Fig. 4 (a) with the acoustic model HMM A used for ASR in Fig. 2. As a result, the composed lattice A L GH can take acoustic features as input, and includes both the acoustic and language model probabilities. Using this value, we can sample appropriate new values of Y and W, and plug this into the learning algorithm of Fig. 3. However, as with traditional ASR, if we simply expand all hypotheses allowed by the acoustic model during the forward-filtering step, the hypothesis space will grow unmanageably large. As a solution to this, before starting training we first perform ASR using only the acoustic model and no linguistic information, generating trimmed phoneme lattices representing candidates for each Y such as those shown in Fig. 6. It should be noted that this dependence on an acoustic model to estimate P(X Y) indicates that this is not an entirely unsupervised method. However, some work has been done on language-independent acoustic model training [31], as well as the unsupervised discovery and clustering of acoustic units from raw speech [32]. The proposed LM acquisition method could be used in combination with these AM acquisition methods to achieve fully unsupervised speech recognition, a challenge that we leave to future work. 5. Experimental Evaluation We evaluated the feasibility of the proposed method on continuous speech from meetings of the Japanese Diet (Parliament). This was chosen as an example of naturally spoken, We created phoneme lattices using a triphone acoustic model, performing decoding with a vocabulary of 385 syllables that represent the phoneme transitions allowed by the syllable model. No additional linguistic information was used during the creation of the lattices, with all syllables in the vocabulary being given a uniform probability. In order to assess the amount of data needed to effectively learn an LM, we performed experiments using five different corpora of varying sizes: 7.9, 16.1, 31.1, 58.7, and minutes. The speech was separated into utterances, with utterance boundaries being delimited by short pauses of 200 ms or longer. According to this criterion, the training data consisted of 119, 238, 476, 952, and 1,904 utterances respectively. An additional 27.2 minutes (500 utterances) of speech were held out as a test set. As a measure of the quality of the LM learned by the training process, we used phoneme error rate (PER) when the LM was used to re-score the phoneme lattices of the test set. We chose PER as word-based accuracy may depend heavily on a particular segmentation standard. Given no linguistic information, the PER on the test set was 34.20%. The oracle PER of the phoneme lattice was 8.10%, indicating the lower bound possibly obtainable by LM learning. Fifty samples of the word sequences W for each training utterance (and the resulting sufficient statistics S )were taken after 20 iterations of burn-in, the first 10 of which were annealed according to the technique presented by [25]. For the LM scaling factor of Eq. (6), α was set arbitrarily to 5, with values between 5 and 10 producing similar results in preliminary tests. 5.2 Effect of n-gram Context Dependency In the first experiment, the effect of using context information in the learning process was examined. The n of the HPYLM language model was set to 1, 2, or 3, and n of the HPYLM spelling model was set to 3 for all models. The results with regards to PER are shown in Fig. 7. First, it can be seen that an LM learned directly from speech was able to improve the accuracy by 7% absolute PER or more compared to a baseline using no linguistic information. This is true even with only 7.9 minutes of training speech. In addition, the results show that the bigram model outperforms the unigram, and the trigram model outperforms the bigram, particularly as the size of the training data increases. We were also able to confirm the observation of [25] that the unigram model tends to undersegment, Syllable-based decoding was a practical consideration due to the limits of the decoding process, and is not a fundamental part of the proposed method. Phoneme-based decoding will be examined in the future.

9 622 IEICE TRANS. INF. & SYST., VOL.E95 D, NO.2 FEBRUARY 2012 Table 2 The effects on accuracy of the n-gram length used to acquire the lexicon and train the language model, as well as Bayesian sample combination. The proposed method significantly exceeds italicized results according to the two-proportions z-test (p < 0.05). Lexicon LM Single Combined 1-gram 1-gram 26.28% 26.08% 1-gram 3-gram 26.06% 25.41% 3-gram 3-gram 25.85% 25.28% Fig. 7 Phoneme error rate by model order. Table 1 The size of the vocabulary, and the number of n-grams in the word-based model G, and the phoneme-based model H when trained on minutes of speech. 1-gram 2-gram 3-gram Vocabulary size G entries H entries Fig. 8 Phoneme error rate for various training methods. grouping together multi-word phrases instead of actual words. This is reflected in the vocabulary and n-gram sizes of the three models after the final iteration of the learning process, which are displayed in Table 1. It can also be seen that the vocabulary size increases when the LM is given a smaller n, with the lack of complexity in the word-based LM being transferred to the phoneme-based spelling model. 5.3 Effect of Joint and Bayesian Estimation The proposed method has two major differences from previous methods such as [10], which estimates multigram models from speech lattices. The first is that we are performing joint learning of the lexicon and n-gram context, while multigram models do not consider context, similarly to the 1-gram model presented in this paper [23]. However, it is conceivable that a context insensitive model could be used for learning lexical units, and its results used to build a traditional LM. In order to test the effect of context-sensitive learning, we experiment with not only the proposed 1-gram and 3-gram models from Sect. 5.2, but also use the 1-gram model to acquire samples of W and use these to train a standard 3-gram LM. The second major difference is that we are performing learning using Bayesian methods. This allows us to consider the uncertainty of the acquired W through the sum in Eq. (26). Previous multigram approaches are based on maximum likelihood estimation, which only allows for a unique solution to be considered. To test the effect of this, we also take the one-best results acquired by the sampled LMs, but instead of combining them together to create a better result as explained in Sect. 3.3, we simply report the average PER of these one-best results. Table 2 shows the results of the evaluation (performed on the minute training data). It can be seen that the proposed method using Bayesian sample combination and incorporating LMs directly into training (3-gram/3- gram/combined) is effective in reducing the error rate compared to a model that does not use these proposed improvements (1-gram/3-gram/single). 5.4 Effect of Lattice Processing We also compare the proposed lattice processing method with four other LM construction methods. First, we trained a model using the proposed method, but instead of using word lattices, used one-best ASR results to provide a comparison with previous methods that have used one-best results [7], [9]. Second, to examine whether the estimation of word boundaries is necessary when acquiring an LM from speech, we trained a syllable trigram LM using these onebest results. Moreover, we show two other performance results for reference. One is an LM that was built using a human-created verbatim transcription of the utterances. WS and pronunciation annotation were performed with the KyTea toolkit [33], and pronunciations of unknown words were annotated by hand. Trigram language and spelling models were created on the segmented word and phoneme strings using interpolated Kneser-Ney smoothing. For the second reference, we created an oracle model by training on the lattice path with the lowest possible PER for each utterance. This demonstrates an upper bound of the accuracy achievable by the proposed model if it picks all the best phoneme sequences in the training lattice. The PER for the four methods is shown in Fig. 8. It can be seen that the proposed method significantly outperforms the model trained on one-best results, demonstrating that lattice processing is critical in reducing the noise inherent in acoustic matching results. It can also be seen that on one-

10 NEUBIG et al.: BAYESIAN LEARNING OF A LANGUAGE MODEL FROM CONTINUOUS SPEECH 623 Table 3 An example of words learned from continuous speech. Function Words no (genitive marker), ni (locative marker), to ( and ) Subwords ka (kyoka reinforcement, interrogative marker) sai (kokusai international, seisai sanction ) Content Words koto ( thing ), hanashi ( speak ), kangae ( idea ), chi-ki ( region ), shiteki ( point out ) Spoken Expressions yu- ( say (colloquial) ), e- (filler), desune (filler), mo-shiage ( say (polite) ) Fig. 9 Entropy comparison for various LM learning methods. best results, the model using acquired units achieves slightly but consistently better results than the syllable-based LM for all data sizes. As might be expected, the proposed method does not perform as well as the model trained on gold-standard transcriptions. However, it appears to improve at approximately the same rate as the model trained on the gold-standard transcriptions as more data is added, which is not true for onebest transcriptions. Furthermore, it can be seen that the oracle results fall directly between those achieved by the proposed model and the results on the gold-standard transcriptions. This indicates that approximately one half of the difference between the model learned on continuous speech and that learned from transcripts can be attributed to the lattice error. By expanding the size of the lattice, or directly integrating the calculation of acoustic scores with sampling, it will likely be possible to further close this gap. Another measure commonly used for evaluating the effectiveness of LMs is cross-entropy on a test set [18]. We show entropy per syllable for the LMs learned with each method in Fig. 9. It can be seen that the proposed method only slightly outperforms the model trained on one-best phoneme recognition results. This difference can be explained by systematic pronunciation variants that are not accounted for in the verbatim transcript. For example, kangaeteorimasu ( I am thinking ) is often pronounced with a dropped e as kangaetorimasu in fluent conversation. As a whole word will fail to match the reference, this will have a large effect on entropy results, but less of an effect on PER as only a single phoneme was dropped. In fact, for many applications such as speech analysis or data preparation for acoustic model training, the proposed method, which managed to properly learn pronunciation variants, is preferable to one that matches the transcript correctly. 5.5 Lexical Acquisition Results Finally, we present a qualitative evaluation of the lexical acquisition results. Typical examples of the words that were acquired in the process of LM learning are shown in Table 3. These are split into four categories: function words, subwords, content words, spoken language expressions. In the resulting vocabulary, function words were the most common of the acquired words, which is reasonable as function words make the majority of the actual spoken utterances. Subwords are the second most frequent category, and generally occur when less frequent content words share a common stem. An example of the content words discovered by the learning method shows a trend towards the content of discussions made in meetings of the Diet. In particular, chi-ki ( region ) and shiteki ( point out ) are good examples of words that are characteristic of Diet speech and acquired by the proposed model. While this result is not surprising, it is significant in that it shows that the proposed method is able to acquire words that match the content of the utterances on which it was trained. In addition to learning the content of the utterances, the proposed model also learned a number of stylistic characteristics of the speech in the form of fillers and colloquial expressions. This is also significant in that these expressions are not included in the official verbatim records in the Diet archives, and thus would not be included in an LM that was simply trained on these texts. 6. Conclusions and Future Work This paper presented a method for unsupervised learning of an LM given only speech and an acoustic model. Specifically, we adapted a Bayesian model for word segmentation and LM learning so that it could be applied to speech input. This was achieved by formulating all elements of LM learning as WFSTs, which allows for lattices to be used as input to the learning algorithm. We then formulated a Gibbs sampling algorithm that allows for learning over composed lattices that represent acoustic and LM probabilities. An experimental evaluation showed that LMs acquired from continuous speech with no accompanying transcriptions were able to significantly reduce the error rates of ASR over when no such models were used. We also showed that the proposed technique of joint Bayesian learning of lexical units and an LM over lattices significantly contributes to this improvement. This work contributes a basic technology that opens up a number of possible directions for future research into practical applications. The first and most immediate application of the proposed method would be for use in semi-supervised learning. In the semi-supervised setting, we have some text already available, but want to discover words from untranscribed speech that may be in new domains, speaking styles, or dialects. This can be formulated in the proposed model

11 624 IEICE TRANS. INF. & SYST., VOL.E95 D, NO.2 FEBRUARY 2012 by treating the phoneme sequences Y (and possibly word boundaries W) of existing text as observed variables and the Y and W of untranscribed speech as hidden variables. In addition, if it is possible to create word dictionaries but not a training corpus, these dictionaries could be used as a complement or replacement to the spelling model, allowing the proposed method to favor words that occur in the dictionary. The combination of the proposed model with information from modalities other than speech is another promising future direction. For example, while the model currently learns words as phoneme strings, it is important to learn the orthographic forms of words for practical use in ASR. One possibility is that speech could be grounded in text data such as television subtitles to learn these orthographic forms. In order to realize this in the proposed model, an additional FST layer that maps between phonetic transcriptions and their orthographic forms could be introduced to allow for a single phonetic word to be mapped into multiple orthographic words and vice-versa. In addition, the proposed method could be used to discover a lexicon and LM for under-resourced languages with little or no written text. In order to do so, it will be necessary to train not only an LM, but also an acoustic model that is able to recognize the phonemes or tones in the target language. One promising approach is to combine the proposed method with cross-language acoustic model adaptation, an active area of research that allows for acoustic models trained in more resource-rich languages to be adapted to resource-poor languages [31], [34]. The proposed method is also of interest in the framework of computational modeling of lexical acquisition by children. In its current form, which performs multiple passes over the entirety of the data, the proposed model is less cognitively plausible than previous methods that have focused on incremental learning [35] [37] However, work by [35] has demonstrated that similar Bayesian methods (which were evaluated on raw text, not acoustic input) can be adapted to an incremental learning framework. This sort of incremental learning algorithm is compatible with the proposed method as well, and may be combined to form a more cognitively plausible model. The final interesting challenge is how to scale the method to larger data sets. One possible way to improve the efficiency of sampling would be to use beam sampling techniques similar to those developed for non-parametric Markov models [39]. Another promising option is parallel sampling, which would allow sampling to be run on a number of different CPUs simultaneously [40]. On the other hand, phonemic acquisition is generally considered to occur in the early stages of infancy, prior to lexical acquisition [6], [38], and thus our reliance on a pre-trained acoustic model is largely plausible. References [1] D. Tannen, Spoken and Written Language: Exploring Orality and Literacy, ABLEX, [2] Y. Akita and T. Kawahara, Statistical transformation of language and pronunciation models for spontaneous speech recognition, IEEE Trans. Audio Speech Language Process., vol.18, no.6, pp , [3] I. Bazzi and J. Glass, Learning units for domain-independent outof-vocabulary word modelling, Proc. 7th European Conference on Speech Communication and Technology (EuroSpeech), pp.61 64, [4] T. Hirsimaki, M. Creutz, V. Siivola, M. Kurimo, S. Virpioja, and J. Pylkkonen, Unlimited vocabulary speech recognition with morph language models applied to finnish, Comput. Speech Lang., vol.20, no.4, pp , [5] S. Abney and S. Bird, The human language project: Building a universal corpus of the world s languages, Proc. 48th Annual Meeting of the Association for Computational Linguistics, pp.88 97, Uppsala, Sweden, July [6] D. Roy and A. Pentland, Learning words from sights and sounds: a computational model, Cognitive Science, vol.26, no.1, pp , [7] C. de Marcken, The unsupervised acquisition of a lexicon from continuous speech, tech. rep., Massachusetts Institute of Technology, Cambridge, MA, USA, [8] S. Deligne and F. Bimbot, Inference of variable-length linguistic and acoustic units by multigrams, Speech Commun., vol.23, no.3, pp , [9] A. Gorin, D. Petrovska-Delacretaz, G. Riccardi, and J. Wright, Learning spoken language without transcriptions, Proc IEEE Automatic Speech Recognition and Understanding Workshop, [10] J. Driesen and H.V. Hamme, Improving the multigram algorithm by using lattices as input, Proc. 9th Annual Conference of the International Speech Communication Association (InterSpeech), [11] L. ten Bosch and B. Cranen, A computational model for unsupervised word discovery, Proc. 8th Annual Conference of the International Speech Communication Association (InterSpeech), pp , [12] A. Park and J. Glass, Unsupervised pattern discovery in speech, IEEE Trans. Audio Speech Language Process., vol.16, no.1, [13] A. Jansen, K. Church, and H. Hermansky, Towards spoken term discovery at scale with zero resources, Proc. 11th Annual Conference of the International Speech Communication Association (Inter- Speech), [14] N. Iwahashi, Language acquisition through a human-robot interface by combining speech, visual, and behavioral information, Inf. Sci., vol.156, no.1-2, pp , [15] C. Yu and D.H. Ballard, A multimodal learning interface for grounding spoken language in sensory perceptions, ACM Trans. Applied Perception, vol.1, pp.57 80, July [16] D. Mochihashi, T. Yamada, and N. Ueda, Bayesian unsupervised word segmentation with nested Pitman-Yor modeling, Proc. 47th Annual Meeting of the Association for Computational Linguistics, [17] Y.W. Teh, A Bayesian interpretation of interpolated kneser-ney, tech. rep., School of Computing, National Univ. of Singapore, [18] J.T. Goodman, A bit of progress in language modeling, Comput. Speech Lang., vol.15, no.4, pp , [19] S.F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Proc. 34th Annual Meeting of the Association for Computational Linguistics, [20] D.J.C. Mackay and L.C.B. Petoy, A hierarchical Dirichlet language model, Natural Language Engineering, vol.1, pp.1 19, [21] J. Pitman and M. Yor, The two-parameter Poisson-Dirichlet distri-

12 NEUBIG et al.: BAYESIAN LEARNING OF A LANGUAGE MODEL FROM CONTINUOUS SPEECH 625 bution derived from a stable subordinator, The Annals of Probability, vol.25, no.2, pp , [22] M. Mohri, F. Pereira, and M. Riley, Speech recognition with weighted finite-state transducers, in Handbook on speech processing and speech communication, Part E: Speech recognition, [23] F. Bimbot, R. Pieraccini, E. Levin, and B. Atal, Variable-length sequence modeling: Multigrams, IEEE Signal Process. Lett., vol.2, no.6, pp , [24] M.R. Brent, An efficient, probabilistically sound algorithm for segmentation and word discovery, Mach. Learn., vol.34, pp , [25] S. Goldwater, T.L. Griffiths, and M. Johnson, A Bayesian framework for word segmentation: Exploring the effects of context, Cognition, vol.112, no.1, pp.21 54, [26] H. Poon, C. Cherry, and K. Toutanova, Unsupervised morphological segmentation with log-linear models, Proc. North American Chapter of the Association for Computational Linguistics - Human Language Technology (NAACL HLT), pp , [27] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., vol.6, no.6, pp , [28] D.J. MacKay, Information theory, inference, and learning algorithms, pp , Cambridge University Press, [29] C.S. Jensen, U. Kjærulff, and A. Kong, Blocking Gibbs sampling in very large probabilistic expert systems, Int. J. Human Comput. Studies, vol.42, no.6, pp , [30] J. Fiscus, A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER), Proc IEEE Automatic Speech Recognition and Understanding Workshop, [31] L. Lamel, J. Gauvain, and G. Adda, Lightly supervised and unsupervised acoustic model training, Comput. Speech Lang., vol.16, pp , [32] J.R. Glass, Finding acoustic regularities in speech: Application to phonetic recognition, Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, [33] G. Neubig and S. Mori, Word-based partial annotation for efficient corpus construction, Proc. 7th International Conference on Language Resources and Evaluation, [34] T. Schultz and A. Waibel, Language-independent and languageadaptive acoustic modeling for speech recognition, Speech Commun., vol.35, no.1, pp.31 52, [35] L. Pearl, S. Goldwater, and M. Steyvers, How ideal are we? incorporating human limitations into Bayesian models of word segmentation, Proc. 34th Annual Boston University Conference on Child Language Development, pp , Somerville, MA, [36] F.R. McInnes and S. Goldwater, Unsupervised extraction of recurring words from infant-directed speech, Proc. 33rd Annual Conference of the Cognitive Science Society, [37] O. Räsänen, A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events, Cognition, vol.120, no.2, pp , [38] P.D. Eimas, E.R. Siqueland, P. Jusczyk, and J. Vigorito, Speech perception in infants, Science, vol.171, no.3968, p.303, [39] J. Van Gael, Y. Saatci, Y. Teh, and Z. Ghahramani, Beam sampling for the infinite hidden Markov model, Proc. 25th International Conference on Machine Learning, [40] A. Asuncion, P. Smyth, and M. Welling, Asynchronous distributed learning of topic models, Proc. 22nd Annual Conference on Neural Information Processing Systems, vol.21, Graham Neubig received his B.E. from University of Illinois, Urbana-Champaign, U.S.A, in 2005, and his M.E. in informatics from Kyoto University, Kyoto, Japan in 2010, where he is currently pursuing his Ph.D. He is a recipient of the JSPS Research Fellowship for Young Scientists (DC1). His research interests include speech and natural language processing, with a focus on unsupervised learning for applications such as automatic speech recognition and machine translation. Masato Mimura received the B.E. and M.E. degrees from Kyoto University, Kyoto, Japan, in 1996 and 2000, respectively. Currently, he is a researcher in the Academic Center for Computing and Media Studies, Kyoto University. His research interests include spontaneous speech recognition and spoken language processing. Shinsuke Mori received B.S., M.S., and Ph.D. degrees in electrical engineering from Kyoto University, Kyoto, Japan in 1993, 1995, and 1998, respectively. After joining Tokyo Research Laboratory of International Business Machines (IBM) in 1998, he studied the language model and its application to speech recognition and language processing. He is currently an associate professor of Academic Center for Computing and Media Studies, Kyoto University. Tatsuya Kawahara received B.E. in 1987, M.E. in 1989, and Ph.D. in 1995, all in information science, from Kyoto University, Kyoto, Japan. In 1990, he became a Research Associate in the Department of Information Science, Kyoto University. From 1995 to 1996, he was a Visiting Researcher at Bell Laboratories, Murray Hill, NJ, USA. Currently, he is a Professor in the Academic Center for Computing and Media Studies and an Affiliated Professor in the School of Informatics, Kyoto University. He has also been an Invited Researcher at ATR and NICT. He has published more than 200 technical papers on speech recognition, spoken language processing, and spoken dialogue systems. He has been managing several speech-related projects in Japan including a free large vocabulary continuous speech recognition software project ( Dr. Kawahara received the 1997 Awaya Memorial Award from the Acoustical Society of Japan and the 2000 Sakai Memorial Award from the Information Processing Society of Japan. From 2003 to 2006, he was a member of IEEE SPS Speech Technical Committee. From 2011, he is a secretary of IEEE SPS Japan Chapter. He was a general chair of IEEE Automatic Speech Recognition & Understanding workshop (ASRU 2007). He also served as a tutorial chair of INTERSPEECH He is a senior member of IEEE.

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode Diploma Thesis of Michael Heck At the Department of Informatics Karlsruhe Institute of Technology

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,

More information

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

CEFR Overall Illustrative English Proficiency Scales

CEFR Overall Illustrative English Proficiency Scales CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Large vocabulary off-line handwriting recognition: A survey

Large vocabulary off-line handwriting recognition: A survey Pattern Anal Applic (2003) 6: 97 121 DOI 10.1007/s10044-002-0169-3 ORIGINAL ARTICLE A. L. Koerich, R. Sabourin, C. Y. Suen Large vocabulary off-line handwriting recognition: A survey Received: 24/09/01

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Improvements to the Pruning Behavior of DNN Acoustic Models

Improvements to the Pruning Behavior of DNN Acoustic Models Improvements to the Pruning Behavior of DNN Acoustic Models Matthias Paulik Apple Inc., Infinite Loop, Cupertino, CA 954 mpaulik@apple.com Abstract This paper examines two strategies that positively influence

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Letter-based speech synthesis

Letter-based speech synthesis Letter-based speech synthesis Oliver Watts, Junichi Yamagishi, Simon King Centre for Speech Technology Research, University of Edinburgh, UK O.S.Watts@sms.ed.ac.uk jyamagis@inf.ed.ac.uk Simon.King@ed.ac.uk

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

Noisy SMS Machine Translation in Low-Density Languages

Noisy SMS Machine Translation in Low-Density Languages Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Corrective Feedback and Persistent Learning for Information Extraction

Corrective Feedback and Persistent Learning for Information Extraction Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Extending Place Value with Whole Numbers to 1,000,000

Extending Place Value with Whole Numbers to 1,000,000 Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

SLINGERLAND: A Multisensory Structured Language Instructional Approach

SLINGERLAND: A Multisensory Structured Language Instructional Approach SLINGERLAND: A Multisensory Structured Language Instructional Approach nancycushenwhite@gmail.com Lexicon Reading Center Dubai Teaching Reading IS Rocket Science 5% will learn to read on their own. 20-30%

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES Judith Gaspers and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University {jgaspers cimiano}@cit-ec.uni-bielefeld.de ABSTRACT Semantic parsers

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

A heuristic framework for pivot-based bilingual dictionary induction

A heuristic framework for pivot-based bilingual dictionary induction 2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

A Version Space Approach to Learning Context-free Grammars

A Version Space Approach to Learning Context-free Grammars Machine Learning 2: 39~74, 1987 1987 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands A Version Space Approach to Learning Context-free Grammars KURT VANLEHN (VANLEHN@A.PSY.CMU.EDU)

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

A General Class of Noncontext Free Grammars Generating Context Free Languages

A General Class of Noncontext Free Grammars Generating Context Free Languages INFORMATION AND CONTROL 43, 187-194 (1979) A General Class of Noncontext Free Grammars Generating Context Free Languages SARWAN K. AGGARWAL Boeing Wichita Company, Wichita, Kansas 67210 AND JAMES A. HEINEN

More information

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S N S ER E P S I M TA S UN A I S I T VER RANKING AND UNRANKING LEFT SZILARD LANGUAGES Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A-1997-2 UNIVERSITY OF TAMPERE DEPARTMENT OF

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Using computational modeling in language acquisition research

Using computational modeling in language acquisition research Chapter 8 Using computational modeling in language acquisition research Lisa Pearl 1. Introduction Language acquisition research is often concerned with questions of what, when, and how what children know,

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company Table of Contents Welcome to WiggleWorks... 3 Program Materials... 3 WiggleWorks Teacher Software... 4 Logging In...

More information

Formulaic Language and Fluency: ESL Teaching Applications

Formulaic Language and Fluency: ESL Teaching Applications Formulaic Language and Fluency: ESL Teaching Applications Formulaic Language Terminology Formulaic sequence One such item Formulaic language Non-count noun referring to these items Phraseology The study

More information

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 2 Apr 2017 Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 Recognizing speech 2 Feature calculation Dan Ellis Michael Mandel 3 Sequence

More information

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing Pallavi Baljekar, Sunayana Sitaram, Prasanna Kumar Muthukumar, and Alan W Black Carnegie Mellon University,

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

The NICT Translation System for IWSLT 2012

The NICT Translation System for IWSLT 2012 The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,

More information

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots Flexible Mixed-Initiative Dialogue Management using Concept-Level Condence Measures of Speech Recognizer Output Kazunori Komatani and Tatsuya Kawahara Graduate School of Informatics, Kyoto University Kyoto

More information

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty Julie Medero and Mari Ostendorf Electrical Engineering Department University of Washington Seattle, WA 98195 USA {jmedero,ostendor}@uw.edu

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Language Model and Grammar Extraction Variation in Machine Translation

Language Model and Grammar Extraction Variation in Machine Translation Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

arxiv: v1 [math.at] 10 Jan 2016

arxiv: v1 [math.at] 10 Jan 2016 THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the

More information

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

CHAPTER 4: REIMBURSEMENT STRATEGIES 24 CHAPTER 4: REIMBURSEMENT STRATEGIES 24 INTRODUCTION Once state level policymakers have decided to implement and pay for CSR, one issue they face is simply how to calculate the reimbursements to districts

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282) B. PALTRIDGE, DISCOURSE ANALYSIS: AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC. 2012. PP. VI, 282) Review by Glenda Shopen _ This book is a revised edition of the author s 2006 introductory

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Phonological and Phonetic Representations: The Case of Neutralization

Phonological and Phonetic Representations: The Case of Neutralization Phonological and Phonetic Representations: The Case of Neutralization Allard Jongman University of Kansas 1. Introduction The present paper focuses on the phenomenon of phonological neutralization to consider

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information