Lip Reader: Video-Based Speech Transcriber

Size: px
Start display at page:

Download "Lip Reader: Video-Based Speech Transcriber"

Transcription

1 Lip Reader: Video-Based Speech Transcriber Bora Erden Max Wolff Sam Wood 1. Introduction We set out to build a lip-reader, which would take audio-free videos of people speaking and reconstruct their spoken sentences. We were drawn to this problem because of its complexity and variety of applications. Most compelling is assisting people with hearing impairments. If this algorithm could be improved to run in realtime, it could be a boon to the deaf, given that even the most accurate human lipreaders are far from perfect. The algorithm could also be used to add subtitles for videos with poor or nonexistent audio quality. We implemented a phoneme classifier, which takes in a short clip of a subject speaking one phoneme (phonemes are the fundamental units of sound for a spoken language) and produces an array containing the estimated probabilities that that clip corresponds to any particular phoneme. We also built an NLP model, so that the phoneme classifier could be used to move from audio-free video to a full reconstruction of a spoken sentence. 2. Related Work There is much existing work on lipreading, but until recently, little of it employed deep learning. Much of the initial research used HMMs (Hidden Markov Models) to classify phonemes. Bear et al., for one, use HMMs to build a viseme classifier, and feed the visemes to another HMM to classify phonemes [1]. From there, they build words and sentences using a bigram language model. CNNs (Convolutional Neural Networks) were first only used for image pre-processing but now researchers use various deep learning techniques for the entire pipeline [2]. The most ambitious papers go straight from mouth movements to sentences, building massive, deep neural networks. The advantages of an end-to-end network that skips phoneme and visemes are clear. Phoneme classification is difficult because some phonemes look nearly identical, despite producing meaningfully different sounds. By relying upon lossy phoneme predictions, the predictions of words and sentences are skewed. Bear et al. ultimately reported an 18% level of accuracy for classifying phonemes, much too low for accurate sentence translation. On top of that, speakers can pronounce words and sentences so differently that they use different phonemes and visemes. This makes word translation from phonemes very difficult. Therefore, language models built on top of phoneme classifiers lose accuracy because they do not consider lower-level features from the mouth that an end-to-end model would. Lipreading presents a complex challenge because it calls for the use of temporal and visual data. So, architectures that are able to combine these two approaches are the most successful. Several authors have combined LSTM (Long-Short Term Memory) systems with CNNs to better predict time series data [3], [4]. Researchers from Google Deep Mind predicted full sentences with impressive accuracy by adding a sequence model in the form of a Recurrent Neural Network with their CNN [5]. They made use of a connectionist temporal classification loss framework (CTC) that allowed them to use an RNN to segment the unlabeled data. Because we were limited to using a small, public data set, we could not implement the full end-toend model. Without enough data, we would not have been able to learn language fluency. Instead, we implemented a phoneme classifier, modeled off the CNN architecture used in Assael et al. To reconstruct sentences, we used the bigram language model that HMM phoneme classifier papers used.

2 3. Dataset We used the VidTIMIT audio-video dataset, published by research scientist Conrad Sanderson of the University of Queensland [6]. It contains high-contrast, close-up videos of 43 people, each of whom speaks 10 short, phonetically-balanced sentences. In addition to the videos, the dataset contains timestamped transcriptions of each sentence into English words as well as into phonemes. The phonetic dictionary used to create these transcriptions contained 59 distinct phonemes. To prepare our data for our neural network phoneme classifier, we used the timestamped phonetic transcriptions to slice the video of each sentence into individual video clips corresponding to one phoneme each. On average, each of the 430 sentences contained around 70 phonemes, giving us a dataset of 32,000 video clips of individual phonemes. Our neural network was extremely computationally intensive to train. To reduce the dimensionality of our data and reduce computing time, we grayscaled the videos and used Haar cascades and OpenCV to crop around the face. Because the images were already lowresolution (200x250 pixels) we decided against compressing them further. To satisfy the constraints of the 3-D convolution function implemented in TensorFlow, we standardized the number of frames in each phoneme clip by interpolating frames into clips of shorter phonemes and removing frames from clips of longer phonemes. The phoneme clips originally contained between 2 and 10 frames each. After testing our neural net with the phoneme clips standardized to different numbers of frames, we settled on 3 frames each as a satisfactory tradeoff between accuracy and efficiency. We aimed to use 29,000 phoneme clips for our training set and 3,000 for our test set. However, due to the size of our data (over 1.5GB of images even after preprocessing), the computational expense of training our STCNN, and the limited resources we had access to (we used Amazon Web Services, but could not afford the top tiers of computational power), we were never able to train our classifier on a set of more than 15,000 phoneme clips. Rather than holding out some of our training set, we used 3,000 distinct phoneme clips for validation. Example frames from phoneme clips for w (pictured left) and m (pictured right): Figure 1: w and m 4. Methods 4.1 Baseline For our baseline, we implemented a simple nearest neighbor algorithm. For each of the 59 phonemes, we created a standard facial expression by averaging the frames from videos of multiple subjects speaking that phoneme. For a given input clip of a speaker saying one phoneme, we classified the clip by finding its nearest neighbor among the standard phoneme images. This algorithm successfully classifies phonemes at a rate of 2.2%, which is slightly higher than chance (1.7%). 4.2 Oracle Phonemes correspond uniquely to fundamental units of sound, but multiple phonemes can be produced by very similar movements of the mouth. As a result, classifying phonemes from audio data is far easier than classifying phonemes from video data. With access to audio data, support vector machines are able to successfully classify phonemes at a rate of 71.4% [7].

3 4.3 Our model Because lip-reading involves context-dependent, fine-grained recognition of specific mouth movements, we decided to classify the phoneme clips using a spatio-temporal convolutional neural network (STCNN). STCNNs are able to process relationships between neighboring pixels within an individual frame of a video and to process changes in pixels from one frame to the next, allowing them to distinguish between phonemes based on the shape of the mouth and the movement of the mouth over time. We implemented a seven-layer STCNN. In each of the first four layers of our STCNN, we perform a three-dimensional convolution followed by a rectified linear activation. In the second and fourth layers we follow the rectified linear activation with a spatial max pooling. The fifth and sixth layers are fully connected, and the seventh layer is a softmax regression. each of our max pooling layers halved the size of the input tensor along the width and height dimensions, but did not affect the dimension of the tensor corresponding to the frames. The output of a fully connected layer is computed by a matrix multiplication on the input tensor (after the input tensor has been flattened so that it has two dimensions, the first of which has length equal to the size of the training set). The values of the matrices for the fully connected layers were drawn from a uniform distribution. The final layer of our neural network uses softmax regression to classify the examples in the training set. Due to the sizes of the matrices used in the fully connected layers, the input to the softmax layer is a matrix of shape training set size by number of classification categories (in our case, 59). For each row of the input matrix, the softmax layer assigns the value at the j th position of the row to be: 3 frames Convolution (x4) Dense (2) Softmax Figure 2: STCNN architecture (adapted from Assael et al.) A convolution creates an affine transformation by passing a filter over its input tensor that relates a pixel (i, j) to all the pixels in the tensor captured by application of the filter centered at (i, j). After experimenting with different numbers and sizes, we had each of our convolutions use four 3x5x5 (frames by pixels of width by pixels of height) filters, generating four separate affine transformations of each input tensor. The parameters of these filters were initialized from a uniform distribution. The rectified linear activation function replaces each value x of its input tensor with max(0, x). Max pooling reduces the size of an input tensor by replacing each block of the input tensor with a single entry holding the maximum value in the block. After experimenting with different sizes, we used a pooling block of size 1x2x2, so that where z i is the value at the i th position of the row in the input matrix, and K = 59. The final output of the STCNN is thus a matrix of shape training set size by 59, where the values of each row sum to 1 and the value at position (i, j) is a number between 0 and 1 corresponding the to the predicted likelihood that the i th example in the training set is a video clip of the j th phoneme. We trained our classifier using stochastic gradient descent optimized against a categorical cross-entropy objective function. Cross-entropy loss is computed by summing the cross-entropy losses for each example in the training set, where the loss for a given example is: where y i is the predicted probability that the example depicts phoneme i and y i is 1 if the example actually is phoneme i and 0 otherwise. The partial derivatives of the cross-entropy loss with respect to each of the parameters in our STCNN (when trained on 10,000 phonemes, the

4 network has parameters) were computed using the backpropagation algorithm. Gradient descent was applied using a decaying Nesterov momentum update rule, which has been shown to have better convergence rates on deep neural networks than standard SGD [8]. If our parameters are stored in a vector x, and dnext is a vector holding the gradients calculated at next, then we update using the algorithm: next = x + mu*v v = mu*v lr*dnext x += v/(1 + k*t) where v is initialized at 0. v can be thought of as the velocity of x, where the velocity increases in directions across which the gradient is consistent. mu (momentum) and lr are hyperparameters between 0 and 1. Higher values of lr correspond to more dramatic gradient updates and higher values of mu correspond to higher inertia, making it harder to change the velocity. k is the decay factor, and t is the iteration number, so that 1/(1 + k*t) approaches 0 as t increases, at a rate determined by k. never reached higher than 10% accuracy on training datasets containing multiple speakers. The big fluctuations in the training error are due to the batch size that we are using. Because of the size of our dataset and the computational expense of training STCNNs, we were forced to compute gradients on small batches of the data, one at a time, rather than on the entire dataset. Even using Amazon Web Services with GPU instances, we were only able to compute gradients on batches of up to 300 phoneme clips, out of a training set of 12,000. This resulted in a very choppy minimization of the loss function and thus big fluctuations in the accuracy. There is a trend however: the accuracy varies less and converges around a higher value as the batch size increases. This leads us to believe that, had we been able to use bigger datasets, we could have achieved higher accuracy. 5.2 Training set size Next, we examined the effect of the size of our training set on our accuracy on our test set: 5. Results 5.1 Training accuracy First, we looked at the training error of our model at every step of gradient descent (where at each step the gradient is computed on one batch). Consider the following graph depicting the learning curves for our classifier when trained on a set of 12,000 phoneme clips: Figure 3: Batch size affects learning curves Our training accuracy quickly departs from chance (1.69% for 59 phonemes). However, it Figure 4: Training set size affects test accuracy Naturally, our test accuracy increased with training set size. Had we been able to use our full training set of 29,000 phonemes, we may have seen our accuracy go even higher. Since our training error was so high, and our test error not much higher, we believe that we did not reach the tipping point of overfitting. While we primarily concentrated our efforts on decreasing bias, and not variance, we did add two dropout layers (which randomly ignored each node in the hidden layer with probability 0.25 when computing gradient updates on a given training batch), which marginally decreased the

5 divergence between our training error and our test error. 5.3 Learning rate hyperparameters Our goal was to implement k-fold crossvalidation, but the computational limitations of the AWS machines only allowed us to use a single hold-out set of 3,000 phonemes. This, together with the fact that decaying Nesterov momentum updates involve three hyperparameters (decay factor, momentum, and learning rate), also meant that we could only explore a small portion of the parameter space. We explored learning rates between and 0.03, momentums between 0.5 and 0.95, and decay factors between and 0.1. The learning rate and momentum ranges were based on the standard values provided in the Keras implementation (0.01 and 0.9, respectively). Because the computational expense of training our algorithm meant that we never trained our classifier for more than a thousand epochs, we settled on a range for the decay factor that would lead the 1/(1+k*t) multiplier to drop below 0.1 within a few hundred epochs. As expected, we found that higher learning rates made our classifier take far longer to converge, but lower learning rates occasionally made our classifier converge to significantly lower accuracies. The momentum had the largest impact on the fluctuations of accuracy during training. Higher momentums led to more stable learning curves, but, again, occasionally led our algorithm to converge to a subpar local optimum. Finally, decay rates at the high end of our explored range led our classifier to converge within a few dozen epochs, while decay rates at the low end meant that our algorithm never converged before a thousand epochs. We settled on a learning rate of 0.015, a decay factor of 0.005, and a momentum of Error Analysis Consider the following confusion matrix, which shows misclassifications among and between two groups of similar phonemes (darker colors indicate higher confusion rate): Figure 5: Example confusion matrix Our classifier frequently makes mistakes within groups of similar phonemes, often classifying one phoneme as another from the same group more often than as itself. This is unsurprising, since the mouth movements that produce similar phonemes can be identical. For example, the woman shown below looks much the same when pronouncing iy (top row) or ih (bottom row): Figure 6: ih and iy The misclassification rates between dissimilar groups of phonemes are also quite high. This shouldn t be too surprising, either, since a big part of speech production happens in unseen parts of the human vocal system. Moreover, when spoken by different people, or even by the same person at different times, the same phoneme might look quite different, and highly dissimilar phonemes may look quite alike. Consider the same woman pronouncing ih (top row) compared to two different women pronouncing iy (middle row) and el (bottom row). Although we would expect ih and iy to look more similar, as spoken by these women, ih and el are far closer in appearance.

6 confusion matrix: some phonemes look identical and are inherently hard to distinguish. Moreover, the accuracy of our algorithm when trained on only 700 phonemes coming from one speaker was significantly higher than the accuracy when trained on 15,000 phonemes coming from many different speakers, suggesting that idiosyncratic speech patterns were at least as big a problem for our classifier as limitations on the size of our training set. Figure 7: ih, iy, and el When the input data has such low variance, the classifier has little to grasp onto to separate different classes. This is the crux of the difficulty of lipreading. 5.5 One speaker, top-k guesses We attempted to quantify the effect of having multiple speakers dilute the information of the already limited set of phoneme examples. We asked ourselves whether fitting our model to one speaker at a time would yield higher accuracy, despite the fact that the set for any individual speaker contained only around 700 phonemes. Knowing that multiple phonemes look the same when spoken, we also looked at whether the correct phoneme was in the top few predicted phonemes. Figure 8: One speaker accuracy As expected, our accuracy is significantly greater for top-3 and top-5 predictions than for top-1. This result mirrors the findings from the 6. Future Directions Ultimately, we were frustrated by our inability to raise the accuracy of our phoneme classifier. With access to greater computational resources, we could use larger batches to compute our gradient updates, and we could train our classifier on a larger training set. Computing gradients on larger batches would allow us to more effectively reach the global optimum. As suggested by Figure 3, our small batch sizes have significantly hampered our ability to train our classifier. Moreover, with a training set of 15,000 examples, our STCNN only has access, on average, to around 250 examples of each of the 59 phonemes. With a larger training set, our STCNN could become more attuned to the particularities of each phoneme. Figure 4 suggests that our accuracy would continue to improve with larger training sets. Even with these changes, it is unlikely that the accuracy of our classifier would rise higher than 15 or 20%. We could, perhaps, further improve our accuracy by increasing the depth of our STCNN, either by adding layers or by increasing the number of filters used by the convolutional layers that we have already implemented. Given the low separation between images of distinct phonemes, this problem may call for a much more sensitive classifier than we were able to implement given our computational resource limitations. Finally, no matter how much we beef up our algorithm and our training process, some phonemes will remain nearly impossible to distinguish. While we may be able to

7 significantly raise our top-5 or even top-3 classification accuracy, the similarities between phonemes such as ah and aa, ih and iy, or p and b, may create an upper limit on the performance of any image-based phoneme classifier. To further improve the accuracy of phoneme classification, we would need to incorporate the context in which each phoneme occurs. In fact, we have implemented an algorithm that uses an n-grams English language model together with a phonetic dictionary to reconstruct full sentences from a sequence of phoneme predictions. While our phoneme accuracy was so low that the algorithm was not able to reconstruct sentences close to the originals, it is possible that, if we began with higher phoneme classification accuracy, our algorithm would be able to improve our accuracy even further by incorporating context. For example, while our phoneme classifier alone may not be able to distinguish between p and b, our sentence reconstruction algorithm might be able to identify the correct phoneme by choosing the one that makes the most sense within a particular word and sentence. We chose to implement our phoneme classifier and our sentence reconstruction algorithm separately because we believed that our training set was not large enough for any algorithm to learn the rules of English-language fluency from it. However, with a larger training set, we could implement an end-to-end trainable lip-reader, which would eliminate the lossy step of going from phoneme predictions to sentence reconstructions. We could, for example, follow Assael et al. in incorporating our STCNN phoneme classifier into an LSTM recurrent neural network optimized against the CTC loss function. 7. References [1] Helen L. Bear and Richard Harvey. Decoding Visemes: Improving Machine Lip- Reading. ICASSP [2] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Lipreading using Convolutional Neural Network. Applied Intelligence [3] J. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lipreading in the Wild. University of Oxford and Google DeepMind, as yet unpublished [4] M. Wand, J. Koutnik, J. Schmidhuber. Lipreading with Long Short-Term Memory ISASSP [5] Y. Assael, B. Shillingford, S. Whiteson, N. de Fritas. Lipnet: End-to-End Sentence-Level Lipreading. University of Oxford and Google DeepMind, as yet unpublished [6] Sanderson, Conrad and Lovell, B.C. Multi- Region Probabilistic Histograms for Robust and Scalable Identity Inference. Lecture Notes in Computer Science (LNCS), VOl. 5558, pp , The dataset can be found at: Figure 9: VidTIMIT license [7] Salomon, Jesper. Support Vector Machines for Phoneme Classification. University of Edinburgh: Division of Informatics [8] Sutskever et al. On the importance of initialization and momentum in deep learning. Proceedings of the 30th International Conference on Machine Learning

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

Lip Reading in Profile

Lip Reading in Profile CHUNG AND ZISSERMAN: BMVC AUTHOR GUIDELINES 1 Lip Reading in Profile Joon Son Chung http://wwwrobotsoxacuk/~joon Andrew Zisserman http://wwwrobotsoxacuk/~az Visual Geometry Group Department of Engineering

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 7 Apr 2015 Transferring Knowledge from a RNN to a DNN William Chan 1, Nan Rosemary Ke 1, Ian Lane 1,2 Carnegie Mellon University 1 Electrical and Computer Engineering, 2 Language Technologies Institute Equal contribution

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Softprop: Softmax Neural Network Backpropagation Learning

Softprop: Softmax Neural Network Backpropagation Learning Softprop: Softmax Neural Networ Bacpropagation Learning Michael Rimer Computer Science Department Brigham Young University Provo, UT 84602, USA E-mail: mrimer@axon.cs.byu.edu Tony Martinez Computer Science

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma Adam Abdulhamid Stanford University 450 Serra Mall, Stanford, CA 94305 adama94@cs.stanford.edu Abstract With the introduction

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.lg] 15 Jun 2015 Dual Memory Architectures for Fast Deep Learning of Stream Data via an Online-Incremental-Transfer Strategy arxiv:1506.04477v1 [cs.lg] 15 Jun 2015 Sang-Woo Lee Min-Oh Heo School of Computer Science and

More information

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning CSL465/603 - Machine Learning Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Introduction CSL465/603 - Machine Learning 1 Administrative Trivia Course Structure 3-0-2 Lecture Timings Monday 9.55-10.45am

More information

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION Atul Laxman Katole 1, Krishna Prasad Yellapragada 1, Amish Kumar Bedi 1, Sehaj Singh Kalra 1 and Mynepalli Siva Chaitanya 1 1 Samsung

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Model Ensemble for Click Prediction in Bing Search Ads

Model Ensemble for Click Prediction in Bing Search Ads Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Microsoft Bing xiaoling@microsoft.com Hucheng Zhou Microsoft Research huzho@microsoft.com Weiwei Deng Microsoft Bing dedeng@microsoft.com

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures Alex Graves and Jürgen Schmidhuber IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland TU Munich, Boltzmannstr.

More information

Attributed Social Network Embedding

Attributed Social Network Embedding JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2017 1 Attributed Social Network Embedding arxiv:1705.04969v1 [cs.si] 14 May 2017 Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua Abstract Embedding

More information

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina

More information

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Dropout improves Recurrent Neural Networks for Handwriting Recognition 2014 14th International Conference on Frontiers in Handwriting Recognition Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham,Théodore Bluche, Christopher Kermorvant, and Jérôme

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach #BaselOne7 Deep search Enhancing a search bar using machine learning Ilgün Ilgün & Cedric Reichenbach We are not researchers Outline I. Periscope: A search tool II. Goals III. Deep learning IV. Applying

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search Using Deep Convolutional Neural Networks in Monte Carlo Tree Search Tobias Graf (B) and Marco Platzner University of Paderborn, Paderborn, Germany tobiasg@mail.upb.de, platzner@upb.de Abstract. Deep Convolutional

More information

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v1 [cs.cv] 10 May 2017 Inferring and Executing Programs for Visual Reasoning Justin Johnson 1 Bharath Hariharan 2 Laurens van der Maaten 2 Judy Hoffman 1 Li Fei-Fei 1 C. Lawrence Zitnick 2 Ross Girshick 2 1 Stanford University

More information

Proficiency Illusion

Proficiency Illusion KINGSBURY RESEARCH CENTER Proficiency Illusion Deborah Adkins, MS 1 Partnering to Help All Kids Learn NWEA.org 503.624.1951 121 NW Everett St., Portland, OR 97209 Executive Summary At the heart of the

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v4 [cs.cl] 28 Mar 2016 LSTM-BASED DEEP LEARNING MODELS FOR NON- FACTOID ANSWER SELECTION Ming Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou IBM Watson Core Technologies Yorktown Heights, NY, USA {mingtan,cicerons,bingxia,zhou}@us.ibm.com

More information

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX, IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX, 2017 1 Small-footprint Highway Deep Neural Networks for Speech Recognition Liang Lu Member, IEEE, Steve Renals Fellow,

More information

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS Heiga Zen, Haşim Sak Google fheigazen,hasimg@google.com ABSTRACT Long short-term

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Residual Stacking of RNNs for Neural Machine Translation

Residual Stacking of RNNs for Neural Machine Translation Residual Stacking of RNNs for Neural Machine Translation Raphael Shu The University of Tokyo shu@nlab.ci.i.u-tokyo.ac.jp Akiva Miura Nara Institute of Science and Technology miura.akiba.lr9@is.naist.jp

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

INPE São José dos Campos

INPE São José dos Campos INPE-5479 PRE/1778 MONLINEAR ASPECTS OF DATA INTEGRATION FOR LAND COVER CLASSIFICATION IN A NEDRAL NETWORK ENVIRONNENT Maria Suelena S. Barros Valter Rodrigues INPE São José dos Campos 1993 SECRETARIA

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

THE world surrounding us involves multiple modalities

THE world surrounding us involves multiple modalities 1 Multimodal Machine Learning: A Survey and Taxonomy Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency arxiv:1705.09406v2 [cs.lg] 1 Aug 2017 Abstract Our experience of the world is multimodal

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

Diagnostic Test. Middle School Mathematics

Diagnostic Test. Middle School Mathematics Diagnostic Test Middle School Mathematics Copyright 2010 XAMonline, Inc. All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

Truth Inference in Crowdsourcing: Is the Problem Solved?

Truth Inference in Crowdsourcing: Is the Problem Solved? Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer

More information

Improvements to the Pruning Behavior of DNN Acoustic Models

Improvements to the Pruning Behavior of DNN Acoustic Models Improvements to the Pruning Behavior of DNN Acoustic Models Matthias Paulik Apple Inc., Infinite Loop, Cupertino, CA 954 mpaulik@apple.com Abstract This paper examines two strategies that positively influence

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF Read Online and Download Ebook ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF Click link bellow and free register to download

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.cl] 27 Apr 2016 The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen TRANSFER LEARNING OF WEAKLY LABELLED AUDIO Aleksandr Diment, Tuomas Virtanen Tampere University of Technology Laboratory of Signal Processing Korkeakoulunkatu 1, 33720, Tampere, Finland firstname.lastname@tut.fi

More information

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems Ajith Abraham School of Business Systems, Monash University, Clayton, Victoria 3800, Australia. Email: ajith.abraham@ieee.org

More information

Ohio s Learning Standards-Clear Learning Targets

Ohio s Learning Standards-Clear Learning Targets Ohio s Learning Standards-Clear Learning Targets Math Grade 1 Use addition and subtraction within 20 to solve word problems involving situations of 1.OA.1 adding to, taking from, putting together, taking

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

An Introduction to Simio for Beginners

An Introduction to Simio for Beginners An Introduction to Simio for Beginners C. Dennis Pegden, Ph.D. This white paper is intended to introduce Simio to a user new to simulation. It is intended for the manufacturing engineer, hospital quality

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

arxiv: v1 [cs.lg] 20 Mar 2017

arxiv: v1 [cs.lg] 20 Mar 2017 Dance Dance Convolution Chris Donahue 1, Zachary C. Lipton 2, and Julian McAuley 2 1 Department of Music, University of California, San Diego 2 Department of Computer Science, University of California,

More information

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

CHAPTER 4: REIMBURSEMENT STRATEGIES 24 CHAPTER 4: REIMBURSEMENT STRATEGIES 24 INTRODUCTION Once state level policymakers have decided to implement and pay for CSR, one issue they face is simply how to calculate the reimbursements to districts

More information

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors Sang-Woo Lee,

More information

A Review: Speech Recognition with Deep Learning Methods

A Review: Speech Recognition with Deep Learning Methods Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.1017

More information

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Understanding and Supporting Dyslexia Godstone Village School. January 2017 Understanding and Supporting Dyslexia Godstone Village School January 2017 By then end of the session I will: Have a greater understanding of Dyslexia and the ways in which children can be affected by

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1 Patterns of activities, iti exercises and assignments Workshop on Teaching Software Testing January 31, 2009 Cem Kaner, J.D., Ph.D. kaner@kaner.com Professor of Software Engineering Florida Institute of

More information

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3 The Oregon Literacy Framework of September 2009 as it Applies to grades K-3 The State Board adopted the Oregon K-12 Literacy Framework (December 2009) as guidance for the State, districts, and schools

More information

Second Exam: Natural Language Parsing with Neural Networks

Second Exam: Natural Language Parsing with Neural Networks Second Exam: Natural Language Parsing with Neural Networks James Cross May 21, 2015 Abstract With the advent of deep learning, there has been a recent resurgence of interest in the use of artificial neural

More information

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval Yelong Shen Microsoft Research Redmond, WA, USA yeshen@microsoft.com Xiaodong He Jianfeng Gao Li Deng Microsoft Research

More information

Cultivating DNN Diversity for Large Scale Video Labelling

Cultivating DNN Diversity for Large Scale Video Labelling Cultivating DNN Diversity for Large Scale Video Labelling Mikel Bober-Irizar mikel@mxbi.net Sameed Husain sameed.husain@surrey.ac.uk Miroslaw Bober m.bober@surrey.ac.uk Eng-Jon Ong e.ong@surrey.ac.uk Abstract

More information