Final exam for CSC 321 April 11, 2013, 7:00pm 9:00pm No aids are allowed.

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Artificial Neural Networks written examination

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Model Ensemble for Click Prediction in Bing Search Ads

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

(Sub)Gradient Descent

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.lg] 7 Apr 2015

CSL465/603 - Machine Learning

Axiom 2013 Team Description Paper

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Deep Neural Network Language Models

Calibration of Confidence Measures in Speech Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v1 [cs.lg] 15 Jun 2015

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Speech Emotion Recognition Using Support Vector Machine

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

A Review: Speech Recognition with Deep Learning Methods

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Modeling function word errors in DNN-HMM based LVCSR systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Generative models and adversarial training

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Softprop: Softmax Neural Network Backpropagation Learning

Knowledge Transfer in Deep Convolutional Neural Nets

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Probabilistic Latent Semantic Analysis

A study of speaker adaptation for DNN-based speech synthesis

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Human Emotion Recognition From Speech

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Attributed Social Network Embedding

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

THE world surrounding us involves multiple modalities

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Second Exam: Natural Language Parsing with Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Basic Concepts of Machine Learning

WHEN THERE IS A mismatch between the acoustic

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Device Independence and Extensibility in Gesture Recognition

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

arxiv: v2 [cs.cv] 30 Mar 2017

An empirical study of learning speed in backpropagation

CS Machine Learning

arxiv: v2 [cs.ro] 3 Mar 2017

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Detailed course syllabus

A Deep Bag-of-Features Model for Music Auto-Tagging

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

5 Guidelines for Learning to Spell

A Comparison of Annealing Techniques for Academic Course Scheduling

Offline Writer Identification Using Convolutional Neural Network Activation Features

arxiv: v2 [cs.ir] 22 Aug 2016

Learning Methods for Fuzzy Systems

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

On the Formation of Phoneme Categories in DNN Acoustic Models

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

The Evolution of Random Phenomena

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Deep Facial Action Unit Recognition from Partially Labeled Data

Residual Stacking of RNNs for Neural Machine Translation

A Reinforcement Learning Variant for Control Scheduling

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Creating Coherent Inquiry Projects to Support Student Cognition and Collaboration in Physics

Semi-Supervised Face Detection

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

arxiv: v1 [cs.cv] 10 May 2017

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

IMPORTANT STEPS WHEN BUILDING A NEW TEAM

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Data Fusion Through Statistical Matching

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Spring 2012 MECH 3313 THERMO-FLUIDS LABORATORY

Transcription:

Your name: Your student number: Final exam for CSC 321 April 11, 2013, 7:00pm 9:00pm No aids are allowed. This exam has two sections, each of which is worth a total of 10 points. Answer all 10 questions in section A, and 5 of the 10 questions in section B.

Section A. Answer all 10 of these questions. Each is worth one mark. These are short questions. When we say briefly explain, don t start writing a whole page of text. The main idea, in one or two sentences, is enough. A1. 1 mark. a) (0.5 marks) Is a Sigmoid Belief Network for supervised learning or for unsupervised learning? Explain briefly. b) (0.5 marks) Is a mixture of experts for supervised learning or for unsupervised learning? Explain briefly. A2. 1 mark. a) (0.5 marks) Is a mixture of Gaussians for supervised learning or for unsupervised learning? Explain briefly. b) (0.5 marks) Is an autoencoder for supervised learning or for unsupervised learning? Explain briefly.

A3. 1 mark. Briefly explain the trigram method of language modeling. A4. 1 mark. What is the procedure of 5- fold cross- validation, and what is its advantage over the traditional approach of simply splitting one's available data into a training set and a validation set? A5. 1 mark. We've seen that averaging the outputs from multiple models typically gives better results than using just one model. Let's say that we're going to average the outputs from 10 models. Of course, we want 10 good models, i.e. models that also perform well individually. What additional property of a collection of 10 models makes that collection a good candidate for output averaging?

A6. 1 mark. What does it mean that a Markov Chain has been run for so long that it has reached thermal equilibrium? A7. 1 mark. When we're changing the states of the units in a Hopfield network, in search of a low energy configuration, we change the states of the units one at a time. What could go wrong if instead we change the state of multiple units at the same time? Illustrate by drawing a concrete Hopfield network (clearly indicate what the connection weights are) and explaining what goes wrong if we change the state of multiple units at the same time, in your network. A8. 1 mark. How is training a 1- hidden- layer autoencoder very similar to training a Restricted Boltzmann Machine with CD- 1 (Contrastive Divergence 1)? Don t answer with mathematical formulas; just answer in simple English.

A9. 1 mark. In assignment 1, we trained a language model that produced a feature vector for each word in its dictionary. Afterwards, we took all those feature vectors, and mapped them to a two- dimensional space using t- SNE, so that it could be displayed on paper for us to see patterns of similarity in the learned feature vectors. We don't need t- SNE for that: an autoencoder can do it, too. Explain how an autoencoder can be used for that task that t- SNE performed for us in assignment 1. A10. 1 mark. For the Hopfield network below, write down the energy of all 8 configurations in the following table: State of A State of B State of C Energy 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 A +1 +2 B -3 C

Section B. Answer 5 of these 10 questions. Each of these is worth 2 marks. If you answer more than 5 of these 10 questions, your worst 5 answers will be used, so just don't do that. If you wrote something for a question and you later decide not to answer that question after all, cross out what you wrote and clearly write "don't mark this". Again, explaining the main idea briefly is better than writing something lengthy. B1. 2 marks. When you're training a neural network with some form of stochastic gradient descent, there's always the learning rate to choose. a) (0.5 marks) What problem will result from using a learning rate that's too large, and how can one detect that problem? b) (0.5 marks) What's the problem if the learning rate is too small, and how can one detect that problem? c) (1 mark) We can charge the computer with adapting the learning rate during training, if we don't want to do it ourselves. If we do that, every parameter that's being learned can have its own learning rate. This approach is called adaptive learning rates. Explain a simple but reasonable strategy for automatically adapting those learning rates during training. You don't have to explain in full pseudo- code; just explain the main idea.

B2. 2 marks. a) (1 mark) If we have a recurrent neural network (RNN), we can view it as a different type of network by "unrolling it through time". Briefly explain what that entails. b) (1 mark) Briefly explain how unrolling through time is related to weight sharing in convolutional networks.

B3. 2 marks. a) (1 mark) For training a Boltzmann Machine, what is the objective function? Write it down as a mathematical formula, and explain the meaning of the symbols that you use. b) (1 mark) And what is the gradient of that objective function, for the weight on the connection between unit i and unit j? Write it down as a mathematical formula, and explain the meaning of the symbols that you use.

B4. 2 marks. A Hopfield network can be made stochastic by introducing a temperature and stochastic state changes for the units. This is most commonly done with temperature T=1, but other temperatures can also be used. a) (1 mark). Write the mathematical formula for the probability of turning on unit i. Your formula will involve the temperature, T. Clearly define any nontrivial symbols that you use. P(si=1) = b) (0.5 marks) What is the effect of using T=infinity? c) (0.5 marks) What is the effect of using T=0?

B5. 2 marks. In a deep neural network, or a recurrent neural network, we can get vanishing or exploding gradients because the backward pass of backpropagation is linear, even for a network where all hidden units are logistic. a) (1 mark) Explain in what sense the backward pass is linear. b) (1 mark) Why does an Echo State Network not suffer from this problem?

B6. 2 marks. If the hidden units of a network are independent of each other, then it's easy to get a sample from the correct distribution, which is a very important advantage. a) (0.5 marks) For a Sigmoid Belief Network where the only connections are from hidden units to visible units (i.e. no hidden- to- hidden or visible- to- visible connections), when we condition on the state of the visible units, are the hidden units conditionally independent of each other? Explain very briefly. b) (0.5 marks) For a Sigmoid Belief Network where the only connections are from hidden units to visible units (i.e. no hidden- to- hidden or visible- to- visible connections), when we don't condition on anything, are the hidden units independent of each other? Explain very briefly. c) (0.5 marks) For a Restricted Boltzmann Machine, when we don't condition on anything, are the hidden units independent of each other? Explain very briefly. d) (0.5 marks) For a Restricted Boltzmann Machine, when we condition on the state of the visible units, are the hidden units conditionally independent of each other? Explain very briefly.

B7. 2 marks. In Bayesian learning, we consider not just one, but many different weight vectors. Each of those is assigned a probability by which it is weighted in producing the final output. a) (1 mark) Write down Bayes' rule as it applies to supervised neural network learning. Clearly define the symbols that you are using. b) (0.5 marks) Clearly indicate which part of the formula is the "prior distribution", which is the "likelihood term", and which is the "posterior distribution". c) (0.5 marks) In this context, how is Maximum A Posteriori (MAP) learning different from Maximum Likelihood (ML) learning?

B8. 2 marks. We've seen a variety of generative models. Some were causal generative models, and others were energy- based generative models. a) (1 mark) Explain the difference between those two types of generative models. b) (0.5 marks) Give an example of a causal generative model that we studied in class. c) (0.5 marks) Give an example of an energy- based generative model that we studied in class.

B9. 2 marks. Here you see a very small neural network: it has one input unit, one hidden unit (logistic), and one output unit (linear). Let's consider one training case. For that training case, the input value is 1 (as shown in the diagram), and the target output value is 1. We're using the standard squared error loss function: E = (t- y) 2 /2 The numbers in this question have been constructed in such a way that you don't need a calculator. bias= 0 bias= +2 1 Linear output unit w2= +4 Logistic hidden unit w1= -2 Input unit a) (0.5 marks) What is the output of the hidden unit and the output unit, for this training case? b) (0.5 marks) What is the loss, for this training case? c) (0.5 marks) What is the derivative of the loss w.r.t. w2, for this training case? d) (0.5 marks) What is the derivative of the loss w.r.t. w1, for this training case?

B10. 2 marks. (the text of this question is long, but it s not complicated) Suppose that we have a vocabulary of 3 words, "a", "b", and "c", and we want to predict the next word in a sentence given the previous two words. For this network, we don't want to use feature vectors for words: we simply use the local encoding, i.e. a 3- component vector with one entry being 1 and the other two entries being 0. In the language models that we have seen so far, each of the context words has its own dedicated section of the network, so we would encode this problem with two 3- dimensional inputs. That makes for a total of 6 dimensions. For example, if the two preceding words (the "context" words) are "c" and "b", then the input would be (0, 0, 1, 0, 1, 0). Clearly, the more context words we want to include, the more input units our network must have. More inputs means more parameters, and thus increases the risk of overfitting. Here is a proposal to reduce the number of parameters in the model: Consider a single neuron that is connected to this input, and call the weights that connect the input to this neuron w1, w2, w3, w4, w5, and w6. w1 connects the neuron to the first input unit, w2 connects it to the second input unit, etc. Notice how for every neuron, we need as many weights as there are input dimensions (6 in our case), which will be the number of words times the length of the context. A way to reduce the number of parameters is to tie certain weights together, so that they share a parameter. One possibility is to tie the weights coming from input units that correspond to the same word but at different context positions. In our example that would mean that w1=w4, w2=w5, and w3=w6 (see the "after" diagram). Explain the main weakness that that change creates.