Lecture 11: Summary. Kai-Wei Chang University of Virginia

Similar documents
(Sub)Gradient Descent

CS Machine Learning

Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

Artificial Neural Networks written examination

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Simulation

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Beyond the Pipeline: Discrete Optimization in NLP

CS 446: Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Discriminative Learning of Beam-Search Heuristics for Planning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Generative models and adversarial training

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

A Version Space Approach to Learning Context-free Grammars

Softprop: Softmax Neural Network Backpropagation Learning

Learning Methods for Fuzzy Systems

Lecture 10: Reinforcement Learning

Probabilistic Latent Semantic Analysis

MYCIN. The MYCIN Task

Corrective Feedback and Persistent Learning for Information Extraction

Chapter 2 Rule Learning in a Nutshell

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Attributed Social Network Embedding

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Linking Task: Identifying authors and book titles in verbose queries

The Strong Minimalist Thesis and Bounded Optimality

Lecture 1: Basic Concepts of Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

AQUA: An Ontology-Driven Question Answering System

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

arxiv: v1 [cs.cv] 10 May 2017

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

INPE São José dos Campos

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning From the Past with Experiment Databases

CSL465/603 - Machine Learning

Software Maintenance

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Knowledge Transfer in Deep Convolutional Neural Nets

Team Formation for Generalized Tasks in Expertise Social Networks

Learning to Schedule Straight-Line Code

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Regret-based Reward Elicitation for Markov Decision Processes

An OO Framework for building Intelligence and Learning properties in Software Agents

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Detailed course syllabus

Axiom 2013 Team Description Paper

Prediction of Maximal Projection for Semantic Role Labeling

Proof Theory for Syntacticians

Lecture 2: Quantifiers and Approximation

Georgetown University at TREC 2017 Dynamic Domain Track

Short Text Understanding Through Lexical-Semantic Analysis

A Case Study: News Classification Based on Term Frequency

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Applications of memory-based natural language processing

Calibration of Confidence Measures in Speech Recognition

Individual Differences & Item Effects: How to test them, & how to test them well

Copyright by Sung Ju Hwang 2013

Introduction to Causal Inference. Problem Set 1. Required Problems

Truth Inference in Crowdsourcing: Is the Problem Solved?

BMBF Project ROBUKOM: Robust Communication Networks

arxiv: v2 [cs.cv] 30 Mar 2017

Short vs. Extended Answer Questions in Computer Science Exams

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Using focal point learning to improve human machine tacit coordination

CS 598 Natural Language Processing

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Second Exam: Natural Language Parsing with Neural Networks

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Test Effort Estimation Using Neural Network

Probability and Game Theory Course Syllabus

Learning goal-oriented strategies in problem solving

Using AMT & SNOMED CT-AU to support clinical research

Visual CP Representation of Knowledge

An empirical study of learning speed in backpropagation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Ensemble Technique Utilization for Indonesian Dependency Parser

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

On the Polynomial Degree of Minterm-Cyclic Functions

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Transcription:

Lecture 11: Summary Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar s course on Structured Prediction Advanced ML: Inference 1

This lecture v What is a structure? v A survey of the terrain we have covered v ML for inter-dependent variables 2

Recall: What is structure? A structure is a concept that can be applied to any complex thing, whether it be a bicycle, a commercial company, or a carbon molecule. By complex, we mean: 1. It is divisible into parts, 2. There are different kinds of parts, 3. The parts are arranged in a specifiable way, and, 4. Each part has a specifiable function in the structure of the thing as a whole From the book Analysing Sentences: An Introduction to English Syntax by Noel Burton-Roberts, 1986. 3

An example task: Semantic Parsing Find the largest state in the US SELECT expression FROM table WHERE condition MAX (numeric list) ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 4

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 5

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 6

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition US_STATES SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 7

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 8

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES Expression 1 = Expression 2 SELECT expression FROM table SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 9

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES Expression 1 = Expression 2 size SELECT expression FROM table Or perhaps population? MAX numeric list US_STATES SELECT expression FROM table WHERE condition size MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 US_CITIES name population state US_STATES name population size capital 10

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition Some name decisions US_STATES are simply Expression not allowed 1 = Expression 2 SELECT expression FROM table WHERE condition MAX numeric list At each step many, many decisions to make ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 - A query has to be well formed! size Even so, many Or perhaps possible population? options SELECT expression FROM table MAX numeric list US_STATES - Why does Find map to SELECT? size - Largest by size/population/population of capital? US_CITIES US_STATES name name population population state size capital 11

Standard classification tools can t predict structures X: Find the largest state in the US. Y: SELECT name FROM us_states WHERE size = (SELECT MAX(size) FROM us_states) Classification is about making one decision v Spam or not spam, or predict one label, etc 12

Standard classification tools can t predict structures X: Find the largest state in the US. Y: SELECT name FROM us_states WHERE size = (SELECT MAX(size) FROM us_states) We need to make multiple decisions v Each part needs a label: e.g., should US be mapped to us_states or us_cities? v The decisions interact with each other v If the outer FROM clause talks about the table us_states, then the inner FROM clause should not talk about utah_counties 13

How did we get here? Binary classification Learning algorithms Prediction is easy Threshold Features (???) Multiclass classification Different strategies One-vs-all, all-vs-all Global learning algorithms One feature vector per outcome Each outcome scored Prediction = highest scoring outcome Structured classification Global models or local models Each outcome scored Prediction = highest scoring outcome Inference is no longer easy! Makes all the difference 14

Structured output is v A graph, possibly labeled and/or directed v Possibly from a restricted family, such as chains, trees, etc. v A discrete representation of input v Eg. A table, the SRL frame output, a sequence of labels etc Representation v A collection of inter-dependent decisions v Eg: The sequence of decisions used to construct the output Procedural v The result of a combinatorial optimization problem v argmax & ( score(x, y) Formally 15

Challenges with structured output v Two challenges v We cannot train a separate weight vector for each possible inference outcome v For multiclass, we could train one weight vector for each label v We cannot enumerate all possible structures for inference vinference for binary/multiclass is easy 16

Challenges with structured output v Solution v Decompose the output into parts that are labeled v Define v how the parts interact with each other v how labels are scored for each part v an inference algorithm to assign labels to all the parts 17

Multiclass as a structured output A structure is A graph (in general, hypergraph), possibly labeled and/or directed A collection of interdependent decisions Multiclass A graph with one node and no edges Node label is the output Can be composed via multiple decisions The output of a combinatorial optimization problem argmax & )*+,*+- score(x, y) Winner-take-all argmax i w 0 φ(x, i) 18

Multiclass is a structure: Implications 1. A lot of the ideas from multiclass may be generalized to structures v Not always trivial, but useful to keep in mind 2. Broad statements about structured learning must apply to multiclass classification v Useful for sanity check, also for understanding 3. Binary classification is the most trivial form of structured classification v Multiclass with two classes 19

The machine learning of interdependent variables 20

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? 21

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? 22

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4 23

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4 Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it Option 1: Score each decision separately Pro: Prediction is easy, each y independent Con: No consideration of interactions 24

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4 Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it Option 2: Add pairwise factors Pro: Accounts for pairwise dependencies Cons: Makes prediction harder, ignores third and higher order dependencies 25

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4 Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it Option 3: Use only order 3 factors Pro: Accounts for order 3 dependencies Cons: Prediction even harder. Inference should consider all triples of labels now 26

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4 Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it Option 4: Use order 4 factors Pro: Accounts for order 4 dependencies Cons: Basically no decomposition over the labels! 27

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4 Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it How do we decide what to do? 28

Some aspects to consider v Availability of supervision v Supervised algorithms are well studied; supervision is hard (or expensive) to obtain v Complexity of model v More complex models encode complex dependencies between parts; complex models make learning and inference harder v Features v Most of the time we will assume that we have a good feature set to model our problem. But do we? v Domain knowledge v Incorporating background knowledge into learning and inference in a mathematically sound way 29

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? 30

Training structured models v Empirical risk minimization principle v Minimize loss over the training data v Regularize the parameters to prevent overfitting v We have seen different training strategies falling under this umbrella v Conditional Random Fields v Structural Support Vector Machines v Structured Perceptron (doesn t have regularization) v Different algorithms exist v We saw stochastic gradient descent in some detail 31

Training considerations v Train globally vs train locally Global: Train according to your final model x y1 y2 y3 y4 Pro: Learning uses all the available information Con: Computationally expensive 32

Training considerations v Train globally vs train locally Local: Decompose your model into smaller ones and train each one separately Full model still used at prediction time y1 y2 x y3 y4 y2 y3 y1 y2 y3 y4 y1 y3 y1 y2 y4 y4 Pro: Easier to train Con: May not capture global dependencies 33

Training considerations v Local vs global v Local learning v Learn parameters for individual components independently v Learning algorithm not aware of the full structure v Global learning v Learn parameters for the full structure v Learning algorithm knows about the full structure How do we choose? v Depends on inference complexity v Depends on size of available data too 34

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? 35

Inference v What is inference? The prediction step v More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum v Different flavors: MAP, marginal, loss augmented. Marginals: find Maximizer: find 36

Inference v Many algorithms, solution strategies v Combinatorial optimization, one size doesn t fit all v Graph algorithms, belief propagation, Integer linear programming, (beam) search, Monte Carlo methods. v Some tradeoffs How do we choose? v Programming effort v Exact vs inexact v Is the problem solvable with a known algorithm? v Do we care about the exact answer? 37

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? 38

How does background knowledge affect your choices? v Background knowledge biases your predictor in several ways v What is the model? v Maybe third order factors are not needed etc v Your choices for learning and inference algorithms v Feature functions v Constraints that prohibit certain inference outcomes 39

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? 40

Data and how it influences your model v Annotated data is a precious resource v Takes specialized expertise to generate v Or: very clever tricks (like online games that make data as a side effect) v Important directions v Learning with latent representations, indirect supervision, partial supervision v In all these cases v Learning is rarely a convex problem v Modeling choices become very important! Bad model will hurt 41

Looking ahead v Big questions (a very limited and biased set) v Representations v Can we learn the factorization? v Can we learn feature functions? v Dealing with the data problem for new applications v Clever tricks to get data v Taming latent variable learning v Applications v How does structured prediction help you? v Gathering importance as computer programs have to deal with uncertain, noisy inputs and make complex decisions 42