NLP Technologies for Cognitive Computing Geilo Winter School 2017

Similar documents
Lecture 1: Machine Learning Basics

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

CSL465/603 - Machine Learning

Probabilistic Latent Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

(Sub)Gradient Descent

arxiv: v2 [cs.ir] 22 Aug 2016

Artificial Neural Networks written examination

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Laboratorio di Intelligenza Artificiale e Robotica

arxiv: v1 [cs.cl] 20 Jul 2015

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

On the Combined Behavior of Autonomous Resource Management Agents

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Laboratorio di Intelligenza Artificiale e Robotica

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

arxiv: v1 [cs.lg] 15 Jun 2015

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Top US Tech Talent for the Top China Tech Company

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

FF+FPG: Guiding a Policy-Gradient Planner

Natural Language Processing. George Konidaris

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Reinforcement Learning by Comparing Immediate Reward

Word Segmentation of Off-line Handwritten Documents

Lecture 1: Basic Concepts of Machine Learning

Predicting Future User Actions by Observing Unmodified Applications

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Dialog-based Language Learning

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

The Evolution of Random Phenomena

While you are waiting... socrative.com, room number SIMLANG2016

Truth Inference in Crowdsourcing: Is the Problem Solved?

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

WHEN THERE IS A mismatch between the acoustic

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

An Introduction to Simulation Optimization

Using focal point learning to improve human machine tacit coordination

Discriminative Learning of Beam-Search Heuristics for Planning

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning Methods for Fuzzy Systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Matching Similarity for Keyword-Based Clustering

Summarizing Answers in Non-Factoid Community Question-Answering

LEGO MINDSTORMS Education EV3 Coding Activities

Seminar - Organic Computing

Lecture 10: Reinforcement Learning

Calibration of Confidence Measures in Speech Recognition

Time series prediction

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Detailed course syllabus

Generative models and adversarial training

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Model Ensemble for Click Prediction in Bing Search Ads

Effect of Word Complexity on L2 Vocabulary Learning

Georgetown University at TREC 2017 Dynamic Domain Track

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

arxiv: v1 [cs.cl] 2 Apr 2017

Probability estimates in a scenario tree

Go fishing! Responsibility judgments when cooperation breaks down

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Action Models and their Induction

Rule Learning With Negation: Issues Regarding Effectiveness

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

AQUA: An Ontology-Driven Question Answering System

TD(λ) and Q-Learning Based Ludo Players

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

A Comparison of Two Text Representations for Sentiment Analysis

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Compositional Semantics

Learning to Schedule Straight-Line Code

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Introduction to Simulation

Online Updating of Word Representations for Part-of-Speech Tagging

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Evidence for Reliability, Validity and Learning Effectiveness

Learning and Transferring Relational Instance-Based Policies

Assignment 1: Predicting Amazon Review Ratings

A Stochastic Model for the Vocabulary Explosion

Attributed Social Network Embedding

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Semantic and Context-aware Linguistic Model for Bias Detection

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Causal Link Semantics for Narrative Planning Using Numeric Fluents

Guru: A Computer Tutor that Models Expert Human Tutors

Transcription:

NLP Technologies for Cognitive Computing Geilo Winter School 2017 Devdatt Dubhashi LAB (Machine Learning. Algorithms, Computational Biology) Computer Science and Engineering Chalmers

Horizon (100 years): Superintelligence

Horizon (20 years): Automation

we really have to think through the economic implications. Because most people aren t spending a lot of time right now worrying about singularity they are worrying about Well, is my job going to be replaced by a machine? WIRED Nov. 2016 D. Dubhashi and S. Lappin, AI Dangers: Real and Imagined Comm. ACM (to appear)

A Spectre is Haunting the World Greatest problem of 21 st century Economics is what to do with surplus humans. Yuval Noah Harari, Homo Deus: History of the Future (2016)

A Tale of Two Stanford Labs Artificial Intelligence (AI John McCarthy) Intelligence Augmentation (IA Douglas Engelbart)

Why do we need Cognitive Assistants? The reason I was interested in interactive computing, even before we knew what that might mean, arose from this conviction that we would be able to solve really difficult problems only through using computers to extend the capability of people to collect information, create knowledge, manipulate and share it, and then to put that knowledge to work Computers most radically and usefully extend our capabilities when they extend our ability to collaborate to solve problems beyond the compass of any single human mind. 1 1 Improving Our Ability to Improve: A Call for Investment in a New Future. Douglas C. Engelbart, September 2003.

What is a Cognitive Assistant? A software agent (cog) that augments human intelligence (Engelbart s definition 1 in 1962) Performs tasks and offer services (assists human in decision making and taking actions) Complements human by offering capabilities that is beyond the ordinary power and reach of human (intelligence amplification) 1 Augmenting Human Intellect: A Conceptual Framework, by Douglas C. Engelbart, October 1962 From Cognitive Assistance at Work: Cognitive Assistant for Employees and Citizens, by Hamid R. Motahari-Nezhad, AAAI 2015 Fall Symposium.

Today

All pervasive cognitive computing agents. The Vision

AI: Roadmaps to the Future B. Lake, J. Tennenbaum et al: "Building machines that learn and think like people" In press at Behavioral and Brain Sciences. 2016 T. Mikolov, A. Joulin and M. Baroni. A Roadmap towards Artificial Intelligence, 2015 arxiv. J. Schmidthuber, On Learning to think, 2015 arxiv

How to Dance with the Robots Natural Language Processing (NLP) and Understanding Interaction, Feedback, Communication, Learning from the environment Causal reasoning Intuitive Physics Behavioural psychology

Why Language is difficult.. polysemous synonymous Concept Layer He sat on the river bank and counted his dough. Lexical Layer She went to the bank and took out some money.

Word senses and Machine Translation

Google Neural Machine Translation

Google Translate educe translation errors across its Google Translate service by between 55 percent and 85 percent

Goals and Contents of Lectures Core Machine Learning Supervised learning: large scale logistic regression, neural networks Unsupervised learning: clustering Optimization: first order methods, submodular functions NLP Applications Distributional semantics Summarization Word sense induction and disambiguation

WORD EMBEDDINGS

Word Embeddings Crown jewel of NLP, J. Howard (KD

Word Embeddings capture meaning

Voxel-wise modelling A G Huth et al. Nature 532, 453 458 (2016) doi:10.1038/nature17637

Distributional Hypothesis Know a man by the company he keeps. (Euripedes) Distributional Hypothesis (Harris 54, Firth 57): if two words are similar in meaning, they will have similar distributions in texts, that is, they will tend to occur in similar linguistic contexts.

Distributional Models: LSA

Predictive Distributional Models: CBOW vs SkipGram

Logistic Regression: Recap Optimize ww to maximize log likelihood of training data.

Skipgram Model Dataset: Context window: Positive examples: Negative examples: (sheep, quick), generated at random

Context and Target Vectors Assign to each word w, a target vector uu ww and a context vector vv ww in RR dd Sigmoid function

Log-likelihood Function Negative Sampling: Use randomly generated pairs ww, ww in place of D

Quiz How do we train parameters for this likelihood function?

Gradient Descent

(Stochastic) Gradient Descent Each iteration expensive as it needs to run through all data points Steady linear convergence Number of iterations OO(log 1 εε ) Total cost OO(nn log 1 εε ) Cheap iteration as it looks at only one data point Initial fast descent but slow at the end Number of iterations OO( 1 εε ) Escape saddle points! Better suited for BigData

Initial fast decrease in error Slows down closer to optimum Sufficient to be close to opt or switch to deterministic variant Error of SGD

(Stochastic) Gradient Descent

Gradient Descent and Relatives Momentum Nesterov acceleration Mirror descent Conjugate gradient descent Proximal gradient descent L. Bottou et al, Optimization Methods for Large Scale Machine Learning, 2016.

Convex vs Non-Convex unique global optimum Local opt = global opt Well understood: gradient descent methods guaranteed to converge to optimum, with known rates of convergence Complex landscape of optima Local opt global opt Gradient descent methods converge only to local opt. However, in practice gradient descent type methods converge to good optima

Quiz How are neural networks trained? What about our objective? Is it convex?

Gradient Descent for Non-convex Recent rigorous results showing that noisy/stochastic gradient descent can escape saddle points for certain classes of non-convex functions. R. Ge et al Matrix Completion has no spurious local minimum, NIPS 2016 (Best theoretical paper) NIPS 2016 workshop on Non-convex opt: https://sites.google.com/site/nonconvexnips2016

Why does it work well in practice? Word2vec tutorial on TensorFlow: https://www.tensorflow.org/tutorials/word2vec/

Why does word2vec work? Why are similar words assigned similar vectors? Why is

word2vec as Matrix Factorization Levy and Goldberg (2014): word2vec can be viewed as implicit factorization of the pointwise mutual information matrix PPPPPP ww, www = log # ww,www DD # ww #(www)

Relations = Lines Arora et al (2016): Posit a generative model such that for every relation R, there is a direction μμ RR such that if aa, bb RR then vv aa - vv bb = αα aa,bb μμ RR + ηη, where ηη is a noise vector.

References Y. Goldberg and O. Levy, word2vec Explained, Arxiv 2014 O. Levy, Y. Goldberg, Neural Word Embedding as Implicit Matrix Factorization, NIPS 2014. S. Ruder, An Overview of Gradient Descent Optimization Algorithms, Arxiv. 2016 L. Bottou, F. Curtis and J.Nocedal, Optimization Methods for Large Scale Machine Learning S. Arora et al, A Latent Variable Model Approach to PMI Based Word Embeddings, TACL 2016.