Natural Language Processing: Interpretation, Reasoning and Machine Learning

Similar documents
Python Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Basic Concepts of Machine Learning

Natural Language Processing. George Konidaris

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Linking Task: Identifying authors and book titles in verbose queries

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Ensemble Technique Utilization for Indonesian Dependency Parser

Lecture 1: Machine Learning Basics

Prediction of Maximal Projection for Semantic Role Labeling

AQUA: An Ontology-Driven Question Answering System

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Applications of memory-based natural language processing

CS Machine Learning

CS 598 Natural Language Processing

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The stages of event extraction

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Bayesian Learning Approach to Concept-Based Document Classification

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

LTAG-spinal and the Treebank

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

An Efficient Implementation of a New POP Model

Second Exam: Natural Language Parsing with Neural Networks

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

CSL465/603 - Machine Learning

Human Emotion Recognition From Speech

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Some Principles of Automated Natural Language Information Extraction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Learning Methods in Multilingual Speech Recognition

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

A Vector Space Approach for Aspect-Based Sentiment Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

Speech Recognition at ICSI: Broadcast News and beyond

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Developing a TT-MCTAG for German with an RCG-based Parser

A Case Study: News Classification Based on Term Frequency

Modeling function word errors in DNN-HMM based LVCSR systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Cross Language Information Retrieval

Rule Learning with Negation: Issues Regarding Effectiveness

Learning Computational Grammars

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

BYLINE [Heng Ji, Computer Science Department, New York University,

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Compositional Semantics

Seminar - Organic Computing

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Model Ensemble for Click Prediction in Bing Search Ads

Beyond the Pipeline: Discrete Optimization in NLP

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Distant Supervised Relation Extraction with Wikipedia and Freebase

Universidade do Minho Escola de Engenharia

arxiv: v1 [cs.lg] 15 Jun 2015

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Speech Emotion Recognition Using Support Vector Machine

Abstractions and the Brain

Knowledge Transfer in Deep Convolutional Neural Nets

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Comparison of Two Text Representations for Sentiment Analysis

Modeling user preferences and norms in context-aware systems

Learning Methods for Fuzzy Systems

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Grammar Extraction from Treebanks for Hindi and Telugu

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The MEANING Multilingual Central Repository

Calibration of Confidence Measures in Speech Recognition

The Strong Minimalist Thesis and Bounded Optimality

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Axiom 2013 Team Description Paper

arxiv: v1 [cs.cv] 10 May 2017

Indian Institute of Technology, Kanpur

Knowledge-Based - Systems

Introduction to Text Mining

Generative models and adversarial training

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An OO Framework for building Intelligence and Learning properties in Software Agents

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Transcription:

Natural Language Processing: Interpretation, Reasoning and Machine Learning Roberto Basili (Università di Roma, Tor Vergata) dblp: http://dblp.uni-trier.de/pers/hd/b/basili:roberto.html Google scholar: https://scholar.google.com/citations?user=u1a22fyaaaaj&hl=it&oi=sra Politecnico di Milano, 4-5 Maggio 2017

Overview Artificial Intelligence, Natural Language & Speech Information, Representation, (re)current challenges, success(and unsuccess)ful stories Natural Language Processing: linguistic background break Natural Language Processing: Tasks, Models and Methods The Role of Machine Learning Technologies Lexicon Acquisition : Automatic Development of Dictionaries, Semantic Lexicons and Ontologies Statistical Language Processing: Semantic Role Labeling break Natural Language Processing : Results & Applications Semantic Document Management Web-based Opinion Mining Systems, Market Watch & Brand Reputation Management Human Robotic Voice Interaction

ML in NLP a prologue The syntax-semantic mapping S N VP Paul Arg. 0 V gives Predicate D a NP N lecture Arg. 1 IN in PP N Rome Arg. M Different semantic theories (e.g. PropBank vs. FrameNet)

Linking syntax to semantics Police arrested the man for shoplifting S N VP Police V Authority arrested Arrest NP Det N the man Suspect PP IN N for shoplifting Offense

A tabular vision Word Predicate Semantic Role Police - Authority arrested Target Arrest the - SUSPECT man - SUSPECT for - OFFENSE Shoplifting - OFFENSE

Using Framenet/PropBank

NLP: linguistic levels

Language as a system of rules comincia qui la mia disperazione di scrittore. Ogni linguaggio è un alfabeto di simboli il cui uso presuppone un passato che gli interlocutori condividono; come trasmettere agli altri l infinito Aleph che la mia timorosa memoria a stento abbraccia? (*) J.L.Borges, L aleph, 1949.

... Meaning is acquired and recognized within the daily pactices related to its usage The meaning of a word is to be defined by the rules for its use, not by the feeling that attaches to the words L. Wittgenstein's Lectures, Cambridge 1932-1935. Recognizing one meaning consists in the ability of mapping a linguistic expression to an experience (praxis) through mechanisms such as analogy or approximating equivalence functions or through the minimization of the risks of being wrong/inappropriate/obscure The interpretation process can be obtained through the induction of one (or more) decision function(s) from experience A different perspective

The inductive process Annotazione Fenomeni Esempi Osservazione 1 Osservazione 2 Osservazione 3 Modello Learning Machine Osservazione n

The inductive process Annotazioni Citazioni Testi Ann. Parole Sintagmi Modello Analisi Alberi SVM Learning FattiNoti

The inductive process Annotazione Fenomeni Testi Kernel Parole Annotazioni Kernel Sintagmi Riconoscimento Kernel Tree Modello Kernel FattiNoti SVM Learning

The inductive process Annotazione Fenomeni Testi Kernel Parole Citazioni Kernel Sintagmi Riconoscimento Kernel Tree Modello Kernel FattiNoti SVM Learning

The inductive process Annotazione Fenomeni Testi Kernel Parole Citazioni Kernel Sintagmi Riconoscimento Kernel Tree Modello Kernel FattiNoti SVM Learning

The inductive process Annotazione Fenomeni Testi Kernel Parole Citazioni Kernel Sintagmi Riconoscimento Kernel Tree Modello Kernel FattiNoti SVM Learning

The inductive process Annotazione Fenomeni Testi Kernel Parole Citazioni Kernel Sintagmi Riconoscimento Kernel Tree Modello Kernel FattiNoti SVM Learning

IBM Watson: between Intelligence and Data IBM s Watson http://www-03.ibm.com/innovation/us/watson/science-behind_watson.shtml 17

Jeopardy!

Watson

Watson

Watson

Semantic Inference in Watson QA

Intelligence in Watson 23

Watson: a DeepQA architecture 24

Ready for Jeopardy! 25

about the Watson intelligence Strongly positive aspects Adaptivity of the overall workflow Significant exploitation of available data Huge volumes of knowledge involved Criticalities The encyclopedic knowledge needed for Jeopardy is quite different in nature from the domain expertise required in many applications Wason is based on Factoid Questions strongly rooted on objective facts, that are explicit and non subjective Formalizing the input knowledge, as it is done a priori for Watson, is very difficult to achive in cost-effective manner: sometimes such knowledge is even absent in an enterprise For many natural languages the amount of information and resources is not available, so that a purely data-driven approach is not applicable

Machine Learning: the weapons Rule and Patern learning from Data Frequent Pattern Mining (Basket analysis) Probabilistic Extensions of Grammars Probabilstic CFGs Stochastic Grammars Discriminative learning in neural networks SVM: perceptrons Kernel functions in implicit semantic spaces Bayesian Models & Graphical Models

POS tagging (Curch, 1989) Probabilistic Context-Free Grammars (Pereira & Schabes, 1991) Data Oriented Parsing (Scha, 1990) Stochastic Grammars (Abney, 1993) Lessicalizzati Modelli (C. Manning, 1995) Weighted Grammars, tra Sintassi & Statistica

Bayesian & Graphical Models

Hidden Markov Models (HMM) States = Categories/Classes Obesrvations Emissions Transitions Applications: Speech Recognition Sequence Labeling (e.g. POS tagging)

The forward algorithm: estimation

Viterbi decoding

Viterbi decoding

NLP and HMM decoding The HMM sequence labeling approach can be applied to a variety of linguistic subtasks: Tokenization MWE recognition POS tagging Named Entity Recognition Predicate Argument Structure Recognition SRL: Shallow Semantic Parsing

POS tagging

NLP & Structured Prediction HMM Decoding is an example of a large class of structured prediction task Key elements: Transform a NLP task into into a sequence of classication problem. Transform into a sequence labeling problem and use a variant of the Viterbi algorithm. Design a representation (e.g. features and metrics), a prediction algorithm, and a learning algorithm for your particular problem.

Discriminative Learning Characterizes neural networks since the early Cybernetics (Minsky&Papert, 1956) Strongly rooted in the notion of Inner product that in turns characterizes the norms thus the distances in the space Use a vector space in R n as a input representation space (not so) Recent Achievments Statistical Learning Theory and Support Vector Machines (Vapnik, 1987) Deep Learning (Bengio et al., 2001)

Linear Classification (1) In the hyperplane equation : f ( x) x w b, x, w n, b x w is the vector describing the targeted input example is the gradient of the hyperplane Classification Inference: h( x) sign( f ( x)) 51

Support Vector Machines Support Vector Machines (SVMs) are based on the Statistical Learning Theory [Vapnik, 1995] I t does not require the storage of all data but only a subset of discriminating instances (i.e. the support vectors, SV) The classifier is a linear combination of the SVs (i.e. it depends ONLY on the inner product with their vectors) Var 1 Support Vectors h( x) sgn( w x b) sgn( j1.. y Support Vectors j j x j x b) Margin Var 2 52

Separability in Higher Dimensional spaces In R 2, 3 points can always be separated (or shuttered) by a linear classifier but 4 punti do not (as VC=3) [Vapnik and Chervonenkis(1971)] Solution 1 (neural networks): complexify the classification function It needs a more complex architecture usually based on ensembles (ad es. multistrato) of neurons Risk of over-fitting on the training data that is dangerous for performance on test ones?

Separability and High Dimensional Spaces (2) Solution 2: Project instances in an higher dimensional space, i.e. a new feature space by using a projection function Basic idea from SLT: Feature space more complex are preferable to more complex functions as the risk of overfitting is minimized

If a specific function called kernel is available such that k(xi,xj)=(xi) (xj), there is no need to project the individual examples htrough the projection function (Cristianini et al., 2002) A structured paradigm is applied such that It is trained against more complex structures It moves the machine learning focus onto the representation ((x j ) ) k(.,.) expresses a similarity (metrics) that can account for linguistic aspects and depend on the lexicon, sintax and/or semantics Representation & Kernels ) ), ( sgn( ) ) ( ) ( sgn( ) ) ( sgn( ) ( 1.. 1.. b x x k y b x x y b x w x h j j i j j j j j Support Vectors ) ) ( ) ( sgn( ) ) ( sgn( ) ( 1.. b x x y b x w x h j j j j

Examples of Kernels sensitive to syntactic structures Given a tree we can see it as the occurrence of a joint event. VP V NP delivers D N a talk

Kernels & Syntactic structures: a collective view of the joint event VP The tree can be see it as the joint occurrence of all the following subtrees: V NP delivers VP VP VP NP NP NP V NP V NP V NP D N D N D N delivers D N D N D N a talk a talk V NP D N VP a talk a talk delivers D N a talk VP VP VP VP VP VP V NP V NP V NP V NP V NP V NP V NP D N D N delivers D N delivers D N delivers delivers D N a talk a talk D a N talk

Tree Kernels: the implicit metric space NP NN The function in a tree kernel define a vector representing ALL subtrees of the input tree T. It naturally (i.e. without feature engineering) emphasizes: S VBZ Lexical information (magazine) Coarse grain grammatical information (POS tags such as VBZ) Syntactic information (frammenti complessi) VP NP NN man reads magazine T ( T ) (0,...,1,...,0,...,1,...,0,...,1,...,1,...,0,...,1) VBZ reads VP NP NN magazine The inner product in the space of all substrees is proportional to the number of subtrees shared between two sentences The learning algorithm (e.g. SVM) will select discriminating examples in (infinite dimensional) space NP NN man S VP NP NN magazine VBZ magazine

Application of distributional lexicons for Semantic Role Labeling @ UTV An important application of tree-kernl based SVMs is Semantic Role labeling wrt Framenet In the UTV system, a cascade of classification steps is applied: Predicate detection Boundary recognition Argument categorization (Local models) Reranking (Joint models) Input: a sentence and its parse trees Adopted kernel: the combination of lexical (e.g. bow) and tree kernel (that is still a kernel)

Linking syntax to semantics Police arrested the man for shoplifting S N VP Police V Authority arrested Arrest NP Det N the man Suspect PP IN N for shoplifting Offense

Using Framenet/PropBank

Semantic Role Labeling via SVM Learning Two steps: Boundary Detection One binary classifier applied to the parse tree nodes Argument Type Classification Multi-classification problem, where n binary classifiers are applied, one for each argument class (i.e. frame element) They are combined in a ONE-vs-ALL scheme, i.e. the argument type that is categorized by an SVM with the maximum score is selected

SRL in Framenet: Results

Framenet SRL: best results Best system [Erk&Pado, 2006] 0.855 Precision, 0.669 Recall 0.751 F1 Trento (+RTV) system (Coppola, PhD2009)

Argument Classification (Croce et al., 2013) UTV experimented with a FrameNet SRL classification (gold standard boundaries) We used the FrameNet version 1.3: 648 frames are considered Training set: 271,560 arguments (90%) Test set: 30,173 arguments (10%) [Bootleggers] CREATOR, then copy [the film] ORIGINAL [onto hundreds of VHS tapes] GOAL Kernel Accuracy GRCT 87,60% GRCT LSA 88,61% LCT 87,61% LCT LSA 88,74% GRCT+LCT 87,99% GRCT LSA +LCT LSA 88,91%

Semantics, Natural Language & Learning From Learning to Read to Knowledge Distillation as a (integrated pool of) Semantic interpretation Task(s) Information Extraction Entity Recognition and Classification Relation Extraction Semantic Role Labeling (Shallow Semantic Parsing) Estimation of Text Similarity Structured Text Similarity/Textual Entailment Recognition Sense disambiguation Semantic Search, Question Classification and Answer Ranking Knowledge Acquisition, e.g. ontology learning Social Network Analysis, Opinion Mining

References AI & Robotics. «Robot Futures», Ilah Reza Nourbakhsh, MIT Press, 2013 NLP & ML: «Statistical Methods for Speech Recognition», F. Jelinek, MIT Press, 1998 «Speech and Language Processing, D. Jurafsky and J. H.Martin, Prentice- Hall, 2009. Foundations of Statistical Natural Language Processing, Manning & Schtze, MIT Press 2001. URLs: SAG, Univ. Roma Tor Vergata: http://sag.art.uniroma2.it/ Reveal s.r.l.: http://www.revealsrl.it/