Wrapup: IE, QA, and Dialog Mausam
Grading 50% 40% project 20% final exam 15% 20% regular reviews 15% 10% midterm survey 10% presentation Extra credit: participation
Plan (1 st half of the course) Classical papers/problems in IE: Bootstrapping, NELL, Open IE Important techniques for IE: CRFs, tree kernels, distant supervision, joint inference, deep learning, reinforcement learning IE++ coreference paraphrases inference Plan (2 nd half of the course) QA: Conversational agents:
Plan (1 st half++ of the course) Classical papers/problems in IE: Bootstrapping, NELL, Open IE Important techniques for IE: Semi-CRFs, tree kernels, distant supervision, joint inference, topic models, deep learning (CNNs), reinforcement learning IE++: coreference paraphrases Inference: random walks, neural models Plan (2 nd half of the course) QA: open QA, semantic parsing. LSTM, attention, more attention, Recursive NN, deep feature fusion network Conversational agents: Gen. Hierarchical nets, GANs, MemNets
NLP (or any application course) Techniques/Models Bootstrapping (coupled) Semi-SSL PGMs: semi-crf, MultiR, LDA Tree Kernels Multi-instance learning Random walks over graphs Reinforcement learning CNN, LSTM, Bi-LSTM, Recursive NN Attention, MemNets, GANs Problems NER Entity/Rel/Event Extraction Open Rel/Event Extraction Multi-task learning KB inference Open QA Machine comprehension Task-oriented dialog w/ KB General dialog
How much data? Large supervised dataset: supervised learning Trick to compute large supervised dataset w/o noise Semi-CRF, Twit-NER/POS, QuizBowl, SQUaD QA, CNN QA, Movies, Ubuntu, OQA, random walks (negative data can be artificial) Small supervised dataset: semi-supervised learning Bootstrapping, co-training, Graph-based SSL No supervised dataset: unsupervised learning/rules TwitIE ReVerb Trick to compute large supervised dataset with noise: distant supervision MultiR, PCNNs
Non-deep L Ideas: Semi-supervised Bootstrapping (in a loop) automatic generation of training data by matching known facts Multi-view / Multi-task co-training Constraints between tasks; Agreement between multiple classifiers for same concept Graph-based SSL Agreement between nodes of the graph
Non-deep L Ideas: distant supervision KB of facts: known. Extraction supervision: unknown Bootstrap a training dataset: matching sentences with facts Hypothesis 1: all such sentences are positive training for a fact: NOISY Hypothesis 2: all such sentences form a bag. Each bag must have a unique relation: BETTER Hypothesis 3: each bag can have multiple labels: EVEN BETTER Multi-Instance Learning Noisy OR in PGMs maximize the max probability in the bag
Non-deep L Ideas: No Intermediate Supervision QA tasks: (Question, Answer) pairs known; inference chain: unknown Distant Supervision: KB fact known; which sentence to extract from: unknown OQA (which proof is better is not known) Random walk inference (which path is better is not known) MultiR (which sentence in corpus is not known) Approach create a model for scoring each path/proof using weights on properties of each constituent train using known supervision (perceptron style updates) Differences: OQA scores each edge separately, PRA scores path; MultiR mil.
Non-deep L Ideas: Sparsity Tree Kernels: two features (paths) are similar if one has many constituent elements with the other. Similarity weighted by penalty to non-similar elements Paraphrase dataset for QA Open relations as supplements in KB inference
Deep Learning Models Convolutional NNs Handle fixed length contexts Recurrent NNs Handle small variable length histories LSTMs/GRUs Handle larger variable length histories Bi-LSTMs Handle larger variable length histories and futures Recursive NNs Handle variable length partially ordered histories
Deep Learning Models (contd) Hierarchical Recurrent NNs RNN over RNNs Attention models attach non-uniform importance to histories based on evidence (question) Co-attention models attach non-uniform importances to histories in two different NNs MemNets add an external storage with explicit read, write, updates Generative Adversarial Nets a better training procedure using actor-critic architecture
Hierarchical Models Semi-CRFs: joint segmentation and labeling Sentence is a sequence of segments, which are sequence of words Allows segment level features to be added HRED: LSTM over LSTM Document is a sequence of sentences, which is a sequence of words Conversation is a sequence of utterances, which is a sequence of words
RL for Text Two uses Use 1: search the Web to find easy documents for IE Use 2: Policy gradient algorithm for updating weights for generator in GANs.
Bootstrapping [Akshay] Fuzzy matching between seed tuples and text [Shantanu] Named entity tags in patterns [Gagan, Barun] Confidence level for each pattern and fact Semantic drift
NELL Never-ending/lifelong learning Human supervision to guide the learning [many] multi-view multi-task co-training [many] coupling constraints for high precision. [Dinesh] ontology to define the constraints
Open IE [many] ontology-free, scalablity [Surag] data-driven research through extensive error analysis [Dinesh] reusing datasets from one task to another [Partha] open relations as supplementary knowledge to reduce sparsity
Tree Kernels [Shantanu] major info about the relation lies in the shortest path of the dependency parse
Semi-CRFs [many] segment level features in CRF [Dinesh] joint segmentation and labeling? Order L CRFs vs Semi-CRFs
MultiR [Rishab] Use of KB to create a training set [Surag] multi-instance learning in PGMs [Akshay] relationship between sentence-level and aggregate extractions [Gagan] Vitterbi approximation (replace expectation with max)
PCNNs [Haroun] Max pooling to make layers independent of sentence size [Akshay] Piecewise max pooling to capture arg1, rel, arg2 [Akshay] Multi-instance learning in neural nets Positional embeddings
TwitIE [Haroun] tweets are challenging, but redundancy is good [Dinesh] G 2 test for ranking entities for a given date [Shantanu] event type discovery using topic models
RL for IE [many] active querying for gathering external evidence
PRA for KB inference [Haroun, Akshay] low variance sampling [Arindam] learning non-functional relations [Nupur] paths as features in a learning model
Joint MF-TF [Akshay, Shantanu] OOV handling [Nupur] loss function in joint modeling
Open QA [Surag] structured perceptron in a pipeline model [Akshay] paraphrase corpus for question rewriting [Shantanu] mining paraphrase operators from corpus [Arindam] decomposition of scoring over derivation steps
LSTMs [Haroun] attention > depth [Akshay] cool way to construct the dataset [Dinesh] two types of readers
Co-attention [many] iterative refinement of answer span selection*
HRED [Akshay] pretraining dialog model with a QA dataset [Arindam] passing intermediate context improves coherence? [Barun] split of local dialog generator and global state tracker
MSQU [many] partially annotated data [many] natural language -> SQL
GANs [many] teacher forcing [Akshay] interesting heuristics [Arindam] discriminator feedback can be backpropagated despite being non-differentiable
MemNets [Surag] typed OOVs [Haroun] hops [Shantanu, Gagan] subtask-styled evaluation
Open/Next Issues IE: mature? Event extraction Temporal extraction Rapid retargettability KB Inference Long way to go Combining DL and path-based models
Open/Next Issues QA systems Dataset driven research: [MC] SQUaD tremendous progress Answering in the wild: not clear (large answer spaces?) Deep learning for large-scale QA Conversational agents [Task driven] how to get DL model to issue a variety of queries [General] how to get the system to say something interesting? DL: what are the systems really capturing!?
Conclusions Learn key historical developments in IE Learn (some) state of the art in IE, inference, QA and dialog Learn how to critique strengths and weaknesses of a paper Learn how to brainstorm next steps and future directions Learn how to summarize an advanced area of research Learn to do research at the cutting edge
Exam Bring a laptop Internet enabled PDFLatex enabled Bring a mobile Taking a picture Extension cords It is ok even if you have not deeply understood every paper
Project Presentations Motivation & Problem definition 1 Slide of Contribution Background Technical Approach Experiments Analysis Conclusions Future Work