SI425 : NLP. Missing Topics and the Future

SI425 : NLP Missing Topics and the Future

Who cares about NLP? NLP has expanded quickly Most top-tier universities now have NLP faculty (Stanford, Cornell, Berkeley, MIT, UPenn, CMU, Hopkins, etc) Commercial NLP hiring: Google, Microsoft, IBM, Amazon, LinkedIn, Yahoo Web startups in Silicon Valley are eating up NLP students Navy, DoD, NSA, NIH: all funding NLP research 2

What NLP topics did we miss? Speech Recognition 3

What NLP topics did we miss? Speech Recognition 4

What NLP topics did we miss? Machine Translation 5

What NLP topics did we miss? Machine Translation Start at ~6min in. http://www.youtube.com/watch?feature=player_embedded&v=nu -nlqqfckg 6

What NLP topics did we miss? Machine Translation IBM Models (1 through 5) Neural Network Translation 7

Machine Translation How to model translations? Words: P( casa house ) Spurious words: P( a null ) Fertility: Pn( 1 house ) English word translates to one Spanish word Distortion: Pd( 5 2 ) The 2 nd English word maps to the 5 th Spanish word

Distortion Encourage translations to follow the diagonal P( 4 4 ) * P( 5 5 ) *

Learning Translations Huge corpus of aligned sentences. Europarl Corpus of European Parliamant proceedings The EU is mandated to translate into all 21 official languages 21 languages, (semi-) aligned to each other P( casa house ) = (count all casa/house pairs!) Pd( 2 5 ) = (count all sentences where 2 nd word went to 5 th word)

Machine Translation Technology Hand-held devices for military Speak english -> recognition -> translation -> generate Urdu Translate web documents Education technology? Doesn t yet receive much of a focus

What NLP topics did we miss? Dialogue Systems Do you think Anakin likes me? I don t care. 12

What NLP topics did we miss? Dialogue Systems Why? Heavy interest in human-robot communication. UAVs require teams of 5+ people for each operating machine Goal: reduce the number of people Give computer high-level dialogue commands, rather than low-level system commands 13

What NLP topics did we miss? Dialogue Systems Dialogue is a fascinating topic. Not only do we need to understand language, but now discourse cues: Questions require replies Imperatives/Commands Acknowledgments: ok Back-channels: uh huh, mm hmm Belief-Desire-Intention (BDI) Model Beliefs: you maintain a set of facts about the world Desires: things you want to become true in the world Intentions: desires that you are taking action on 14

Neural Networks The field started shifting ~2012 with real movement in just the past 2 years. Neural networks now becoming the default type of classifier. 15

Word Embeddings Our features in this class were largely binary on/off [ he 1, said 1, pizza 0, denied 1, ate 0, ] Now we represent a unigram as a vector of real numbers. Just like distributional learning!!! But: the vector is learned instead of derived from context word counts 16

Distributional to Embeddings Distributional word vector: Word Embedding: cell [ 1.47, 0.42, 1.44, 0.53, 0.69, 1.02, 0.42, -0.43 ]

Word Embeddings as Features The features are no longer n-grams, but embeddings of the n-grams. Before: [ he 1, said 1, pizza 0, denied 1, ate 0, ] Now: [ 1.47, 0.42, 1.44, 0.53, 0.69, 1.02, 0.42, -0.43 ] Sum up the word embeddings for each unigram OR just concatenate them into an even longer vector! 18

Basic Neural Network We learned multi-class logistic regression (MaxEnt) This is a one-layer neural network! thou hast opened the box Dickens Austen Shakespeare quicketh 19

Basic Neural Network Do several regressions at once. Decide later. thou hast opened the box quicketh Dickens Austen Shakespeare New hidden layer!!! 20

Basic Neural Network Do several regressions at once. Decide later. Word Embeddings As Input Dickens Austen Shakespeare 21

El Fin Secret 1: 22

El Fin Secret 1: I intentionally made some of our labs ambiguous 23

El Fin Secret 1: I intentionally made some of our labs ambiguous Under-defined tasks with unclear expected results 24

El Fin Secret 1: I intentionally made some of our labs ambiguous Under-defined tasks with unclear expected results Secret 2: 25

El Fin Secret 1: I intentionally made some of our labs ambiguous Under-defined tasks with unclear expected results Secret 2: I tried to teach you skills that have nothing to do with NLP 26

El Fin Secret 1: I intentionally made some of our labs ambiguous Under-defined tasks with unclear expected results Secret 2: I tried to teach you skills that have nothing to do with NLP Experimentation Error Analysis 27

What NLP topics did we miss? Unsupervised Learning 31

What NLP topics did we miss? Unsupervised Learning Most of this semester used data that had human labels. Bootstrapping was our main counterexample: it is mostly unsupervised. Many many algorithms being researched to learn language and knowledge without humans, only using text. 32