Introduction to Computational Linguistics Olga Zamaraeva (2018) Based on Guestrin (2013) University of Washington April 10, 2018 1 / 30
This and last lecture: bird s eye view Next lecture: understand precision & recall in detail Coming next week: N-grams, then CFG will also look in more detail Some more bird s eye topics later in the course 2 / 30
What is? Study of algorithms that: improve their performance at some task with experience Data ML Understanding Note: understanding is more general than e.g. linguistics or speech. (This is where my the distinction between and NLP comes in, and that s why NLP is more closely associated with ML.) 3 / 30
ML tasks: Classification From data to discrete labels Spam filtering Text classification Object detection Weather prediction (e.g. rain, snow...) Sentiment analysis etc. 4 / 30
ML tasks: Regression Predict a numeric value Stock market Weather prediction (temperature) Predict final scores given comments in the code :) 5 / 30
ML tasks: Similarity Finding data Given image, find similar ones Similar products, songs... Similar texts Similar words... 6 / 30
Clustering (Unsupervised learning) Group similar things together 7 / 30
Embedding Representing data (e.g. images) 8 / 30
Embedding Representing data (e.g. words) 9 / 30
Reinforcement Learning Training by feedback Have an agent: make sensor observations select action receive rewards compute a strategy to maximize expected rewards balance immediate reward and exploration pic from: http://www.todayifoundout.com/index.php/2013/08/the-history-of-pac-man/ 10 / 30
Neural Nets and Deep Learning picture from: https://mapr.com/blog/demystifying-ai-ml-dl/ 11 / 30
Data in ML is all about finding patterns in the Data typically, Big data is required Training data find patterns in the data train a function to minimize mistakes (learn) Development data Test data Tune the parameters Perform error analysis Never ever learn on the test data (why?) But even if you never look at your test data but keep using the same test data... The case of Wall Street Journal, section 23 12 / 30
Decision function: separates data points I Train the function on the training data I Given a new ( test ) point, know which side of the DF I parameters, e.g. feature weights are optimized I which θ = P (heads ) makes the data HHHTT most probable I or: find such a vector w that the decision function φ(w1 f1, w2 f2, w3 f3 ) makes least mistakes I (f1 is the value for feature 1, e.g. yesterday s t ) pic from: Koprowski et al. (2012) 13 / 30
Features Extract informative features (e.g. whiskers length, ear size...) Turn them into numbers somehow How important is each feature? (weights) Given the below training dataset, how important is whiskers length? what about color? what about tail position? 14 / 30
Loss function On the training data, observe the true label (value) and penalize the mistakes 15 / 30
Bias-Variance Tradeoff 16 / 30
Fundamental questions in ML (according to Mitchel, 2017) How can computers improve performance through experience? Which theoretical laws govern learning systems? Think again about what NLP s fundamental questions are What about Linguistics fundamental questions? Acquisition (the Holy Grail, for some linguists?) 17 / 30
ML perspectives ML as optimization E.g. optimize a loss function to get better predictions ML as probabilistic inference E.g. derive a function that makes the data most probable (Recall MLE (maximum likelihood estimation)) ML as parametric programming E.g. Deep Learning networks instantiate a specific program out of a set of possible programs ML as evolutionary search :) Is evolution a ML phenomenon? Think again about the research question of NLP...we want to understand something about the world through language... 18 / 30
ML: Key results No free lunch...no system has any basis to reliably classify new examples that go beyond those it has already seen... Three sources of error: Bias, variance, and unavoidable error: Overfitting some probability of us being wrong When True error > Train error What is the relationship between True error and Test error? 19 / 30
Overfitting 20 / 30
Bayesian Networks and Graphical Models Discover some structure in and analyze complex data distributions https://stats.stackexchange.com/questions/249392/how-to-calculate-causal-inference-in-bayesian-networks 21 / 30
Discriminative and Generative models Generative: Learn joint distribution P(x,y) (from which conditional can be inferred) Need to make more assumptions Can generate data (x,y) Based on my generation assumptions, which category is most likely to generate this observation? Example: Naive Bayes classifier, HMM Discriminative: Learn conditional probability directly (P(y x) ) Need fewer constraints/assumptions Which class to predict given observation? Does not care about how the data was generated Example: Logistic Regression (Maximum Entropy classifier) While Generative models sound more generally useful, discriminative often perform better 22 / 30
Discriminative and Generative models x=1: cat goes outside; x=0: cat stays indoors y=1: cat catches mouse; y=0 cat does not catch mouse Observe the cat for 10 days and get the following data (x,y): (0,1), (0,0) (0,0) (1,0) (1,0) (1,0) (1,1) (1,1) (1,1) (1,1) Joint probability of both events happening: P(x,y): y=0 y=1 x=0 0.2 0.1 x=1 0.3 0.4 Conditional P(y x): choice of x value is fixed: y=0 y=1 x=0 0.66 0.33 x=1 0.42 0.58 Now suppose you want to artificially create more observations (e.g. for a computer game about a cat). If you generate N more observations using P(x,y), will you end up with the same probabilities of events? Can you use P(y x)? But how to determine how many times X was equal to 0 and to 1? 23 / 30
Deep Neural Networks A family of ML algorithms where simple units are combined to perform a larger computation Simultaneously train millions of parameters (for all simple units) Development in types of units used LSTM (Long Short Term Memory) units Structural questions are asked about the data But how well can we tell what is going on in the end? Specific architectures for specific problems Good performance...in domain How to generalize knowledge from here? Representation learning Learn new representation of data in hidden layers E.g. progress in relating text to images 24 / 30
Other issues PAC learning theory (upper error bounds) Ensemble learning Semi-supervised learning and Active learning Kernel methods (changing dimensionality of data) Reinforcement learning 25 / 30
Where is ML headed next? Will ML change the way we think about human learning? Human-machine (learning) interaction ML by reading Note that both directions involve natural language understanding 26 / 30
ML & NLP Natural Language Understanding (NLU) in demand in the industry At the same time, NLP is behind e.g. machine vision (in using deep learning) Researchers are after performance improvement on classic tasks, as well as defining new interesting tasks, as well as after understanding how learning works, through language input ML is the dominant paradigm in today s NLP 27 / 30
ML & ML is often employed for automatically tagging data e.g. to get access to larger annotated corpora (bootstrap from smaller sample) Is it likely to discover something linguistically valid about language via statistics? Definitely! But can we learn everything we want this way? The question of how learning happens is equally interesting to most linguists...however, most related questions go beyond well-defined linguistic theories Note that e.g. NLU as defined in NLP is not strictly speaking a linguistic task E.g. semantic theory is not trying to model world knowledge ML is not the dominant paradigm in Linguistics 28 / 30
Python libraries for ML http://scikit-learn.org/stable/install.html comes with good documentation and (usually small) examples! http://scikit-learn.org/stable/tutorial/text analytics/ working with text data.html sample datasets and sample code are available 29 / 30
What you need to know ML fundamental questions Training, Dev, Test data Decision vs. Loss function (what are they for) Overfitting, Sources of error No free lunch Regression vs. Classification Optimization (why is it an important perspective) Features and feature weights (what is their role) Role of conditional probability (why is it used) Difference between conditional and joint probability in terms of generating data 30 / 30