Machine Learning for NLP

Size: px

Start display at page:

Download "Machine Learning for NLP"

Gladys Meghan Hodges
6 years ago
Views:

1 Natural Language Processing SoSe 2014 Machine Learning for NLP Dr. Mariana Neves April 30th, 2014 (based on the slides of Dr. Saeedeh Momtazi)

2 Introduction Field of study that gives computers the ability to learn without being explicitly programmed Arthur Samuel, 1959 Learning Methods Supervised learning 2 Active learning Unsupervised learning Semi-supervised learning Reinforcement learning Natural Language Processing Machine Learning for NLP

3 Outline 3 Supervised Learning Semi-supervised learning Unsupervised learning Natural Language Processing Machine Learning for NLP

4 Outline 4 Supervised Learning Semi-supervised learning Unsupervised learning Natural Language Processing Machine Learning for NLP

5 Supervised Learning Example: mortgage credit decision Age Income 5 Natural Language Processing Machine Learning for NLP

6 Supervised Learning age? income 6 Natural Language Processing Machine Learning for NLP

7 Classification Training T1 T2 Tn C1 C2 Cn F1 F2 Fn Model(F,C) Testing Tn+1 7? Fn+1 Natural Language Processing Machine Learning for NLP Cn+1

8 Applications Problems POS tagging Named entity recognition Word sense disambiguation Spam mail detection Language identification Text categorization Information retrieval 8 Natural Language Processing Machine Learning for NLP Items Word Word Word Document Document Document Document Categories POS Named entity Word's sense Spam/Not Spam Language Topic Relevant/Not relevant

Part-of-speech tagging http://weaver.nlplab.

9 Part-of-speech tagging 9 Natural Language Processing Machine Learning for NLP

Named entity recognition http://corpora.informatik.hu-berlin.de/index.

10 Named entity recognition 10 Natural Language Processing Machine Learning for NLP

11 Word sense disambiguation 11 Natural Language Processing Machine Learning for NLP

12 Spam mail detection 12 Natural Language Processing Machine Learning for NLP

13 Language identification 13 Natural Language Processing Machine Learning for NLP

14 Text categorization 14 Natural Language Processing Machine Learning for NLP

15 Classification Training T1 T2 Tn C1 C2 Cn F1 F2 Fn Model(F,C) Testing Tn+1? Fn+1 15 Natural Language Processing Machine Learning for NLP Cn+1

16 Classification algorithms K Nearest Neighbor Support Vector Machines Naïve Bayes Maximum Entropy Linear Regression Logistic Regression Neural Networks Decision Trees Boosting Natural Language Processing Machine Learning for NLP

17 Classification algorithms K Nearest Neighbor Support Vector Machines Naïve Bayes Maximum Entropy Linear Regression Logistic Regression Neural Networks Decision Trees Boosting Natural Language Processing Machine Learning for NLP

18 K Nearest Neighbor? 18 Natural Language Processing Machine Learning for NLP

19 K Nearest Neighbor? 19 Natural Language Processing Machine Learning for NLP

20 K Nearest Neighbor 1-nearest neighbor 20 Natural Language Processing Machine Learning for NLP

21 K Nearest Neighbor 3-nearest neighbors? 21 Natural Language Processing Machine Learning for NLP

22 K Nearest Neighbor 3-nearest neighbors 22 Natural Language Processing Machine Learning for NLP

23 Classification algorithms K Nearest Neighbor Support Vector Machines Naïve Bayes Maximum Entropy Linear Regression Logistic Regression Neural Networks Decision Trees Boosting Natural Language Processing Machine Learning for NLP

24 Support vector machines 24 Natural Language Processing Machine Learning for NLP

25 Support vector machines Find a hyperplane in the vector space that separates the items of the two categories 25 Natural Language Processing Machine Learning for NLP

26 Support vector machines There might be more than one possible separating hyperplane 26 Natural Language Processing Machine Learning for NLP

27 Support vector machines Find the hyperplane with maximum margin Vectors at the margins are called support vectors 27 Natural Language Processing Machine Learning for NLP

28 Classification algorithms K Nearest Neighbor Support Vector Machines Naïve Bayes Maximum Entropy Linear Regression Logistic Regression Neural Networks Decision Trees Boosting Natural Language Processing Machine Learning for NLP

29 Naïve Bayes Selecting the class with highest probability Minimizing the number of items with wrong labels c =argmax c P (c i ) i Probability should depend on the to be classified data (d) P(c i d ) 29 Natural Language Processing Machine Learning for NLP

30 Naïve Bayes c =argmax c P (c i ) i c =argmax c P (c i d ) i P (d c i ) P (c i ) c =argmax c P (d ) i c =argmax c P (d c i ) P (c i ) i 30 Natural Language Processing Machine Learning for NLP

31 Naïve Bayes c =argmax c P (d c i ) P (c i ) i Prior probability Likelihood probability 31 Natural Language Processing Machine Learning for NLP

32 Classification Training T1 T2 Tn C1 C2 Cn F1 F2 Fn Model(F,C) Testing Tn+1? Fn+1 32 Natural Language Processing Machine Learning for NLP Cn+1

33 Spam mail detection Features: - words - sender's - contains links - contains attachments - contains money amounts Natural Language Processing Machine Learning for NLP

34 Feature selection Bag-of-words: Each document can be represented by the set of words that appear in the document Result is a high dimensional feature space The process is computationally expensive Solution Using a feature selection method to select informative words 34 Natural Language Processing Machine Learning for NLP

35 Feature selection methods Information gain Mutual information χ-square 35 Natural Language Processing Machine Learning for NLP

36 Information gain Measuring the number of bits required for category prediction w.r.t. the presence or absence of a term in the document Removing words whose information gain is less than a predefined threshold IG (w)= i=1 K P (c i ) log P(ci ) + P( w) i=1 + P( w ) i=1 36 Natural Language Processing Machine Learning for NLP K P (c i w ) log P (ci w) K P (c i w ) log P (ci w )

37 Information gain N = # docs N i = # docs in category ci N w = # docs containing w N w = # docs not containing w N iw = # docs in category ci containing w N i w = # docs in category ci not containing w Ni P(c i )= N Nw P( w)= N P(c i w)= N iw Ni N w P( w )= N P(c i w )= N i w Ni 37 Natural Language Processing Machine Learning for NLP

38 Mutual information Measuring the effect of each word in predicting the category How much does its presence or absence in a document contribute to category prediction? P (w, c i ) MI ( w, c i )=log P (w) P (c i ) Removing words whose mutual information is less than a predefined threshold MI ( w)=max i MI ( w, c i ) MI ( w)= i P (c i ) MI ( w, c i ) 38 Natural Language Processing Machine Learning for NLP

39 χ-square Measuring the dependencies between words and categories 2 N ( N iw N iw N i w N i w ) χ 2 (w, c i )= ( N iw + N i w ) ( N i w + N iw ) ( N iw + N i w ) ( N i w + N iw ) Ranking words based on their χ-square measure χ 2 (w)= i=1 K P (c i ) χ 2 (w, ci ) Selecting the top words as features 39 Natural Language Processing Machine Learning for NLP

40 Feature selection These models perform well for document-level classification Spam Mail Detection Language Identification Text Categorization Word-level Classification might need another types of features Part-of-speech tagging Named Entity Recognition 40 Natural Language Processing Machine Learning for NLP

41 Supervised learning Shortcoming Relies heavily on annotated data Time consuming and expensive task Solution Active learning Using a minimum amount of annotated data Annotating further data by human, if they are very informative 41 Natural Language Processing Machine Learning for NLP

42 Active learning 42 Natural Language Processing Machine Learning for NLP

43 Active learning - Annotating a small amount of data 43 Natural Language Processing Machine Learning for NLP

44 Active learning - Calculating the confidence score of the classifier on unlabeled data H L M L 44 Natural Language Processing Machine Learning for NLP

45 Active learning - Finding the informative unlabeled data (data with lowest confidence) H L M L - manually annotating the informative data 45 Natural Language Processing Machine Learning for NLP

46 Outline Supervised Learning Semi-supervised learning Unsupervised learning 46 Natural Language Processing Machine Learning for NLP

47 Semi-supervised learning Annotating data is a time consuming and expensive task Solution Using a minimum amount of annotated data Annotating further data automatically 47 Natural Language Processing Machine Learning for NLP

48 Semi-supervised learning - A small amount of labeled data 48 Natural Language Processing Machine Learning for NLP

49 Semi-supervised learning - A large amount of unlabeled data 49 Natural Language Processing Machine Learning for NLP

50 Semi-supervised learning - Finding the similarity between the labeled and unlabeled data - Predicting the labels of the unlabeled data 50 Natural Language Processing Machine Learning for NLP

51 Semi-supervised learning - Training the classifier using labeled data and predicted labels of unlabeled data 51 Natural Language Processing Machine Learning for NLP

52 Semi-supervised learning - Introducing a lot of noisy data to the system - Adding unlabeled data to the training set, if the predicted label has a high confidence 52 Natural Language Processing Machine Learning for NLP

53 Outline Supervised Learning Semi-supervised learning Unsupervised learning 53 Natural Language Processing Machine Learning for NLP

54 Supervised Learning age? income 54 Natural Language Processing Machine Learning for NLP

55 Unsupervised Learning age income 55 Natural Language Processing Machine Learning for NLP

Unsupervised Learning age income http://nationalmortgageprofessional.

56 Unsupervised Learning age income 56 Natural Language Processing Machine Learning for NLP

57 Clustering Calculating similarities between the data items Assigning similar data items to the same cluster 57 Natural Language Processing Machine Learning for NLP

58 Applications Word clustering Speech recognition Machine translation Named entity recognition Information retrieval... Document clustering Text classification Information retrieval Natural Language Processing Machine Learning for NLP

59 Speech recognition Computers can recognize a speeech. Computers can wreck a nice peach. recognition speech named-entity hand-writing 59 Natural Language Processing Machine Learning for NLP wreck ball ship

60 Machine translation The cat eats... Die Katze frisst... Die Katze isst... Katze fressen Hund laufen 60 Natural Language Processing Machine Learning for NLP essen Jung Mann

61 Language modelling I have a meeting on Moday evening. You should work on Wednesday afternoon. The next session is on Thursday morning. The talk is on Monday morning. The talk is on Monday molding. Monday Thursday Friday Sunday Saturday Tuesday morning afternoon evening night Tuesday 61 Natural Language Processing Machine Learning for NLP

62 Clustering algorithms Flat K-means Hierarchical Top-Down (Divisive) Bottom-Up (Agglomerative) Single-link Complete-link Average-link 62 Natural Language Processing Machine Learning for NLP

63 K-means The best known clustering algorithm Works well for many cases Used as default/baseline for clustering documents Defining each cluster center as the mean or centroid of the items in the cluster 1 μ = x c x c Minimizing the average squared Euclidean distance of the items from their cluster centers 63 Natural Language Processing Machine Learning for NLP

64 K-means Initialization: Randomly choose k items as initial centroids while stopping criterion has not been met do for each item do Find the nearest centroid Assign the item to the cluster associated with the nearest centroid end for for each cluster do Update the centroid of the cluster based on the average of all items in the cluster end for end while Iterating two steps: Re-assignment Assigning each vector to its closest centroid Re-computation Computing each centroid as the average of the vectors that were assigned to it in re-assignment 64 Natural Language Processing Machine Learning for NLP

65 K-means 65 Natural Language Processing Machine Learning for NLP

66 Hierarchical Agglomerative Clustering (HAC) Creating a hierarchy in the form of a binary tree 66 Natural Language Processing Machine Learning for NLP

67 Hierarchical Agglomerative Clustering (HAC) Creating a hierarchy in the form of a binary tree 67 Natural Language Processing Machine Learning for NLP

68 Hierarchical Agglomerative Clustering (HAC) Initial Mapping: Put a single item in each cluster while reaching the predefined number of clusters do for each pair of clusters do Measure the similarity of two clusters end for Merge the two clusters that are most similar end while Measuring the similarity in three ways: Single-link Complete-link Average-link 68 Natural Language Processing Machine Learning for NLP

69 Hierarchical Agglomerative Clustering (HAC) Single-link / single-linkage clustering Based on the similarity of the most similar members 69 Natural Language Processing Machine Learning for NLP

70 Hierarchical Agglomerative Clustering (HAC) Complete-link / complete-linkage clustering Based on the similarity of the most dissimilar members 70 Natural Language Processing Machine Learning for NLP

71 Hierarchical Agglomerative Clustering (HAC) Average-link / average-linkage clustering Based on the average of all similarities between the members 71 Natural Language Processing Machine Learning for NLP

72 Hierarchical Agglomerative Clustering (HAC) 72 Natural Language Processing Machine Learning for NLP

This is no clustering...just word frequencies http://www.wordle.

73 This is no clustering...just word frequencies 73 Natural Language Processing Machine Learning for NLP

74 Further reading 74 Natural Language Processing Machine Learning for NLP

75 Further reading 75 Natural Language Processing Machine Learning for NLP

76 Further reading 76 Natural Language Processing Machine Learning for NLP

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled