CS6220: DATA MINING TECHNIQUES

Similar documents
Probabilistic Latent Semantic Analysis

Python Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Generative models and adversarial training

Lecture 1: Machine Learning Basics

Assignment 1: Predicting Amazon Review Ratings

A Case Study: News Classification Based on Term Frequency

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Bayesian Learning Approach to Concept-Based Document Classification

Latent Semantic Analysis

Speech Emotion Recognition Using Support Vector Machine

Reducing Features to Improve Bug Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Australian Journal of Basic and Applied Sciences

CSL465/603 - Machine Learning

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Calibration of Confidence Measures in Speech Recognition

Learning Methods in Multilingual Speech Recognition

Semi-Supervised Face Detection

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Learning From the Past with Experiment Databases

CS Machine Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Comment-based Multi-View Clustering of Web 2.0 Items

Mining Topic-level Opinion Influence in Microblog

Applications of data mining algorithms to analysis of medical data

Automatic document classification of biological literature

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

A Comparison of Two Text Representations for Sentiment Analysis

WHEN THERE IS A mismatch between the acoustic

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Multivariate k-nearest Neighbor Regression for Time Series data -

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Time series prediction

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Bug triage in open source systems: a review

(Sub)Gradient Descent

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Lecture 1: Basic Concepts of Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Experts Retrieval with Multiword-Enhanced Author Topic Model

Artificial Neural Networks written examination

Using dialogue context to improve parsing performance in dialogue systems

Modeling function word errors in DNN-HMM based LVCSR systems

Indian Institute of Technology, Kanpur

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A survey of multi-view machine learning

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

arxiv: v1 [cs.lg] 15 Jun 2015

Human Emotion Recognition From Speech

Detecting English-French Cognates Using Orthographic Edit Distance

Attributed Social Network Embedding

Development of Multistage Tests based on Teacher Ratings

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

arxiv: v2 [cs.cv] 30 Mar 2017

CS 446: Machine Learning

Genre classification on German novels

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

arxiv: v1 [cs.lg] 3 May 2013

Data Fusion Through Statistical Matching

Speaker Identification by Comparison of Smart Methods. Abstract

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Computerized Adaptive Psychological Testing A Personalisation Perspective

Universidade do Minho Escola de Engenharia

Probability and Statistics Curriculum Pacing Guide

MASTER OF PHILOSOPHY IN STATISTICS

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Lecture 9: Speech Recognition

Axiom 2013 Team Description Paper

Evolutive Neural Net Fuzzy Filtering: Basic Description

The Strong Minimalist Thesis and Bounded Optimality

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Managerial Decision Making

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The Evolution of Random Phenomena

arxiv: v1 [math.at] 10 Jan 2016

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Identifying Topical Authorities in Microblogs

Support Vector Machines for Speaker and Language Recognition

Transcription:

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016

Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification Decision Tree; Naïve Bayes; Logistic Regression SVM; knn HMM Label Propagation Neural Network Clustering K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means* PLSA SCAN; Spectral Clustering Frequent Pattern Mining Apriori; FPgrowth GSP; PrefixSpan Prediction Linear Regression Autoregression Collaborative Filtering Similarity Search Ranking DTW P-PageRank PageRank 2

Text Data and Topic Models Text Data: Topic Models Probabilistic Latent Semantic Analysis Summary 3

Text Data Word/term Document A bag of words Corpus A collection of documents 4

Represent a Document Most common way: Bag-of-Words Ignore the order of words keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 5

More Details Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) Number of words is huge Select and use a smaller set of words that are of interest E.g. uninteresting words: and, the at, is, etc. These are called stop-words Stemming: remove endings. E.g. learn, learning, learnable, learned could be substituted by the single stem learn Other simplifications can also be invented and used The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index. Can be extended to bi-gram, tri-gram, or so 6

Topic Topics A topic is represented by a word distribution Relate to an issue 7

Topic modeling Topic Models Get topics automatically from a corpus Assign documents to topics automatically Most frequently used topic models plsa LDA 8

Text Data and Topic Models Text Data: Topic Models Probabilistic Latent Semantic Analysis Summary 9

Word, document, topic w, d, z Word count in document c(w, d) Notations Word distribution for each topic (β z ) β zw : p(w z) Topic distribution for each document (θ d ) θ dz : p(z d) (Yes, fuzzy clustering) 10

Review of Multinomial Distribution Select n data points from K categories, each with probability p k n trials of independent categorical distribution E.g., get 1-6 from a dice with 1/6 When K=2, binomial distribution n trials of independent Bernoulli distribution E.g., flip a coin to get heads or tails 11

Describe how a document is generated probabilistically Generative Model for plsa For each position in d, n = 1,, N d Generate the topic for the position as z n ~mult θ d, i. e., p z n = k = θ dk (Note, 1 trial multinomial, i.e., categorical distribution) Generate the word for the position as w n ~mult β zn, i. e., p w n = w = β zn w 12

The Likelihood Function for a Corpus Probability of a word p w d = p(w, z = k d) = p w z = k p z = k d = β kw θ dk k k k Likelihood of a corpus π d is usually considered as uniform, which can be dropped 13

Re-arrange the Likelihood Function Group the same word from different positions together max logl = dw c w, d log z θ dz β zw s. t. z θ dz = 1 and w β zw = 1 14

Optimization: EM Algorithm Repeat until converge E-step: for each word in each document, calculate is conditional probability belonging to each topic p z w, d p w z, d p z d = β zw θ dz (i. e., p z w, d = β zwθ dz z β z w θ dz ) M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood β zw d p z w, d c w, d (i. e., β zw = θ dz w p z w, d c w,d d w,d p z w, d ) c w,d p z w, d c w, d (i. e., θ dz = w p z w, d c w, d N d ) 15

Text Data and Topic Models Text Data: Topic Models Probabilistic Latent Semantic Analysis Summary 16

Basic Concepts Summary Word/term, document, corpus, topic How to represent a document plsa Generative model Likelihood function EM algorithm 17