INTRODUCTION TO TEXT MINING

Similar documents
Assignment 1: Predicting Amazon Review Ratings

A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Probabilistic Latent Semantic Analysis

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Linking Task: Identifying authors and book titles in verbose queries

Universiteit Leiden ICT in Business

Cross Language Information Retrieval

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Bayesian Learning Approach to Concept-Based Document Classification

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Comment-based Multi-View Clustering of Web 2.0 Items

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

AQUA: An Ontology-Driven Question Answering System

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Laboratorio di Intelligenza Artificiale e Robotica

Switchboard Language Model Improvement with Conversational Data from Gigaword

Statewide Framework Document for:

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Lecture 1: Machine Learning Basics

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Rule Learning With Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Laboratorio di Intelligenza Artificiale e Robotica

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Term Weighting based on Document Revision History

Python Machine Learning

Automating the E-learning Personalization

Finding Translations in Scanned Book Collections

A Graph Based Authorship Identification Approach

arxiv: v1 [cs.cl] 2 Apr 2017

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

The stages of event extraction

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

On document relevance and lexical cohesion between query terms

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Vocabulary Usage and Intelligibility in Learner Language

CS Machine Learning

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Lecture 1: Basic Concepts of Machine Learning

Learning Methods for Fuzzy Systems

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Multilingual Sentiment and Subjectivity Analysis

Classroom Assessment Techniques (CATs; Angelo & Cross, 1993)

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The Role of String Similarity Metrics in Ontology Alignment

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Leveraging Sentiment to Compute Word Similarity

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Automatic document classification of biological literature

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Constructing Parallel Corpus from Movie Subtitles

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

South Carolina English Language Arts

HLTCOE at TREC 2013: Temporal Summarization

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

The Smart/Empire TIPSTER IR System

Variations of the Similarity Function of TextRank for Automated Summarization

The Role of the Head in the Interpretation of English Deverbal Compounds

Handling Sparsity for Verb Noun MWE Token Classification

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Humboldt-Universität zu Berlin

A heuristic framework for pivot-based bilingual dictionary induction

Cross-lingual Text Fragment Alignment using Divergence from Randomness

The taming of the data:

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Backwards Numbers: A Study of Place Value. Catherine Perez

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Using Proportions to Solve Percentage Problems I

A Comparison of Two Text Representations for Sentiment Analysis

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

A Domain Ontology Development Environment Using a MRD and Text Corpus

Introduction to Text Mining

Detecting English-French Cognates Using Orthographic Edit Distance

Grade 6: Correlated to AGS Basic Math Skills

Transcription:

INTRODUCTION TO TEXT MINING Jelena Jovanovic Email: jeljov@gmail.com Web: http://jelenajovanovic.net

2 OVERVIEW What is Text Mining (TM)? Why is TM relevant? Why do we study it? Application domains The complexity of unstructured text (the origin of TM challenges) Bag-of-words representation of text Vector Space Model Methods/techniques for text pre-processing Assessing the relevancy of individual words/phrases Measuring document similarity: Cosine similarity

3 WHAT IS TEXT MINING (TM)? The use of computational methods and techniques to extract high quality information from text A computational approach to the discovery of new, previously unknown information and/or knowledge through automated extraction of information from often large amounts of unstructured text

4 WHY IS TM RELEVANT / USEFUL? Unstructured text is present in various forms, and in huge and ever increasing quantities: books, financial and other business reports, various kinds of business and administrative documents, news articles, blog posts, wiki, messages/posts on social networking and social media sites, It is estimated that ~80% of all the available data are unstructured data

5 WHY IS TM RELEVANT / USEFUL? To enable effective and efficient use of such huge quantities of textual content, we need computational methods for automated extraction of information from unstructured text analysis and summarization of extracted information TM research and practice are focused on the development, continual improvement and application of such methods

6 TM APPLICATION DOMAINS Document classification* Clustering / organizing documents Document summarization Visualization of document space (often aimed at facilitating document search) Making predictions (e.g., predicting stock market prices based on the analysis of news articles and financial reports) Content-based recommender systems (for news articles, movies, books, articles, ) *The term document refers to any kind of unstructured piece of text: blog post, news article, tweet, status update, business document,

7 THE COMPLEXITY OF UNSTRUCTURED TEXT In general, interpretation / comprehension of unstructured content (text, images, videos) is (often) easy for people, but very complex for computer program In particular, difficulties with automated text comprehension are caused by the fact that the human / natural language: is full of ambiguous terms and phrases often strongly relies on the context and background knowledge for defining and conveying meaning is full of fuzzy and probabilistic terms and phrases strongly based on commonsense knowledge and reasoning is influenced by and is influencing people s mutual interactions

8 ADDITIONAL CHALLENGES FACED BY TM The use of supervised machine learning (ML) methods for TM is often very expensive This is caused by the need to prepare high number of annotated documents to be used as the training dataset Such a training set is essential for, e.g., document classification or extraction of entities, relations and events from text High-dimension of the attribute space: Documents are often described with numerous attributes, which further impedes the application of ML methods Most often, attributes are either all terms or a selection of terms and/or phrases from the collection of documents to be analyzed

9 BAG OF WORDS REPRESENTATION OF TEXT Considers text a simple set/bag of words Based on the following (unrealistic) assumptions: words are mutually independent, word order in text is irrelevant Despite its unrealistic assumptions and simplicity, this approach to text modeling proved to be highly effective, and is often used in TM

10 BAG OF WORDS REPRESENTATION OF TEXT Unique words from the corpus are used to create the corpus dictionary ; then, each document from the corpus is represented as a vector of (dictionary) word frequencies

11 VECTOR SPACE MODEL Generalization of the Bag of Words model Each document from the corpus* is represented as a multidimensional vector Each unique term from the corpus represents one dimension of the vector space Term can be a single word or a sequence of words (phrase) The number of unique terms in the corpus determines the dimension of the vector space *corpus refers to a collection of documents to be processed / analyzed

12 VECTOR SPACE MODEL Vector elements are weights associated with individual terms; these weights reflect the relevancy of the corresponding terms in the given corpus If a corpus consists of n terms (t i, i=1,n), document d from that corpus would be represented with the vector: d = {w 1,w 2,,w n }, where w i are weights associated with terms t i

13 VECTOR SPACE MODEL Distances among vectors in this multi-dim. space represent the relationships among the corresponding documents It is assumed that documents that are close to one another in this multi-dim. space, are also close (similar) in meaning

14 VSM: TERM DOCUMENT MATRIX In VSM, corpus is represented in the form of Term Document Matrix (TDM), i.e., an m x n matrix with following features: Rows (i=1,m) represent terms from the corpus Columns (j=1,n) represent documents from the corpus Cell ij stores the weight of the term i in the context of the document j Image source: http://mlg.postech.ac.kr/research/nmf

15 VSM: TEXT PREPROCESSING Before creating the TDM matrix, documents from the corpus need to be preprocessed Rationale / objective: to reduce the set of words to those that are expected to be the most relevant for the given corpus Preprocessing (often) includes: Normalizing the text Removing terms with very small / high frequency in the given corpus Removing the so-called stop-words Reducing words to their root form through stemming or lemmatization

16 NORMALIZATION OF TEXT Objective: transform various forms of the same term into a common, normalized form E.g.: Apple, apple, APPLE -> apple Intelligent Systems, Intelligent systems, Intelligent-systems -> intelligent systems How it is done: Using simple rules: Remove all punctuation marks (dots, dashes, commas, ) Transform all words to lower case Using a dictionary, such as WordNet, to replace synonyms with a common, often more general, concept E.g., automobile, car -> vehicle

17 REMOVING HIGH AND LOW FREQUENCY TERMS Empirical observations (in numerous corpora): Many low frequency words Only a few words with high frequency Formalized in the Zipf s rule: the frequency of a word in a given corpus is inversely proportional to its rank in the frequency table (for that corpus)

18 ILLUSTRATION OF THE ZIPF S RULE Word frequency in the Brown Corpus of American English text source: http://nlp.stanford.edu/fsnlp/intro/fsnlp-slides-ch1.pdf

19 IMPLICATIONS OF THE ZIPF S RULE Words in the upper part of the frequency table comprise a significant proportion of all the words in the corpus, but are semantically almost useless Examples: the, a, an, we, do, to On the other hand, words towards the bottom of the frequency table are semantically rich, but are of very low frequency Example: dextrosinistral The rest of the words are those that represent the corpus the best and thus should be included in the VSM model

20 IMPLICATIONS OF THE ZIPF S RULE Remove words that do not bear meaning Remove highly infrequent words Image source: http://www.dcs.gla.ac.uk/keith/chapter.2/ch.2.html

21 STOP-WORDS An alternative or a complementary way to eliminate words that are (most probably) irrelevant for corpus analysis Stop-words are those words that (on their own) do not bear any information / meaning It is estimated that they represent 20-30% of words in any corpus There is no unique stop-words list Frequently used lists are available at: http://www.ranks.nl/stopwords Potential problems with stop-words removal: the loss of original meaning and structure of text examples: this is not a good option -> option to be or not to be -> null

22 LEMMATIZATION AND STEMMING Two approaches to decreasing variability of words by reducing different forms of words to their basic / root form Stemming is a crude heuristic process that chops off the ends of words without considering linguistic features of the words E.g., argue, argued, argues, arguing -> argu Lemmatization refers to the use of a vocabulary and morphological analysis of words, aiming to return the base or dictionary form of a word, which is known as the lemma E.g., argue, argued, argues, arguing -> argue

23 VSM: COMPUTING TERMS WEIGHTS There are various approaches for determining the terms weights Simple and frequently used approaches include: Binary weights Term Frequency (TF) Inverse Document Frequency (IDF) TF-IDF

24 VSM: BINARY WEIGHTS Weights take the value of 0 or 1, to reflect the presence (1) or absence (0) of the term in a particular document Example: Doc1: Text mining is to identify useful information. Doc2: Useful information is mined from text. Doc3: Apple is delicious. text information identify mining mined is useful to from apple delicious Doc1 1 1 1 1 0 1 1 1 0 0 0 Doc2 1 1 0 0 1 1 1 0 1 0 0 Doc3 0 0 0 0 0 1 0 0 0 1 1

25 VSM: TERM FREQUENCY Term Frequency (TF) represents the frequency of the term in a specific document The underlying assumption: the higher the term frequency in a document, the more important it is for that document TF(t) = c(t,d) c(t,d) the number of occurrences of the term t in the document d

26 VSM: INVERSE DOCUMENT FREQUENCY The underlying idea: assign higher weights to unusual terms, i.e., to terms that are not so common in the corpus IDF is computed at the corpus level, and thus describes corpus as a whole, not individual documents It is computed in the following way: IDF(t) = 1 + log(n/df(t)) N number of documents in the corpus df(t) number of documents with the term t

27 VSM: TF IDF The underlying idea: value those terms that are not so common in the corpus (relatively high IDF), but still have same reasonable level of frequency (relatively high TF) The most frequently used metric for computing term weights in a VSM General formula for computing TF-IDF: TF-IDF(t) = TF(t) x IDF(t) One popular instantiation of this formula: TF-IDF(t) = tf(t) * log(n/df(t))

28 VSM: ESTIMATING SIMILARITY OF DOCUMENTS Key question: which metric to use for estimating the similarity of documents (i.e., vectors that represent documents)? The most well known and widely used metric is Cosine similarity Image source: http://www.ascilite.org.au/ajet/ajet26/ghauth.html

29 COSINE SIMILARITY cos(d i,d j ) = V i x V j / ( V i V j ) V i and V j are vectors representing documents d i and d j Image source: http://blog.christianperone.com/?p=2497

30 VSM: PROS AND CONS Advantages Intuitive Easy to implement Empirically proven as highly effective Drawbacks Based on the unrealistic assumption of words mutual independence Tuning the model s parameters is often challenging and time consuming; this includes selection of method for: determining the terms weights computing document (vector) similarity

31 TEXT PROCESSING IN JAVA Well known and widely used Java frameworks for text processing and analysis: Stanford CoreNLP: http://nlp.stanford.edu/software/corenlp.shtml Apache OpenNLP: http://opennlp.apache.org/ LingPIPE: http://alias-i.com/lingpipe/ GATE: http://gate.ac.uk/

32 ACKNOWLEDGEMENTS These slides are partially based on: Lecture on Vector Space Model of the Text Mining course @ Uni. of Virginia (link) Presentation Introduction to Text Mining downloaded from SlideShare.net (link)