N-Gram-Based Text Categorization

Similar documents
Switchboard Language Model Improvement with Conversational Data from Gigaword

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

A Case Study: News Classification Based on Term Frequency

Assignment 1: Predicting Amazon Review Ratings

Postprint.

CS Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

arxiv: v1 [cs.cl] 2 Apr 2017

Cross-Lingual Text Categorization

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 1: Machine Learning Basics

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Python Machine Learning

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

International Advanced level examinations

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Probabilistic Latent Semantic Analysis

Calibration of Confidence Measures in Speech Recognition

Finding Translations in Scanned Book Collections

Unit 3. Design Activity. Overview. Purpose. Profile

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Reducing Features to Improve Bug Prediction

(Sub)Gradient Descent

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Word Segmentation of Off-line Handwritten Documents

Literature and the Language Arts Experiencing Literature

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Multi-Lingual Text Leveling

School Size and the Quality of Teaching and Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Online Updating of Word Representations for Part-of-Speech Tagging

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Cross-lingual Short-Text Document Classification for Facebook Comments

What the National Curriculum requires in reading at Y5 and Y6

The Role of String Similarity Metrics in Ontology Alignment

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Degree Qualification Profiles Intellectual Skills

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

On-the-Fly Customization of Automated Essay Scoring

Generative models and adversarial training

Software Development Plan

arxiv: v1 [cs.lg] 3 May 2013

Detecting English-French Cognates Using Orthographic Edit Distance

Idaho Public Schools

5 Star Writing Persuasive Essay

Language Independent Passage Retrieval for Question Answering

The Writing Process. The Academic Support Centre // September 2015

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

Exposé for a Master s Thesis

CEF, oral assessment and autonomous learning in daily college practice

An Empirical and Computational Test of Linguistic Relativity

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Bug triage in open source systems: a review

The taming of the data:

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Cross Language Information Retrieval

Speech Emotion Recognition Using Support Vector Machine

Comment-based Multi-View Clustering of Web 2.0 Items

INPE São José dos Campos

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

A Bayesian Learning Approach to Concept-Based Document Classification

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

ACADEMIC AFFAIRS GUIDELINES

Visit us at:

Using dialogue context to improve parsing performance in dialogue systems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

CS 446: Machine Learning

Individual Differences & Item Effects: How to test them, & how to test them well

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Probability and Statistics Curriculum Pacing Guide

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Axiom 2013 Team Description Paper

Vocabulary Agreement Among Model Summaries And Source Documents 1

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Large vocabulary off-line handwriting recognition: A survey

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Comparison of network inference packages and methods for multiple networks inference

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

South Carolina English Language Arts

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Transcription:

N-Gram-Based Text Categorization William B. Cavnar and John M. Trenkle Proceedings of the Third Symposium on Document Analysis and Information Retrieval (1994) presented by Marco Lui

Automated text categorization (TC) is a supervised learning task, defined as assigning category labels (pre-defined) to new documents based on the likelihood suggested by a set of labeled documents. Yang & Liu (1999)

Examples of Text Categorization Topic-based Routing news articles from a newswire Sorting through digitized paper archives Style-based Authorship attribution Syntactic (?) Language idenfitication

Characteristics (as declared by C&T) The categorization must work reliably in spite of textual errors. The categorization must be efficient, consuming as little storage and processing time as possible, because of the sheer volume of documents to be handled. The categorization must be able to recognize when a given document does not match any category, or when it falls between two categories. This is because category boundaries are almost never clear-cut.

Document Representation Normalization: keep only letters, apostrophes, whitespace discard digits and punctuation pad with whitespace Tokenization: contiguous byte N-grams mixture of N-gram orders (1 N 5) Features: feature vector: frequency counts of N-grams

Document Representation: Example

Zipfian Distribution in N-Grams

Document and Category Profiles N-gram 'profile' Byte n-grams in decreasing order of frequency Category profile summed over all documents in that category

C&T's observations on Profiles Top 300 N-grams are highly correlated to language. Very highest N-grams are mostly 1-grams, followed by function words, frequent prefixes and suffixes. Around rank 300 or so, an N-grams are more specific to the subject of the document. There is nothing special about rank 300, and the value was chosen by inspection.

Classification: Profile distance

Understanding C&T Feature Selection Nearest-Prototype Classification

Understanding C&T: Feature Selection Local Dimensionality Reduction (Sebastiani 2002) A set of terms is selected for each category Terms are selected by Term Frequency (M per category) We can 'weight' features by their minimum rank across all categories The feature set for a given M is the set of features with 'weight' M Relationship between M and # features selected varies with dataset

Understanding C&T: Nearest-Prototype Classification Nearest-Prototype Classification Sometimes referred to as the Rocchio Method Training Phase: Instance vectors are summarized to a 'prototype' for each class Testing Phase: Distance metric is used to compare test instance to prototypes Nearest Prototype (minimum distance) is selected

Understanding C&T: Nearest-Prototype Classification In C&T (1994), the prototype is the sum of the document vectors for a given category The distance metric is 'out-of-place', a rank order statistic Measures differences between features in rank order, taking into account distance in ordering Most closely related to Spearman's Rho

Evaluation Language Classification (LangID) Subject Classification

LangID: Dataset 3478 samples in 8 Languages from the soc.culture newsgroup hierarchy Semi-automatically labelled for language, multilingual articles manually rejected English 1208 Spanish 697 German 481 Italian 316 French 273 Dutch 235 Portugese 151 Polish 117

LangID: Results

LangID: Observations Works better for longer articles, but not as much as expected Works better with longer profiles, with some anomalies. Part of the problem was due to multilingual articles that passed manual filtering With M=400, overall accuracy is 99.8%

Subject Classifiation 778 article bodies from 5 Usenet newsgroups Category profiles were built from 7 FAQ articles rather than aggregated articles

Subject Classification: Results

Advantages of the N-Gram Frequency Technique Suited for text coming from noisy sources such as email or OCR systems (or social media?) More robust than word counts: for noisy data, a single misrecognized character throws off the statistics for the whole word for short data, word-statistics are under-sampled N-grams give word stemming for free No need for language-dependent tools

Conclusions and Future Directions Omitting statistics for n-grams that are extremely common as they are features of the language Experiment with document sets that have a higher overall coherence and quality Normalize raw match scores to measure match quality by thresholding Unicode codepoint n-grams

My Thoughts (I) The significance to LangID is that first to model documents with character n-grams achieved high accuracy under their parameters The approach is weak in terms of ML technique Was this really state-of-the-art in 1994? Was not particularly influential in Text Categorization Sebastiani's 2002 survey only mentions LangID as an application

My Thoughts (II) They don't meet their stated objectives Work reliably in spite of textual errors This is not measured Efficient, consuming little storage and processing time as possible No theoretical support and no empirical comparison Categorization must be able to recognize when a given document does not match any category, or when it falls between two categories This is speculatively addressed in future work

My Thoughts (III) Not clear why they used FAQs for subject classification The paper is poorly referenced 6 references, 3 to authors' work, 1 minimally relevant Missing relationship to relevant prior work Rocchio method is 1971 Lewis has text categorization work from 1991 and is thanked in the acknowledgements! I would not model a new paper on this paper

Thanks!