N-Gram-Based Text Categorization

N-Gram-Based Text Categorization William B. Cavnar and John M. Trenkle Proceedings of the Third Symposium on Document Analysis and Information Retrieval (1994) presented by Marco Lui

Automated text categorization (TC) is a supervised learning task, defined as assigning category labels (pre-defined) to new documents based on the likelihood suggested by a set of labeled documents. Yang & Liu (1999)

Examples of Text Categorization Topic-based Routing news articles from a newswire Sorting through digitized paper archives Style-based Authorship attribution Syntactic (?) Language idenfitication

Characteristics (as declared by C&T) The categorization must work reliably in spite of textual errors. The categorization must be efficient, consuming as little storage and processing time as possible, because of the sheer volume of documents to be handled. The categorization must be able to recognize when a given document does not match any category, or when it falls between two categories. This is because category boundaries are almost never clear-cut.

Document Representation Normalization: keep only letters, apostrophes, whitespace discard digits and punctuation pad with whitespace Tokenization: contiguous byte N-grams mixture of N-gram orders (1 N 5) Features: feature vector: frequency counts of N-grams

Document Representation: Example

Zipfian Distribution in N-Grams

Document and Category Profiles N-gram 'profile' Byte n-grams in decreasing order of frequency Category profile summed over all documents in that category

C&T's observations on Profiles Top 300 N-grams are highly correlated to language. Very highest N-grams are mostly 1-grams, followed by function words, frequent prefixes and suffixes. Around rank 300 or so, an N-grams are more specific to the subject of the document. There is nothing special about rank 300, and the value was chosen by inspection.

Classification: Profile distance

Understanding C&T Feature Selection Nearest-Prototype Classification

Understanding C&T: Feature Selection Local Dimensionality Reduction (Sebastiani 2002) A set of terms is selected for each category Terms are selected by Term Frequency (M per category) We can 'weight' features by their minimum rank across all categories The feature set for a given M is the set of features with 'weight' M Relationship between M and # features selected varies with dataset

Understanding C&T: Nearest-Prototype Classification Nearest-Prototype Classification Sometimes referred to as the Rocchio Method Training Phase: Instance vectors are summarized to a 'prototype' for each class Testing Phase: Distance metric is used to compare test instance to prototypes Nearest Prototype (minimum distance) is selected

Understanding C&T: Nearest-Prototype Classification In C&T (1994), the prototype is the sum of the document vectors for a given category The distance metric is 'out-of-place', a rank order statistic Measures differences between features in rank order, taking into account distance in ordering Most closely related to Spearman's Rho

Evaluation Language Classification (LangID) Subject Classification

LangID: Dataset 3478 samples in 8 Languages from the soc.culture newsgroup hierarchy Semi-automatically labelled for language, multilingual articles manually rejected English 1208 Spanish 697 German 481 Italian 316 French 273 Dutch 235 Portugese 151 Polish 117

LangID: Results

LangID: Observations Works better for longer articles, but not as much as expected Works better with longer profiles, with some anomalies. Part of the problem was due to multilingual articles that passed manual filtering With M=400, overall accuracy is 99.8%

Subject Classifiation 778 article bodies from 5 Usenet newsgroups Category profiles were built from 7 FAQ articles rather than aggregated articles

Subject Classification: Results

Advantages of the N-Gram Frequency Technique Suited for text coming from noisy sources such as email or OCR systems (or social media?) More robust than word counts: for noisy data, a single misrecognized character throws off the statistics for the whole word for short data, word-statistics are under-sampled N-grams give word stemming for free No need for language-dependent tools

Conclusions and Future Directions Omitting statistics for n-grams that are extremely common as they are features of the language Experiment with document sets that have a higher overall coherence and quality Normalize raw match scores to measure match quality by thresholding Unicode codepoint n-grams

My Thoughts (I) The significance to LangID is that first to model documents with character n-grams achieved high accuracy under their parameters The approach is weak in terms of ML technique Was this really state-of-the-art in 1994? Was not particularly influential in Text Categorization Sebastiani's 2002 survey only mentions LangID as an application

My Thoughts (II) They don't meet their stated objectives Work reliably in spite of textual errors This is not measured Efficient, consuming little storage and processing time as possible No theoretical support and no empirical comparison Categorization must be able to recognize when a given document does not match any category, or when it falls between two categories This is speculatively addressed in future work

My Thoughts (III) Not clear why they used FAQs for subject classification The paper is poorly referenced 6 references, 3 to authors' work, 1 minimally relevant Missing relationship to relevant prior work Rocchio method is 1971 Lewis has text categorization work from 1991 and is thanked in the acknowledgements! I would not model a new paper on this paper

Thanks!