LibShortText: A Library for Short-text Classification and Analysis

Similar documents
Assignment 1: Predicting Amazon Review Ratings

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Speech Emotion Recognition Using Support Vector Machine

Python Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

CS Machine Learning

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Switchboard Language Model Improvement with Conversational Data from Gigaword

arxiv: v1 [cs.lg] 3 May 2013

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Reducing Features to Improve Bug Prediction

As a high-quality international conference in the field

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Rule Learning With Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Case Study: News Classification Based on Term Frequency

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Exposé for a Master s Thesis

Multi-Lingual Text Leveling

Rule Learning with Negation: Issues Regarding Effectiveness

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Human Emotion Recognition From Speech

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Term Weighting based on Document Revision History

Automatic document classification of biological literature

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.cl] 2 Apr 2017

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Australian Journal of Basic and Applied Sciences

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CS 446: Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Lecture 1: Machine Learning Basics

Learning From the Past with Experiment Databases

Word Segmentation of Off-line Handwritten Documents

Learning to Rank with Selection Bias in Personal Search

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Introduction to Mobile Learning Systems and Usability Factors

Mining Association Rules in Student s Assessment Data

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Variations of the Similarity Function of TextRank for Automated Summarization

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Ensemble Technique Utilization for Indonesian Dependency Parser

EXAMINING THE DEVELOPMENT OF FIFTH AND SIXTH GRADE STUDENTS EPISTEMIC CONSIDERATIONS OVER TIME THROUGH AN AUTOMATED ANALYSIS OF EMBEDDED ASSESSMENTS

Efficient Online Summarization of Microblogging Streams

Multivariate k-nearest Neighbor Regression for Time Series data -

Bug triage in open source systems: a review

Conversational Framework for Web Search and Recommendations

Circuit Simulators: A Revolutionary E-Learning Platform

Attributed Social Network Embedding

Team Formation for Generalized Tasks in Expertise Social Networks

Calibration of Confidence Measures in Speech Recognition

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

International Series in Operations Research & Management Science

Graph Alignment for Semi-Supervised Semantic Role Labeling

Detecting English-French Cognates Using Orthographic Edit Distance

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Using Web Searches on Important Words to Create Background Sets for LSI Classification

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

A Comparison of Two Text Representations for Sentiment Analysis

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

MTH 215: Introduction to Linear Algebra

Online Updating of Word Representations for Part-of-Speech Tagging

(Sub)Gradient Descent

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Generative models and adversarial training

arxiv: v1 [cs.lg] 15 Jun 2015

Distributed Weather Net: Wireless Sensor Network Supported Inquiry-Based Learning

Universidade do Minho Escola de Engenharia

Cross-Lingual Text Categorization

Indian Institute of Technology, Kanpur

Online Marking of Essay-type Assignments

Modeling function word errors in DNN-HMM based LVCSR systems

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

SARDNET: A Self-Organizing Feature Map for Sequences

CAFE Collaboration Aimed at Finding Experts

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Model Ensemble for Click Prediction in Bing Search Ads

Georgetown University at TREC 2017 Dynamic Domain Track

Probabilistic Latent Semantic Analysis

Wenguang Sun CAREER Award. National Science Foundation

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Cross Language Information Retrieval

Transcription:

LibShortText: A Library for Short-text Classification and Analysis Hsiang-Fu Yu rofuyu@cs.utexas.edu Department of Computer Science, University of Texas at Austin, Austin, TX 78712 USA Chia-Hua Ho b95082@csie.ntu.edu.tw Yu-Chin Juan r01922136@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan Editor: Editor name Abstract LibShortText is an open source library for short-text classification and analysis. Properties of short texts are considered in its design and implementation. The package supports effective text pre-processing and fast training/prediction procedures. We provide an interactive tool to perform error analysis. Because of the short length, details of each short text can be investigated easily. In addition, the package is designed so that users can conveniently make extensions. The ease of use, efficiency, and extensibility of LibShortText make it very useful for practitioners working on short-text classification and analysis. The package is available at http://www.csie.ntu.edu.tw/~cjlin/libshorttext Keywords: short-text classification, interactive error analysis, open source, linear classification, machine learning 1. Introduction Recently, there has been a growing interest in short-text classification and analysis. Application domains include titles (e.g., Shen et al., 2012), questions (e.g., Zhang and Lee, 2003; Qu et al., 2012), sentences (e.g., Khoo et al., 2006), and short messages. Existing text-classification packages may not be suitable for analyzing short texts because of not considering their special properties. For example, it is difficult to analyze a text with many words, but we can easily investigate details in training/predicting short texts. Unlike general texts such as web pages and emails, short texts in a corpus may have similar lengths and words in a short text are often distinct. In addition, practitioners wonder if the standard procedures for text classification must be applied when experimenting with short texts. We develop LibShortText as an open source tool (under the BSD license 1 ) for short-text classification and analysis. It has the following features. 1. For large-scale short-text classification, it is more efficient than general text-mining tools. 2. Based on our study (Yu et al., 2012), we carefully select the default options for LibShort- Text. Users need not try many options to get the best performance for their applications. 3. We provide an interactive tool to perform error analysis. In particular, because of short lengths, details of each text can be investigated. 1. The New BSD licences approved by the Open Source Initiative. 1

Yu et al. Workflow Libraries Command-line tools short texts pre-processing text-train.py libshorttext.converter feature generation text2svm.py training text-train.py libshorttext.classifier testing text-predict.py analysis libshorttext.analyzer Figure 1: Workflow of LibShortText and the corresponding libraries/command-line tools. 4. The package is mainly written in Python for simplicity and extensibility, while timeconsuming operations are implemented in C/C++ for efficiency. 5. We provide full documentation including class references and examples. 2. The Software Package The workflow of LibShortText is shown in Figure 1. Each step corresponds to a library in the core Python module libshorttext. 1. libshorttext.converter: For given short texts, LibShortText follows the bag-of-word model to generate features. Users apply procedures in this library to pre-process short texts by tokenization, stemming (optional), and stop-word removal (optional). The library also allows users to choose between unigram and bigram features. 2. libshorttext.classifier: Following users choice on feature representation (binary, word count, TF, or TF-IDF), this library generates sparse feature vectors, conducts instance-wise normalization (optional), and calls the popular linear-classification package LIBLINEAR (Fan et al., 2008) for training/testing. All these operations can be easily built upon LIBLINEAR s Python interface, but for computational and memory efficiency, part of the implementation is in C/C++. We carefully design this library so that no modification of LIBLINEAR is required. For multi-class classification, we follow LIBLIN- EAR to support the one-versus-rest approach (e.g., Bottou et al., 1994) and the method by Crammer and Singer (2001). In addition, a parameter selection tool for selecting the penalty parameter of SVM/logistic regression formulations of LIBLINEAR is provided. 3. libshorttext.analyser: The library provides a simple interactive tool to conduct error analysis in both macro and micro levels. By macro level we mean the overall performance (e.g., accuracy), while the micro level involves the analysis of each feature of a short text. Based on the study (Yu et al., 2012) for product title classification, we set the following default options. pre-processing feature-generation training Default options no stemming, no stop word removal, bigram features instance-wise normalization, binary feature representation, Crammer & Singer multi-class SVM 2

For general texts, one option suitable for some texts may not suitable for others. In contrast, it is easier to select options based on properties of short texts. For example, because in a short text words are generally distinct, Yu et al. (2012) show that the SVM models obtained by binary and TF-IDF feature representations give similar performance. 2.1 Command-line and Interactive Use We provide two main command-line tools: text-train.py pre-processes a text file and trains a model, while text-predict.py predicts another text file to obtain classification accuracy and generates an output file of predicted results. $ python text-train.py train_file [output skipped] $ python text-predict.py test_file train_file.model predict_result Accuracy = 90.7873% (9662757/10643283) Because users may train with the same data several times after the pre-processing phase, we provide text2svm.py to convert short texts to sparse feature vectors of term frequency. Then text-train.py can directly train a model on these sparse vectors. $ python text2svm.py train_file [train_file.text_converter and train_file.svm are generated.] $ python text-train.py -P train_file.text_converter train_file.svm LibShortText supports interactive error analysis. First, we enter Python, import the module, load the prediction results, and create an object of Analyzer by reading a model. $ python >>> from libshorttext.analyzer import * >>> predict_result = InstanceSet( predict_result ) >>> analyzer = Analyzer( train_file.model ) We randomly select 1000 instances with two labels Books and Music and generate a confusion table. >>> insts = predict_result.select(with_labels([ Books, Music ]), subset(1000,... random )) >>> analyzer.gen_confusion_table(insts) Books Music Books 425 3 Music 9 563 Next, to see how a short text MICKEY MOUSE POT STAKE is predicted by our model, the following operation gives the feature weights of the top three classes. For multi-class SVM and logistic regression used in LibShortText, in the trained model, each class has a corresponding weight vector. The decision value is the inner product between the weight vector and the test instance. For this example, the class Home & Garden is correctly predicted because of the highest decision value. From the weights, we identify stake as an important word for the Home & Garden class. 3

Yu et al. >>> analyzer.analyze_single( MICKEY MOUSE POT STAKE, 3) Home & Garden Specialty Services Toys & Hobbies pot 9.786e-01 4.070e-01 2.023e-01 mouse 2.683e-01 3.613e-01 1.649e-01 stake 1.392e+00 2.801e-01 3.924e-01 mickey mouse 3.564e-02 9.631e-02 6.446e-01 mouse pot 5.168e-01-7.780e-02-1.771e-01 pot stake 5.489e-01-9.084e-03-3.282e-01 mickey -8.684e-02-4.901e-02 1.075e-01 **decval** 1.381e+00 3.813e-01 3.804e-01 We provide more operations for error analysis. Details are in the documents. 2.2 Making Extensions Because of the modularized framework, extending LibShortText is easy. We show an example to replace the tokenizer. The default tokenizer assumes that tokens are separated by space characters. If other characters are used, users may write their own tokenizer. 1 from libshorttext. converter import 2 3 def comma_tokenizer ( text ) : 4 return text. lower ( ). split (, ) 5 6 text_converter = text2svm_converter ( ) 7 text_converter. text_prep. tokenizer = comma_tokenizer We define comma tokenizer to separate tokens by commas. We then create an instance of converter, text converter, and assign the new tokenizer to it in lines 6 7. 3. Experiments To demonstrate the efficiency and effectiveness of LibShortText, we crawled 20 million ebay product titles in 34 classes. The first 10 million titles are for training, while the next 10 million are for testing. 2 On a machine with Intel Xeon 2.4GHz CPU (E5620), 12MB Cache, and 64GB RAM, running text-train.py with default options takes about 37 minutes, where 21 minutes are for generating features from texts and 16 minutes are for training. For prediction, text-predict.py takes 28 minutes and achieves 90.7873% accuracy. In contrast, a general text-mining tool such as RapidMiner (Mierswa et al., 2006) fails to even pre-process the data because of the memory issue. On a subset of prod-title with 100K instances, LibShortText takes only 17 seconds and 240 MB memory, while RapidMiner requires 10 minutes and 3 GB memory. 4. Conclusions LibShortText is an easy-to-use, efficient, and extensible tool for short-text classification and analysis. Its design and implementation incorporate properties of short texts. The package is still being improved by new research results and users needs. 2. We cannot redistribute the data; see http://developer.ebay.com/join/licenses/individual. 4

References L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L.D. Jackel, Y. LeCun, U.A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of classifier methods: a case study in handwriting digit recognition. In International Conference on Pattern Recognition, pages 77 87. IEEE Computer Society Press, 1994. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265 292, 2001. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLIN- EAR: A library for large linear classification. Journal of Machine Learning Research, 9: 1871 1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf. Anthony Khoo, Yuval Marom, and David Albrecht. Experiments with sentence classification. In Australasian Language Technology Workshop, 2006. Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, and Timm Euler. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 935 940, 2006. Bo Qu, Gao Cong, Cuiping Li, Aixin Sun, and Hong Chen. An evaluation of classification models for question topic categorization. Journal of the American Society for Information Science and Technology, 63:889 903, 2012. Dan Shen, Jean-David Ruvini, and Badrul Sarwar. Large-scale item categorization for e- commerce. In Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM), pages 595 604, 2012. Hsiang-Fu Yu, Chia-Hua Ho, Prakash Arunachalam, Manas Somaiya, and Chih-Jen Lin. Product title classification versus text classification. Technical report, 2012. URL http: //www.csie.ntu.edu.tw/~cjlin/papers/title.pdf. Dell Zhang and Wee Sun Lee. Question classification using support vector machines. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26 32. 2003. 5