Talk. Daniel Saraiva Leite Undergraduate Student. Lucia Helena Machado Rino, PhD Advisor

Similar documents
A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Learning From the Past with Experiment Databases

Reducing Features to Improve Bug Prediction

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning with Negation: Issues Regarding Effectiveness

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Word Segmentation of Off-line Handwritten Documents

AQUA: An Ontology-Driven Question Answering System

Australian Journal of Basic and Applied Sciences

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Term Weighting based on Document Revision History

Variations of the Similarity Function of TextRank for Automated Summarization

Cross Language Information Retrieval

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Modeling function word errors in DNN-HMM based LVCSR systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Linking Task: Identifying authors and book titles in verbose queries

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

A Comparison of Two Text Representations for Sentiment Analysis

Preference Learning in Recommender Systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Assignment 1: Predicting Amazon Review Ratings

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 1: Machine Learning Basics

Constructing Parallel Corpus from Movie Subtitles

Content-based Image Retrieval Using Image Regions as Query Examples

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Beyond the Pipeline: Discrete Optimization in NLP

Georgetown University at TREC 2017 Dynamic Domain Track

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

CSL465/603 - Machine Learning

CS Machine Learning

Cross-Lingual Text Categorization

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Mining Student Evolution Using Associative Classification and Clustering

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

arxiv: v1 [cs.lg] 3 May 2013

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Emotion Recognition Using Support Vector Machine

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Detecting English-French Cognates Using Orthographic Edit Distance

INPE São José dos Campos

Using dialogue context to improve parsing performance in dialogue systems

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Data Fusion Models in WSNs: Comparison and Analysis

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Cross-lingual Short-Text Document Classification for Facebook Comments

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCES

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Python Machine Learning

Cross-lingual Text Fragment Alignment using Divergence from Randomness

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Organizational Knowledge Distribution: An Experimental Evaluation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Problems of the Arabic OCR: New Attitudes

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

An OO Framework for building Intelligence and Learning properties in Software Agents

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Efficient Online Summarization of Microblogging Streams

Exposé for a Master s Thesis

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Language Independent Passage Retrieval for Question Answering

HLTCOE at TREC 2013: Temporal Summarization

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Guru: A Computer Tutor that Models Expert Human Tutors

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

arxiv: v1 [cs.cl] 2 Apr 2017

Multi-label Classification via Multi-target Regression on Data Streams

Welcome to. ECML/PKDD 2004 Community meeting

Handling Concept Drifts Using Dynamic Selection of Classifiers

Distant Supervised Relation Extraction with Wikipedia and Freebase

Agent-Based Software Engineering

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Speech Recognition at ICSI: Broadcast News and beyond

Indian Institute of Technology, Kanpur

Semi-Supervised Face Detection

A Case-Based Approach To Imitation Learning in Robotic Agents

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

The stages of event extraction

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Smart/Empire TIPSTER IR System

Detecting Student Emotions in Computer-Enabled Classrooms

Transcription:

Talk Selecting a Feature Set to Summarize Texts in Brazilian Portuguese Daniel Saraiva Leite Undergraduate Student Lucia Helena Machado Ri, PhD Advisor NILC - Núcleo Interinstitucional de Lingüística Computacional UFSCAR - Universidade Federal de São Carlos

Overview - Introduction: The Summarization Task - Extractive AS based on Machine Learning - Scenario: The SuPor System - Employed Methods - How methods are mapped into features - Feature selection problem - Taking advantages of WEKA - Improving the Model - Machine Learning Techniques - Assessments - Final Remarks

The Summarization Task Taking one or more texts and producing a shorter one The summary should convey the main content information of the original text Two Main Approaches for Automatic Summarization Building abstracts Rewriting the text Building extracts copying-and-pasting sentences full

Extractive AS based on Machine Learning Extractive Automatic Summarization How to choose sentences to include in the summary? Based on the relevance of each sentence Take the top relevant ones Stop when desired length is achieved Machine Learning for Extractive AS - Kupiec et al. (1995) Relevance ~ Likelihood of inclusion in the Extract Naïve-Bayes is suggested Shallow features of the text (E.g., location, frequency of the words, etc.) as far back as (Luhn, 1958; Edmundson, 1969) Binary representation

Extractive AS based on Machine Learning Using Naïve-Bayes Training phase Need of a Corpus: Source Texts (ST) and Ideal Extracts (IE) For each sentence S of a ST Process its features Verify if it also appears in the corresponding IE If S Є IE Class is Yes If S Є IE Class is No F 1 F 2 yes yes yes F 3 yes F 4 yes yes yes yes F 5 yes S Є E? yes yes We get a dataset in which each instance is the representation of a sentence of the ST

Extractive AS based on Machine Learning Using Naïve-Bayes Sentence Classifying phase Computing each sentence features (Fi s) Using Naïve-Bayes formula and the training dataset Calculating its probability for class S Є E = Yes P(( s E) F, F,..., F ) 1 2 k = k j= 1 P( F j s E) P( s E) k j= 1 P( F j ) Is it a classification task? We are always interested in probabilities for just one class

Our scenario: SuPor (Módolo, 2003) Main aspects Based on Kupiec s et al. (1995) model An AS environment Novelties User can choose features he/she wants customization to a given AS system Many different AS methods Besides shallow and basic features, SuPor embeds: Lexical Chains (Barzilay & Elhadad, 1999) Importance of Topics (Larocca Neto et al., 2000) Relationship Map (Salton et al., 1997) Methods mapped into binary features

SuPor Features F1 F2 F3 F4 F5 F6 F7 Name Lexical Chains Location Words Frequency Relationship map Importance of Topics Proper Nouns Sentence Length Condition for sentence S be labeled Yes S must be recommended by at least one of the three heuristics of the method S must appear in special positions of the text (beginning or ending) S sum of its words frequency must be higher than a threshold S must be recommended by at least one of the three heuristics of the method S must appear in an important topic and must be very similar to such topic S must contain a number of proper uns higher than a threshold S number of words must be higher than a threshold Actually 11 features (by varying preprocessing)

SuPor Drawbacks Feature Selection Problem How the user can select the right feature set? Difficult task He/she must be an expert in AS and still... he/she may t be able to properly accomplish it Extracts quality depends a lot on the feature set (100% in some cases) Motivation to our work

SuPor Drawbacks Motivation to our work Explore means to reduce such effort of customization Automatic Feature Selection! Combine SuPor with WEKA Free machine learning tool Very comprehensive Classification, Rules, Clustering Data visualization and preprocessing Available at www.cs.waikato.ac.nz/ml/weka/

Taking Advantage of WEKA Two Approaches 1) Automatic Feature Selection allows judging the relevance of features subset and choosing the best! Methods based on Entropy measure (Shann s Information Theory) Employed as a filter before classification 2) Change Features Representation Hypothesis: improving representation Feature Selection might be t necessary Provide more information to the machine learning algorithm Try other classifiers C4.5 (suggested by Módolo, 2003)

Taking Advantage of WEKA Approach 1: CFS (Correlation Feature Selection) Hall, 2000 Measure to evaluate importance of a subset of features IG (feature i,classe) - IG (feature i, feature j) relevance redundancy Idea of low redundancy seems good for Naïve-Bayes (Independence Assumption) Measure employed together with a search heuristic In WEKA, by default, Hill-Climbing

Taking Advantage of WEKA Approach 2: Improving Features Representation Principles Non-binary features Explore numeric and multivalued features Sentence Length: number of words of the sentence Proper Nouns: number of proper uns of the sentence Words Frequency: sum of the frequency of each word of the sentence

Taking Advantage of WEKA Approach 2: Improving Features Representation Location: according to 9 labels: Label II IM IF MI MM MF FI FM FF Position of paragraph Initial Initial Initial Medial Medial Medial Final Final Final Position of sentence within the paragraph Initial Medial Final Initial Medial Final Initial Medial Final

Taking Advantage of WEKA Approach 2: Improving Features Representation Importance of Topics: Harmonic mean between topic importance and sentence similarity to the topic Relationship Map and Lexical Chains: according to the heuristics that have recommended the sentence Label H1 H2 H3 H1+H2 H1+H3 H2+H3 H1+H2+H3 Meaning No heuristics recommend the sentence Only first heuristic recommends the sentence Only second heuristic recommends the sentence Only third heuristic recommends the sentence Both first and second heuristics recommend the sentence Both first and third heuristics recommend the sentence Both second and third heuristics recommend the sentence All heuristics recommend the sentence

Taking Advantage of WEKA How to handle numeric features? Naïve-Bayes Case Assume a Normal Distribution (Gaussian) Not always true Discretize Fayyad & Irani Method (1993): Discretization with low loss of information Estimate the probabilistic distribution (Kernel Density Estimation, John & Langley, 1995) Results at least as good as assuming a rmal distribution C4.5 Case Only choice is discretization!

Assessment Characteristics Corpus TeMário (Ri & Pardo, 2003) 100 news texts Same methodology of a former experiment (Ri et al., SBIA 04) Compression Rate = 30% (extract length / source text length) 10-fold cross validation Compare automatic extracts (AE) with their corresponding ideal extracts (IE) Measures P = AE IE AE Precision Recall AE IE R = AE IE IE F-measure AE IE F = 2 P x C P + C

Assessment Results Model Classifier Numeric Handling Feature Selection Recall (%) Precision (%) F-measure (%) M1 M2 M3 M4 Naïve-Bayes KDE Discretization No CFS No CFS 43,9 42,8 42,2 42,0 47,4 46,6 45,8 45,9 45,6 44,6 43,8 43,9 M5 M6 C4.5 Discretization No CFS 37,7 40,2 40,6 43,8 39,1 41,9 Best model = M1 SuPor-2!

Assessment Comparing with former results (Ri et al., SBIA 04) System Precision (%) Recall (%) F-measure (%) % above Random SuPor-2 47,4 43,9 45,6 47 SuPor 44.9 40.8 42.8 38 ClassSumm 45.6 39.7 42.4 37 From-Top (B) 42.9 32.6 37.0 19 TF-ISF-Summ 39.6 34.3 36.8 19 GistSumm 49.9 25.6 33.8 9 NeuralSumm 36.0 29.5 32.4 5 Random order (B) 34.0 28.5 31.0 0 B = Baseline

Final Remarks Some issues Why did Naïve-Bayes outperform C4.5? Related to the way C4.5 calculates probabilities NB performs well for ranking (Zhang & Su, 2004) Why didn t CFS bring better results overall? Features got more informative Feature Selection t needed anymore

Final Remarks Overall results SuPor-2 significant improvements over SuPor Expert user may t be necessary anymore Using all features yields good results Future work Explore new features New classifiers especially probabilistic ones (e.g., Bayesian Networks) Improve even more features informativeness

Thank you! Questions? daniel_leite@dc.ufscar.br

References Barzilay, R.; Elhadad, M. (1997). Using Lexical Chains for Text Summarization. In the Proc. of the Intelligent Scalable Text Summarization Workshop, Madri, Spain. Also In I. Mani and M.T. Maybury (eds.), Advances in Automatic Text Summarization. MIT Press, pp. 111-121, 1999. Fayyad, Usama ; Irani, Keki. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of IJCAI 93. Hall, M. (2000). Correlation-based feature selection of discrete and numeric class machine learning. In Proceedings of the International Conference on Machine Learning, pp. 359-366, San Francisco, CA. Morgan Kaufmann Publishers. Hearst, M. (1997). TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), pp. 33-64 John, G. ; Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338-345) Kupiec, Julian ; Pedersen, Jan ; Chen, Francine (1995). A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp 68-73. Larocca Neto, J.; Santos, A. D.; Kaestner, C. A. A.; Freitas, A. A. (2000). Generating Text Summaries through the Relative Importance of Topics. In M. C. Monard and J. S. Sichman (Eds.), Iberamia-Sbia 2000, pp. 300-309. Springer-Verlag, Berlin, Heidelberg. Leite, D. S. ; Ri, L. H. M. (2006a). A migração do SuPor para o ambiente WEKA: potencial e abordagens. Série de Relatórios do NILC. NILC-TR-06-03. São Carlos-SP, Janeiro, 35p.

References Leite, D. S.; Ri, L.H.M. (2006b). SuPor: extensões e acoplamento a um ambiente para mineração de dados. Série de Relatórios do NILC. NILC-TR-06-07. São Carlos SP, Agosto, 22 p. Módolo, M. (2003). SuPor: an Environment for Exploration of Extractive Methods for Automatic Text Summarization for Portuguese [in Portuguese]. MSc. Dissertation. Departamento de Computação, UFSCar. Pardo, T.A.S. e Ri, L.H.M. (2004). Descrição do GEI - Gerador de Extratos Ideais para o Português do Brasil. Série de Relatórios do NILC. NILC-TR-04-07. São Carlos-SP, Agosto, 10p. Pardo, T.A.S.; Ri, L.H.M. (2003). TeMário: Um Corpus para a Sumarização Automática de Textos. NILC Tech. Report. NILC-TR-03-09. São Carlos, Outubro, 12p. Quinlan, J.R. (1993). C4.5 Programs for machine learning. San Mateo, Morgan-Kaufman, 1993. Ri, L.H.M.; Pardo, T.A.S.; Silla Jr., C.N.; Kaestner, C.A.; Pombo, M. (2004). A Comparison of Automatic Summarization Systems for Brazilian Portuguese Texts. In the Proceedings of the XVII Brazilian Symposium on Artificial Intelligence - SBIA2004. São Luís, Maranhão, Brazil. Witten, Ian H. ; Frank, Eibe (2005). Data Mining: Practical machine learning tools and techniques, 2ª Ed., Morgan Kaufmann, San Francisco. Zhang, H. ; Su, J. (2004). Naive Bayesian classifiers for ranking. Proceedings of the 15th European Conference on Machine Learning (ECML2004), Springer. Salton, G.; Singhal, A.; Mitra, M.; Buckley, C. (1997). Automatic Text Structuring and Summarization. Information Processing & Management, 33(2), pp. 193-207.

SuPor-2 Architecture: Training Phase Source Texts Ideal Extracts Lexicon StopList Preprocessing WEKA Classifier algorithm Learning Model Features computing Preprocessing Comparison to Ideal Extracts Training Dataset Generation Training Dataset

SuPor-2 Architecture: Sentence Selection Phase Lexicon StopList Source Text Preprocessing Features computing Comparison to Ideal Extracts Compression Rate Sentence Selection Classification WEKA Preprocessing Extract Learning Model

140,000 120,000 100,000 80,000 60,000 40,000 20,000 0,000 χ 2 Analysis Lexical Cha ins (TextTiling) Sen tence Length Proper Nouns Locatio n Words Frequency (Stemm ing) Words Frequency (4-gram s) Relationship Map (Stem ming) Relationship Map (4-gra ms) Topics Importance (Stemming) Topics Importance (4-grams) New Features Former Features Lexical Cha ins (Para gra phs) χ2 Statistics