Evaluation of statistical categorization methods for creating specialized vocabulary lists to be used as learning aid

Similar documents
The Internet as a Normative Corpus: Grammar Checking with a Search Engine

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Universiteit Leiden ICT in Business

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Methods for the Qualitative Evaluation of Lexical Association Measures

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Linking Task: Identifying authors and book titles in verbose queries

Analysis of Enzyme Kinetic Data

Multilingual Sentiment and Subjectivity Analysis

What the National Curriculum requires in reading at Y5 and Y6

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Rule Learning With Negation: Issues Regarding Effectiveness

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

The taming of the data:

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

National Literacy and Numeracy Framework for years 3/4

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

STA 225: Introductory Statistics (CT)

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Blank Table Of Contents Template Interactive Notebook

A Comparison of Two Text Representations for Sentiment Analysis

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Formulaic Language and Fluency: ESL Teaching Applications

Learning Methods for Fuzzy Systems

The College Board Redesigned SAT Grade 12

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Myths, Legends, Fairytales and Novels (Writing a Letter)

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Word Sense Disambiguation

Loughton School s curriculum evening. 28 th February 2017

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Teacher assessment of student reading skills as a function of student reading achievement and grade

1. Introduction. 2. The OMBI database editor

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Common Core State Standards for English Language Arts

Proof Theory for Syntacticians

INSTRUCTOR USER MANUAL/HELP SECTION

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Probability and Statistics Curriculum Pacing Guide

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Switchboard Language Model Improvement with Conversational Data from Gigaword

Advanced Grammar in Use

Modeling full form lexica for Arabic

Context Free Grammars. Many slides from Michael Collins

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Seventh Grade Curriculum

Epping Elementary School Plan for Writing Instruction Fourth Grade

Learning Microsoft Publisher , (Weixel et al)

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Writing a composition

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Writing Research Articles

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Vocabulary Usage and Intelligibility in Learner Language

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Cross Language Information Retrieval

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Outreach Connect User Manual

Specifying a shallow grammatical for parsing purposes

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

A Comparative Study of Research Article Discussion Sections of Local and International Applied Linguistic Journals

Rule Learning with Negation: Issues Regarding Effectiveness

Using Moodle in ESOL Writing Classes

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Introduction to the Practice of Statistics

Welcome to ACT Brain Boot Camp

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

CODE Multimedia Manual network version

Storytelling Made Simple

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Centre for Evaluation & Monitoring SOSCA. Feedback Information

A High-Quality Web Corpus of Czech

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Mathematics (JUN14MS0401) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

Prentice Hall Literature Common Core Edition Grade 10, 2012

Age Effects on Syntactic Control in. Second Language Learning

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Variation of English passives used by Swedes

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

A Bayesian Learning Approach to Concept-Based Document Classification

Transcription:

Evaluation of statistical categorization methods for creating specialized vocabulary lists to be used as learning aid Christian Lindgren Lund University Lund, Sweden ada7cli @student.lu.se David Larsson Lund University Lund, Sweden lak3dbo @student.lu.se Lars Gustafsson Lund University Lund, Sweden ada1lgu @student.lu.se Abstract This paper examines the possibility of creating a shortcut in category targeted learning of a new language through filtering category word lists using two goldstandard statistical methods: Student s T- test and Chi-squared method. The word lists are compared to each other using only the most frequent words in a large training corpus with coverage of the test corpus as the main measurement. The results are rather disappointing and the coverage of the filtered lists don t differ significantly from using only a list sorted by frequency. Studies argue however that the statistical methods used produce a rather large amount of false positives and future work should therefore examine other methods as presented by linguistic studies. 1 Introduction The purpose of this paper is to examine whether there exists a shortcut to learning a new language with the premise that one is not interested in learning about more than one or a few disciplines (e.g physics, sports, plants etc.) as well as learning enough words to read a general article in that language. The idea is that using categories of words one can filter out a list of significant words to be learned in order to be able to read articles and converse with people about that category. On top of that one must learn a set of general words in order to understand texts and dialogues in that language. Apart from the required grammar, this paper postulates that a higher coverage of the words in a given category corpus leads to a better understanding of a language. This paper s thesis is that a combination of the general words and the category specific words are sufficient to be able to understand a language as long as one does not wander outside the category chosen. The number of words needed to be learned using this method would be lower than the number of words that one might need to learn using a conventional method of learning a language. 1.1 Previous work Frequency analysis for comparing frequencies of words in corpora has been thoroughly examined before [1] [2] [3] [4]. Rayson and Garside presents a variation of the Chi-squared method for filtering out significant words and sorting them according to significance. Their results are promising and the main reason for choosing their method as one of the filters in this study. 2 Methodology The method used is frequency analysis on training corpora to predict the most frequent words in a test corpus. Given two training corpora, one that represents the language in general and one that is category specific, the goal is to cover an as large proportion of the text in the test corpus as possible. The words are tagged with part-of-speech (POS) tags to prevent ambiguity. For instance a corpus on the category golf may contain a high frequency of the noun green, while a corpus on the category colours may contain a high frequency of the adjective green. In order of being able of separate those words POS tagging is needed. The task is narrowed down to regard category specific prediction, meaning that a corpus for a specific category is required to make predictions on that very category. The first step is to get a large corpus which contains as general content as possible, which will be called corpus A in this paper. Furthermore one needs a category-specific corpus, here called corpus B. The second step is to POS tag the corpora and then simply sort the words by their frequencies.

Four different methods for frequency analysis are examined. 1. Simply using only words from corpus A, sorted by frequency as a filter, starting with the most frequent. 2. Simple category specific method, using the same method as in 1, but has a window between word X 1 and X 2 where it uses corpus B instead. 3. Students T-test, running a T-test between corpus A and B to get significant words with 95% accuracy into a new corpus B 1 and then continue as in method 2 replacing B with B 1. 4. Chi-squared, the same as method 3 but replacing 95% accurate T-test with Chi-squared statistics, sorted by significance according to the Chi-squared value. All parts of speech except nouns, verbs, adjectives and adverbs are filtered out in order to examine only words containing actual information. It s the belief of the authors that words like and as well as on 1 don t hold any significance at all for a category specific vocabulary. The initial usage of the corpus A is to get rid of common words like, is, contains, exists etc. since these words indeed are very common in the test corpus but are of less relevance for the category specific list of words. In this paper the content of the Swedish Wikipedia as of 213-11-1 is used as corpus A and articles linked under a specific category on Wikipedia as corpus B. The test corpus consists of several news articles connected to the same category as corpus B. The system is constructed by four separate subsystems: Data to categories, tagging, evaluating and presenting. The base files used by the system are an XML-dump from Wikipedia containing all articles in a given language and the Data to categories produces one file for each requested category and one for the entire text now stripped down to raw text without any formatting. After that the files are used by the tagging software and it calculates the frequency of each word in every file and returns a list with all words and the number of occurrences for each word. Then the program 1 Even though the study is done on the Swedish language we use English words for the readers convenience. evaluates the lists and compares them to produce the final lists filtered according to the methods explained above. Lastly the data is presented both as a list and a wordcloud 2. The result is presented as a percentage of the maximum possible coverage for a given amount of words (X N ), a window (X 1 to X 2 ) and a category, as well as the over all coverage for the same parameters. 2.1 Method 1 Given a list of all the words in corpus A, sorted by frequency, pick the first in the list and iterate X N times. For graphs see Appendix A. Method 1 Kampsport 52.6 % 77. % Algoritmer 58.8 % 79.5 % Table 1: Values for method 1 given X N = 7. 2.2 Method 2 Given the same list as in Method 1, perform the same iteration but for X 1 times, then switch over to corpus B and continue for X 2 X 1 times, finally switching back to corpus A and continue X N X 2 times (always skipping already picked words). For graphs see Appendix B. Method 2 Kampsport 59.8 % 87.7 % Algoritmer 62.3 % 84.3 % Table 2: Values for method 2 given X 1 = 2, X 2 = 5 and X N = 7. 2.3 Method 3 Given the corpora A and B 1, where B 1 is corpus B filtered by T-test according to a method described in [5], where the standard deviation is approximated by the frequency itself, and then sorted by 2 Wordcloud - a graphical representation of a text where the font size of a given word corresponds to the frequency of the given word in the text.

frequency. The formula where t = x µ s 2 N (1) s = p(p 1) (2) is used for calculating the t-value as proposed. After that the procedure is identical to method 2. For graphs see Appendix C. Method 3 Kampsport 59.8 % 87.8 % Algoritmer 62.3 % 84.3 % Table 3: Values for method 3 given X 1 = 2, X 2 = 5 and X N = 7. 2.4 Method 4 Given the corpora A and B a word frequency list B 2 is created as proposed by Rayson and Garside [4]. The method is based upon log-likelihood and chi-squared statistics to create the list using the formulas E i = N i i O i (3) and 2 ln λ = 2 i i N i O i ln ( ) Oi E i (4) Then the same concept as in method 2 of a window is used. For graphs see Appendix D. Method 4 Kampsport 6. % 88.6 % Algoritmer 61.5 % 83.1 % Table 4: Values for method 4 given X 1 = 2, X 2 = 5 and X N = 7. 3 Possible applications The software is mainly constructed as a tool for learning languages, described in the Introduction, but during the course of the study a few other applications were considered. 3.1 Text categorization Some experiments were done, not related to the original purpose of the study, to see if one could automatically categorize a given text in to one predetermined category. By use of the Students T-test or Chi-squared the new article or text was given a percentage of similarity to the different texts in the categories. 3.2 Profiling texts The program can be used to profile a text or texts to present a Wordcloud so its easy to get a general idea what the text is about. 4 External software Almost all software was developed specifically for this study by the authors but three external tools were also used. Due to the text extracted from Wikipedia uses a specific markup language a parser was constructed to extract the raw text. The streaming parser, made by the authors, was constructed with focus on speed but lacked in accuracy. Therefore another tool was used. The Wiki-markup filter was made by Peter Exner, a Ph.D. student at the Department of Computer Science, Lund University. By using the new parser an almost 1% success-rate was accomplished. To achieve some separation of homographs, even though words with the same part of speech will be seen as the same, from the raw text a partof-speech-tagger was used, called Stagger. Stagger [6] is made by the University of Stockholm and based on Collins (22) averaged perceptron and is one of the best Swedish POS-taggers when it comes to accuracy with about 96.6 percent. Lastly JDOM was used as an XML parser. 5 Discussion A few different problems were discovered when manually checking the results. If the text contained the same base word but with different inflections these words were counted as totally separate words. One way to solve this is to reduce all words to their base form before calculating the frequencies. This could dramatically change the significance of some words. Just as inflections may create several instances of the same word there are some languages that have homographs that are the same part of speech.

This might give a falsely high frequency for some words. Another thing that was never really discussed when forming the main thesis was that some words may be useful to know for their characteristics even though the words themselves might not be useful. Since the study only evaluates the list of words based on the frequency and no knowledge of the structure of the language this is totally overlooked. Furthermore grammar is something that this method takes no notice of; the authors sees this tool as an aid when learning a new language. They do understand that learning a new language is as much grammar as learning the words. The system only provides a list of words related to the language. What the pupil has to do is to use a dictionary to figure out the meaning and pronunciation of the given words. One could also argue that it is possible to figure out what meaning of a word is relevant. For example the English word bow have several meanings. When learning the category of Archery its fairly easy to understand that the correct translation to Swedish is båge instead of rosett that is a tied ribbon. Lastly, T-test and Chi-squared filters might not be the preferred methods when choosing what words are significant [1][2]. Kilgarriff argues that since language is never random, the standard null hypothesis methods are less useful since they produce too many false positives. This might mirror the result of this study where some very common words made the lists of significant words in the different categories. Moreover, Jefrey Lijffijt et al [3] presents two alternatives to the classical statistical methods, inter-arrival times and bootstrapping which they prove to result in far less false positives than the gold standard methods. Future work for this study might be to test the program with these methods instead to produce even better results with less common words. tested. Furthermore this paper assumes that coverage is more important than the individual words, regarding what words are needed to understand the text. However this might not be the case at all, complementary research is needed in order to conclude whether a human understands a text better with lower coverage and larger amount of key words or if there is a balancing point in-between. References [1] Adam Kilgarriff. Language is never, ever, ever, random. Corpus linguistics and linguistic theory, 1(2):263 276, 25. [2] Stefan Th Gries. Null-hypothesis significance testing of word frequencies: a follow-up on kilgarriff. Corpus linguistics and linguistic theory, 1(2):277 294, 25. [3] Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila. Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. In Machine Learning and Knowledge Discovery in Databases, pages 341 357. Springer, 211. [4] Paul Rayson and Roger Garside. Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, pages 1 6. Association for Computational Linguistics, 2. [5] Christopher D Manning and Hinrich Schütze. Foundations of statistical natural language processing, volume 999. MIT Press, 1999. [6] Robert Östling. Stagger: an open-source part of speech tagger for swedish. Northern European Journal of Language Technology, 3:1 18, 213. 6 Conclusions The lists filtered out by the methods used are intuitively good but lacks an objective measurement of the rate of actual significance the words hold with respect to the structure of language as discussed above. One can however conclude that however good or bad, T-test and chi-squared methods produce largely the same result as simply using the category training data directly for both categories

Appendix A Graphs for method 1. Appendix B Graphs for method 2. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Figure 1: Blue is category Kampsport with Method 1, Green is Optimal from test data. Figure 3: Blue is category Kampsport with Method 2, Green is Optimal from test data..8.8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Figure 2: Blue is category Algoritmer with Method 1, Green is Optimal from test data. Figure 4: Blue is category Algoritmer with Method 2, Green is Optimal from test data.

Appendix C Graphs for method 3. Appendix D Graphs for method 4. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Figure 5: Blue is category Kampsport with Method 3, Green is Optimal from test data. Figure 7: Blue is category Kampsport with Method 4, Green is Optimal from test data..8.8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Figure 6: Blue is category Algoritmer with Method 3, Green is Optimal from test data. Figure 8: Blue is category Algoritmer with Method 4, Green is Optimal from test data.