Assignment 3: Clustering

Similar documents
Literature and the Language Arts Experiencing Literature

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

CS Machine Learning

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Linking Task: Identifying authors and book titles in verbose queries

Python Machine Learning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

Academic literacies and student learning: how can we improve our understanding of student writing?

Diploma in Library and Information Science (Part-Time) - SH220

Laboratorio di Intelligenza Artificiale e Robotica

Corpus Linguistics (L615)

MYP Language A Course Outline Year 3

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

correlated to the Nebraska Reading/Writing Standards Grades 9-12

Grade Band: High School Unit 1 Unit Target: Government Unit Topic: The Constitution and Me. What Is the Constitution? The United States Government

ANGLAIS LANGUE SECONDE

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Studies Arts, Humanities and Social Science Faculty

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

The College Board Redesigned SAT Grade 12

Laboratorio di Intelligenza Artificiale e Robotica

PROGRAMME SPECIFICATION UWE UWE. Taught course. JACS code. Ongoing

SOC 175. Australian Society. Contents. S3 External Sociology

Certificate of Higher Education in History. Relevant QAA subject benchmarking group: History

Lecture 1: Machine Learning Basics

This course has been proposed to fulfill the Individuals, Institutions, and Cultures Level 1 pillar.

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

(Sub)Gradient Descent

Multi Camera Production

Student Name: OSIS#: DOB: / / School: Grade:

Lucy Caulkins Writing Rubrics

Natural Language Processing. George Konidaris

Word Segmentation of Off-line Handwritten Documents

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

Oakland Unified School District English/ Language Arts Course Syllabus

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Measuring Web-Corpus Randomness: A Progress Report

Seminar - Organic Computing

Post-16 transport to education and training. Statutory guidance for local authorities

Intermediate Academic Writing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Approaches to Teaching Second Language Writing Brian PALTRIDGE, The University of Sydney

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

Oakland Unified School District English/ Language Arts Course Syllabus

Innovative Methods for Teaching Engineering Courses

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Assignment 1: Predicting Amazon Review Ratings

Prentice Hall Literature Common Core Edition Grade 10, 2012

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Rule Learning With Negation: Issues Regarding Effectiveness

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv: v1 [cs.cl] 2 Apr 2017

DEPARTMENT OF HISTORY AND CLASSICS Academic Year , Classics 104 (Summer Term) Introduction to Ancient Rome

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Speech Recognition at ICSI: Broadcast News and beyond

Course evaluations at Chalmers

K-12 Blueprint Logo Placement

MGT/MGP/MGB 261: Investment Analysis

Pearson Longman Keystone Book D 2013

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Context-Sensitive Bidirectional OT: a New Approach to Russian Aspect

Pearson Longman Keystone Book F 2013

Social Media Journalism J336F Unique Spring 2016

ENGLISH. Progression Chart YEAR 8

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Programme Specification 1

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Demography and Population Geography with GISc GEH 320/GEP 620 (H81) / PHE 718 / EES80500 Syllabus

On-the-Fly Customization of Automated Essay Scoring

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

GOING GLOBAL 2018 SUBMITTING A PROPOSAL

Rottenberg, Annette. Elements of Argument: A Text and Reader, 7 th edition Boston: Bedford/St. Martin s, pages.

IBCP Language Portfolio Core Requirement for the International Baccalaureate Career-Related Programme

5 th Grade Language Arts Curriculum Map

Curriculum for the Academy Profession Degree Programme in Energy Technology

Create Quiz Questions

Teaching ideas. AS and A-level English Language Spark their imaginations this year

LING 329 : MORPHOLOGY

Arts, Literature and Communication (500.A1)

An Introduction to the Minimalist Program

Applications of memory-based natural language processing

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Curriculum for the Bachelor Programme in Digital Media and Design at the IT University of Copenhagen

ACADEMIC AFFAIRS GUIDELINES

Introduction to Moodle

South Carolina English Language Arts

Transcription:

Assignment 3: Clustering Machine Learning for Language Technology Individual Home Assignment: Clustering Published Online: 7 Nov 2016 (CHANGELOG12 DECEMBER 2016) SUBMISSION DEADLINE: SUNDAY 15 JAN 2017, 23:59 Assignments Deadlines 18 Dec 2016: Ass1 and Ass2 15 Jan 2017: Ass 3 24 Feb 2017: Final submission date for all assignments. Learning objectives In this assignment you are going to: Use the k- Means and Hierarchical clustering as implemented in Weka to perform unsupervised classification and exploration of the text categories included in the Swedish national corpus. NB: In this assignment, you are required to select the machine learning methods and the options to be used in the tasks by yourselves, without step- by- step instructions. By now, you are familiar with the algorithms we studied in the course and you should be able to orientate yourselves in weka. Data The SUC datasets Download the datasets on to your computer: < http://stp.lingfil.uu.se/~santinim/ml/2016/datasets/suc_datasets/ > The Stockholm- Umeå Corpus (or SUC) is the Swedish national corpus. The SUC is a collection of Swedish texts from the 1990's, consisting of one million words. The original weka SUC dataset was recently created by Johan Falkenjack using readability features 1. This original dataset was then divided into several subsets for carry out a number of experiments of text classification 2. The SUC contains 500 samples of texts with a length of about 2,000 words each. Technically speaking, the SUC is divided into 1040 bibliographically distinct text chunks, each assigned to a category and a subsub. The SUC contains nine top categories and 48 subcategories. Dataset names are self-explanatory. Each dataset contains the same number of readability feature (ie 118 features), but a different number of classes and texts. See the breakdown of the datasets in Table 1. Capital letters indicate SUC text categories: see Table2 in the Appendix for the full list of domains, s and subs. 1 See Falkenjack et al. (2013). Features indicating readability in Swedish text. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NoDaLiDa- 2013), Oslo, Norway <http://stp.lingfil.uu.se/~santinim/ml/2016/assignments/_swedishreadabilityfeatures2013_ecp1385008.pdf >. 2 See Falkenjack et al. (2016). An Exploratory Study on Genre Classification using Readability Features. The Sixth Swedish Language Technology Conference (SLTC) Umeå University, 17-18 November, 2016 < http://stp.lingfil.uu.se/~santinim/ml/2016/assignments/_sltc_2016_paper_19.pdf >. 1

SUCdataset_SubGenres_48_1040texts_118readabilityCues.arff SUCdataset_TopGenres_9_1040rows_118_ReadabilityCues.arff SUCdataset_TopGenresWithoutMisc_8_895texts_118readabilityCues.arff SUCdataset_SelectedGenres_6_709texts_118readabilityCues.arff SUCdataset_SelectedDomainsWithMisc_3_331texts_118readabilityCues.arff SUCdataset_SelectedDomains_2_186texts_118readabilityCues.arff Table 1. Breakdown of the SUC datasets. 48 subs 9 s (A, B, C, E, F, G, H, J, K) 8 s (without H) 6 selected s (A, B, C, G, J, K) 2 domains + Miscellanous (E, F, H) 2 domains (E, F) Goal of the Assignment The goal of this assignment is to explore to what extent k- Means and Hierarchical Clustering in combination with readability features make sense of SUC's text categories. Since clustering does not rely on labelled examples, it needs robust features capable of revealing sensible patterns in data. The underlying assumption is that domain and are two different notions that are not represented by the same type of features. The following theoretical distinctions is provided to distinguish the notions of and domain: Domain is a subject field. Domain refers to the shared general topic of a group of texts. For instance, Fashion, Leisure, Business, Sport, Medicine or Education are examples of broad domains. In text classification, domains are normally represented by topical features, such as content words. Genre is a more abstract concept. It characterizes text varieties on the basis of conventionalized textual patterns. For instance, an academic paper obeys to textual conventions that differ from the textual conventions of a tweet ; or a letter complies to conventions that are different from the conventions of an interview. Academic papers, tweets, letters, interviews are examples of s. Genre conventions usually affect the organization of the documents (its rhetorical structure and composition), the length of the text, the syntax and the morphology (e.g. passive forms v.s. active forms), vocabulary richness, etc. In text classification, s are often represented by features such as POS tags, character n- grams, or POS n- grams. How do readability features work on the whole SUC (9 text categories), on SUC subcategories (48 classes), on the six s, on the 2 domains, etc.? Run k- Means and Hierarchical Clustrering on all the datasets listed in Table 1 to explore the efficiency of readability features using unsupervised machine learning algorithms. 2

G tasks Theoretical question: Q1: Describe and comment the main differences between k- Means and hierarchical clustering. Avantages and disadvantages of both. Part 1 Start weka and choose the Explorer interface. Work with the SUC datasets. Cluster the SUC using kmeans and HierarchicalClusterer for all the SUC datasets. For both clustering algorithms and for all the datasets, choose Classes to cluster evaluation in the Cluster mode pane. Rember to change the number of clusters according to the number of categories of the dataset at hand. Create a table to organize your results. Q2: What is the best performance? How successful have the clustering algorithms been all in all? Looking at each class individually, can you spot particular classes that are consistently well identified by the clustering algorithms? Classes that are poorly identified? Which classes are mostly confused with each other? etc. Provide an interpretation of the clustering results based on the evidences you got. 3

VG tasks Theoretical question: Q3: Describe k- means optimization objective in simple words. Choose the best cluster results you get with k- means. To get a concise description of the best clustering produced, we are going to give it to a tree classifier. In the Visualize cluster assignment window, select Save to output the cluster assignment to a data file. In the data file, replace Cluster by class in @attribute Cluster {cluster1, cluster2, cluster3}. Load this file and apply J48 (disable pruning and keep the parameter M on the default value 2). Evaluate on the training set and with 10- fold- crossvalidation. Q4: Do you get a good description of the clusters? Visualize the trees. Is it what you expected? Explain and interpret what you see. To be submitted A written report (at least 2 pages) containing the reasoned answers to the tasks and questions above and a short section where you summarize your reflections and experience. If you just cut and paste Weka results page into the report without commenting or explaining the whys and wherefores, you might fail the assignment. Submit the report in PDF format to santinim@stp.lingfil.uu.se no later than 8 January 2017, 23:59. Please, write this phrase in the subject line of your email: ML4LT 2016 Ass3: your name. Attach any additional material that you think is important to fully understand your report. No need to paste in Weka result page in your report in if not needed in your discussion in the report. Naming conventions Please, name your pdf report in this way (it will be easier for me to organize and archive them): surname_name_ass3report.pdf (ex: santini_marina_ ass3report.pdf). 4

Appendix SUCs text categories divided into, domain and mixed. Main Categories Subcategories Genre or Domain? A Press, Reportage AA. Political AB. Community AC. Financial AD. Cultural AE. Sports AF. Spot News B Press, Editorials BA. Institutional BB. Debate articles C Press, Reviews CA. Books CB. Films CC. Art CD. Theatre CE. Music CF. Artists, shows CG. Radio, TV E Skills, trades and hobbies domain EA. Hobbies, amusements EB. Society press EC. Occupational and trade union press ED. Religion F Popular lore domain FA. Humanities FB. Behavioural sciences FC. Social sciences FD. Religion FE. Complementary life styles FF. History FG. Health and medicine FH. Natural science, technology FJ. Politics FK. Culture G Biographies, essays GA. Biographies, memoirs GB. Essays H Miscellaneous mixed HA. Federal publications HB. Municipal publications HC. Financial reports, business HD. Financial reports, non- profit organisations HE. Internal publications, companies HF. University publications J Learned and scientific writing JA. Humanities JB. Behavioural sciences JC. Social sciences JD. Religion JE. Technology JF. Mathematics JG. Medicine JH. Natural science K Imaginative prose KK. General fiction KL. Mysteries and science fiction KN. Light reading KR. Humour --the-end-- 5