The taming of the data:

Similar documents
Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Applications of memory-based natural language processing

Using dialogue context to improve parsing performance in dialogue systems

Disambiguation of Thai Personal Name from Online News Articles

Cross Language Information Retrieval

Distant Supervised Relation Extraction with Wikipedia and Freebase

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Methods in Multilingual Speech Recognition

The stages of event extraction

Innovative Teaching in Science, Technology, Engineering, and Math

SCORING KEY AND RATING GUIDE

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Adjusting a semantic taxonomy and annotation tool for historical corpora

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Matching Similarity for Keyword-Based Clustering

The MEANING Multilingual Central Repository

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

AQUA: An Ontology-Driven Question Answering System

Software Maintenance

Office: CLSB 5S 066 (via South Tower elevators)

Memory-based grammatical error correction

Python Machine Learning

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

THE VERB ARGUMENT BROWSER

Detecting English-French Cognates Using Orthographic Edit Distance

Training and evaluation of POS taggers on the French MULTITAG corpus

Word Segmentation of Off-line Handwritten Documents

Tap vs. Bottled Water

CSC200: Lecture 4. Allan Borodin

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Corpus Linguistics (L615)

CHEM 1105: SURVEY OF GENERAL CHEMISTRY LABORATORY COURSE INFORMATION

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv: v1 [cs.cl] 2 Apr 2017

ELPAC. Practice Test. Kindergarten. English Language Proficiency Assessments for California

Probabilistic Latent Semantic Analysis

Cross-Lingual Text Categorization

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

BYLINE [Heng Ji, Computer Science Department, New York University,

A Pumpkin Grows. Written by Linda D. Bullock and illustrated by Debby Fisher

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Introduction of Open-Source e-learning Environment and Resources: A Novel Approach for Secondary Schools in Tanzania

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Learning Methods for Fuzzy Systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Introduction to Text Mining

Modeling full form lexica for Arabic

Developing a TT-MCTAG for German with an RCG-based Parser

Std: III rd. Subject: Morals cw.

A High-Quality Web Corpus of Czech

A Case Study: News Classification Based on Term Frequency

An Introduction to the Minimalist Program

Reducing Features to Improve Bug Prediction

Disciplinary Literacy in Science

Fourth Grade. Reporting Student Progress. Libertyville School District 70. Fourth Grade

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

been each get other TASK #1 Fry Words TASK #2 Fry Words Write the following words in ABC order: Write the following words in ABC order:

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Lecture 1: Basic Concepts of Machine Learning

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Year 4 National Curriculum requirements

Toward a Unified Approach to Statistical Language Modeling for Chinese

Indian Institute of Technology, Kanpur

Construction Grammar. University of Jena.

The Role of the Head in the Interpretation of English Deverbal Compounds

Prediction of Maximal Projection for Semantic Role Labeling

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Electromagnetic Spectrum Webquest Answer Key

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

OUTLINE OF ACTIVITIES

Measuring Web-Corpus Randomness: A Progress Report

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Conference Presentation

On document relevance and lexical cohesion between query terms

On-Line Data Analytics

A Vector Space Approach for Aspect-Based Sentiment Analysis

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

For information only, correct responses are listed in the chart below. Question Number. Correct Response

Acquiring Competence from Performance Data

Problems of the Arabic OCR: New Attitudes

Physics 270: Experimental Physics

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Transcription:

The taming of the data: Using text mining in building a corpus for diachronic analysis Stefania Degaetano-Ortlieb, Hannah Kermes, Ashraf Khamis, Jörg Knappen, Noam Ordan and Elke Teich

Background Big data (e.g. Google n-grams) Small, hand-crafted corpora (e.g. Brown corpora) Typically poorly contextualized some meta-data (e.g. time, book title) no structural data no linguistic data Considered of limited value for linguistic analysis Annotated with meta-data (e.g. variety, register and time) structural data (e.g. page, section) linguistic data (e.g. pos, lemma) Readily usable for analysis 2

Google n-grams example Insights: Development over time More variation of the phrase taming of the NOUN over time 3

Google n-grams vs. Brown corpora Diachronic development No linguistic distinction possible (that) No context for inspection Limited diachronic perspective Linguistic distinction possible (that as relativizer) Contextualized search (kwic, register, etc.) 4

Rationale Assumption Scientific language becomes more informationally dense over time Due to specialization greater encoding density over time shorter ling. forms used to maximize efficiency in communication Approach Detection of linguistic features of densification Comparison across historical stages 5

Example The use of this control method leads to a safer and faster train operation in the most adverse weather conditions. more dense linguistic encoding You can control the trains this way and if you do that you can be quite sure that they ll be able to run more safely and more quickly than they would otherwise, no matter how bad the weather gets. less dense linguistic encoding 6

Building new corpora Sources for new corpora all relevant meta-, structural and ling. data? Old Bailey sources vs. richly annotated corpus (Huber, 2007) 7

Motivation Create a corpus from uncharted material of the Philosophical Transactions and Proceedings of the Royal Society of London (RSC Corpus) JSTOR material in XML Containing some meta-data (e.g. time, title), but no structural data Enrich corpus with relevant meta-, structural, and linguistic data for diachronic linguistic analysis Big data RSC corpus Small hand-crafted corpora 8

Royal Society Corpus (RSC) Journal Period Text type Book reviews Articles Miscellaneous Obituaries Total Philosophical Transactions 1665 1678 124 641 154 919 Philosophical Transactions 1683 1775 154 3,903 338 4,395 Philosophical Transactions of the Royal Society of London (PTRSL) 1776 1869 2,531 283 2,814 Abstracts of Papers Printed in PTRSL 1843-1861 1,316 15 1,331 Abstracts of Papers Communicated to RSL 1862-1869 429 5 434 Proceedings of RSL 1862 1869 1,476 38 14 1,528 Total 278 10,296 833 14 11,421 Size: approx. 35 million tokens Source: XML (JSTOR) 9

Methods From uncharted to enriched data Meta-data, structural data and linguistic data Pattern-based techniques Standard compling. techniques Data/Text mining Uncharted JSTOR data Hidden mark-up Headers/footers Normalization Tagging Disciplines Time stages Data quality Annotation Enriched RSC corpus 10

Pattern-based techniques Structural data Uncover and clean hidden markup Identify article beginnings and endings and order scrambled pages Detect headers/footers, toc, errata Data quality Detect and remove duplicates Eliminate OCR errors by adaptation of patterns from Underwood and Auvil (n.d.) (1,282 correction patterns) 11

Standard comp. linguistic techniques Normalization Spelling variation with VARD (Baron and Rayson 2008) Manual normalization of an extract of the RSC used to train VARD Tokenization, segmentation, PoS tagging and lemmatization TreeTagger (Schmid 1994) + Perl scripts Data quality Additions to abbreviation list of TreeTagger to improve segmentation 12

Standard comp. linguistic techniques Feature extraction Semi-automatic extraction of features relevant for diachronic analysis (Harris 1991) with CQP (CWB2010) Use of word/pos sequences in manually designed macros Feature Extraction pattern Example Reduction by prefix by suffix [lemma="anti-.*"] [pos="vv.*" & lemma="\w{1,}ify"] anti-rheumatic remedies surfaces solidify simultaneously Omission of relativizer [pos="dt"][pos="n.*"][pos="p.*"][pos="v.*"] the Bodies ^ we are acquainted Nominalization [pos="nn.*" & lemma="\w{1,}ness"] there is a Lake of that bigness 13

Data mining Discipline detection Topic modeling MALLET (McCallum 2002) Limit of 24 topics chemistry light rays glass eye colours spectrum blood heart muscles nerves stomach acid water solution gas oxygen force electricity current wire power cells animal fluid eggs physics leaves plant tree seed flowers languages quae quam sit vero hoc hath tis tho abbreviations la les dans en ii iii mr fig dr archaic words 14

Data mining Detection of time periods Distance measures Identification of ling. changes in corpora (Fankhauser et al. 2014a & 2014b) Based on Information Theory Kullback-Leibler Divergence (relative entropy) Unigram model + smoothing Assessing how typical an n-gram is to a corpus/subcorpus Example: Bioinformatics Abstracts across time Function words typical for 70/80s Nominal (denser) style in 2000s 70/80s 2000s 15

Data mining Detection of time periods Clustering Variability-based neighbor clustering algorithm (Gries & Hilpert 2008) Detection of stages in diachronic data Tailored to specific linguistic phenomena Piotrowski law Language changes as a result of interaction between old forms and new forms Complete change Partial change Reversible change 16

Data mining Feature detection Classification/Ranking Classify time periods by linguistic features relevant for dense/less dense encodings Use feature weights to detect relevant features Pattern mining Squeeze looks for interesting patterns (Vreeken 2010) Desq (Gemulla forthcoming) Looks for patterns of a desired form Makes use of a hierarchy (e.g. WordNet) anti-smth PersPron opposes SMTH is against 17

Conclusions Meta-data Structural data Linguistic data Royal Society Corpus High quality corpus from rel. big data with affordable automatic and manual effort Continuously improve data quality Tailored to linguistic research Comparison of historical stages / disciplines over time Inspection of linguistic features of densification 18

Thank you for your attention! Thanks to the team! Sarah Thiry Ashraf Khamis Peter Fankhauser Elke Teich Jörg Knappen