Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Similar documents
Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

EACL th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the 2nd International Workshop on

A Case Study: News Classification Based on Term Frequency

AQUA: An Ontology-Driven Question Answering System

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Methods for the Qualitative Evaluation of Lexical Association Measures

Linking Task: Identifying authors and book titles in verbose queries

A High-Quality Web Corpus of Czech

Measuring Web-Corpus Randomness: A Progress Report

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Corpus Linguistics (L615)

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Cross Language Information Retrieval

The Web for Corpus and the Web as Corpus in Translator Training 1

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Web as a Corpus: Going Beyond the n-gram

Word Sense Disambiguation

Probabilistic Latent Semantic Analysis

Word Segmentation of Off-line Handwritten Documents

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Visual CP Representation of Knowledge

Postprint.

Using dialogue context to improve parsing performance in dialogue systems

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Using Web Searches on Important Words to Create Background Sets for LSI Classification

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Constructing Parallel Corpus from Movie Subtitles

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The stages of event extraction

Controlled vocabulary

Providing student writers with pre-text feedback

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Independent Passage Retrieval for Question Answering

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

arxiv: v1 [cs.cl] 2 Apr 2017

A NOTE ON UNDETECTED TYPING ERRORS

T Seminar on Internetworking

University of the Basque Country

Distant Supervised Relation Extraction with Wikipedia and Freebase

Matching Similarity for Keyword-Based Clustering

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

The MEANING Multilingual Central Repository

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Software Maintenance

On-Line Data Analytics

Systematic reviews in theory and practice for library and information studies

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

An Evaluation of POS Taggers for the CHILDES Corpus

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Natural Language Processing. George Konidaris

Generation of Referring Expressions: Managing Structural Ambiguities

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

ALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA CORSO DI LAUREA IN. MEDIAZIONE LINGUISTICA INTERCULTURALE (Classe L-12) ELABORATO FINALE

An Interactive Intelligent Language Tutor Over The Internet

The taming of the data:

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Proceedings of the 19th COLING, , 2002.

The Ups and Downs of Preposition Error Detection in ESL Writing

Radius STEM Readiness TM

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

GACE Computer Science Assessment Test at a Glance

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

On document relevance and lexical cohesion between query terms

A Statistical Approach to the Semantics of Verb-Particles

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ScienceDirect. Malayalam question answering system

Concepts and Properties in Word Spaces

Learning Computational Grammars

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

New Ways of Connecting Reading and Writing

Developing a TT-MCTAG for German with an RCG-based Parser

Text-mining the Estonian National Electronic Health Record

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Getting Started with Deliberate Practice

An Introduction to the Minimalist Program

Cross-Lingual Text Categorization

The Smart/Empire TIPSTER IR System

CS Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Finding Translations in Scanned Book Collections

UCEAS: User-centred Evaluations of Adaptive Systems

Transcription:

Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University

Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3

Corpora as linguistic tools Any natural corpus will be skewed. Some sentences won t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list. (Chomsky 1959, 159) What do you think of corpus linguistics? It doesn t exist. (Chomsky answering a question by Bas Aarts, reported in a talk at the Corpus Linguistics conference, Freiburg 2001)

Corpora as linguistic tools Any natural corpus will be skewed. Some sentences won t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list. (Chomsky 1959, 159) What do you think of corpus linguistics? It doesn t exist. (Chomsky answering a question by Bas Aarts, reported in a talk at the Corpus Linguistics conference, Freiburg 2001)

Corpora as linguistic tools Corpora crashed into computational linguistics at the 1989 ACL meeting in Vancouver: but they were large, messy, ugly objects clearly lacking in theoretical integrity in all sorts of ways... (Kilgariff, 2003) Special Issue of CL on Using Large Corpora (Church and Mercer, 1993) changed role of corpora in computational linguistics

Corpora as linguistic tools Corpora crashed into computational linguistics at the 1989 ACL meeting in Vancouver: but they were large, messy, ugly objects clearly lacking in theoretical integrity in all sorts of ways... (Kilgariff, 2003) Special Issue of CL on Using Large Corpora (Church and Mercer, 1993) changed role of corpora in computational linguistics

Web as corpus Corpora as linguistic tools First publications at ACL 1999 Since then the web was used as a data source for: Word Sense Disambiguation (Rigau et al., 2002) Machine Translation (Way and Gough, 2003) Overcoming data sparseness in Language Modeling (Volk, 2001; Lapata and Keller, 2003) Answers for Question-Answering applications (Dumais et al., 2002; Zheng, 2002) New instances for Ontologies (Agirre et al., 2000) Sublanguage corpora for Translation (Varantola, 2000) Language Teaching (Fletcher, 2002)

What is a corpus? McEnery and Wilson (1996) Sampling and representativeness Finite (and fixed) size Machine-readable Standard reference Manning and Schütze (1999) Certain amount of data from a certain domain of interest Kilgariff (2003) A collection of texts Is the Web a Corpus?

What is a corpus? McEnery and Wilson (1996) Sampling and representativeness Finite (and fixed) size Machine-readable Standard reference Manning and Schütze (1999) Certain amount of data from a certain domain of interest Kilgariff (2003) A collection of texts Is the Web a Corpus?

Requirements for corpus design Standardisation Comparison/Exchange with respect to other corpora Flexibility Adding new layers of annotation, multimodality Detailed linguistic annotation with good search facilities Consistency in annotation Import/Export Add new data, create subcorpora, export search results

Issues in corpus creation Where to get the data? How to digitalise the data? Accessiblity, data sparseness Timeconsuming, costly How to annotate the data? Timeconsuming, linguistic decisions, inter-annotatior agreement How to guarantee representativity and reliability? The philologist s dilemma God s truth fallacy Mystery of vanishing reliability (Rissanen, 1989) How to get enough data? There s no data like more data

Issues in corpus creation Where to get the data? How to digitalise the data? Accessiblity, data sparseness Timeconsuming, costly How to annotate the data? Timeconsuming, linguistic decisions, inter-annotatior agreement How to guarantee representativity and reliability? The philologist s dilemma God s truth fallacy Mystery of vanishing reliability (Rissanen, 1989) How to get enough data? There s no data like more data

Limitations of web data Strategies to enhance web data Web as Solution for Sparse Data Problems? Advantages Lots of data freely available already digitalised Disadvantages No (reliable) meta-information No annotation, no control of search tool No control of precision and recall of search results (essential for quantitative studies) No control of contents No stability results can not be replicated

Limitations of web data Strategies to enhance web data Web as Solution for Sparse Data Problems? Advantages Lots of data freely available already digitalised Disadvantages No (reliable) meta-information No annotation, no control of search tool No control of precision and recall of search results (essential for quantitative studies) No control of contents No stability results can not be replicated

No control of the search tool Limitations of web data Strategies to enhance web data Problem: No control of indexing and search strategies Found on Jean Veronis blog in Feb 2005: If you type Chirac OR Sarkozy, you get half the number results of Chirac alone, which may have a political explanation... but is a weird approach to boolean logic. If you search the in the English pages, you get 1% of the number you get for the all languages together. Does this mean that the is 99 times more frequent in languages other than English? (http://aixtal.blogspot.com/2005/02/web-googles-missing-pagesmystery.html)

No control of the search tool Limitations of web data Strategies to enhance web data Indexing and search strategies of a commercial search engine may be modified at any time without notice Google: index update with in-depth correction of extrapolation routines and boolean logic (Mar 2005) (http://aixtal.blogspot.com/2005/03/google-snapshot-ofupdate.html)

No control of the search tool Limitations of web data Strategies to enhance web data Google IE Google ALL cat 1 190 000 389 000 000 cat OR cat 1 190 000 465 000 000 dog 854 000 275 000 000 dog OR dog 850 000 353 000 000 cat OR dog 1 400 000 448 000 000 dog OR cat 1 360 000 454 000 000 the 15 000 000 5 380 000 000 the OR the 15 000 000 9 190 000 000 (Google in November 2006)

Limitations of web data Strategies to enhance web data Lots of problems with web data... Can we use it at all for linguistic purposes? What type of research questions can be answered by using web data?

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis (Lüdeling and Evert, 2004) medical -itis: Combines with neoclassical stems denoting body parts Semantics: Inflammation of X (arthritis, appendicitis) non-medical -itis: Derived from medical -itis Semantics: hysteria or excessively doing something Possibly they are apt to become too ambitious - they rarely succumb to the disease of fontitis but are only too apt to have bad attacks of linkitis and activitis. (BNC, CG9:500)

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis (Lüdeling and Evert, 2004) medical -itis: Combines with neoclassical stems denoting body parts Semantics: Inflammation of X (arthritis, appendicitis) non-medical -itis: Derived from medical -itis Semantics: hysteria or excessively doing something Possibly they are apt to become too ambitious - they rarely succumb to the disease of fontitis but are only too apt to have bad attacks of linkitis and activitis. (BNC, CG9:500)

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Quantitative: Is word formation with non-medical -itis productive? Qualitative: With which bases does non-medical -itis combine? Distributional: In which contexts are the resulting complex words used? Comparative: What are the differences between the English and the German affix? Is one of them more productive than the other? Diachronic: When did non-medical -itis start to appear and what is its development?

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Type of Study BNC DWDS Google quantitative (find new types) yes yes no qualitative (find new token) yes yes yes distributional (look at context) yes yes yes comparative (meta-data, number yes no no of token/category) diachronic (date of origin) no yes no : BNC: not diachronic, too old DWDS: not (yet) stable enough, only accessible through web interface Web: no meta-data, no annotation, not stable

Limitations of web data Strategies to enhance web data Example: Productivity of non-medical -itis Type of Study BNC DWDS Google quantitative (find new types) yes yes no qualitative (find new token) yes yes yes distributional (look at context) yes yes yes comparative (meta-data, number yes no no of token/category) diachronic (date of origin) no yes no : BNC: not diachronic, too old DWDS: not (yet) stable enough, only accessible through web interface Web: no meta-data, no annotation, not stable

Limitations of web data Strategies to enhance web data How to overcome the limitations of web data? Two strategies: 1 Edit data from the search engine WebCorp (Kehoe and Renouf, 2002) KWicFinder (Fletcher, 2001) The Linguist s Search Enginge (Elkiss and Resnik, 2004) 2 Create your own corpus from the web BootsCaT (Baroni and Bernardini, 2004) Do it your own: Crawling, post-processing, annotating and indexing web data

WebCorp (Kehoe and Renouf, 2002) Limitations of web data Strategies to enhance web data Web-based interface to comercial search engines More powerful query syntax (wildcards) Output: keyword in context word frequency lists collocation statistics source document Limitations Same as the original search engine (Normalisations, stability, lack of control, no meta-information, no linguistic annotation) High precision, but low recall (for I like *ing less (10) than the BNC (295)) No random subset of results but dependent on search engine ranking (popularity,...)

Limitations of web data Strategies to enhance web data BootCaT (Baroni and Bernhardi, 2004) Create specialised language corpora for terminographical work Build general corpora in the size of the BNC (Sharoff, submitted; http://corpus.leeds.ac.uk/internet.html) Select initial seeds Run Google Queries Retrieve Corpus Extract Seeds (Unigram Terms) Extract Multi Word Terms No meta-information Linguistic annotation, control of search results Stability, Replicability Limited in size

Limitations of web data Strategies to enhance web data WaCky: kool ynitiative Informal initiative to rapidly build 1-billion-token proof-of-concept Web-corpora in 3 languages and a toolkit to collect, process and exploit such large corpora

Corpora as linguistic tools Corpora are a useful tool for linguistics but have to follow certain design criteria Linguistic studies based on web corpora are highly problematic But: often do simple algorithms using web data outperform more sophisticated methods based on smaller, but controlled data sets Use the web where it makes sense, but keep pitfalls in mind!

Thank You! Questions?

References (1) Corpora as linguistic tools Baroni, Marco and Silvia Bernardini (2004). BootCaT: Bootstrapping corpora and terms from the Web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon. BNC: http://www.natcorp.ox.ac.uk/ Chomsky, Noam (1957). Syntactic structures. The Hague, 159. Church, Kenneth W.; Mercer, Robert L. (1993). Introduction to the special issue on Computational Linguistics using large corpora. Computational Linguistics, 19(1), 1-24. DWDS: http://www.dwds.de Elkiss, Aaron and Philip Resnik (2004). The Linguist s Search Engine User s Guide. Available at: http://lse.umiacs.umd.edu:8080/lseuser (March 29, 2005).

References (2) Corpora as linguistic tools Fletcher, William H. (2001) Concordancing the Web with KWiCFinder. In: Proceedings of the 3rd North American Symposium on Corpus Linguistics and Language Teaching, Boston. Draft version: http://kwicfinder.com/fletchercllt2001.pdf (March 22, 2005). Google: http://www.google.com Kehoe, Andrew and Antoinette Renouf (2002). WebCorp: Applying the Web to linguistics and linguistics to the Web. In: Proceedings of the WWW 2002 Conference. Honolulu. Kilgariff, Adam and Gregory Grefenstette (2003). Introduction to the Special Issue on the, Computational Linguistics Volume 29, Number 3. Lüdeling, Evert, and Baroni (to appear). Using Web Data for Linguistic Purposes.

References (3) Corpora as linguistic tools Lüdeling, Anke and Stefan Evert, (2004). The emergence of productive non-medical -itis: corpus evidence and qualitative analysis in Proceedings of the First International Conference on Linguistic Evidence Tübingen, Germany. Manning and Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Pres. McEnery, Tony and Andrew Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press. Rissanen, M. (1989). Three problems connected with the use of diachronic corpora. ICAME Journal 13: 16-19. Sharoff, Serge (submitted). Open-source Corpora: using the net to fish for linguistic data. WaCky: http://wacky.sslmit.unibo.it/doku.php Way, A. and N. Gough (2003). Developing and Validating an Example-Based Machine Translation System using the World Wide Web. Computational Linguistics: special issue on.