Corpus Linguistics. Anca Dinu February, 2017

Similar documents
Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Linking Task: Identifying authors and book titles in verbose queries

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The following information has been adapted from A guide to using AntConc.

Cross Language Information Retrieval

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Modeling full form lexica for Arabic

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Corpus Linguistics (L615)

The taming of the data:

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Florida Reading Endorsement Alignment Matrix Competency 1

English Language and Applied Linguistics. Module Descriptions 2017/18

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Language Independent Passage Retrieval for Question Answering

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Ontologies vs. classification systems

Developing a TT-MCTAG for German with an RCG-based Parser

CS 598 Natural Language Processing

Variation of English passives used by Swedes

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Using dialogue context to improve parsing performance in dialogue systems

Task Tolerance of MT Output in Integrated Text Processes

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

A First-Pass Approach for Evaluating Machine Translation Systems

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods in Multilingual Speech Recognition

The influence of written task descriptions in Wizard of Oz experiments

Ensemble Technique Utilization for Indonesian Dependency Parser

1. Introduction. 2. The OMBI database editor

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

The MEANING Multilingual Central Repository

Adjusting a semantic taxonomy and annotation tool for historical corpora

A Comparison of Two Text Representations for Sentiment Analysis

AQUA: An Ontology-Driven Question Answering System

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Development of the First LRs for Macedonian: Current Projects

Let's Learn English Lesson Plan

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

- «Crede Experto:,,,». 2 (09) ( '36

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

On document relevance and lexical cohesion between query terms

The Smart/Empire TIPSTER IR System

Constructing Parallel Corpus from Movie Subtitles

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Parsing of part-of-speech tagged Assamese Texts

A corpus-based sociolinguistic study of amplifiers in British English

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

Proof Theory for Syntacticians

Multi-Lingual Text Leveling

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Speech Recognition at ICSI: Broadcast News and beyond

Routledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

A heuristic framework for pivot-based bilingual dictionary induction

Universiteit Leiden ICT in Business

Postprint.

Construction Grammar. University of Jena.

MYCIN. The embodiment of all the clichés of what expert systems are. (Newell)

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Constraining X-Bar: Theta Theory

MILITARY ENGLISH VERSUS GENERAL ENGLISH A CASE STUDY OF AN ENGLISH PROFICIENCY TEST IN THE ITALIAN MILITARY

Eyebrows in French talk-in-interaction

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

An Introduction to the Minimalist Program

CEFR Overall Illustrative English Proficiency Scales

Memory for questions and amount of processing

Mandarin Lexical Tone Recognition: The Gating Paradigm

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

BYLINE [Heng Ji, Computer Science Department, New York University,

A Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study. E. N. Westerhout November 10, 2005

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

The stages of event extraction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Transcription:

Corpus Linguistics Anca Dinu February, 2017

Where did corpus linguistics come from? Understanding that language in use is worthy of study; Understanding that large quantities of authentic language are needed for meaningful study; Understanding that context is important; General shift in social sciences to empiricism; Rise of technology; Recognition that a data-based approach opens up research.

Corpus Linguistics Corpus linguistics is a method of carrying out linguistic analyses. Corpus linguistics is the analysis of naturally occurring language on the basis of computerized corpora. Usually, the analysis is performed with the help of the computer, i.e. with specialized software, and takes into account the frequency of the phenomena investigated.

Corpus Linguistics It has become one of the most wide-spread methods of linguistic investigation. It can be used for the investigation of many kinds of linguistic questions. It has the potential to yield highly interesting, fundamental, and often surprising new insights about language.

Linguistic Data What data do linguists use to investigate linguistic phenomena? Roughly, four types of data can be distinguished: 1) data gained by intuition a) the researcher s own intuition ( introspection ) b) other people s intuition (accessed, for example, by elicitation tests) 2) naturally occurring language a) randomly collected texts or occurrences ( anecdotal evidence ) b) systematic collections of texts - CORPORA

Corpora A corpus is as a systematic collection of naturally occurring texts (of both written and spoken language). Systematic means that the structure and contents of the corpus follows certain extralinguistic principles or criteria.

Corpora For example, the texts or transcriptions of a corpus are often restricted to certain time span, domain, genre, style, dialect, language, etc... If several of these subcategories are present in a corpus, these are often represented by the same amount of text and separated as such in the corpus. Different types of corpora are used for different kind of analysis.

Use of corpora In linguistics, the typical use of corpora is: the (in)validation of linguistic hypothesis and statistical analysis of the linguistic data (Corpus Pattern Analysis, frequency lists, word cooccurrences, concordances, idioms, structures). The (semi-)automated data extraction (like argumental structure, thematic role) for the creation of electronic lexicons.

What corpora are there? Depending of the type of text or transcript, corpora can be: general/reference corpora (vs. specialized corpora) (e.g. BNC = British National Corpus, or Bank of English) aim at representing a language or variety as a whole (contain both spoken and written language, different text types etc.) historical corpora (vs. corpora of present-day language) (e.g. Helsinki Corpus, ARCHER) aim at representing an earlier stage or earlier stages of a language.

What corpora are there? regional corpora (vs. corpora containing more than one variety) (e.g. WCNZE = Wellington Corpus of Written New Zealand English) aim at representing one regional variety of a language. learner corpora (vs. native speaker corpora) (e.g. ICLE = International Corpus of Learner English) aim at representing the language as produced by learners of this language. multilingual corpora (vs. one-language corpora) aim at representing several, at least two, different languages, often with the same text types (for contrastive analyses). spoken (vs. written vs. mixed corpora) (e.g. LLC = London- Lund Corpus of Spoken English) aim at representing spoken language.

Annotation Annotation of corpora means that some kind of linguistic analysis has already been performed and marked on the texts, such as sentence analysis, or part of speach tagging. Depending of the type of annotation made on the text or transcript, a corpus can be: un-annotated (ortographic, raw, with just meta-annotation), phonetically, morphologically, syntactically, semantically or pragmatically annotated.

Annotation Annotation schemata should focus on a single coherent theme: Different linguistic phenomena should be annotated separately over the same corpus. Annotations must be consistent with each other: Unification and merging of multiple annotation is necessary.

Example of semantic annotation Predicators and their named arguments: [The man]agent painted [the wall]patient. Anaphors and their antecedents: [The protein] inhibits growth in yeast. [It] blocks production... Acronyms and their long forms: [Platelet-derived growth factor] (known as [pdgf]) impacts... Semantic Typing of entities: [The man]human fired [the gun]firearm.

Annotation Corpus annotation is usually made in a standardized manner with: XML (extensible Markup Language), designed to be both human- and machine-readable, via intuitive tags. Or TEI (Text Encoding Initiative), a text-centric community of practice that defined text guidelines in XML format).

Corpus Software Two types of software for corpus analysis can be distinguished in principle: software that is tailored to one specific corpus, (such as SARA and BNCWeb for BNC, or ICE-CUP for ICE-GB) and software that can be used with almost any kind of corpus (such as AntConc, MonoConc Pro and WordSmith Tools, which is probably the most widely used corpus software) We will use AntConc.

What can the software do? While there are many differences between the software packages designed for corpus analysis, certain basic functions can be performed by practically all the available software. For most kinds of linguistic analyses, the most important one of these is the possibility of searching the corpus in question for the (co-)occurrence of certain strings (words or phrases).

What can the software do? As output, the software then usually gives information on: the number of these strings occurring in the corpus, on the text in which they were found, and the so-called concordance-lines, which show the string in question in context (with the search term(s) highlighted).

Bibliography Nadja Nesselhauf, Corpus Linguistics: A Practical Introduction, 2011 Charlotte Taylor, What is corpus linguistics? What the data says, ICAME Journal No. 32, 2008