Corpus Linguistics and Multivariate Statistics

Similar documents
Construction Grammar. University of Jena.

English Language and Applied Linguistics. Module Descriptions 2017/18

A Case Study: News Classification Based on Term Frequency

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Probabilistic Latent Semantic Analysis

Sociology. M.A. Sociology. About the Program. Academic Regulations. M.A. Sociology with Concentration in Quantitative Methodology.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Speech Recognition at ICSI: Broadcast News and beyond

Intercultural communicative competence past and future

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Evidence for Reliability, Validity and Learning Effectiveness

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

CHAPTER 10 Statistical Measures for Usage-Based Linguistics

learning collegiate assessment]

Welcome to ACT Brain Boot Camp

Software Maintenance

Linking Task: Identifying authors and book titles in verbose queries

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Teaching a Laboratory Section

TU-E2090 Research Assignment in Operations Management and Services

Generation of Referring Expressions: Managing Structural Ambiguities

TITLE: Shakespeare: The technical words. DATE(S): Project will run for four weeks during June or July

An Introduction to Simio for Beginners

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Word Sense Disambiguation

Position Statements. Index of Association Position Statements

Language Acquisition Chart

GLBL 210: Global Issues

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

MMOG Subscription Business Models: Table of Contents

CEFR Overall Illustrative English Proficiency Scales

BSID-II-NL project. Heidelberg March Selma Ruiter, University of Groningen

THE HEAD START CHILD OUTCOMES FRAMEWORK

Physical Versus Virtual Manipulatives Mathematics

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Methods for the Qualitative Evaluation of Lexical Association Measures

Office of Institutional Effectiveness 2012 NATIONAL SURVEY OF STUDENT ENGAGEMENT (NSSE) DIVERSITY ANALYSIS BY CLASS LEVEL AND GENDER VISION

A Pilot Study on Pearson s Interactive Science 2011 Program

Progressive Aspect in Nigerian English

(ALMOST?) BREAKING THE GLASS CEILING: OPEN MERIT ADMISSIONS IN MEDICAL EDUCATION IN PAKISTAN

The following information has been adapted from A guide to using AntConc.

Unraveling symbolic number processing and the implications for its association with mathematics. Delphine Sasanguie

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

John Benjamins Publishing Company

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

The History of Language Teaching

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Update on Standards and Educator Evaluation

teaching issues 4 Fact sheet Generic skills Context The nature of generic skills

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

INTRODUCTION TO TEACHING GUIDE

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

The Paradox of Structure: What is the Appropriate Amount of Structure for Course Assignments with Regard to Students Problem-Solving Styles?

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Figuration & Frequency: A Usage-Based Approach to Metaphor

Evaluation of a College Freshman Diversity Research Program

1. Introduction. 2. The OMBI database editor

Bachelor Programme Structure Max Weber Institute for Sociology, University of Heidelberg

Prentice Hall Literature Common Core Edition Grade 10, 2012

On document relevance and lexical cohesion between query terms

high writing writing high contests. school students student

Using dialogue context to improve parsing performance in dialogue systems

Postprint.

ABET Criteria for Accrediting Computer Science Programs

Tap vs. Bottled Water

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Program Assessment and Alignment

Vocabulary Usage and Intelligibility in Learner Language

Three Crucial Questions about Target Audience Analysis

Investigations for Chapter 1. How do we measure and describe the world around us?

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

What is a Mental Model?

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

The College Board Redesigned SAT Grade 12

EMPIRICAL RESEARCH ON THE ACCOUNTING AND FINANCE STUDENTS OPINION ABOUT THE PERSPECTIVE OF THEIR PROFESSIONAL TRAINING AND CAREER PROSPECTS

correlated to the Nebraska Reading/Writing Standards Grades 9-12

Readyman Activity Badge Outline -- Community Group

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Research Design & Analysis Made Easy! Brainstorming Worksheet

Guidelines for Writing an Internship Report

Frequency and pragmatically unmarked word order *

ANGLAIS LANGUE SECONDE

Eyebrows in French talk-in-interaction

Learning Lesson Study Course

VIEW: An Assessment of Problem Solving Style

Transcription:

Corpus Linguistics and Multivariate Statistics Seminar 1 Dylan Glynn www.dsglynn.univ-paris8.fr dglynn@univ-paris8.fr

What is a corpus?

You shall know a word by the company it keeps J. R. Firth 1957

Three basic, but inter-related, approaches to corpora 1. Comparison corpora - Concordances 2. Formal patterns with a corpus - Collocations 3. Meaning patterns within a corpus - Correlations These methods vary massively in complexity and application They are typically used to answer questions in Discourse Analysis Critical Discourse Analysis Sociolinguistics Semantics and Pragmatics Phonology Morpho-syntax Stylistics...

Comparison corpora - Concordances The most basic type of concordance is the list of words and their frequencies in a body of writing. They have been used for hundreds of years in especially Theology and Literature. They are still important in stylistics and in many other areas of research. They are very quick and easy to compile and often represent the first step of more advanced studies. For example take the parliamentary speeches of the Tories and Labour and compile a word list of both and compare the frequency of certain key words? Do the same for men and women in the parliament? What would differences tell us?

2. Formal patterns with a corpus - Collocations This is the mainstay of contemporary corpus linguistics. There are various types Frequency Analysis Concordance Analysis KWIC (keyword in context) - Collocations (concordance of word co-occurrences) - Collostructions (concordance of word syntactic pattern) Vector analysis / Word space analysis

Frequency Analysis 2008 US Presidential Election

Concordance Analysis

2. Collocation Analysis

Collostructional Analysis Table 2. Most significant V cause -V result co-varying collexemes in the into-causative in the 1990 2000 volumes of The Guardian (cf. Gries and Stefanowitsch 2004: 230) V cause V result N -log (p Fisher-Yates ) bounce accepting 29 14.074 torture confessing 8 13.155 draw commenting 6 10.581 shock understanding 7 10.483 stimulate producing 6 9.330 dupe carrying 8 7.244 con paying 16 7.019 hoodwink leaving 8 6.982 mislead buying 14 6.980 delude supposing 3 6.792 terrorise fleeing 4 6.762 talk letting 12 6.743 dupe leaving 13 6.609 force making 51 6.546 pressure having 14 6.505 bounce announcing 6 6.100 shame cleaning 4 5.953 dragoon voting 7 5.899 swing planning 2 5.518 fool queuing 3 5.435 lock using 5 5.406 guide lending 2 5.372 rush making 11 5.305 educate understanding 3 5.296 fool seeing 6 5.180

Vector Analysis synonymy of run come through guide run for bring home the bacon track down consort hunt down pass hunt lead sail extend go through go across win succeed deliver the goods unloosen unloose release occur incur fulfil fulfill chronological succession sequence carry through succession carry out successiveness chronological sequence action locomote accomplish travel displace persist prevail discharge die hard move outpouring execute endure campaign ravel carry political campaign black market ladder bleed incline go draw tend ply melt idle run off run run around run along run over run away melt down race running trial tally running play test unravel running game foot race footrace scarper streak take to the woods scat turn tail lam hightail it escape break away head for fly the hills coop bunk loose liberate free zip travel rapidly speed hurry trip runnel lean be given streamlet rivulet play rill range function work operate course flow feed period get become period of time time period process indefinite quantity liberty treat NOUNS

3. Meaning patterns within a corpus - Correlations There is a continuum from counting occurrences of some meaning or use through to large-scale multivariate modelling of the behaviour of those uses. Relative Frequencies This simple counting of uses is a simple and useful corpus-driven approach. It is the mainstay of Discourse Analysis and Functional Linguistics Multidimensional correlations This line of research is the newest and only began in the 1990s in Belgium and Germany It is an extremely popular technique in Cognitive Linguistics

Relative Frequencies Memoranda vs. email in office communication

Multidimensional correlations Conceptualisation of HOME

Form-based vs Meaning based analysis Problems with Meaning based analysis i. Low degree of representativity due to small sample size ii. High degree of subjectivity due to manual analysis Response to Problem of Representativity i. Restrict studies to careful controlled datasets ii. Predictive statistical modelling is essential Response to Problem of Subjectivity i. Clearly operationalised usage-features ii. Multiple annotators and Kappa scores for reliability

Corpus Evidence in Langauge Science In the late 1960s, the Functionalists were questioning the the assumptions of the European and American Structuralists France: Martinet, Benveniste, Culioli Britain: Firth, Halliday, Sinclair Russia: Bendarko, Aprecijan, Mel cuk America: Givon, Hopper, Fillmore, Lakoff The debate remains the of the two key debates of linguistics today. This debate is central to the theories of arguably the three greatest linguists in history What structures language?

Corpus Evidence in Langauge Science Where is grammar? Does langue come from parole or does parole come from langue? Humbolt - ergon (product) vs. energeia (activity) de Saussure - langue vs. parole Chomsky - performance vs. competence Theoretically, there are strong arguments for both Empirically, there are strong arguments for both Corpus linguistics necessarily assumes that the product is a result of the activity, that langue comes from parole, that competence is a based on performance... Although probably less than half, a very large group of linguists today think this is RUBBISH!!!

Corpus Evidence in Langauge Science The main argument against using performance as an index of structure. I live in New York versus I live in Dayton, Ohio Chomsky s (1964) Frequency of performance tells us about the world langauge is used to describe, not the langauge structure in the mind. Q. 1. Why does one assume that the langauge in the mind is different from the world it describes At some level, I come from New York is more important in langauge than I come from Dayton Q. 2. Why would one look at raw frequency to describe langauge, it is always relative? We could only compare the frequency of these two utterances, if the same number of people lived in Dayton & New York

Questions for Corpus Linguistics In every corpus-based study, it is crucial you are aware of the practical limitations and theoretical assumptions of the method!!! (this includes your mémoires) 1. Practical Questions a. Representativity Text type b. Representativity Hapax legomena 2. Theoretical Questions a. Frequency Linguistic structure b. Frequency Thematic bias 3. Analytical Questions a. Negative Evidence b. Objective Accuracy

Representativity - Practical Questions for Corpus-Based Research 1. Text type and Topic of Discourse The type of text and what it is talking can have a profound effect on your results The most common meaning of run will be fast pedestrian motion in a corpus of children s books, but it will be management in a corpus economics news press. 2. Hapax legomena and rare events The largest corpus in the world is but a fraction of langauge Something that can be very rare in a corpus, is, in fact, quite common out there in the real world We are relatively restricted to quite common events. Things like idioms etc. are relatively rare. a. What are the implications of each of these questions for you own project b. If a particular question has implications for your project, what measure have you taken to respond to the question?

Frequency - Theoretical Questions for Corpus-Based Research 1. Linguistic structure This is the langauge parole debate 2. Thematic bias This is the same as the issue of text type and is the basis of Chomsky s criticsm. 1. What are the implications of each of these questions for you own project 2. If a particular question has implications for your project, what measure have you taken to respond to the question?

Object - Analytical Questions for Corpus-Based Research 1. Negative Evidence We only have what people say, not what they don t say. How can we disprove hypotheses? 2. Objective Accuracy To increase representativity and objectivity, we necessarily increase inaccuracy If we increase accuracy, we necessarily decrease representativity and objectivity 1. What are the implications of each of these questions for you own project 2. If a particular question has implications for your project, what measure have you taken to respond to the question?

For next week Have a look at each of the articles on line. Choose one and have a go at reading it. Remember, your memoire should look something like on of these articles... serious.