Toward a methodology of chunking: applications and extensions of Linear Unit Grammar Ray Carey University of Helsinki

Similar documents
Procedia - Social and Behavioral Sciences 154 ( 2014 )

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Corpus Linguistics (L615)

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

English Language and Applied Linguistics. Module Descriptions 2017/18

CEFR Overall Illustrative English Proficiency Scales

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Using dialogue context to improve parsing performance in dialogue systems

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Parsing of part-of-speech tagged Assamese Texts

Mandarin Lexical Tone Recognition: The Gating Paradigm

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Dialog Act Classification Using N-Gram Algorithms

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Formulaic Language and Fluency: ESL Teaching Applications

Spoken English, TESOL and Applied Linguistics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Cross Language Information Retrieval

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Some Principles of Automated Natural Language Information Extraction

Advanced Grammar in Use

The Ups and Downs of Preposition Error Detection in ESL Writing

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

International Conference on Current Trends in ELT

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Vocabulary Usage and Intelligibility in Learner Language

Create Quiz Questions

CS Machine Learning

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

CONTENUTI DEL CORSO (presentazione di disciplina, argomenti, programma):

Speech Recognition at ICSI: Broadcast News and beyond

Keynote. Developments in English for Specific Purposes Research. Brian Paltridge University of Sydney

Linking Task: Identifying authors and book titles in verbose queries

Learning Computational Grammars

Problems of the Arabic OCR: New Attitudes

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries

Multi-Lingual Text Leveling

Lower and Upper Secondary

What the National Curriculum requires in reading at Y5 and Y6

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Handbook for Graduate Students in TESL and Applied Linguistics Programs

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

GACE Computer Science Assessment Test at a Glance

Degree Qualification Profiles Intellectual Skills

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Loughton School s curriculum evening. 28 th February 2017

The Use of Drama and Dramatic Activities in English Language Teaching

Developing a Language for Assessing Creativity: a taxonomy to support student learning and assessment

Prediction of Maximal Projection for Semantic Role Labeling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Switchboard Language Model Improvement with Conversational Data from Gigaword

Why PPP won t (and shouldn t) go away

Proof Theory for Syntacticians

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Progressive Aspect in Nigerian English

Natural Language Processing. George Konidaris

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

The Potential of Corpus-Informed L2 Pedagogy. Jonathon Reinhardt University of Arizona

LANGUAGE LEARNING MOOCS : REFLECTING ON THE CREATION OF TECHNOLOGY-BASED LEARNING MATERIALS IN A MOOLC" Research collaboration

THE EFFECTS OF TASK COMPLEXITY ALONG RESOURCE-DIRECTING AND RESOURCE-DISPERSING FACTORS ON EFL LEARNERS WRITTEN PERFORMANCE

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Evidence for Reliability, Validity and Learning Effectiveness

Construction Grammar. University of Jena.

282 About the Authors

Assessing speaking skills:. a workshop for teacher development. Ben Knight

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Common European Framework of Reference for Languages p. 58 to p. 82

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

International Conference on Education and Educational Psychology (ICEEPSY 2012)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

SEMAFOR: Frame Argument Resolution with Log-Linear Models

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

AQUA: An Ontology-Driven Question Answering System

Developing a TT-MCTAG for German with an RCG-based Parser

Wonderworks Tier 2 Resources Third Grade 12/03/13

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

The Smart/Empire TIPSTER IR System

Evaluation of the coursebooks used in the Chungbuk Provincial Board. of Education Secondary School Teachers Training Sessions

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The Singapore Copyright Act applies to the use of this document.

The taming of the data:

Common Core State Standards for English Language Arts

A Case Study: News Classification Based on Term Frequency

Transcription:

Toward a methodology of chunking: applications and extensions of Linear Unit Grammar Ray Carey University of Helsinki 1.5.2014 1

What & why ELFA corpus (2008, www.helsinki.fi/elfa) English as a Lingua Franca in Academic Settings 1 million words of naturally occurring academic ELF corpus-based fluency research (cf. Götz 2013) chunking as descriptive methodology chunk-based patterns of ordinary dysfluency description of (dys)fluency features among L2 users implications for language testing? 1.5.2014 2

What kind of chunking? chunking as storage / production Ellis 2002, Bybee 2010 chunking as processing / parsing Hunston & Francis 2000, Mason 2007 Linear Unit Grammar (LUG) Sinclair & Mauranen 2006 oriented toward processing, with implications for storage robust parsing of spoken and written text 1.5.2014 3

LUG: core concepts chunking is intuitive & pre-theoretical: prospection (suspension) completion (extension) Brazil 1995, A Grammar of Speech different analysts will have different intuitions much of LUG chunking is routinized & easily learned labeling chunks is systematic: organising (O) or incrementing message (M) prospection (M-) / completion (+M) linear analysis: what precedes & what likely follows? neither dependent on nor incompatible with traditional grammatical categories 1.5.2014 4

Studies incorporating LUG Describing the LUG model Sinclair & Mauranen 2006; Mauranen 2013 Organizing chunks in ELF Mauranen 2009, Carey 2013 LUG and metadiscourse Smart 2014 LUG & discourse markers Huang 2013 LUG-based model of turn-taking & interruption Carey 2011 Applying LUG to literary analysis Stone 2011 Tone unit boundaries & LUG chunking boundaries Cheng, Greaves, Warren 2008 1.5.2014 5

Linear, real-time processing whatidefendisthater(.)wecanergivethe mtheermtheknowledgethisspecifickno wledgetheytheywant 1.5.2014 6

Words are chunks too what i defend is that er (.) we can er give them the erm the knowledge this specific knowledge they they want 1.5.2014 7

Provisional Unit Boundaries what i defend is that er (.) we can er give them the erm the knowledge this specific knowledge they they want 1.5.2014 8

M: message-oriented O: organisation-oriented what i defend is that er (.) we can er give them the erm the knowledge this specific knowledge they they want 1.5.2014 9

M: message-oriented O: organisation-oriented M what i defend is O that O er O (.) M we can O er M O M M M M give them the erm the knowledge this specific knowledge they they want 1.5.2014 10

OT: text organising OI: interaction organising M what i defend is OT that OI er OI (.) M we can OI er M OI M M M M give them the erm the knowledge this specific knowledge they they want 1.5.2014 11

5 types of M (message) chunks M- what i defend is OT that OI er OI (.) +M- we can OI er +MA give them the OI erm +M the knowledge MR this specific knowledge MF they MS they want MA = Message Adjustment / MF = Message Fragment MR = Message Revision / MS = Message Supplement 1.5.2014 12

Toward LUG-annotated corpora Smart 2014: first LUG-annotated written corpus 40,000 words of IMDb message board discussions Two major challenges: consistency in judgments (systematization) takes forever to do manual annotation Partial automation no substitute for human intuition advantage of provisional annotations 1.5.2014 13

Step 1: find the O chunks Following Sinclair & Mauranen 2006: O chunks are fixed & recurring, easy to find good place to start placing boundaries Data-driven lists of O chunks ELFA corpus n-grams: 2-6 words manually examining concordance lines single units: er, and, but, yeah, erm, mhm bigrams: high-frequency chunk boundaries 1.5.2014 14

Step 2: guess the M chunks Message Fragments (MF) preliminary boundaries at repeats, false starts (wha-) Shallow noun phrase chunking: remaining data is part-of-speech tagged by TreeTagger Natural Language Tool Kit: regular expression chunker Bird, Loper & Klein 2009 data-driven regex patterns (based on 24 min. of data) Shallow NP combined with surrounding words, rough guesses at M chunk labels 1.5.2014 15

Human analysis in XML Preliminary chunker outputs data in XML format data is overchunked by design: easier to merge than create new annotations XML is formatted using cascading stylesheet (CSS) analyzed & revised in Oxygen XML editor chunk boundaries are merged, altered chunk labels are assigned through drop-down menus 1.5.2014 16

Working with XML through CSS 1.5.2014 17

Measuring chunker accuracy Each token treated as an observation ends with chunk boundary or no boundary chunker output compared to gold standard true/false positives, true/false negatives calculated After 32,000 words (3.5 hours of data) accuracy: 87% (% of observations in agreement) recall: 92% (% of gold boundaries predicted) precision: 80% (% of predicted boundaries correct) 1.5.2014 18

Is LUG reproducible among different analysts? inter-rater reliability test trained research assistant using 1400 words of text developed set of chunking guidelines (<2 pages) independently chunked 5210 words of spoken text 2k words from ELFA corpus, 3k words from Hong Kong Corpus of Spoken English (prosodic) Cohen s Kappa = 0.885 >.81 almost perfect (Landis & Koch 1977) Smart 2014: test with 288 words, Kappa=0.92 1.5.2014 19

Relation between chunks and tone units? Hong Kong Corpus of Spoken English (prosodic) HKCSE: Cheng, Greaves, Warren 2008 900,000 words of spoken English (L1/L2), annotated for tone unit boundaries 3135 words for inter-rater chunking (1108 TUBs) How well does LUG predict tone unit boundaries? accuracy: 80% (% of observations in agreement) recall: 70% (% tone unit boundaries predicted) precision: 73% (% of predicted boundaries correct) 1.5.2014 20

Conclusions LUG is a robust descriptive model of language can handle transcriptions of challenging spoken data judgments can be regularised and systematised good potential for inter-rater reproducibility supported by intonation patterns automatisation is within reach Ideal for corpus-based studies of (dys)fluency Message Fragments (MF), Message Adjustments (MA) inter-speaker distribution of (dys)fluency features 1.5.2014 21

www.helsinki.fi/elfa elfaproject.wordpress.com Bird, S., Loper, E. & Klein, E. (2009) Natural Language Processing with Python. O Reilly Media Inc. http://www.nltk.org/book/. Brazil, D. (1995) A Grammar of Speech. Describing English Language. Oxford: OUP. Bybee, J. (2010) Language, Usage and Cognition. Cambridge: CUP. Carey, R. (2011) Interruption and Uncooperativeness in Academic ELF Group Work: An Application of Linear Unit Grammar. MA thesis, University of Helsinki, Finland. Carey, R. (2013) On the other side: formulaic organizing chunks in spoken and written academic ELF. Journal of English as a Lingua Franca, 2(2), 207-228. Cheng, W., Greaves, C. & Warren, M. (2008) A corpus-driven study of discourse intonation: the Hong Kong corpus of spoken English (prosodic). Studies in Corpus Linguistics, vol. 32. Amsterdam: John Benjamins. ELFA (2008) The Corpus of English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. http://www.helsinki.fi/elfa/elfacorpus. Ellis, N.C. (2002) Frequency effects in language processing. Studies in second language acquisition, 24(2), 143-188. Götz, S. (2013) Fluency in native and non-native English speech. Studies in Corpus Linguistics, vol. 53. Amsterdam: John Benjamins. 1.5.2014 22

www.helsinki.fi/elfa elfaproject.wordpress.com Huang, Lan-fen (2013) The use of Linear Unit Grammar (LUG) in the investigation of discourse markers in spoken English. International Journal of Language Studies, 7(3), 119-136. Hunston, S. & Francis, G. (2000) Pattern Grammar: a corpus-driven approach to the lexical grammar of English. Studies in Corpus Linguistics, vol. 4. Amsterdam: John Benjamins. Mason, O. (2007) From lexis to syntax: the use of multi-word units in grammatical description. Proceedings of the 26th Intl. Conference on Lexis and Grammar, Bonifacio, 2-6 October 2007. Landis, J.R. & Koch, G.G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. Mauranen, A. (2009) Chunking in ELF: Expressions for managing interaction. Intercultural Pragmatics, 6(2), 217-233. Mauranen, A. (2013) Linear Unit Grammar. In The Encyclopedia of Applied Linguistics, Chapelle, C.A. (ed.). Oxford: Wiley-Blackwell. Sinclair, J.McH. & Mauranen, A. (2006) Linear Unit Grammar: Integrating Speech and Writing. Studies in Corpus Linguistics, vol. 25. Amsterdam: John Benjamins. Smart, C. (2014) The role of discourse reflexivity in a linear description of grammar and discourse: the case of IMDb message boards. Unpublished doctoral dissertation, University of Birmingham. Stone, L. (2011) Grammatical patterning in literary texts and Linear Unit Grammar in the dialogue of Hills like White Elephants by Ernest Hemingway. Innervate, 4 (2011-12), 150-167. 1.5.2014 23