Toward a methodology of chunking: applications and extensions of Linear Unit Grammar Ray Carey University of Helsinki

Toward a methodology of chunking: applications and extensions of Linear Unit Grammar Ray Carey University of Helsinki 1.5.2014 1

What & why ELFA corpus (2008, www.helsinki.fi/elfa) English as a Lingua Franca in Academic Settings 1 million words of naturally occurring academic ELF corpus-based fluency research (cf. Götz 2013) chunking as descriptive methodology chunk-based patterns of ordinary dysfluency description of (dys)fluency features among L2 users implications for language testing? 1.5.2014 2

What kind of chunking? chunking as storage / production Ellis 2002, Bybee 2010 chunking as processing / parsing Hunston & Francis 2000, Mason 2007 Linear Unit Grammar (LUG) Sinclair & Mauranen 2006 oriented toward processing, with implications for storage robust parsing of spoken and written text 1.5.2014 3

LUG: core concepts chunking is intuitive & pre-theoretical: prospection (suspension) completion (extension) Brazil 1995, A Grammar of Speech different analysts will have different intuitions much of LUG chunking is routinized & easily learned labeling chunks is systematic: organising (O) or incrementing message (M) prospection (M-) / completion (+M) linear analysis: what precedes & what likely follows? neither dependent on nor incompatible with traditional grammatical categories 1.5.2014 4

Studies incorporating LUG Describing the LUG model Sinclair & Mauranen 2006; Mauranen 2013 Organizing chunks in ELF Mauranen 2009, Carey 2013 LUG and metadiscourse Smart 2014 LUG & discourse markers Huang 2013 LUG-based model of turn-taking & interruption Carey 2011 Applying LUG to literary analysis Stone 2011 Tone unit boundaries & LUG chunking boundaries Cheng, Greaves, Warren 2008 1.5.2014 5

Linear, real-time processing whatidefendisthater(.)wecanergivethe mtheermtheknowledgethisspecifickno wledgetheytheywant 1.5.2014 6

Words are chunks too what i defend is that er (.) we can er give them the erm the knowledge this specific knowledge they they want 1.5.2014 7

Provisional Unit Boundaries what i defend is that er (.) we can er give them the erm the knowledge this specific knowledge they they want 1.5.2014 8

M: message-oriented O: organisation-oriented what i defend is that er (.) we can er give them the erm the knowledge this specific knowledge they they want 1.5.2014 9

M: message-oriented O: organisation-oriented M what i defend is O that O er O (.) M we can O er M O M M M M give them the erm the knowledge this specific knowledge they they want 1.5.2014 10

OT: text organising OI: interaction organising M what i defend is OT that OI er OI (.) M we can OI er M OI M M M M give them the erm the knowledge this specific knowledge they they want 1.5.2014 11

5 types of M (message) chunks M- what i defend is OT that OI er OI (.) +M- we can OI er +MA give them the OI erm +M the knowledge MR this specific knowledge MF they MS they want MA = Message Adjustment / MF = Message Fragment MR = Message Revision / MS = Message Supplement 1.5.2014 12

Toward LUG-annotated corpora Smart 2014: first LUG-annotated written corpus 40,000 words of IMDb message board discussions Two major challenges: consistency in judgments (systematization) takes forever to do manual annotation Partial automation no substitute for human intuition advantage of provisional annotations 1.5.2014 13

Step 1: find the O chunks Following Sinclair & Mauranen 2006: O chunks are fixed & recurring, easy to find good place to start placing boundaries Data-driven lists of O chunks ELFA corpus n-grams: 2-6 words manually examining concordance lines single units: er, and, but, yeah, erm, mhm bigrams: high-frequency chunk boundaries 1.5.2014 14

Step 2: guess the M chunks Message Fragments (MF) preliminary boundaries at repeats, false starts (wha-) Shallow noun phrase chunking: remaining data is part-of-speech tagged by TreeTagger Natural Language Tool Kit: regular expression chunker Bird, Loper & Klein 2009 data-driven regex patterns (based on 24 min. of data) Shallow NP combined with surrounding words, rough guesses at M chunk labels 1.5.2014 15

Human analysis in XML Preliminary chunker outputs data in XML format data is overchunked by design: easier to merge than create new annotations XML is formatted using cascading stylesheet (CSS) analyzed & revised in Oxygen XML editor chunk boundaries are merged, altered chunk labels are assigned through drop-down menus 1.5.2014 16

Working with XML through CSS 1.5.2014 17

Measuring chunker accuracy Each token treated as an observation ends with chunk boundary or no boundary chunker output compared to gold standard true/false positives, true/false negatives calculated After 32,000 words (3.5 hours of data) accuracy: 87% (% of observations in agreement) recall: 92% (% of gold boundaries predicted) precision: 80% (% of predicted boundaries correct) 1.5.2014 18

Is LUG reproducible among different analysts? inter-rater reliability test trained research assistant using 1400 words of text developed set of chunking guidelines (<2 pages) independently chunked 5210 words of spoken text 2k words from ELFA corpus, 3k words from Hong Kong Corpus of Spoken English (prosodic) Cohen s Kappa = 0.885 >.81 almost perfect (Landis & Koch 1977) Smart 2014: test with 288 words, Kappa=0.92 1.5.2014 19

Relation between chunks and tone units? Hong Kong Corpus of Spoken English (prosodic) HKCSE: Cheng, Greaves, Warren 2008 900,000 words of spoken English (L1/L2), annotated for tone unit boundaries 3135 words for inter-rater chunking (1108 TUBs) How well does LUG predict tone unit boundaries? accuracy: 80% (% of observations in agreement) recall: 70% (% tone unit boundaries predicted) precision: 73% (% of predicted boundaries correct) 1.5.2014 20

Conclusions LUG is a robust descriptive model of language can handle transcriptions of challenging spoken data judgments can be regularised and systematised good potential for inter-rater reproducibility supported by intonation patterns automatisation is within reach Ideal for corpus-based studies of (dys)fluency Message Fragments (MF), Message Adjustments (MA) inter-speaker distribution of (dys)fluency features 1.5.2014 21

www.helsinki.fi/elfa elfaproject.wordpress.com Bird, S., Loper, E. & Klein, E. (2009) Natural Language Processing with Python. O Reilly Media Inc. http://www.nltk.org/book/. Brazil, D. (1995) A Grammar of Speech. Describing English Language. Oxford: OUP. Bybee, J. (2010) Language, Usage and Cognition. Cambridge: CUP. Carey, R. (2011) Interruption and Uncooperativeness in Academic ELF Group Work: An Application of Linear Unit Grammar. MA thesis, University of Helsinki, Finland. Carey, R. (2013) On the other side: formulaic organizing chunks in spoken and written academic ELF. Journal of English as a Lingua Franca, 2(2), 207-228. Cheng, W., Greaves, C. & Warren, M. (2008) A corpus-driven study of discourse intonation: the Hong Kong corpus of spoken English (prosodic). Studies in Corpus Linguistics, vol. 32. Amsterdam: John Benjamins. ELFA (2008) The Corpus of English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. http://www.helsinki.fi/elfa/elfacorpus. Ellis, N.C. (2002) Frequency effects in language processing. Studies in second language acquisition, 24(2), 143-188. Götz, S. (2013) Fluency in native and non-native English speech. Studies in Corpus Linguistics, vol. 53. Amsterdam: John Benjamins. 1.5.2014 22

www.helsinki.fi/elfa elfaproject.wordpress.com Huang, Lan-fen (2013) The use of Linear Unit Grammar (LUG) in the investigation of discourse markers in spoken English. International Journal of Language Studies, 7(3), 119-136. Hunston, S. & Francis, G. (2000) Pattern Grammar: a corpus-driven approach to the lexical grammar of English. Studies in Corpus Linguistics, vol. 4. Amsterdam: John Benjamins. Mason, O. (2007) From lexis to syntax: the use of multi-word units in grammatical description. Proceedings of the 26th Intl. Conference on Lexis and Grammar, Bonifacio, 2-6 October 2007. Landis, J.R. & Koch, G.G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. Mauranen, A. (2009) Chunking in ELF: Expressions for managing interaction. Intercultural Pragmatics, 6(2), 217-233. Mauranen, A. (2013) Linear Unit Grammar. In The Encyclopedia of Applied Linguistics, Chapelle, C.A. (ed.). Oxford: Wiley-Blackwell. Sinclair, J.McH. & Mauranen, A. (2006) Linear Unit Grammar: Integrating Speech and Writing. Studies in Corpus Linguistics, vol. 25. Amsterdam: John Benjamins. Smart, C. (2014) The role of discourse reflexivity in a linear description of grammar and discourse: the case of IMDb message boards. Unpublished doctoral dissertation, University of Birmingham. Stone, L. (2011) Grammatical patterning in literary texts and Linear Unit Grammar in the dialogue of Hills like White Elephants by Ernest Hemingway. Innervate, 4 (2011-12), 150-167. 1.5.2014 23