Corpus Linguistics. Applied Corpus Search Corpus of Contemporary American English (COCA) Niko Schenk

Similar documents
Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

The following information has been adapted from A guide to using AntConc.

BULATS A2 WORDLIST 2

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Corpus Linguistics (L615)

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Literature and the Language Arts Experiencing Literature

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Procedia - Social and Behavioral Sciences 154 ( 2014 )

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Developing Grammar in Context

The taming of the data:

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Mercer County Schools

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Indian Institute of Technology, Kanpur

Adjectives tell you more about a noun (for example: the red dress ).

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Cross Language Information Retrieval

The College Board Redesigned SAT Grade 12

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Part I. Figuring out how English works

Loughton School s curriculum evening. 28 th February 2017

Advanced Grammar in Use

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Linking Task: Identifying authors and book titles in verbose queries

Writing a composition

Words come in categories

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Today we examine the distribution of infinitival clauses, which can be

International Examinations. IGCSE English as a Second Language Teacher s book. Second edition Peter Lucantoni and Lydia Kellas

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

California Department of Education English Language Development Standards for Grade 8

What the National Curriculum requires in reading at Y5 and Y6

THE VERB ARGUMENT BROWSER

Emmaus Lutheran School English Language Arts Curriculum

Modeling full form lexica for Arabic

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

ScienceDirect. Malayalam question answering system

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Intermediate Academic Writing

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Memory-based grammatical error correction

Derivational and Inflectional Morphemes in Pak-Pak Language

SAMPLE PAPER SYLLABUS

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

CS Machine Learning

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Test Blueprint. Grade 3 Reading English Standards of Learning

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Pontificia Universidad Católica del Ecuador Facultad de Comunicación, Lingüística y Literatura Escuela de Lenguas Sección de Inglés

How to learn writing english online free >>>CLICK HERE<<<

Custom essay writing services 1 aa >>>CLICK HERE<<<

4 th Grade Reading Language Arts Pacing Guide

Information for Candidates

Vocabulary Usage and Intelligibility in Learner Language

Course Outline for Honors Spanish II Mrs. Sharon Koller

Using a Native Language Reference Grammar as a Language Learning Tool

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

EUROPEAN DAY OF LANGUAGES

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

The Role of the Head in the Interpretation of English Deverbal Compounds

Let's Learn English Lesson Plan

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Primary English Curriculum Framework

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide

Sample Goals and Benchmarks

Language contact in East Nusantara

Development of the First LRs for Macedonian: Current Projects

Variation of English passives used by Swedes

CEFR Overall Illustrative English Proficiency Scales

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Intensive English Program Southwest College

Innovative Methods for Teaching Engineering Courses

Lesson objective: Year: 5/6 Resources: 1a, 1b, 1c, 1d, 1e, 1f, Examples of newspaper orientations.

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Using dialogue context to improve parsing performance in dialogue systems

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Gr. 9 Geography. Canada: Creating a Sustainable Future DAY 1

Transcription:

Applied Corpus Search Corpus of Contemporary American English (COCA) Institut für England- und Amerikastudien Goethe-Universität Frankfurt am Main Winter Term 2015/2016 November 30th, 2016

1 COCA Corpus 2

1 COCA Corpus 2

A List of Available Corpora 2 Properties Corpus language words time period type Google s N-Gram Corpus English 1.024 trillion - web data Google Books Corpus AE/BE 155/34 billion 1500s-2000s historical, contemporary books Global Web-Based English (GloWbE) 20 countries 1.9 billion 2012-2013 web pages Corpus of Contemporary AE 1 (COCA) AE 450 million 1990-2012 spoken, fiction, magazines, news, acad texts British National Corpus (BYU-BNC) BE 100 million 1980s-1993 representative sample of written/spoken BE Corpus of American Soap Operas AE 100 million 2001-2012 film dialogues Strathy Corpus Canadian English 50 million 1970s-2000 spoken, fiction, magazines, newspapers, academic texts. My S-21 Facebook Corpus German 50 million 2010-2013 UGC, web data Corpus do Português Portuguese 45 million 1300s-1900s newspaper academic texts Canadian Hansard Corpus English, French 26 million 1986-1987 parallel corpus, parliament debates International Corpus of Learner English English 3.7 million 2002 essays written by 16 native langs learners of English 1 http://corpus.byu.edu/coca/ 2 no exhaustive list, sorted by size; references: 1, 2

COCA Corpus Getting started with the COCA corpus... http://corpus.byu.edu/coca

Tagset and Instructions on How to Use the Corpus 1 Tagset http://ucrel.lancs.ac.uk/claws7tags.html 2 Instructions on how to search the data Click on the LIST button and explore all links in the section More information: basic syntax, part of speech, lemmas (forms of words), synonyms, customized word lists, and combining words.

Use the COCA corpus for your analysis and explore the following exercises. For each exercise, provide the query that you formulated a short (brief and concise(!)) explanation of the trend that you see (based on frequencies that you obtain). Also note that for some exercises you might want to switch between the display options List, Chart, KWIC and Compare.

COCA Corpus Video Lectures In case you re having trouble with the search or when you need some more information on how to work with the corpus you can consult these video lectures: About the COCA corpus: http://www.youtube.com/watch?v=sclgrtlxg0y Parts-of-Speech (POS) http://www.youtube.com/watch?v=kp-7thiunlm List of POS tags http://ucrel.lancs.ac.uk/claws7tags.html Collocations http://www.youtube.com/watch?v=t_sxpfipo_o

Word Meaning 1 Search for the word corpus, inspect the results and try to use the different contexts to capture the different meanings.

Word Frequencies 2 What are the top-five most frequent words in the corpus? What s so special about the second and third most frequent words? Why are they included? Think of a potential application/linguistic scenario in which you might want to use these within your search query. 3 What is the most frequent noun in the corpus? Compute the relative frequency of this word compared to all words in the corpus. (simple division) Lookup the same word in the Google NGram viewer https://books.google.com/ngrams/ and check whether the word s relative frequency in the books corpus is different. Report and compare the two numbers. 4 What are the two most frequent words preceding the word body? What are the two most frequent affixes preceding the word body? Inspect the results for the seventh most frequent word which looks a bit strange. Could you explain what it is?

Synonyms, POS-Tags, Affixes, Lemmata 5 Find five synonyms of the verb (to) love. The synonyms should only be verbs. 6 Click on the keyword-in-context view (KWIC). Search for all nouns of the word form play. Inspect the results and find a sentence which was been tagged incorrectly. (e.g., a sentence in which the word is actually a verb.) 7 What are the three most frequent adjectives starting with the prefix in? Restrict your search only to the fiction domain / academic writing genre and report the adjectives. 8 Search for the lemma forms nice and tall with the List display option. Do the same for good. What is a potential problem here? 9 -licious is a suffix which is used to form new words. Find some instances and come up with a definition for them.

Comparing Genres 10 Are auxiliary verbs used more often in spoken language or in written text? 11 Generally, search for all nouns, verbs, adjectives and adverbs and compare the results across all genres in the corpus. Try to come up with a simple explanation for the trend you see. 12 Formulate a query for passive tense. Show that the passive tense is used more often in academic writing compared to fiction texts. What could be a possible explanation? 13 Compare the use of negation (not, etc.) and verb (base forms) across genres. (Note, that there is a tag for negation). Explain the trend you see. 14 In fiction texts, you would expect a lot of proper names. How does this hypothesis relate to other genres? Could you think of a linguistic construction (word, part-of-speech tag, ngram, affix) which is more prominent in fiction writing compared to the other genres?

Collocations 15 Search for all adjectives preceding the token President. Only inspect the first eleven results. Come up with two linguistic categories for the resulting adjectives by trying to classify them. 16 Which type of nouns does cause collocate with? 17 Which type of adjectives does rather collocate with? How about fairly? Compare the two types of adjectives and inspect many of them carefully. (Use the Compare option) Do these two types of adjectives fall into two classes with different properties? 18 Search for hard followed by any word. Inspect the results. Then, from the SORTING AND LIMITS panel, choose SORT BY RELEVANCE and rerun the query. Why are the results different? Which one is better interpretable? 19 Which type of nouns follow handsome? Which words go with pretty? Try to categorize them.

Collocations 20 A guy in a language form 3 claims that little carries an emotional factor [...] small usually does not. Prove this informally. 21 The words quick, rapid and fast all have very similar meanings. Formulate a query which extracts their collocates and explain the differences. 22 The word them can (very informally) be used as a synonym for those. 4 Find instances of this type in the corpus. 3 http://www.english-test.net/forum/ftopic14714.html 4 http://de.urbandictionary.com/define.php?term=them

COCA vs. BNC Lexicography & Syntax 23 Previous research on quotative like 5 has claimed that the phenomenon is much more common in AE than in BE. Test the hypothesis formally using the corpora COCA and BNC. 24 Formulate a query to check which adjectives are used to describe men. The query should have the pattern masculine pronoun + form of (to) be and collocate with adjectives to the right (max 4 tokens). Sort by RELEVANCE. Interpret the result. Which of the two lists are you more familiar with? 5 http://en.wikipedia.org/wiki/like#as_a_colloquial_quotative

COCA vs. BNC Lexicography & Syntax 25 Compare constructions of the sort -need NEG VERB- as in need not worry in AE and BE. 26 Search for constructions of the sort -Beginning of sentence One DO NEGATIONas in One doesn t and compare AE to BE. Could you come up with a hypothesis for the trend you see? (in general/for academic texts?) 27 -all of the NOUN- vs. -all the NOUN- / all the cases vs. all of the cases (BNC vs. COCA) 28 Search for all noun collocates of the noun web. (4 tokens to the left and right). Compare AE to BE and sort by RELEVANCE. Explain the differences. 29 Similar to the previous exercise but with dumb.