Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Introduction Beáta B. Megyesi Uppsala University Department of Linguistics and Philology beata.megyesi@lingfil.uu.se Introduction 1(48)

Course content Credits: 7.5 ECTS Subject: Computational linguistics Level: Advanced Content: The course provides basic knowledge about how to use language resources and tools for language studies. Goals: The students learn about basic technologies for creation and processing of text collections and speech databases - focus is on practical usage of tools Introduction 2(48)

Course content Requirements: Bachelor s Degree with main field of language study Regulations: The course may not be included in a degree if equivalent parts have been studied within another course included in the degree. Instruction: The teaching consists of lectures, exercises, supervision, and examination parts Introduction 3(48)

Course content What is a corpus and what is in it? Various corpora: text collections, speech databases, treebanks, parallel corpora, parallel treebanks, etc Corpus usage in language studies: concordances, collocations, frequency lists, keywords, corpus search Methods for corpus creation: collection, scanning, encoding, formats, annotation, alignment Existing tools for corpus creation and analysis on various linguistic levels (e.g., tagging and parsing) International corpus distribution Presentation of research Introduction 4(48)

Learning outcomes account for and independently apply basic concepts in corpus linguistics account for and discuss different corpora and speech databases account for the different usage of corpora and speech databases in language studies master basic techniques for corpus development use programs for text processing and quantitative analysis based on corpora process and analyze corpora in relation to a research question by using available tools carry out a project work on a research question and present the study written as a scientific paper and orally Introduction 5(48)

Examination Assignments: Corpus distributors and description of two existing corpora of your choice, individual written report, deadline: November 13 to Bengt Dahlqvist Corpus usage for language studies: Wordsmith tools, written report in group of 2, deadline: November 18 to Bengt Dahlqvist Corpus annotation and alignment: written report in group of two, deadline: November 23 to Bengt Dahlqvist Speech data: written report, deadline December 4 to Petur Helgason Introduction 6(48)

Examination Project work: carry out a project on a delimited research question and present the study orally and as a wtitten report presented as a scientific paper and peer review another work during a seminar on December 21 Grades: Pass or Fail 4 assignments the oral and written presentation of the project work and the peer review Introduction 7(48)

Literature and course page Hunston, Susan (2002) Corpora in Applied Linguistics. 6th printing. Cambridge University Press. Recommended reading: McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-Based Language Studies - an advanced resource book. Routledge Applied Linguistics. Course page: http://stp.lingfil.uu.se/~bea/uv/uv09/dokverktyg/ Introduction 8(48)

Outline Language studies Corpora: definition and content Corpus linguistics Corpus archives and distribution Assignment Introduction 9(48)

Language studies Intuition-based approach traditional researchers invent examples instantly for analysis intuition is available free from language-external influences may not represent typical language use what is acceptable is indivual intuition should be applied with caution; possible to be influenced by own dialect, sociolect results based on introspection is not observable, difficult to verify Introduction 10(48)

Language studies Corpus-based approach investigating language by using authentic examples derived from corpora what we see in a corpus is largely grammatical and/or acceptable a corpus provides evidence of what speakers believe to be acceptable utterances in their language a corpus draws upon authentic or real text a corpus can find differences that intuition alone cannot perceive a corpus can yield reliable quantitative data not all linguist accept the use of corpora not all research questions can be addressed by the the corpus-based approach Introduction 11(48)

Language studies Neither the corpus linguist of the 1950s, who rejected intuition, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterise the many successful corpus analyses of recent years. (Leech, 1991) Introduction 12(48)

Language studies Corpus-based approach: examples are taken from a corpus to verify hypothesis Corpus-driven approach: the whole corpus is used to find patterns Introduction 13(48)

Corpus: Definition Usage: singular - corpus, plural - corpora Definition: collection of sampled language data consisting of a digital collection (in machine-readable form) of written text or transcriptions of spoken language data which is (more or less) representative for the particular language and which may be annotated with various forms of linguistic information. Goal: to verify hypotheses about natural language, e.g. investigate how a particular sound, word, or syntactic construction is used Introduction 14(48)

What is a corpus? A corpus is defined as a body of naturally occurring language. It should be added that computer corpora are rarely haphazard collections of textual material: They are generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type. (Leech, 1992) A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. (Sinclair, 1996) Introduction 15(48)

Corpora a corpus is not just a collection of some text digitalized, machine-readable authentic texts including transcripts of spoken data sampled to be a representative collection of a particular language or language variety often limited size standard reference for comparative studies Introduction 16(48)

Corpus Text = written or spoken language Examples: SUC (Stockholm Umeå Corpus): 1 million word (written) PAROLE corpus from Språkbanken: 19 million word (written) BNC (Brittish National Corpus): 100 million word (written/spoken) London-Lund corpus: 0,5 million word (spoken) Introduction 17(48)

Some corpora SUC (written) 500 texts consisting of 2000 tokens per text 9 genrer, with subcategories, e.g. K - imaginative prose KK - general fiction KL - science fiction KN - light reading KR - humour info about lemma, PoS, named entities Introduction 18(48)

Some corpora monitor corpora (growing) Språkbanken: http://spraakbanken.gu.se/ Corpora and lexicon Bank of English: written and spoken English (for the COBUILD series of English language books) Introduction 19(48)

Some corpora The British National Corpus (BNC): http://www.natcorp.ox.ac.uk/ balansed, modern (British) English, over 100 million words (written and spoken) 4,124 texts, of which 863 transcribed from dialogs och monologs PoS tagged (65 PoS tags) Roger Garside and Geoffrey Leech Introduction 20(48)

Exemple: BNC <p> <s n=011> <w AT0>The <w AJ0>medical <w NN2>aspects <w VM0>can <w VBI>be <w NN1>cancer <c PUN>, <w NN1>pneumonia <c PUN>, <w AJ0>sudden <w NN1>blindness <c PUN>, <w NN1>dementia <c PUN>, <w AJ0>dramatic <w NN1>weight loss <w CJC>or <w DT0>any <w NN1>combination <w PRF>of <w DT0>these <c PUN>. </p> <p> <s n=012> <w AV0>Often <w AJ0>infected <w NN0>people <w VBB>are <w VVN>rejected <w PRP>by <w NN0>family <w CJC>and <w NN2>friends<c PUN>, <w VVG>leaving <w PNP>them <w TO0>to <w VVI>face <w DT0>this <w AJ0>chronic <w NN1>condition <w AJ0-AV0>alone<c PUN>. </p> Introduction 21(48)

Some corpora Brown University Corpus (Brown corpus): http://helmer.aksis.uib.no/icame/brown/bcm.html American English balansed over 1 million word 500 texts with 2000 words in each PoS tagged (82 tags) W. Nelson Francis - Henry Kucera Introduction 22(48)

Example: Brown THE AT A01001001E1 *FULTON NP-TL A01001002E1 *COUNTY NN-TL A01001003E1 *GRAND JJ-TL A01001004E1 *JURY NN-TL A01001005E1 SAID VBD A01001006E1 *FRIDAY NR A01001007E1 AN AT A01001008E1 INVESTIGATION NN A01001009E1 OF IN A01002001E1 *ATLANTA S NP A01002002E1 RECENT JJ A01002003E1 PRIMARY NN A01002004E1 ELECTION NN A01002005E1 PRODUCED VBD A01002006E1 NO AT A01002007E1 EVIDENCE NN A01002008E1 THAT CS A01002009E1 ANY DTI A01003001E1 IRREGULARITIES NNS A01003002E1 TOOK VBD A01003003E1 PLACE NN A01003004E1.. A01003005E1 Introduction 23(48)

Some corpora LOB (Lancaster-Oslo-Bergen) Corpus Based on Brown for British English 500 texts with 2000 words in each text PoS tagged, tags taken from Brown with some modification Eric Atwell, Roger Garside, Geoffrey Leech Introduction 24(48)

Some corpora Speech corpora Göteborg Spoken Language Corpus (GSLC) 1.5 million words, from various social activities transcribed, PoS tagged London-Lund Corpus (LLC): http://khnt.hit.uib.no/icame/manuals/londlund spoken British English, 100 texts with 5000 token in each, tot. 500 000 words transcribed, even phonetically and prosodically Introduction 25(48)

Corpus-based language study old idea used in: dialect studies comparative linguistics descriptive grammar combined with modern technique (large scale since the 80 s) empirical linguistics through corpus linguistics Introduction 26(48)

Corpus linguistics The term first appeared in the early 1980s (Leech, 1992) The method dates back to the pre-chomskyan period used by structuralists, e.g Sapir, Newman, Bloomfield Empirical, based on observed data People used paper-based smaller collections of written or transcribed texts, not representative The method was widespread in the early twentieth century but hardly criticized by Chomsky Introduction 27(48)

Corpus linguistics? The corpora used were very small - on paper - and used primarily for the study of distinguishing features in phonetics, and few used to study grammar - hard task as all analyses are hand-made Chomsky against the use of corpora: real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting. Merging corpora with modern computer technology Introduction 28(48)

Why use computers to study language? Introduction 29(48)

Why use computers to study language? processing speed easy manipulation of data (searching, selecting, sorting and formatting) machine-readable data can be processed accurately and consistently more reliable result compared to humans further automatic processing is possible enriched with various metadata and linguistic analyses Introduction 30(48)

Why corpus linguistics? The immense scope of a modern corpus, and the range of computing resources that are available for exploiting it, make up a powerful force for deepening our awareness and understanding of language. (M.A.K. Halliday) Introduction 31(48)

Why corpora? First modern corpus: Brown corpus (Brown University Standard Corpus of Present-day American English) built in the 1960s Since the 1980s, the number and size of corpora and corpus-based studies have increased dramatically. Corpora of today give insights into the language used in real world text. Introduction 32(48)

Why corpora? Corpora allow reliable language analysis, in natural contexts and with minimal experimental interference. Corpora can be used for a number of practical applications, e.g. in lexicography, language teaching, and language studies. Corpora are used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe Corpora have revolutionalized nearly all branches of linguistics. Introduction 33(48)

Purpose Corpora are created with a special purpose diachronic corpora: empirical studies of language change parallel corpora: learning translation parametres for automatic machine translation systems but also empirically study similarities and differences betwen languages Introduction 34(48)

Corpus linguistics CL is the study of language as expressed in samples (corpora) or real worldtext. two types of corpus linguistics: linguistics and language technology different background, aims, tools, networks, conferences, journals empirical language research semiautomatic extraction of linguistic knowledge for language technology Introduction 35(48)

Language studies Background: empirical linguistics Aim: traditional language studies Tools: concordances, word lists, statistic programs, Conferences: e.g. Conference of the Int. Computer Archive of Modern/Mediaeval English (ICAME) Teaching and Language Corpora Conference (TALC) Journals: International Journal of Corpus Linguistics, Corpora, Corpus Linguistics and Linguistic Theory, ICAME Journal Introduction 36(48)

Language technology Background: computer science, matemathical methods Conferences: Aim: machine learning Tools: tagger, parser, tools for alignment Journals: Int. Conf. on Computational Linguistics (COLING) Meetings of the ACL (ACL, EACL, NAACL) Empirical Methods in NLP (EMNLP) Computational Linguistics, Journal of Natural Language Engineering Introduction 37(48)

Language resources Resources: written/spoken mono-/multilingual corpora mono- and multilingual dictionaries terminology collections grammars Benchmarks for evaluation Basic tools: modules (e.g. taggers, parsers, grapheme-to-phoneme converters) annotation standards and tools corpus exploration and exploitation tools Introduction 38(48)

Corpus types What types of corpora do you know of? Introduction 39(48)

Corpus types modality: written, spoken, sign, multimodal language type, genre, etc. language: one, two, many relation between languages (comparable, parallel,...) size finite size monitor corpora analyzed, disambiguated, type of annotation Introduction 40(48)

Corpus types contains text in: a single language (monolingual corpus) or several languages (multilingual corpus): a collection of text in different languages translation corpus: original text and its translation in different languages a comparable corpus: comparable original texts in different languages, the texts in each language have been selected according to the same criteria (genre, content, publication date, etc) parallel corpus: bi-directional translation corpus specially formatted for side-by-side comparison, combi of translation and comparable corpus (e.g. EuroParl) synchronic or diachronic, historical or modern Introduction 41(48)

Useful links: see the course page CORPORA electronic mailing list for all people interested in corpora ACL SIGLEX: http://www.clres.com/siglex.html Special Interest Group on the Lexicon of the Association for Computational Linguistics The ACL NLP/CL Universe: http://www1.cs.columbia.edu/ radev/u/db/acl/ Links to computational linguistics resources, including corpora. Introduction 42(48)

Corpus archive, Corpus distributors Linguistic Data Consortium (LDC): http://www.ldc.upenn.edu Development and distribution of language resources: data, tools and standards Corpus distribution: text och speech for many different languages, lexicon, training data, benchmarks, etc Projects: corpus collection, annotation, information extraction, etc National Institute of Standards and Technology (NIST): defines benchmarks Introduction 43(48)

Corpus archive, Corpus distributors European Language Resources Association (ELRA): http://www.elra.info/ samt Evaluations and Language Resources Distribution Agency (ELDA): http://www.elda.org distributes, produces, standardise, evaluate language resources (e.g. lexicon, corpora: mono- and multilingual) to promote research in Human Language Technology (HLT) organize conferenses: The Language Resources and Evaluation Conference, LREC gives test data to evaluate various applications Introduction 44(48)

Corpus archives, organisations Oxford Text Archive (OTA): http://ota.ahds.ac.uk/ collects electronical texts of high quality for research and teaching and distributes more than 2000 resources for over more than 20 languages. International Computer Archive of Modern English (ICAME): http://nora.hd.uib.no/whatis.html corpus distribution in Bergen, Norway organise conference, ICAME Journal TELRI: http://www.telri.de. collects and distributes mono- and multilingual language resources with special focus on Central and Eastern European languages. Introduction 45(48)

Corpus archives, organisations Sprakbanken: http://spraakbanken.gu.se Corpora and tools for corpus search Lexin, SUC, Swedish Academy Lexicon Introduction 46(48)

Online databases Gutenberg: http://www.gutenberg.org 30 000 free ebooks expired copyrights many languages Runeberg: http://runeberg.org as Gutenberg but for Nordic literature Gallica: http://gallica.bnf.fr French Introduction 47(48)

Assignment Find out more about corpus archives Find corpora for a language of your choice, take two corpora and compare them Try to scan Introduction 48(48)