Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Similar documents
Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Modeling full form lexica for Arabic

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Learning Methods in Multilingual Speech Recognition

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

Development of the First LRs for Macedonian: Current Projects

Using dialogue context to improve parsing performance in dialogue systems

Corpus Linguistics (L615)

1. Introduction. 2. The OMBI database editor

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Linking Task: Identifying authors and book titles in verbose queries

Constructing Parallel Corpus from Movie Subtitles

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Cross Language Information Retrieval

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

The Potential of Corpus-Informed L2 Pedagogy. Jonathon Reinhardt University of Arizona

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The CESAR Project: Enabling LRT for 70M+ Speakers

Towards a corpus-based online dictionary. of Italian Word Combinations

Automatic Translation of Norwegian Noun Compounds

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

AQUA: An Ontology-Driven Question Answering System

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Vocabulary Usage and Intelligibility in Learner Language

The stages of event extraction

Training and evaluation of POS taggers on the French MULTITAG corpus

Context Free Grammars. Many slides from Michael Collins

Parsing of part-of-speech tagged Assamese Texts

A High-Quality Web Corpus of Czech

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

TITLE: Shakespeare: The technical words. DATE(S): Project will run for four weeks during June or July

Proceedings of the 19th COLING, , 2002.

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Specifying a shallow grammatical for parsing purposes

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Word Sense Disambiguation

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Phonological Processing for Urdu Text to Speech System

Ensemble Technique Utilization for Indonesian Dependency Parser

Language Model and Grammar Extraction Variation in Machine Translation

Accurate Unlexicalized Parsing for Modern Hebrew

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The Role of the Head in the Interpretation of English Deverbal Compounds

Improving software testing course experience with pair testing pattern. Iyad Alazzam* and Mohammed Akour

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Memory-based grammatical error correction

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

A Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study. E. N. Westerhout November 10, 2005

Routledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man

The influence of written task descriptions in Wizard of Oz experiments

Noisy SMS Machine Translation in Low-Density Languages

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Speech Recognition at ICSI: Broadcast News and beyond

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

BYLINE [Heng Ji, Computer Science Department, New York University,

A heuristic framework for pivot-based bilingual dictionary induction

English Language and Applied Linguistics. Module Descriptions 2017/18

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Executive summary (in English)

CS 598 Natural Language Processing

CEFR Overall Illustrative English Proficiency Scales

MYP Language A Course Outline Year 3

Annotation Projection for Discourse Connectives

Corpus on Web: Introducing The First Tagged and Balanced Chinese Corpus + Chu-Ren Huang, *Keh-Jiann Chen and -Shin Lin

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

The Smart/Empire TIPSTER IR System

Progressive Aspect in Nigerian English

Handling Sparsity for Verb Noun MWE Token Classification

Combining a Chinese Thesaurus with a Chinese Dictionary

Developing a TT-MCTAG for German with an RCG-based Parser

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Anna P. Kosterina Iowa State University. Retrospective Theses and Dissertations

Teaching ideas. AS and A-level English Language Spark their imaginations this year

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Ontologies vs. classification systems

Transcription:

Introduction Beáta B. Megyesi Uppsala University Department of Linguistics and Philology beata.megyesi@lingfil.uu.se Introduction 1(48)

Course content Credits: 7.5 ECTS Subject: Computational linguistics Level: Advanced Content: The course provides basic knowledge about how to use language resources and tools for language studies. Goals: The students learn about basic technologies for creation and processing of text collections and speech databases - focus is on practical usage of tools Introduction 2(48)

Course content Requirements: Bachelor s Degree with main field of language study Regulations: The course may not be included in a degree if equivalent parts have been studied within another course included in the degree. Instruction: The teaching consists of lectures, exercises, supervision, and examination parts Introduction 3(48)

Course content What is a corpus and what is in it? Various corpora: text collections, speech databases, treebanks, parallel corpora, parallel treebanks, etc Corpus usage in language studies: concordances, collocations, frequency lists, keywords, corpus search Methods for corpus creation: collection, scanning, encoding, formats, annotation, alignment Existing tools for corpus creation and analysis on various linguistic levels (e.g., tagging and parsing) International corpus distribution Presentation of research Introduction 4(48)

Learning outcomes account for and independently apply basic concepts in corpus linguistics account for and discuss different corpora and speech databases account for the different usage of corpora and speech databases in language studies master basic techniques for corpus development use programs for text processing and quantitative analysis based on corpora process and analyze corpora in relation to a research question by using available tools carry out a project work on a research question and present the study written as a scientific paper and orally Introduction 5(48)

Examination Assignments: Corpus distributors and description of two existing corpora of your choice, individual written report, deadline: November 13 to Bengt Dahlqvist Corpus usage for language studies: Wordsmith tools, written report in group of 2, deadline: November 18 to Bengt Dahlqvist Corpus annotation and alignment: written report in group of two, deadline: November 23 to Bengt Dahlqvist Speech data: written report, deadline December 4 to Petur Helgason Introduction 6(48)

Examination Project work: carry out a project on a delimited research question and present the study orally and as a wtitten report presented as a scientific paper and peer review another work during a seminar on December 21 Grades: Pass or Fail 4 assignments the oral and written presentation of the project work and the peer review Introduction 7(48)

Literature and course page Hunston, Susan (2002) Corpora in Applied Linguistics. 6th printing. Cambridge University Press. Recommended reading: McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-Based Language Studies - an advanced resource book. Routledge Applied Linguistics. Course page: http://stp.lingfil.uu.se/~bea/uv/uv09/dokverktyg/ Introduction 8(48)

Outline Language studies Corpora: definition and content Corpus linguistics Corpus archives and distribution Assignment Introduction 9(48)

Language studies Intuition-based approach traditional researchers invent examples instantly for analysis intuition is available free from language-external influences may not represent typical language use what is acceptable is indivual intuition should be applied with caution; possible to be influenced by own dialect, sociolect results based on introspection is not observable, difficult to verify Introduction 10(48)

Language studies Corpus-based approach investigating language by using authentic examples derived from corpora what we see in a corpus is largely grammatical and/or acceptable a corpus provides evidence of what speakers believe to be acceptable utterances in their language a corpus draws upon authentic or real text a corpus can find differences that intuition alone cannot perceive a corpus can yield reliable quantitative data not all linguist accept the use of corpora not all research questions can be addressed by the the corpus-based approach Introduction 11(48)

Language studies Neither the corpus linguist of the 1950s, who rejected intuition, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterise the many successful corpus analyses of recent years. (Leech, 1991) Introduction 12(48)

Language studies Corpus-based approach: examples are taken from a corpus to verify hypothesis Corpus-driven approach: the whole corpus is used to find patterns Introduction 13(48)

Corpus: Definition Usage: singular - corpus, plural - corpora Definition: collection of sampled language data consisting of a digital collection (in machine-readable form) of written text or transcriptions of spoken language data which is (more or less) representative for the particular language and which may be annotated with various forms of linguistic information. Goal: to verify hypotheses about natural language, e.g. investigate how a particular sound, word, or syntactic construction is used Introduction 14(48)

What is a corpus? A corpus is defined as a body of naturally occurring language. It should be added that computer corpora are rarely haphazard collections of textual material: They are generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type. (Leech, 1992) A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. (Sinclair, 1996) Introduction 15(48)

Corpora a corpus is not just a collection of some text digitalized, machine-readable authentic texts including transcripts of spoken data sampled to be a representative collection of a particular language or language variety often limited size standard reference for comparative studies Introduction 16(48)

Corpus Text = written or spoken language Examples: SUC (Stockholm Umeå Corpus): 1 million word (written) PAROLE corpus from Språkbanken: 19 million word (written) BNC (Brittish National Corpus): 100 million word (written/spoken) London-Lund corpus: 0,5 million word (spoken) Introduction 17(48)

Some corpora SUC (written) 500 texts consisting of 2000 tokens per text 9 genrer, with subcategories, e.g. K - imaginative prose KK - general fiction KL - science fiction KN - light reading KR - humour info about lemma, PoS, named entities Introduction 18(48)

Some corpora monitor corpora (growing) Språkbanken: http://spraakbanken.gu.se/ Corpora and lexicon Bank of English: written and spoken English (for the COBUILD series of English language books) Introduction 19(48)

Some corpora The British National Corpus (BNC): http://www.natcorp.ox.ac.uk/ balansed, modern (British) English, over 100 million words (written and spoken) 4,124 texts, of which 863 transcribed from dialogs och monologs PoS tagged (65 PoS tags) Roger Garside and Geoffrey Leech Introduction 20(48)

Exemple: BNC <p> <s n=011> <w AT0>The <w AJ0>medical <w NN2>aspects <w VM0>can <w VBI>be <w NN1>cancer <c PUN>, <w NN1>pneumonia <c PUN>, <w AJ0>sudden <w NN1>blindness <c PUN>, <w NN1>dementia <c PUN>, <w AJ0>dramatic <w NN1>weight loss <w CJC>or <w DT0>any <w NN1>combination <w PRF>of <w DT0>these <c PUN>. </p> <p> <s n=012> <w AV0>Often <w AJ0>infected <w NN0>people <w VBB>are <w VVN>rejected <w PRP>by <w NN0>family <w CJC>and <w NN2>friends<c PUN>, <w VVG>leaving <w PNP>them <w TO0>to <w VVI>face <w DT0>this <w AJ0>chronic <w NN1>condition <w AJ0-AV0>alone<c PUN>. </p> Introduction 21(48)

Some corpora Brown University Corpus (Brown corpus): http://helmer.aksis.uib.no/icame/brown/bcm.html American English balansed over 1 million word 500 texts with 2000 words in each PoS tagged (82 tags) W. Nelson Francis - Henry Kucera Introduction 22(48)

Example: Brown THE AT A01001001E1 *FULTON NP-TL A01001002E1 *COUNTY NN-TL A01001003E1 *GRAND JJ-TL A01001004E1 *JURY NN-TL A01001005E1 SAID VBD A01001006E1 *FRIDAY NR A01001007E1 AN AT A01001008E1 INVESTIGATION NN A01001009E1 OF IN A01002001E1 *ATLANTA S NP A01002002E1 RECENT JJ A01002003E1 PRIMARY NN A01002004E1 ELECTION NN A01002005E1 PRODUCED VBD A01002006E1 NO AT A01002007E1 EVIDENCE NN A01002008E1 THAT CS A01002009E1 ANY DTI A01003001E1 IRREGULARITIES NNS A01003002E1 TOOK VBD A01003003E1 PLACE NN A01003004E1.. A01003005E1 Introduction 23(48)

Some corpora LOB (Lancaster-Oslo-Bergen) Corpus Based on Brown for British English 500 texts with 2000 words in each text PoS tagged, tags taken from Brown with some modification Eric Atwell, Roger Garside, Geoffrey Leech Introduction 24(48)

Some corpora Speech corpora Göteborg Spoken Language Corpus (GSLC) 1.5 million words, from various social activities transcribed, PoS tagged London-Lund Corpus (LLC): http://khnt.hit.uib.no/icame/manuals/londlund spoken British English, 100 texts with 5000 token in each, tot. 500 000 words transcribed, even phonetically and prosodically Introduction 25(48)

Corpus-based language study old idea used in: dialect studies comparative linguistics descriptive grammar combined with modern technique (large scale since the 80 s) empirical linguistics through corpus linguistics Introduction 26(48)

Corpus linguistics The term first appeared in the early 1980s (Leech, 1992) The method dates back to the pre-chomskyan period used by structuralists, e.g Sapir, Newman, Bloomfield Empirical, based on observed data People used paper-based smaller collections of written or transcribed texts, not representative The method was widespread in the early twentieth century but hardly criticized by Chomsky Introduction 27(48)

Corpus linguistics? The corpora used were very small - on paper - and used primarily for the study of distinguishing features in phonetics, and few used to study grammar - hard task as all analyses are hand-made Chomsky against the use of corpora: real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting. Merging corpora with modern computer technology Introduction 28(48)

Why use computers to study language? Introduction 29(48)

Why use computers to study language? processing speed easy manipulation of data (searching, selecting, sorting and formatting) machine-readable data can be processed accurately and consistently more reliable result compared to humans further automatic processing is possible enriched with various metadata and linguistic analyses Introduction 30(48)

Why corpus linguistics? The immense scope of a modern corpus, and the range of computing resources that are available for exploiting it, make up a powerful force for deepening our awareness and understanding of language. (M.A.K. Halliday) Introduction 31(48)

Why corpora? First modern corpus: Brown corpus (Brown University Standard Corpus of Present-day American English) built in the 1960s Since the 1980s, the number and size of corpora and corpus-based studies have increased dramatically. Corpora of today give insights into the language used in real world text. Introduction 32(48)

Why corpora? Corpora allow reliable language analysis, in natural contexts and with minimal experimental interference. Corpora can be used for a number of practical applications, e.g. in lexicography, language teaching, and language studies. Corpora are used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe Corpora have revolutionalized nearly all branches of linguistics. Introduction 33(48)

Purpose Corpora are created with a special purpose diachronic corpora: empirical studies of language change parallel corpora: learning translation parametres for automatic machine translation systems but also empirically study similarities and differences betwen languages Introduction 34(48)

Corpus linguistics CL is the study of language as expressed in samples (corpora) or real worldtext. two types of corpus linguistics: linguistics and language technology different background, aims, tools, networks, conferences, journals empirical language research semiautomatic extraction of linguistic knowledge for language technology Introduction 35(48)

Language studies Background: empirical linguistics Aim: traditional language studies Tools: concordances, word lists, statistic programs, Conferences: e.g. Conference of the Int. Computer Archive of Modern/Mediaeval English (ICAME) Teaching and Language Corpora Conference (TALC) Journals: International Journal of Corpus Linguistics, Corpora, Corpus Linguistics and Linguistic Theory, ICAME Journal Introduction 36(48)

Language technology Background: computer science, matemathical methods Conferences: Aim: machine learning Tools: tagger, parser, tools for alignment Journals: Int. Conf. on Computational Linguistics (COLING) Meetings of the ACL (ACL, EACL, NAACL) Empirical Methods in NLP (EMNLP) Computational Linguistics, Journal of Natural Language Engineering Introduction 37(48)

Language resources Resources: written/spoken mono-/multilingual corpora mono- and multilingual dictionaries terminology collections grammars Benchmarks for evaluation Basic tools: modules (e.g. taggers, parsers, grapheme-to-phoneme converters) annotation standards and tools corpus exploration and exploitation tools Introduction 38(48)

Corpus types What types of corpora do you know of? Introduction 39(48)

Corpus types modality: written, spoken, sign, multimodal language type, genre, etc. language: one, two, many relation between languages (comparable, parallel,...) size finite size monitor corpora analyzed, disambiguated, type of annotation Introduction 40(48)

Corpus types contains text in: a single language (monolingual corpus) or several languages (multilingual corpus): a collection of text in different languages translation corpus: original text and its translation in different languages a comparable corpus: comparable original texts in different languages, the texts in each language have been selected according to the same criteria (genre, content, publication date, etc) parallel corpus: bi-directional translation corpus specially formatted for side-by-side comparison, combi of translation and comparable corpus (e.g. EuroParl) synchronic or diachronic, historical or modern Introduction 41(48)

Useful links: see the course page CORPORA electronic mailing list for all people interested in corpora ACL SIGLEX: http://www.clres.com/siglex.html Special Interest Group on the Lexicon of the Association for Computational Linguistics The ACL NLP/CL Universe: http://www1.cs.columbia.edu/ radev/u/db/acl/ Links to computational linguistics resources, including corpora. Introduction 42(48)

Corpus archive, Corpus distributors Linguistic Data Consortium (LDC): http://www.ldc.upenn.edu Development and distribution of language resources: data, tools and standards Corpus distribution: text och speech for many different languages, lexicon, training data, benchmarks, etc Projects: corpus collection, annotation, information extraction, etc National Institute of Standards and Technology (NIST): defines benchmarks Introduction 43(48)

Corpus archive, Corpus distributors European Language Resources Association (ELRA): http://www.elra.info/ samt Evaluations and Language Resources Distribution Agency (ELDA): http://www.elda.org distributes, produces, standardise, evaluate language resources (e.g. lexicon, corpora: mono- and multilingual) to promote research in Human Language Technology (HLT) organize conferenses: The Language Resources and Evaluation Conference, LREC gives test data to evaluate various applications Introduction 44(48)

Corpus archives, organisations Oxford Text Archive (OTA): http://ota.ahds.ac.uk/ collects electronical texts of high quality for research and teaching and distributes more than 2000 resources for over more than 20 languages. International Computer Archive of Modern English (ICAME): http://nora.hd.uib.no/whatis.html corpus distribution in Bergen, Norway organise conference, ICAME Journal TELRI: http://www.telri.de. collects and distributes mono- and multilingual language resources with special focus on Central and Eastern European languages. Introduction 45(48)

Corpus archives, organisations Sprakbanken: http://spraakbanken.gu.se Corpora and tools for corpus search Lexin, SUC, Swedish Academy Lexicon Introduction 46(48)

Online databases Gutenberg: http://www.gutenberg.org 30 000 free ebooks expired copyrights many languages Runeberg: http://runeberg.org as Gutenberg but for Nordic literature Gallica: http://gallica.bnf.fr French Introduction 47(48)

Assignment Find out more about corpus archives Find corpora for a language of your choice, take two corpora and compare them Try to scan Introduction 48(48)