Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)
|
|
- Brett Rodgers
- 6 years ago
- Views:
Transcription
1 Introduction Beáta B. Megyesi Uppsala University Department of Linguistics and Philology Introduction 1(48)
2 Course content Credits: 7.5 ECTS Subject: Computational linguistics Level: Advanced Content: The course provides basic knowledge about how to use language resources and tools for language studies. Goals: The students learn about basic technologies for creation and processing of text collections and speech databases - focus is on practical usage of tools Introduction 2(48)
3 Course content Requirements: Bachelor s Degree with main field of language study Regulations: The course may not be included in a degree if equivalent parts have been studied within another course included in the degree. Instruction: The teaching consists of lectures, exercises, supervision, and examination parts Introduction 3(48)
4 Course content What is a corpus and what is in it? Various corpora: text collections, speech databases, treebanks, parallel corpora, parallel treebanks, etc Corpus usage in language studies: concordances, collocations, frequency lists, keywords, corpus search Methods for corpus creation: collection, scanning, encoding, formats, annotation, alignment Existing tools for corpus creation and analysis on various linguistic levels (e.g., tagging and parsing) International corpus distribution Presentation of research Introduction 4(48)
5 Learning outcomes account for and independently apply basic concepts in corpus linguistics account for and discuss different corpora and speech databases account for the different usage of corpora and speech databases in language studies master basic techniques for corpus development use programs for text processing and quantitative analysis based on corpora process and analyze corpora in relation to a research question by using available tools carry out a project work on a research question and present the study written as a scientific paper and orally Introduction 5(48)
6 Examination Assignments: Corpus distributors and description of two existing corpora of your choice, individual written report, deadline: November 13 to Bengt Dahlqvist Corpus usage for language studies: Wordsmith tools, written report in group of 2, deadline: November 18 to Bengt Dahlqvist Corpus annotation and alignment: written report in group of two, deadline: November 23 to Bengt Dahlqvist Speech data: written report, deadline December 4 to Petur Helgason Introduction 6(48)
7 Examination Project work: carry out a project on a delimited research question and present the study orally and as a wtitten report presented as a scientific paper and peer review another work during a seminar on December 21 Grades: Pass or Fail 4 assignments the oral and written presentation of the project work and the peer review Introduction 7(48)
8 Literature and course page Hunston, Susan (2002) Corpora in Applied Linguistics. 6th printing. Cambridge University Press. Recommended reading: McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-Based Language Studies - an advanced resource book. Routledge Applied Linguistics. Course page: Introduction 8(48)
9 Outline Language studies Corpora: definition and content Corpus linguistics Corpus archives and distribution Assignment Introduction 9(48)
10 Language studies Intuition-based approach traditional researchers invent examples instantly for analysis intuition is available free from language-external influences may not represent typical language use what is acceptable is indivual intuition should be applied with caution; possible to be influenced by own dialect, sociolect results based on introspection is not observable, difficult to verify Introduction 10(48)
11 Language studies Corpus-based approach investigating language by using authentic examples derived from corpora what we see in a corpus is largely grammatical and/or acceptable a corpus provides evidence of what speakers believe to be acceptable utterances in their language a corpus draws upon authentic or real text a corpus can find differences that intuition alone cannot perceive a corpus can yield reliable quantitative data not all linguist accept the use of corpora not all research questions can be addressed by the the corpus-based approach Introduction 11(48)
12 Language studies Neither the corpus linguist of the 1950s, who rejected intuition, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterise the many successful corpus analyses of recent years. (Leech, 1991) Introduction 12(48)
13 Language studies Corpus-based approach: examples are taken from a corpus to verify hypothesis Corpus-driven approach: the whole corpus is used to find patterns Introduction 13(48)
14 Corpus: Definition Usage: singular - corpus, plural - corpora Definition: collection of sampled language data consisting of a digital collection (in machine-readable form) of written text or transcriptions of spoken language data which is (more or less) representative for the particular language and which may be annotated with various forms of linguistic information. Goal: to verify hypotheses about natural language, e.g. investigate how a particular sound, word, or syntactic construction is used Introduction 14(48)
15 What is a corpus? A corpus is defined as a body of naturally occurring language. It should be added that computer corpora are rarely haphazard collections of textual material: They are generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type. (Leech, 1992) A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. (Sinclair, 1996) Introduction 15(48)
16 Corpora a corpus is not just a collection of some text digitalized, machine-readable authentic texts including transcripts of spoken data sampled to be a representative collection of a particular language or language variety often limited size standard reference for comparative studies Introduction 16(48)
17 Corpus Text = written or spoken language Examples: SUC (Stockholm Umeå Corpus): 1 million word (written) PAROLE corpus from Språkbanken: 19 million word (written) BNC (Brittish National Corpus): 100 million word (written/spoken) London-Lund corpus: 0,5 million word (spoken) Introduction 17(48)
18 Some corpora SUC (written) 500 texts consisting of 2000 tokens per text 9 genrer, with subcategories, e.g. K - imaginative prose KK - general fiction KL - science fiction KN - light reading KR - humour info about lemma, PoS, named entities Introduction 18(48)
19 Some corpora monitor corpora (growing) Språkbanken: Corpora and lexicon Bank of English: written and spoken English (for the COBUILD series of English language books) Introduction 19(48)
20 Some corpora The British National Corpus (BNC): balansed, modern (British) English, over 100 million words (written and spoken) 4,124 texts, of which 863 transcribed from dialogs och monologs PoS tagged (65 PoS tags) Roger Garside and Geoffrey Leech Introduction 20(48)
21 Exemple: BNC <p> <s n=011> <w AT0>The <w AJ0>medical <w NN2>aspects <w VM0>can <w VBI>be <w NN1>cancer <c PUN>, <w NN1>pneumonia <c PUN>, <w AJ0>sudden <w NN1>blindness <c PUN>, <w NN1>dementia <c PUN>, <w AJ0>dramatic <w NN1>weight loss <w CJC>or <w DT0>any <w NN1>combination <w PRF>of <w DT0>these <c PUN>. </p> <p> <s n=012> <w AV0>Often <w AJ0>infected <w NN0>people <w VBB>are <w VVN>rejected <w PRP>by <w NN0>family <w CJC>and <w NN2>friends<c PUN>, <w VVG>leaving <w PNP>them <w TO0>to <w VVI>face <w DT0>this <w AJ0>chronic <w NN1>condition <w AJ0-AV0>alone<c PUN>. </p> Introduction 21(48)
22 Some corpora Brown University Corpus (Brown corpus): American English balansed over 1 million word 500 texts with 2000 words in each PoS tagged (82 tags) W. Nelson Francis - Henry Kucera Introduction 22(48)
23 Example: Brown THE AT A E1 *FULTON NP-TL A E1 *COUNTY NN-TL A E1 *GRAND JJ-TL A E1 *JURY NN-TL A E1 SAID VBD A E1 *FRIDAY NR A E1 AN AT A E1 INVESTIGATION NN A E1 OF IN A E1 *ATLANTA S NP A E1 RECENT JJ A E1 PRIMARY NN A E1 ELECTION NN A E1 PRODUCED VBD A E1 NO AT A E1 EVIDENCE NN A E1 THAT CS A E1 ANY DTI A E1 IRREGULARITIES NNS A E1 TOOK VBD A E1 PLACE NN A E1.. A E1 Introduction 23(48)
24 Some corpora LOB (Lancaster-Oslo-Bergen) Corpus Based on Brown for British English 500 texts with 2000 words in each text PoS tagged, tags taken from Brown with some modification Eric Atwell, Roger Garside, Geoffrey Leech Introduction 24(48)
25 Some corpora Speech corpora Göteborg Spoken Language Corpus (GSLC) 1.5 million words, from various social activities transcribed, PoS tagged London-Lund Corpus (LLC): spoken British English, 100 texts with 5000 token in each, tot words transcribed, even phonetically and prosodically Introduction 25(48)
26 Corpus-based language study old idea used in: dialect studies comparative linguistics descriptive grammar combined with modern technique (large scale since the 80 s) empirical linguistics through corpus linguistics Introduction 26(48)
27 Corpus linguistics The term first appeared in the early 1980s (Leech, 1992) The method dates back to the pre-chomskyan period used by structuralists, e.g Sapir, Newman, Bloomfield Empirical, based on observed data People used paper-based smaller collections of written or transcribed texts, not representative The method was widespread in the early twentieth century but hardly criticized by Chomsky Introduction 27(48)
28 Corpus linguistics? The corpora used were very small - on paper - and used primarily for the study of distinguishing features in phonetics, and few used to study grammar - hard task as all analyses are hand-made Chomsky against the use of corpora: real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting. Merging corpora with modern computer technology Introduction 28(48)
29 Why use computers to study language? Introduction 29(48)
30 Why use computers to study language? processing speed easy manipulation of data (searching, selecting, sorting and formatting) machine-readable data can be processed accurately and consistently more reliable result compared to humans further automatic processing is possible enriched with various metadata and linguistic analyses Introduction 30(48)
31 Why corpus linguistics? The immense scope of a modern corpus, and the range of computing resources that are available for exploiting it, make up a powerful force for deepening our awareness and understanding of language. (M.A.K. Halliday) Introduction 31(48)
32 Why corpora? First modern corpus: Brown corpus (Brown University Standard Corpus of Present-day American English) built in the 1960s Since the 1980s, the number and size of corpora and corpus-based studies have increased dramatically. Corpora of today give insights into the language used in real world text. Introduction 32(48)
33 Why corpora? Corpora allow reliable language analysis, in natural contexts and with minimal experimental interference. Corpora can be used for a number of practical applications, e.g. in lexicography, language teaching, and language studies. Corpora are used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe Corpora have revolutionalized nearly all branches of linguistics. Introduction 33(48)
34 Purpose Corpora are created with a special purpose diachronic corpora: empirical studies of language change parallel corpora: learning translation parametres for automatic machine translation systems but also empirically study similarities and differences betwen languages Introduction 34(48)
35 Corpus linguistics CL is the study of language as expressed in samples (corpora) or real worldtext. two types of corpus linguistics: linguistics and language technology different background, aims, tools, networks, conferences, journals empirical language research semiautomatic extraction of linguistic knowledge for language technology Introduction 35(48)
36 Language studies Background: empirical linguistics Aim: traditional language studies Tools: concordances, word lists, statistic programs, Conferences: e.g. Conference of the Int. Computer Archive of Modern/Mediaeval English (ICAME) Teaching and Language Corpora Conference (TALC) Journals: International Journal of Corpus Linguistics, Corpora, Corpus Linguistics and Linguistic Theory, ICAME Journal Introduction 36(48)
37 Language technology Background: computer science, matemathical methods Conferences: Aim: machine learning Tools: tagger, parser, tools for alignment Journals: Int. Conf. on Computational Linguistics (COLING) Meetings of the ACL (ACL, EACL, NAACL) Empirical Methods in NLP (EMNLP) Computational Linguistics, Journal of Natural Language Engineering Introduction 37(48)
38 Language resources Resources: written/spoken mono-/multilingual corpora mono- and multilingual dictionaries terminology collections grammars Benchmarks for evaluation Basic tools: modules (e.g. taggers, parsers, grapheme-to-phoneme converters) annotation standards and tools corpus exploration and exploitation tools Introduction 38(48)
39 Corpus types What types of corpora do you know of? Introduction 39(48)
40 Corpus types modality: written, spoken, sign, multimodal language type, genre, etc. language: one, two, many relation between languages (comparable, parallel,...) size finite size monitor corpora analyzed, disambiguated, type of annotation Introduction 40(48)
41 Corpus types contains text in: a single language (monolingual corpus) or several languages (multilingual corpus): a collection of text in different languages translation corpus: original text and its translation in different languages a comparable corpus: comparable original texts in different languages, the texts in each language have been selected according to the same criteria (genre, content, publication date, etc) parallel corpus: bi-directional translation corpus specially formatted for side-by-side comparison, combi of translation and comparable corpus (e.g. EuroParl) synchronic or diachronic, historical or modern Introduction 41(48)
42 Useful links: see the course page CORPORA electronic mailing list for all people interested in corpora ACL SIGLEX: Special Interest Group on the Lexicon of the Association for Computational Linguistics The ACL NLP/CL Universe: radev/u/db/acl/ Links to computational linguistics resources, including corpora. Introduction 42(48)
43 Corpus archive, Corpus distributors Linguistic Data Consortium (LDC): Development and distribution of language resources: data, tools and standards Corpus distribution: text och speech for many different languages, lexicon, training data, benchmarks, etc Projects: corpus collection, annotation, information extraction, etc National Institute of Standards and Technology (NIST): defines benchmarks Introduction 43(48)
44 Corpus archive, Corpus distributors European Language Resources Association (ELRA): samt Evaluations and Language Resources Distribution Agency (ELDA): distributes, produces, standardise, evaluate language resources (e.g. lexicon, corpora: mono- and multilingual) to promote research in Human Language Technology (HLT) organize conferenses: The Language Resources and Evaluation Conference, LREC gives test data to evaluate various applications Introduction 44(48)
45 Corpus archives, organisations Oxford Text Archive (OTA): collects electronical texts of high quality for research and teaching and distributes more than 2000 resources for over more than 20 languages. International Computer Archive of Modern English (ICAME): corpus distribution in Bergen, Norway organise conference, ICAME Journal TELRI: collects and distributes mono- and multilingual language resources with special focus on Central and Eastern European languages. Introduction 45(48)
46 Corpus archives, organisations Sprakbanken: Corpora and tools for corpus search Lexin, SUC, Swedish Academy Lexicon Introduction 46(48)
47 Online databases Gutenberg: free ebooks expired copyrights many languages Runeberg: as Gutenberg but for Nordic literature Gallica: French Introduction 47(48)
48 Assignment Find out more about corpus archives Find corpora for a language of your choice, take two corpora and compare them Try to scan Introduction 48(48)
Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationLinguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis
International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:
More informationOutline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt
Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationReview in ICAME Journal, Volume 38, 2014, DOI: /icame
Review in ICAME Journal, Volume 38, 2014, DOI: 10.2478/icame-2014-0012 Gaëtanelle Gilquin and Sylvie De Cock (eds.). Errors and disfluencies in spoken corpora. Amsterdam: John Benjamins. 2013. 172 pp.
More informationCross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract
More informationModeling full form lexica for Arabic
Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS Objectives Presentation of the current standardization activity in the domain of lexical data modeling
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationProcedia - Social and Behavioral Sciences 154 ( 2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 154 ( 2014 ) 263 267 THE XXV ANNUAL INTERNATIONAL ACADEMIC CONFERENCE, LANGUAGE AND CULTURE, 20-22 October
More informationLexical Collocations (Verb + Noun) Across Written Academic Genres In English
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 182 ( 2015 ) 433 440 4th WORLD CONFERENCE ON EDUCATIONAL TECHNOLOGY RESEARCHES, WCETR- 2014 Lexical Collocations
More informationDevelopment of the First LRs for Macedonian: Current Projects
Development of the First LRs for Macedonian: Current Projects Ruska Ivanovska-Naskova Faculty of Philology- University St. Cyril and Methodius Bul. Krste Petkov Misirkov bb, 1000 Skopje, Macedonia rivanovska@flf.ukim.edu.mk
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More information1. Introduction. 2. The OMBI database editor
OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationBigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora
Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora Stefan Th. Gries Department of Linguistics University of California, Santa Barbara stgries@linguistics.ucsb.edu
More informationAN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)
B. PALTRIDGE, DISCOURSE ANALYSIS: AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC. 2012. PP. VI, 282) Review by Glenda Shopen _ This book is a revised edition of the author s 2006 introductory
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationPossessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand
1 Introduction Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand heidi.quinn@canterbury.ac.nz NWAV 33, Ann Arbor 1 October 24 This paper looks at
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationStefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
More informationThe Potential of Corpus-Informed L2 Pedagogy. Jonathon Reinhardt University of Arizona
The Potential of Corpus-Informed L2 Pedagogy Jonathon Reinhardt University of Arizona Abstract Corpus linguistic methods have led to many revelations about the nature of language use and language learning
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationThe CESAR Project: Enabling LRT for 70M+ Speakers
The CESAR Project: Enabling LRT for 70M+ Speakers Marko Tadić University of Zagreb, Faculty of Humanities and Social Sciences Zagreb, Croatia marko.tadic@ffzg.hr META-FORUM 2011 Budapest, Hungary, 2011-06-28
More informationTowards a corpus-based online dictionary. of Italian Word Combinations
Towards a corpus-based online dictionary of Italian Word Combinations Castagnoli Sara 1, Lebani E. Gianluca 2, Lenci Alessandro 2, Masini Francesca 1, Nissim Malvina 3, Piunno Valentina 4 1 University
More informationAutomatic Translation of Norwegian Noun Compounds
Automatic Translation of Norwegian Noun Compounds Lars Bungum Department of Informatics University of Oslo larsbun@ifi.uio.no Stephan Oepen Department of Informatics University of Oslo oe@ifi.uio.no Abstract
More informationLessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities
Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities Simon Clematide, Isabel Meraner, Noah Bubenhofer, Martin Volk Institute of Computational Linguistics
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationTowards a MWE-driven A* parsing with LTAGs [WG2,WG3]
Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationContext Free Grammars. Many slides from Michael Collins
Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationA High-Quality Web Corpus of Czech
A High-Quality Web Corpus of Czech Johanka Spoustová, Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague, Czech Republic {johanka,spousta}@ufal.mff.cuni.cz
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationTITLE: Shakespeare: The technical words. DATE(S): Project will run for four weeks during June or July
PROJECT: CulpeperSprint1 TITLE: Shakespeare: The technical words SUPERVISOR(S): Prof. Jonathan Culpeper DATE(S): Project will run for four weeks during June or July JOB DESCRIPTION: This project focuses
More informationProceedings of the 19th COLING, , 2002.
Crosslinguistic Transfer in Automatic Verb Classication Vivian Tsang Computer Science University of Toronto vyctsang@cs.toronto.edu Suzanne Stevenson Computer Science University of Toronto suzanne@cs.toronto.edu
More informationCROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2
1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis
More informationUNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen
UNIVERSITY OF OSLO Department of Informatics Dialog Act Recognition using Dependency Features Master s thesis Sindre Wetjen November 15, 2013 Acknowledgments First I want to thank my supervisors Lilja
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationSpecifying a shallow grammatical for parsing purposes
Specifying a shallow grammatical for parsing purposes representation Atro Voutilainen and Timo J~irvinen Research Unit for Multilingual Language Technology P.O. Box 4 FIN-0004 University of Helsinki Finland
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationWord Sense Disambiguation
Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationPhonological Processing for Urdu Text to Speech System
Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationLanguage Model and Grammar Extraction Variation in Machine Translation
Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department
More informationAccurate Unlexicalized Parsing for Modern Hebrew
Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More information11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation
tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each
More informationThe Role of the Head in the Interpretation of English Deverbal Compounds
The Role of the Head in the Interpretation of English Deverbal Compounds Gianina Iordăchioaia i, Lonneke van der Plas ii, Glorianna Jagfeld i (Universität Stuttgart i, University of Malta ii ) Wen wurmt
More informationImproving software testing course experience with pair testing pattern. Iyad Alazzam* and Mohammed Akour
244 Int. J. Teaching and Case Studies, Vol. 6, No. 3, 2015 Improving software testing course experience with pair testing pattern Iyad lazzam* and Mohammed kour Department of Computer Information Systems,
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationMemory-based grammatical error correction
Memory-based grammatical error correction Antal van den Bosch Peter Berck Radboud University Nijmegen Tilburg University P.O. Box 9103 P.O. Box 90153 NL-6500 HD Nijmegen, The Netherlands NL-5000 LE Tilburg,
More informationHeuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger
Page 1 of 35 Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger Kaihong Liu, MD, MS, Wendy Chapman, PhD, Rebecca Hwa, PhD, and Rebecca S. Crowley, MD, MS
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationLQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY
More informationCorrespondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy
1 Desired Results Developmental Profile (2015) [DRDP (2015)] Correspondence to California Foundations: Language and Development (LLD) and the Foundations (PLF) The Language and Development (LLD) domain
More informationA Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study. E. N. Westerhout November 10, 2005
A Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study E. N. Westerhout November 10, 2005 Abstract In this thesis, a pilot study for the development of a corpus of Dutch aphasic
More informationRoutledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man
Routledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man (Routledge Library Edition: The English Language) By Linda
More informationThe influence of written task descriptions in Wizard of Oz experiments
The influence of written task descriptions in Wizard of Oz experiments Heidi Brøseth Department of Language and Communication Studies Norwegian University of Science and Technology NO-7491 Trondheim broseth@hf.ntnu.no
More informationNoisy SMS Machine Translation in Low-Density Languages
Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationCELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom
CELTA Syllabus and Assessment Guidelines Third Edition CELTA (Certificate in Teaching English to Speakers of Other Languages) is accredited by Ofqual (the regulator of qualifications, examinations and
More informationarxiv:cmp-lg/ v1 7 Jun 1997 Abstract
Comparing a Linguistic and a Stochastic Tagger Christer Samuelsson Lucent Technologies Bell Laboratories 600 Mountain Ave, Room 2D-339 Murray Hill, NJ 07974, USA christer@research.bell-labs.com Atro Voutilainen
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationEnglish Language and Applied Linguistics. Module Descriptions 2017/18
English Language and Applied Linguistics Module Descriptions 2017/18 Level I (i.e. 2 nd Yr.) Modules Please be aware that all modules are subject to availability. If you have any questions about the modules,
More informationTowards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la
Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing Grzegorz Chrupa la A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.)
More informationExecutive summary (in English)
Executive summary (in English) Project description The project "Open Educational Resources in institutional repositories has been carried out in collaboration between Göteborg university, University of
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationCEFR Overall Illustrative English Proficiency Scales
CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey
More informationMYP Language A Course Outline Year 3
Course Description: The fundamental piece to learning, thinking, communicating, and reflecting is language. Language A seeks to further develop six key skill areas: listening, speaking, reading, writing,
More informationAnnotation Projection for Discourse Connectives
SFB 833 / Univ. Tübingen Penn Discourse Treebank Workshop Annotation projection Basic idea: Given a bitext E/F and annotation for F, how would the annotation look for E? Examples: Word Sense Disambiguation
More informationCorpus on Web: Introducing The First Tagged and Balanced Chinese Corpus + Chu-Ren Huang, *Keh-Jiann Chen and -Shin Lin
Corpus on Web: Introducing The First Tagged and Balanced Chinese Corpus + Chu-Ren Huang, *Keh-Jiann Chen and -Shin Lin + Institute of History & Philology, Academia Sinica *Institute of Information Science,
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationProgressive Aspect in Nigerian English
ISLE 2011 17 June 2011 1 New Englishes Empirical Studies Aspect in Nigerian Languages 2 3 Nigerian English Other New Englishes Explanations Progressive Aspect in New Englishes New Englishes Empirical Studies
More informationHandling Sparsity for Verb Noun MWE Token Classification
Handling Sparsity for Verb Noun MWE Token Classification Mona T. Diab Center for Computational Learning Systems Columbia University mdiab@ccls.columbia.edu Madhav Krishna Computer Science Department Columbia
More informationCombining a Chinese Thesaurus with a Chinese Dictionary
Combining a Chinese Thesaurus with a Chinese Dictionary Ji Donghong Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore, 119613 dhji @krdl.org.sg Gong Junping Department of Computer Science Ohio
More informationDeveloping a TT-MCTAG for German with an RCG-based Parser
Developing a TT-MCTAG for German with an RCG-based Parser Laura Kallmeyer, Timm Lichte, Wolfgang Maier, Yannick Parmentier, Johannes Dellert University of Tübingen, Germany CNRS-LORIA, France LREC 2008,
More informationThe Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners
105 By Fatemeh Behjat & Firooz Sadighi The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners Fatemeh Behjat fb_304@yahoo.com Islamic Azad University, Abadeh Branch, Iran Fatemeh
More informationAnna P. Kosterina Iowa State University. Retrospective Theses and Dissertations
Retrospective Theses and Dissertations 2007 The influence of the grammatical structure of L1 on learners' L2 development and transfer patterns in ESL academic writing: a comparative study (a case of Chinese
More informationTeaching ideas. AS and A-level English Language Spark their imaginations this year
Teaching ideas AS and A-level English Language Spark their imaginations this year We ve put together this handy set of teaching ideas so you can explore new ways to engage your AS and A-level English Language
More informationUniversity of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma
University of Alberta Large-Scale Semi-Supervised Learning for Natural Language Processing by Shane Bergsma A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of
More informationOntologies vs. classification systems
Ontologies vs. classification systems Bodil Nistrup Madsen Copenhagen Business School Copenhagen, Denmark bnm.isv@cbs.dk Hanne Erdman Thomsen Copenhagen Business School Copenhagen, Denmark het.isv@cbs.dk
More information