Experiments on the LIMSI Broadcast News Data
|
|
- Moses Davidson
- 6 years ago
- Views:
Transcription
1 Experiments on the LIMSI Broadcast News Data Interim Report for SNF Project : Rule-Based Language Model for Speech Recognition Tobias Kaufmann Institut für Technische Informatik und Kommunikationsnetze February 2007
2 1 Lattice Preprocessing The research reported here is based on an experiment performed with the LIMSI German broadcast news transcription system [1]. We selected the first 30 lattices of this experiment, which corresponds roughly to the first 10 minutes of a German 8 o clock news broadcast (Tagesschau der ARD) from the 14 th of April, We first reconstructed the exact scoring scheme, which was not known to us 1. Thus, we finally were able to reproduce the first best scores noted in the comment section of each lattice. We also corrected the reference transcription which contained several errors. For our experiments we assume perfect sentence segmentation, as the benefit of grammar information heavily deteriorates if the sentence boundaries are incorrect. Therefore, the 30 lattices were manually split at the sentence boundaries and merged where a sentence crossed a lattice boundary. As a result, we obtained 107 lattices, each spanning a single sentence. Due to this manual segmentation the word error rate dropped from 13.79% to 13.24%. 2 Linguistic resources We used the Head-driven Phrase Structure Grammar (HPSG, [2]) formalism to develop a precise large-coverage grammar for German. The main grammar consists of 17 general rules, 12 rules for modeling the German sentence structure and 13 construction-specific rules (relative clauses, genitive attributes, optional determiners, nominalized adjectives etc...). The various subgrammars (expressions of date and time, spoken numbers, compound nouns and acronyms) amount to a total of 43 rules. As split numbers and compounds (e.g. ein und zwanzig, kriegs pläne) are not counted as errors in the evaluation scheme, the grammar is able to analyze such expressions. The main grammar is largely based on existing linguistic work, e.g. [3], [4], [5] and [6]. We have added a couple of linguistic phenomena which we considered to be important, but which are often neglected in formal syntactic theories. Among them are prenominal and postnominal genitives, expressions of quantity, expressions of date and time, forms of address and written numbers. The coverage and precision of our grammar is denoted by the set of test sentences which was developed in parallel to the grammar. The test results are available at kaufmann/grammar/test.html. The lexicon has been created manually. A sorted word list was created from each of the 30 original lattices, and each word has been precisely annotated with its syntactic features such as agreement features and valencies. The context in which the word appeared in the reference transcription was not known to the lexicon developer. Consequently, every possible usage for each lexeme had to be entered. Multi-word lexemes posed a particular challenge, as they do not appear as units in the word lists. Important examples of German multi-word lexemes are certain adverbials (e.g. nach wie vor, zu Fuss) and verbs with separable prefixes (e.g. die Sonne geht auf ). 1 A [silence] node receives a silence penalty, but no word penalty. </s> and <s/> are ignored altogether and receive no penalty at all. {breath} and {fw} receive a word penalty. 1
3 3 Feature Extraction In the feature extraction step, each of the N best hypotheses of a lattice is parsed, i.e. the parser identifies every grammatical word sequence in a given hypothesis. In addition, it determines all possible syntactic structures of such a word sequence. To this end, the parse trees are transformed into grammar-independent dependency graph representations similar to those used in the German TIGER treebank [7]. Subsequently, the probability of each dependency graph is estimated by means of a statistical model. This statistical model was designed manually and trained on the TIGER treebank. Formally, our linguistic postprocessing results in a set of features for each recognizer hypothesis W k = w 1, w 2,..., w nk. A feature is a tuple (i, j, p), stating the fact that the word sequence w i, w i+1,..., w j is grammatical and that its most likely syntactic structure has the probability p. Note that for every word w i, there is a feature (i, i, 1). 4 N-Best Rescoring The k-th best recognizer hypothesis W k is assigned a new score which takes into account the linguistic knowledge, and the best hypothesis with respect to this new score is returned as the new recognition result W : ( W = arg max s rec (W k ) W k max P partition(w k ) (i,j,p) P α p) λ (1) In this expression, s rec is the score on the basis of which the original N-best list was computed. It includes an acoustic score, an N-gram language model score and additional correction terms. The bracketed expression is the score computed from our rule-based language model. The two scores are balanced by means of a weight λ. The function partition(w k ) returns all sequences of features spanning the hypothesis W k = w 1, w 2,..., w nk. Formally, (i 1, j 1, p 1 ),..., (i m, j m, p m ) partition(w k ) iff i 1 = 1 and j m = n k and j s + 1 = i s+1 for 1 s < m. The parameter α influences the number of features of the optimal partitioning. If α is very big, partitionings into single-word features are favoured. This means that syntactic information is ignored entirely. However, if α is very small, partitionings into a few features covering many words are favoured, even if they have a small probability. In this case, the binary information on grammaticality dominates the more fine-grained probability provided by the statistical model. 5 Experiment Our experiments were performed on the 107 single-sentence lattices manually created from the first 30 lattices of the LIMSI data. We computed the 20 best hypotheses of each lattice. The average hypothesis length is 13.4 words, with a maximum of 31 words. For given parameters α and λ, each hypothesis was parsed in order to extract the features. Subsequently, the hypotheses were rescored and a new set of first-best solutions was produced. The parameters α and λ were optimized by means of leave-one-out cross- 2
4 experiment word error rate baseline 13.24% grammar 12.31% (-7.0% relative) grammar+cheating 11.89% (-10.2% relative) oracle 7.97% (-39.8% relative) Tabelle 1: The impact of our rule-based language model on the word error rate. validation. Due to the small number of parameters and the small training set, we could apply a simple grid search. Table 1 compares the word error rate of the LIMSI broadcast new transcription system (baseline) to that of the system extended with a rule-based rescoring component (grammar). For comparison, the table also shows the result of the extended system with α and λ optimized on the test data (grammar+cheating), as well as the 20-best oracle word error rate (oracle). By applying the rule-based linguistic knowledge, the word error rate could be reduced by 7.0% relative. Unfortunately, this result is not statistically significant: the significance level for the Matched-Pair Sentence Segment Test [8] is 7.2%, whereas a level of 5% or lower is generally considered to be significant. However, we expect to achieve statistically significant results for a larger training set. The rule-based language model corrects 25 errors and produces 12 new errors. Surprisingly, only about half of the corrected errors are due to the case where the parser picks the correct sentence from the 20-best hypotheses. The remaining errors were corrected by preferring a better, but still incorrect sentence. This suggests that our approach also works in the presence of ungrammatical or out-of-grammar sentences. 6 Problems Our decision to develop a domain-independent lexicon (i.e. to include virtually all usages of a given word) leads to a large amount of ambiguity. The tendency that correct word sequences have many readings is apart form processing issues not problematic for our approach. However, the fact that many bad word sequences do have some readings as well suggests that the criterion of grammaticality alone is not sufficient to distinguish between good and bad word sequences. Instead, the probabilistic model should be extended and refined. For instance: There are many personal names and geographic names which are homophones of nouns, adjectives or verbs 2. Although most of these proper names are very rare and unlikely to appear in a broadcast news context, they were entered into the lexicon. Proper names contribute considerably to ambiguity, as they can appear without a determiner. To deal with this problem, we manually disabled those proper name entries which were not known to us. This is of course unsatisfying. In future, we intend to use corpus linguistics (i.e. named entity extraction) to compute the probability of each proper name for a given domain. 2 For instance, about 40% of all nouns in our lexicon have an inflected form that can also be used as a personal name. 3
5 As the grammar generally allows for split compound nouns, two consecutive nouns can often be analyzed as a compound noun, which leads to massive ambiguity. The performance of our rule-based language model can be expected to improve if the probability of a given compound noun (as estimated on a corpus) is taken into account. Of course, ambiguity also has a big impact on processing efficiency. In the reported experiment, we were able to derive all possible readings for 99.9% of the parsed recognizer hypotheses, even though some of the sentences were quite long. However, processing can take rather long for some highly ambiguous sentences. Therefore it seems to be desirable to use a stochastic HPSG (e.g. [9]), such that the most probable readings are derived first and parsing can be stopped after a number of processing steps. 7 Acknowledgements We wish to thank Jean-Luc Gauvain of LIMSI for providing us with word lattices produced by their German broadcast news transcription system. Literatur [1] Kevin McTait and Martine Adda-Decker, The 300k LIMSI German Broadcast News Transcription System, in ISCA Eurospeech, Geneva, September [2] C. J. Pollard and I. A. Sag, Head-Driven Phrase Structure Grammar, University of Chicago Press, Chicago, [3] Stefan Müller, Head-Driven Phrase Structure Grammar: Eine Einführung, to appear, Stauffenburg Verlag, [4] Stefan Müller, Deutsche Syntax deklarativ. Head-Driven Phrase Structure Grammar für das Deutsche, Number 394 in Linguistische Arbeiten. Max Niemeyer Verlag, Tübingen, [5] Berthold Crysmann, Relative clause extraposition in german: An efficient and portable implementation, Research on Language and Computation, vol. 3, no. 1, pp , [6] Berthold Crysmann, On the efficient implementation of german verb placement in hpsg, in Proceedings of RANLP 2003, [7] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith, The TIGER treebank, in Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, [8] L. Gillick and S. Cox, Some statistical issues in the comparison of speech recognition algorithms, in ICASSP, 1989, pp [9] Steven P. Abney, Stochastic attribute-value grammars, Comput. Linguist., vol. 23, no. 4, pp ,
An Interactive Intelligent Language Tutor Over The Internet
An Interactive Intelligent Language Tutor Over The Internet Trude Heift Linguistics Department and Language Learning Centre Simon Fraser University, B.C. Canada V5A1S6 E-mail: heift@sfu.ca Abstract: This
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationEAGLE: an Error-Annotated Corpus of Beginning Learner German
EAGLE: an Error-Annotated Corpus of Beginning Learner German Adriane Boyd Department of Linguistics The Ohio State University adriane@ling.osu.edu Abstract This paper describes the Error-Annotated German
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationProof Theory for Syntacticians
Department of Linguistics Ohio State University Syntax 2 (Linguistics 602.02) January 5, 2012 Logics for Linguistics Many different kinds of logic are directly applicable to formalizing theories in syntax
More information11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation
tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each
More informationDeveloping a TT-MCTAG for German with an RCG-based Parser
Developing a TT-MCTAG for German with an RCG-based Parser Laura Kallmeyer, Timm Lichte, Wolfgang Maier, Yannick Parmentier, Johannes Dellert University of Tübingen, Germany CNRS-LORIA, France LREC 2008,
More informationLanguage Model and Grammar Extraction Variation in Machine Translation
Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department
More informationMinimalism is the name of the predominant approach in generative linguistics today. It was first
Minimalism Minimalism is the name of the predominant approach in generative linguistics today. It was first introduced by Chomsky in his work The Minimalist Program (1995) and has seen several developments
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationLING 329 : MORPHOLOGY
LING 329 : MORPHOLOGY TTh 10:30 11:50 AM, Physics 121 Course Syllabus Spring 2013 Matt Pearson Office: Vollum 313 Email: pearsonm@reed.edu Phone: 7618 (off campus: 503-517-7618) Office hrs: Mon 1:30 2:30,
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationOn the Notion Determiner
On the Notion Determiner Frank Van Eynde University of Leuven Proceedings of the 10th International Conference on Head-Driven Phrase Structure Grammar Michigan State University Stefan Müller (Editor) 2003
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationIntroduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.
to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about
More informationAnnotation Projection for Discourse Connectives
SFB 833 / Univ. Tübingen Penn Discourse Treebank Workshop Annotation projection Basic idea: Given a bitext E/F and annotation for F, how would the annotation look for E? Examples: Word Sense Disambiguation
More informationMethods for the Qualitative Evaluation of Lexical Association Measures
Methods for the Qualitative Evaluation of Lexical Association Measures Stefan Evert IMS, University of Stuttgart Azenbergstr. 12 D-70174 Stuttgart, Germany evert@ims.uni-stuttgart.de Brigitte Krenn Austrian
More informationInteractive Corpus Annotation of Anaphor Using NLP Algorithms
Interactive Corpus Annotation of Anaphor Using NLP Algorithms Catherine Smith 1 and Matthew Brook O Donnell 1 1. Introduction Pronouns occur with a relatively high frequency in all forms English discourse.
More informationCOMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR
COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR ROLAND HAUSSER Institut für Deutsche Philologie Ludwig-Maximilians Universität München München, West Germany 1. CHOICE OF A PRIMITIVE OPERATION The
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationSyntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm syntax: from the Greek syntaxis, meaning setting out together
More informationSpoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers
Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers Chad Langley, Alon Lavie, Lori Levin, Dorcas Wallace, Donna Gates, and Kay Peterson Language Technologies Institute Carnegie
More informationCharacter Stream Parsing of Mixed-lingual Text
Character Stream Parsing of Mixed-lingual Text Harald Romsdorfer and Beat Pfister Speech Processing Group Computer Engineering and Networks Laboratory ETH Zurich {romsdorfer,pfister}@tik.ee.ethz.ch Abstract
More informationProcedia - Social and Behavioral Sciences 154 ( 2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 154 ( 2014 ) 263 267 THE XXV ANNUAL INTERNATIONAL ACADEMIC CONFERENCE, LANGUAGE AND CULTURE, 20-22 October
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More informationBasic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1
Basic Parsing with Context-Free Grammars Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Announcements HW 2 to go out today. Next Tuesday most important for background to assignment Sign up
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationParticipate in expanded conversations and respond appropriately to a variety of conversational prompts
Students continue their study of German by further expanding their knowledge of key vocabulary topics and grammar concepts. Students not only begin to comprehend listening and reading passages more fully,
More informationThe College Board Redesigned SAT Grade 12
A Correlation of, 2017 To the Redesigned SAT Introduction This document demonstrates how myperspectives English Language Arts meets the Reading, Writing and Language and Essay Domains of Redesigned SAT.
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationConstraining X-Bar: Theta Theory
Constraining X-Bar: Theta Theory Carnie, 2013, chapter 8 Kofi K. Saah 1 Learning objectives Distinguish between thematic relation and theta role. Identify the thematic relations agent, theme, goal, source,
More informationWhat the National Curriculum requires in reading at Y5 and Y6
What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the
More informationTowards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la
Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing Grzegorz Chrupa la A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.)
More informationCAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011
CAAP Content Analysis Report Institution Code: 911 Institution Type: 4-Year Normative Group: 4-year Colleges Introduction This report provides information intended to help postsecondary institutions better
More informationThe Role of the Head in the Interpretation of English Deverbal Compounds
The Role of the Head in the Interpretation of English Deverbal Compounds Gianina Iordăchioaia i, Lonneke van der Plas ii, Glorianna Jagfeld i (Universität Stuttgart i, University of Malta ii ) Wen wurmt
More informationCase government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG
Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG Dr. Kakia Chatsiou, University of Essex achats at essex.ac.uk Explorations in Syntactic Government and Subcategorisation,
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationGrammars & Parsing, Part 1:
Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review
More informationLoughton School s curriculum evening. 28 th February 2017
Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's
More informationSpecifying a shallow grammatical for parsing purposes
Specifying a shallow grammatical for parsing purposes representation Atro Voutilainen and Timo J~irvinen Research Unit for Multilingual Language Technology P.O. Box 4 FIN-0004 University of Helsinki Finland
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationTheoretical Syntax Winter Answers to practice problems
Linguistics 325 Sturman Theoretical Syntax Winter 2017 Answers to practice problems 1. Draw trees for the following English sentences. a. I have not been running in the mornings. 1 b. Joel frequently sings
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationSome Principles of Automated Natural Language Information Extraction
Some Principles of Automated Natural Language Information Extraction Gregers Koch Department of Computer Science, Copenhagen University DIKU, Universitetsparken 1, DK-2100 Copenhagen, Denmark Abstract
More informationImproved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form
Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationBeyond the Pipeline: Discrete Optimization in NLP
Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationPREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES
PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,
More informationAccurate Unlexicalized Parsing for Modern Hebrew
Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The
More informationCS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus
CS 1103 Computer Science I Honors Fall 2016 Instructor Muller Syllabus Welcome to CS1103. This course is an introduction to the art and science of computer programming and to some of the fundamental concepts
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationBAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass
BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,
More informationA Computational Evaluation of Case-Assignment Algorithms
A Computational Evaluation of Case-Assignment Algorithms Miles Calabresi Advisors: Bob Frank and Jim Wood Submitted to the faculty of the Department of Linguistics in partial fulfillment of the requirements
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationAn Empirical and Computational Test of Linguistic Relativity
An Empirical and Computational Test of Linguistic Relativity Kathleen M. Eberhard* (eberhard.1@nd.edu) Matthias Scheutz** (mscheutz@cse.nd.edu) Michael Heilman** (mheilman@nd.edu) *Department of Psychology,
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationLearning Computational Grammars
Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract
More informationSegmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition
Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationUNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen
UNIVERSITY OF OSLO Department of Informatics Dialog Act Recognition using Dependency Features Master s thesis Sindre Wetjen November 15, 2013 Acknowledgments First I want to thank my supervisors Lilja
More informationCalibration of Confidence Measures in Speech Recognition
Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE
More informationContext Free Grammars. Many slides from Michael Collins
Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationLQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY
More informationThe Interface between Phrasal and Functional Constraints
The Interface between Phrasal and Functional Constraints John T. Maxwell III* Xerox Palo Alto Research Center Ronald M. Kaplan t Xerox Palo Alto Research Center Many modern grammatical formalisms divide
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationSenior Stenographer / Senior Typist Series (including equivalent Secretary titles)
New York State Department of Civil Service Committed to Innovation, Quality, and Excellence A Guide to the Written Test for the Senior Stenographer / Senior Typist Series (including equivalent Secretary
More informationThe Strong Minimalist Thesis and Bounded Optimality
The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this
More informationWord-based dialect identification with georeferenced rules
Word-based dialect identification with georeferenced rules Yves Scherrer LATL Université de Genève Genève, Switzerland yves.scherrer@unige.ch Owen Rambow CCLS Columbia University New York, USA rambow@ccls.columbia.edu
More informationCEFR Overall Illustrative English Proficiency Scales
CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationDerivational and Inflectional Morphemes in Pak-Pak Language
Derivational and Inflectional Morphemes in Pak-Pak Language Agustina Situmorang and Tima Mariany Arifin ABSTRACT The objectives of this study are to find out the derivational and inflectional morphemes
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationLTAG-spinal and the Treebank
LTAG-spinal and the Treebank a new resource for incremental, dependency and semantic parsing Libin Shen (lshen@bbn.com) BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA Lucas Champollion (champoll@ling.upenn.edu)
More informationTowards a MWE-driven A* parsing with LTAGs [WG2,WG3]
Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general
More informationConstruction Grammar. University of Jena.
Construction Grammar Holger Diessel University of Jena holger.diessel@uni-jena.de http://www.holger-diessel.de/ Words seem to have a prototype structure; but language does not only consist of words. What
More informationApplying Speaking Criteria. For use from November 2010 GERMAN BREAKTHROUGH PAGRB01
Applying Speaking Criteria For use from November 2010 GERMAN BREAKTHROUGH PAGRB01 Contents Introduction 2 1: Breakthrough Stage The Languages Ladder 3 Languages Ladder can do statements for Breakthrough
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationAge Effects on Syntactic Control in. Second Language Learning
Age Effects on Syntactic Control in Second Language Learning Miriam Tullgren Loyola University Chicago Abstract 1 This paper explores the effects of age on second language acquisition in adolescents, ages
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More information