Experiments on the LIMSI Broadcast News Data

Similar documents
An Interactive Intelligent Language Tutor Over The Internet

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Using dialogue context to improve parsing performance in dialogue systems

EAGLE: an Error-Annotated Corpus of Beginning Learner German

AQUA: An Ontology-Driven Question Answering System

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Proof Theory for Syntacticians

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Developing a TT-MCTAG for German with an RCG-based Parser

Language Model and Grammar Extraction Variation in Machine Translation

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Ensemble Technique Utilization for Indonesian Dependency Parser

Learning Methods in Multilingual Speech Recognition

Prediction of Maximal Projection for Semantic Role Labeling

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

LING 329 : MORPHOLOGY

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS 598 Natural Language Processing

On the Notion Determiner

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Annotation Projection for Discourse Connectives

Methods for the Qualitative Evaluation of Lexical Association Measures

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Parsing of part-of-speech tagged Assamese Texts

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Character Stream Parsing of Mixed-lingual Text

Procedia - Social and Behavioral Sciences 154 ( 2014 )

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

The stages of event extraction

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

The College Board Redesigned SAT Grade 12

A Case Study: News Classification Based on Term Frequency

Constraining X-Bar: Theta Theory

What the National Curriculum requires in reading at Y5 and Y6

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

The Role of the Head in the Interpretation of English Deverbal Compounds

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Speech Recognition at ICSI: Broadcast News and beyond

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Grammars & Parsing, Part 1:

Loughton School s curriculum evening. 28 th February 2017

Specifying a shallow grammatical for parsing purposes

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Theoretical Syntax Winter Answers to practice problems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Some Principles of Automated Natural Language Information Extraction

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Multilingual Sentiment and Subjectivity Analysis

Beyond the Pipeline: Discrete Optimization in NLP

Language Independent Passage Retrieval for Question Answering

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Accurate Unlexicalized Parsing for Modern Hebrew

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

arxiv: v1 [cs.cl] 2 Apr 2017

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Computational Evaluation of Case-Assignment Algorithms

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Universiteit Leiden ICT in Business

An Empirical and Computational Test of Linguistic Relativity

Linking Task: Identifying authors and book titles in verbose queries

CS Machine Learning

Learning Computational Grammars

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Calibration of Confidence Measures in Speech Recognition

Context Free Grammars. Many slides from Michael Collins

Assignment 1: Predicting Amazon Review Ratings

The Smart/Empire TIPSTER IR System

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

The Interface between Phrasal and Functional Constraints

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

The Strong Minimalist Thesis and Bounded Optimality

Word-based dialect identification with georeferenced rules

CEFR Overall Illustrative English Proficiency Scales

Cross Language Information Retrieval

Derivational and Inflectional Morphemes in Pak-Pak Language

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

LTAG-spinal and the Treebank

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Construction Grammar. University of Jena.

Applying Speaking Criteria. For use from November 2010 GERMAN BREAKTHROUGH PAGRB01

Mandarin Lexical Tone Recognition: The Gating Paradigm

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Age Effects on Syntactic Control in. Second Language Learning

Corpus Linguistics (L615)

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Transcription:

Experiments on the LIMSI Broadcast News Data Interim Report for SNF Project 105211-112133: Rule-Based Language Model for Speech Recognition Tobias Kaufmann Institut für Technische Informatik und Kommunikationsnetze February 2007

1 Lattice Preprocessing The research reported here is based on an experiment performed with the LIMSI German broadcast news transcription system [1]. We selected the first 30 lattices of this experiment, which corresponds roughly to the first 10 minutes of a German 8 o clock news broadcast (Tagesschau der ARD) from the 14 th of April, 2002. We first reconstructed the exact scoring scheme, which was not known to us 1. Thus, we finally were able to reproduce the first best scores noted in the comment section of each lattice. We also corrected the reference transcription which contained several errors. For our experiments we assume perfect sentence segmentation, as the benefit of grammar information heavily deteriorates if the sentence boundaries are incorrect. Therefore, the 30 lattices were manually split at the sentence boundaries and merged where a sentence crossed a lattice boundary. As a result, we obtained 107 lattices, each spanning a single sentence. Due to this manual segmentation the word error rate dropped from 13.79% to 13.24%. 2 Linguistic resources We used the Head-driven Phrase Structure Grammar (HPSG, [2]) formalism to develop a precise large-coverage grammar for German. The main grammar consists of 17 general rules, 12 rules for modeling the German sentence structure and 13 construction-specific rules (relative clauses, genitive attributes, optional determiners, nominalized adjectives etc...). The various subgrammars (expressions of date and time, spoken numbers, compound nouns and acronyms) amount to a total of 43 rules. As split numbers and compounds (e.g. ein und zwanzig, kriegs pläne) are not counted as errors in the evaluation scheme, the grammar is able to analyze such expressions. The main grammar is largely based on existing linguistic work, e.g. [3], [4], [5] and [6]. We have added a couple of linguistic phenomena which we considered to be important, but which are often neglected in formal syntactic theories. Among them are prenominal and postnominal genitives, expressions of quantity, expressions of date and time, forms of address and written numbers. The coverage and precision of our grammar is denoted by the set of test sentences which was developed in parallel to the grammar. The test results are available at http://www.tik.ee.ethz.ch/ kaufmann/grammar/test.html. The lexicon has been created manually. A sorted word list was created from each of the 30 original lattices, and each word has been precisely annotated with its syntactic features such as agreement features and valencies. The context in which the word appeared in the reference transcription was not known to the lexicon developer. Consequently, every possible usage for each lexeme had to be entered. Multi-word lexemes posed a particular challenge, as they do not appear as units in the word lists. Important examples of German multi-word lexemes are certain adverbials (e.g. nach wie vor, zu Fuss) and verbs with separable prefixes (e.g. die Sonne geht auf ). 1 A [silence] node receives a silence penalty, but no word penalty. </s> and <s/> are ignored altogether and receive no penalty at all. {breath} and {fw} receive a word penalty. 1

3 Feature Extraction In the feature extraction step, each of the N best hypotheses of a lattice is parsed, i.e. the parser identifies every grammatical word sequence in a given hypothesis. In addition, it determines all possible syntactic structures of such a word sequence. To this end, the parse trees are transformed into grammar-independent dependency graph representations similar to those used in the German TIGER treebank [7]. Subsequently, the probability of each dependency graph is estimated by means of a statistical model. This statistical model was designed manually and trained on the TIGER treebank. Formally, our linguistic postprocessing results in a set of features for each recognizer hypothesis W k = w 1, w 2,..., w nk. A feature is a tuple (i, j, p), stating the fact that the word sequence w i, w i+1,..., w j is grammatical and that its most likely syntactic structure has the probability p. Note that for every word w i, there is a feature (i, i, 1). 4 N-Best Rescoring The k-th best recognizer hypothesis W k is assigned a new score which takes into account the linguistic knowledge, and the best hypothesis with respect to this new score is returned as the new recognition result W : ( W = arg max s rec (W k ) W k max P partition(w k ) (i,j,p) P α p) λ (1) In this expression, s rec is the score on the basis of which the original N-best list was computed. It includes an acoustic score, an N-gram language model score and additional correction terms. The bracketed expression is the score computed from our rule-based language model. The two scores are balanced by means of a weight λ. The function partition(w k ) returns all sequences of features spanning the hypothesis W k = w 1, w 2,..., w nk. Formally, (i 1, j 1, p 1 ),..., (i m, j m, p m ) partition(w k ) iff i 1 = 1 and j m = n k and j s + 1 = i s+1 for 1 s < m. The parameter α influences the number of features of the optimal partitioning. If α is very big, partitionings into single-word features are favoured. This means that syntactic information is ignored entirely. However, if α is very small, partitionings into a few features covering many words are favoured, even if they have a small probability. In this case, the binary information on grammaticality dominates the more fine-grained probability provided by the statistical model. 5 Experiment Our experiments were performed on the 107 single-sentence lattices manually created from the first 30 lattices of the LIMSI data. We computed the 20 best hypotheses of each lattice. The average hypothesis length is 13.4 words, with a maximum of 31 words. For given parameters α and λ, each hypothesis was parsed in order to extract the features. Subsequently, the hypotheses were rescored and a new set of first-best solutions was produced. The parameters α and λ were optimized by means of leave-one-out cross- 2

experiment word error rate baseline 13.24% grammar 12.31% (-7.0% relative) grammar+cheating 11.89% (-10.2% relative) oracle 7.97% (-39.8% relative) Tabelle 1: The impact of our rule-based language model on the word error rate. validation. Due to the small number of parameters and the small training set, we could apply a simple grid search. Table 1 compares the word error rate of the LIMSI broadcast new transcription system (baseline) to that of the system extended with a rule-based rescoring component (grammar). For comparison, the table also shows the result of the extended system with α and λ optimized on the test data (grammar+cheating), as well as the 20-best oracle word error rate (oracle). By applying the rule-based linguistic knowledge, the word error rate could be reduced by 7.0% relative. Unfortunately, this result is not statistically significant: the significance level for the Matched-Pair Sentence Segment Test [8] is 7.2%, whereas a level of 5% or lower is generally considered to be significant. However, we expect to achieve statistically significant results for a larger training set. The rule-based language model corrects 25 errors and produces 12 new errors. Surprisingly, only about half of the corrected errors are due to the case where the parser picks the correct sentence from the 20-best hypotheses. The remaining errors were corrected by preferring a better, but still incorrect sentence. This suggests that our approach also works in the presence of ungrammatical or out-of-grammar sentences. 6 Problems Our decision to develop a domain-independent lexicon (i.e. to include virtually all usages of a given word) leads to a large amount of ambiguity. The tendency that correct word sequences have many readings is apart form processing issues not problematic for our approach. However, the fact that many bad word sequences do have some readings as well suggests that the criterion of grammaticality alone is not sufficient to distinguish between good and bad word sequences. Instead, the probabilistic model should be extended and refined. For instance: There are many personal names and geographic names which are homophones of nouns, adjectives or verbs 2. Although most of these proper names are very rare and unlikely to appear in a broadcast news context, they were entered into the lexicon. Proper names contribute considerably to ambiguity, as they can appear without a determiner. To deal with this problem, we manually disabled those proper name entries which were not known to us. This is of course unsatisfying. In future, we intend to use corpus linguistics (i.e. named entity extraction) to compute the probability of each proper name for a given domain. 2 For instance, about 40% of all nouns in our lexicon have an inflected form that can also be used as a personal name. 3

As the grammar generally allows for split compound nouns, two consecutive nouns can often be analyzed as a compound noun, which leads to massive ambiguity. The performance of our rule-based language model can be expected to improve if the probability of a given compound noun (as estimated on a corpus) is taken into account. Of course, ambiguity also has a big impact on processing efficiency. In the reported experiment, we were able to derive all possible readings for 99.9% of the parsed recognizer hypotheses, even though some of the sentences were quite long. However, processing can take rather long for some highly ambiguous sentences. Therefore it seems to be desirable to use a stochastic HPSG (e.g. [9]), such that the most probable readings are derived first and parsing can be stopped after a number of processing steps. 7 Acknowledgements We wish to thank Jean-Luc Gauvain of LIMSI for providing us with word lattices produced by their German broadcast news transcription system. Literatur [1] Kevin McTait and Martine Adda-Decker, The 300k LIMSI German Broadcast News Transcription System, in ISCA Eurospeech, Geneva, September 2003. [2] C. J. Pollard and I. A. Sag, Head-Driven Phrase Structure Grammar, University of Chicago Press, Chicago, 1994. [3] Stefan Müller, Head-Driven Phrase Structure Grammar: Eine Einführung, to appear, Stauffenburg Verlag, 2007. [4] Stefan Müller, Deutsche Syntax deklarativ. Head-Driven Phrase Structure Grammar für das Deutsche, Number 394 in Linguistische Arbeiten. Max Niemeyer Verlag, Tübingen, 1999. [5] Berthold Crysmann, Relative clause extraposition in german: An efficient and portable implementation, Research on Language and Computation, vol. 3, no. 1, pp. 61 82, 2005. [6] Berthold Crysmann, On the efficient implementation of german verb placement in hpsg, in Proceedings of RANLP 2003, 2003. [7] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith, The TIGER treebank, in Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, 2002. [8] L. Gillick and S. Cox, Some statistical issues in the comparison of speech recognition algorithms, in ICASSP, 1989, pp. 532 535. [9] Steven P. Abney, Stochastic attribute-value grammars, Comput. Linguist., vol. 23, no. 4, pp. 597 618, 1997. 4