Automated Extraction of Lexico-Syntactic Information

Similar documents
Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

THE VERB ARGUMENT BROWSER

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking Task: Identifying authors and book titles in verbose queries

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Adding syntactic structure to bilingual terminology for improved domain adaptation

Developing a TT-MCTAG for German with an RCG-based Parser

Cross Language Information Retrieval

CS 598 Natural Language Processing

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Development of the First LRs for Macedonian: Current Projects

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Specifying a shallow grammatical for parsing purposes

Universiteit Leiden ICT in Business

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Semi-supervised Training for the Averaged Perceptron POS Tagger

Using dialogue context to improve parsing performance in dialogue systems

Parsing of part-of-speech tagged Assamese Texts

Multilingual Sentiment and Subjectivity Analysis

Prediction of Maximal Projection for Semantic Role Labeling

The taming of the data:

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

An Evaluation of POS Taggers for the CHILDES Corpus

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Annotation Projection for Discourse Connectives

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Memory-based grammatical error correction

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

A High-Quality Web Corpus of Czech

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A Framework for Customizable Generation of Hypertext Presentations

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Grammar Extraction from Treebanks for Hindi and Telugu

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

arxiv: v1 [cs.cl] 2 Apr 2017

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Ensemble Technique Utilization for Indonesian Dependency Parser

Constructing Parallel Corpus from Movie Subtitles

Distant Supervised Relation Extraction with Wikipedia and Freebase

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A Case Study: News Classification Based on Term Frequency

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Leveraging Sentiment to Compute Word Similarity

Character Stream Parsing of Mixed-lingual Text

A Graph Based Authorship Identification Approach

Finding Translations in Scanned Book Collections

Modeling full form lexica for Arabic

An Interactive Intelligent Language Tutor Over The Internet

Context Free Grammars. Many slides from Michael Collins

Automated Identification of Domain Preferences of Collocations

Constraining X-Bar: Theta Theory

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

LTAG-spinal and the Treebank

AQUA: An Ontology-Driven Question Answering System

Natural Language Processing. George Konidaris

Methods for the Qualitative Evaluation of Lexical Association Measures

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Accurate Unlexicalized Parsing for Modern Hebrew

The Role of the Head in the Interpretation of English Deverbal Compounds

A heuristic framework for pivot-based bilingual dictionary induction

A Bayesian Learning Approach to Concept-Based Document Classification

Applications of memory-based natural language processing

Noisy SMS Machine Translation in Low-Density Languages

Modeling function word errors in DNN-HMM based LVCSR systems

Copyright and moral rights for this thesis are retained by the author

1. Introduction. 2. The OMBI database editor

Procedia - Social and Behavioral Sciences 154 ( 2014 )

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

A Computational Evaluation of Case-Assignment Algorithms

Some Principles of Automated Natural Language Information Extraction

A corpus-based approach to the acquisition of collocational prepositional phrases

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Myths, Legends, Fairytales and Novels (Writing a Letter)

Vocabulary Usage and Intelligibility in Learner Language

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Extracting and Ranking Product Features in Opinion Documents

The Smart/Empire TIPSTER IR System

CS 446: Machine Learning

Anna P. Kosterina Iowa State University. Retrospective Theses and Dissertations

Transcription:

Automated Extraction of Lexico-Syntactic Information Ondřej Bojar obo@cuni.cz June 17, 2004

Outline 1 Motivation: Why syntactic lexicons? The two goals: Extending monolingual syntactic lexicons. Providing translation dictionaries with syntactic information. Summary and further research.

Syntactic Analysis: A Tough Cookie 2 Task: String of words tree. There are O(n n ) configurations theoretically possible. Ribarov 1 : sentences of 20 words have 300 structures in PDT, not 15.6 billion structures. Bojar [2004]: Allow observed local configurations and you ll get 9 n possible solutions. (9 20 10 19 ) Statistical approaches simply cheat: they provide most common (frequent) analyses and therefore commonly seem to perform well. 1 Thesis in preparation

What a Lexicon Might Tell 3 (1) Manažeři The managers hypermarketu of the hypermarket přišli came na to trh the market. (2) Manažeři The managers hypermarketu the hypermarket přišli came The managers started to like the hypermarket. na to chuť taste.

The Two Goals 4 Extending monolingual syntactic lexicons. Current lexicons are incomplete. Building lexicons is a demanding task. Eventually, human decisions are necessary. An automatic preprocessing of corpus examples will help. Providing translation dictionaries with syntactic information. No formalized syntactic information in current dictionaries. Several semi-automatic steps proposed.

Extending Monolingual Syntactic Lexicons (I) 5 Treebank data are not sufficient: In PDT, after having observed 20,000 75,000 training sentences a new lemma (i.e. word) comes every 1.6 1.8 test sentences a new full morphological tag comes every 110 290 test sentences a new simplified tag 2 comes every 280 870 test sentences PDT covers only 5,407 of 22,276 Czech verbs observed in Czech National Corpus. PDT contains more than 50 occurences per verb only for a few hundreds of verbs. 2 The simplified tag comprises only POS, SUBPOS, CASE, NUMBER and GENDER information.

Extending Monolingual Syntactic Lexicons (II) 6 Simple scheme: (Get more texts, e.g. from the Czech National Corpus or the Internet.) Morphologically annotate and disambiguate the sentences. Employ one of the parsers available for Czech to get the dependency trees. Extract the desired lexico-syntactic information. But current parsers are not accurate enough: (55% verb frames observed correctly) Select nice examples only. Keep sentences containing the observed phenomenon. Filter out all the sentences where the phenomenon is hidden and/or interferes with other phenomena.

Extending Monolingual Syntactic Lexicons (III) 7 Bojar [2002] designs a scripting language to select nice examples. A sample script to select nice examples of verbal frames helps current parsers: 5 10% improvement in correct dependencies. (Reached 88% dependencies correct) 10 15% improvement in correctly observed verb frames. (Reached 65% frames correct.) Different scripts should be used when extracting different kinds of data.

8 Providing Translation Dictionaries with Syntactic Information Accessible Czech-English dictionaries are for humans only. Necessary steps: Add morphological information. Add syntactic structure and agreement constraints. Provide entries with examples. Estimate monolingual and parallel frequencies. Benefits: Better machine translation, better monolingual syntactic lexicons. Possibility to align deep syntactic lexicons.

Steps Proposed 9 Manual disambiguation necessary. (We shall see.) Automatic morphological analysis and grouping saves a lot of work. Corpus data as an additional source: Data Annotation Level Possible Tasks to Augment Entries Entries Corpora no annotation any annotation no useful task morphology morphology possibly annotate internal agreement morphology trees find syntactic structure trees morphology find examples, estimate frequency trees trees confirm structure

Morphological Disambiguation: Manually! Noun and Noun/Adjective Correct Interpretation English Translation husa divoká Noun Adjective grey goose kniha účetní Noun Adjective account book napětí dovolené Noun Adjective permissible stress chyba měření Noun Noun measurement error plán prací Noun Noun schedule of operation rozsah měření Noun Noun range of measurement Numeral/Verb and Noun Correct Interpretation English Translation tři prdele Numeral Noun shitloads pět švestek Numeral Noun one s duds pět chválu Verb Noun sing someone s praises 10 These expressions allow for another interpretation, too, mostly kind of funny. Part of the idiom pick up one s duds.

Adding Syntactic Information 11 Manual annotation plausible and preferable for important groups. Noun Adjective Noun Syntactic Structure English Translation Komise Evropské unie náhrada způsobené škody látkou potažené sedadlo poruchy způsobené přijímačem nevolnost způsobená pohybem Once the syntactic information is present: Commission of the European Community dilapidation fabric-covered seat set noise kinesia Adding agreement constraints automatically, per structure. Searching for examples more precise. Searching allows to find more examples (modified multi-word entries).

Searching for Examples, Estimating Frequencies 12 In treebanks: Finding a subtree in a forest is easy. Only most prominent examples expected in PDT. Possibility to search in automatic trees of nice sentences. In plain corpora: Use agreement constraints to reject random collocations. Use syntactic structure to allow valid reordering and extra modifications present at specific places. Relevant source to estimate frequencies.

Aligning Deep Syntactic Lexicons 13 Czech and English deep syntactic lexicons under development. (VALLEX and FrameNet). Annotation schemata not equivalent but comparable. No common parallel corpus annotated with VALLEX on the Czech side and FrameNet on the English side. Use surface syntactic translation dictionary as a bridging link.

Conclusion and Further Research 14 We need syntactic lexicons. Automatic preprocessing saves effort (but does not solve the problem) for the two tasks: Extending monolingual syntactic lexicons. Providing translation dictionaries with syntactic information. Further goals: Build a syntactic Czech-English dictionary. Evaluate the utility of syntactic dictionaries.

15 References Ondřej Bojar. Automatická extrakce lexikálně-syntaktických údajů z korpusu (Automatic extraction of lexico-syntactic information from corpora). Master s thesis, ÚFAL, MFF UK, Prague, Czech Republic, 2002. In Czech. Ondřej Bojar. Czech Syntactic Analysis Constraint-Based, XDG: One Possible Start. Prague Bulletin of Mathematical Linguistics, 81:43 54, 2004. ISSN 0032-6585.