SynTagRus (Russian National Corpus)

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Ensemble Technique Utilization for Indonesian Dependency Parser

THE VERB ARGUMENT BROWSER

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Vocabulary Usage and Intelligibility in Learner Language

Accurate Unlexicalized Parsing for Modern Hebrew

Linking Task: Identifying authors and book titles in verbose queries

Parsing of part-of-speech tagged Assamese Texts

Developing a TT-MCTAG for German with an RCG-based Parser

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Modeling full form lexica for Arabic

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

SEMAFOR: Frame Argument Resolution with Log-Linear Models

1. Introduction. 2. The OMBI database editor

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Applications of memory-based natural language processing

Prediction of Maximal Projection for Semantic Role Labeling

Автоматическая идентификация общих аргументов сочиненных глаголов. Automatic Identification of Shared

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Annotation Projection for Discourse Connectives

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Development of the First LRs for Macedonian: Current Projects

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Word Sense Disambiguation

Specifying a shallow grammatical for parsing purposes

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

AQUA: An Ontology-Driven Question Answering System

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

LTAG-spinal and the Treebank

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

A Graph Based Authorship Identification Approach

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Procedia - Social and Behavioral Sciences 154 ( 2014 )

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The Discourse Anaphoric Properties of Connectives

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

A High-Quality Web Corpus of Czech

The CESAR Project: Enabling LRT for 70M+ Speakers

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Universiteit Leiden ICT in Business

A Bayesian Learning Approach to Concept-Based Document Classification

The Smart/Empire TIPSTER IR System

Automated Identification of Domain Preferences of Collocations

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Introduction to Text Mining

Accuracy (%) # features

Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation

Analysis of Probabilistic Parsing in NLP

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Grammar Extraction from Treebanks for Hindi and Telugu

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

A Framework for Customizable Generation of Hypertext Presentations

A Grammar for Battle Management Language

English Language and Applied Linguistics. Module Descriptions 2017/18

CS 598 Natural Language Processing

Short Text Understanding Through Lexical-Semantic Analysis

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Distant Supervised Relation Extraction with Wikipedia and Freebase

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Some Principles of Automated Natural Language Information Extraction

Adapting Stochastic Output for Rule-Based Semantics

Natural Language Processing. George Konidaris

The Role of the Head in the Interpretation of English Deverbal Compounds

Combining a Chinese Thesaurus with a Chinese Dictionary

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The taming of the data:

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

The stages of event extraction

Domain Adaptation for Parsing

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

The phonological grammar is probabilistic: New evidence pitting abstract representation against analogy

Using dialogue context to improve parsing performance in dialogue systems

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

Iraide Ibarretxe Antuñano Universidad de Zaragoza

Multi-Lingual Text Leveling

Artificial Intelligence

arxiv: v1 [cs.cl] 2 Apr 2017

AP-H Library Administrative Procedures

Multilingual Sentiment and Subjectivity Analysis

Transcription:

SynTagRus (Russian National Corpus) Over 52,000 sentences as of 2012 - from texts of a variety of genres(contemporary fiction, popular science, newspaper etc. from 1960-2012) A Sub-Corpus of the NRC Developed and maintained by the Laboratory of Computational Linguistics (LCL) in Moscow main purpose of the corpus is to facilitate academic research on the lexicon and grammar of a language, as well as the subtle but constant processes of language change within a relatively short period of time: from one to two centuries. Annotation consists of: - Morphological marking, syntactic tagging Done by: Wong Si Ning

Syntactic and Morphological annotation Done semi-automatically - First processed by ETAP-3 parser - Then manually corrected by linguists As Russian is a free word order language: 1) Relies on the Meaning-Text theory by Igor Melcuk 2) Uses a dependency tree - Nodes: the lemma, POS, morphological features (Eg. Aspect, tense, person, gender etc.); - Arcs: syntactic relations Uses a morphological dictionary with over 130,000 entries

Lexical Semantic and Lexical Functional Annotation SynTagRus currently only contains partial lexical functional annotation (Eg. Collocations) SynTagRus currently only shows lemmas of words occurring in the texts - does not disambiguate ambiguous words, unless they have different lemmas or different POS tags. There is an ongoing project to add these into the corpus

Usage As a benchmark in regression tests designed to ensure stable performance of the ETAP-3 Russian Parser(Iomdin et al., 2012) As well as to refine the ETAP-3 Parser (Boguslavsky et al., 2011) And train other parsers(shelmanov & Smirnov, 2014) As a source for the creation of statistical parsers for Russian(Nivre et al., 2008)

References Boguslavsky, I., Iomdin, L., Timoshenko, S. P., & Frolova, T. I. (2009, April). Development of the Russian Tagged Corpus with Lexical and Functional Annotation. In Metalanguage and Encoding Scheme Design for Digital Lexicography. MONDILEX Third Open Workshop. Proceedings. Bratislava, Slovakia (pp. 83-90).Development of a dependency Treebank for Russian and its possible applications in NLP Boguslavsky, I., Iomdin, L., Sizov, V., Tsinman, L., & Petrochenkov, V. (2011). Rule-based dependency parser refined by empirical and corpus statistics. In Proceedings of the International Conference on Dependency Linguistics (pp. 318-327). Iomdin L. (2012). Automatic text processing and deeply annotated text corpora of Russian: interaction and mutual impact [PowerPoint slides]. Retrieved from http://korpus.sk/files/roadshow2012/iomdin-syntagrus.pdf Iomdin L., Petrochenkov V., Sizov V., Tsinman L. (2012). ETAP parser: state of the art. Retrieved from http://www.dialog-21.ru/digests/dialog2012/materials/pdf/iomdin.pdf Nivre, J., Boguslavsky, I. M., & Iomdin, L. L. (2008, August). Parsing the SynTagRus treebank of Russian. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1 (pp. 641-648). Association for Computational Linguistics. Shelmanov A. O., Smirnov I. V. (2014). Methods for Semantic Role Labeling of Russian Texts. Retrieved from http://www.dialog-21.ru/digests/dialog2014/materials/pdf/shelmanovaosmirnoviv.pdf

ISLRN Title Full Title Resource Type Source/URL Format/MIME Type Size/Duration Access Medium Description SynTagRus SynTagRus Corpus Corpus http://www.ruscorpora.ru/en/search-syntax.html text/xml Over 52,000 Sentences Online A sub-corpus of the Russian National Corpus, SynTagRus is a corpus of Russian Texts annotated with dependency-type syntactic structures, with full morphological and syntactic markup Version 2? Media Type Text Language Russian, English Resource Creator Laboratory of Computational Linguistics of the Institute of Information Transmission Problems in Moscow Distributor Laboratory of Computational Linguistics of the Institute of Information Transmission Problems in Moscow Rights Holder Laboratory of Computational Linguistics of the Institute of Information Transmission Problems in Moscow