Build a Chinese Treebank as the test suite for Chinese parser

Similar documents
Chinese Language Parsing with Maximum-Entropy-Inspired Parser

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Accurate Unlexicalized Parsing for Modern Hebrew

Learning Computational Grammars

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Prediction of Maximal Projection for Semantic Role Labeling

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Using dialogue context to improve parsing performance in dialogue systems

BYLINE [Heng Ji, Computer Science Department, New York University,

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Parsing of part-of-speech tagged Assamese Texts

The Smart/Empire TIPSTER IR System

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Analysis of Probabilistic Parsing in NLP

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Using Semantic Relations to Refine Coreference Decisions

AQUA: An Ontology-Driven Question Answering System

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Grammars & Parsing, Part 1:

The stages of event extraction

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

A Case Study: News Classification Based on Term Frequency

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Ensemble Technique Utilization for Indonesian Dependency Parser

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

Natural Language Processing. George Konidaris

Specifying a shallow grammatical for parsing purposes

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

An Efficient Implementation of a New POP Model

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Some Principles of Automated Natural Language Information Extraction

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

LTAG-spinal and the Treebank

Probabilistic Latent Semantic Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The Role of the Head in the Interpretation of English Deverbal Compounds

CS 598 Natural Language Processing

A Computational Evaluation of Case-Assignment Algorithms

Training and evaluation of POS taggers on the French MULTITAG corpus

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Compositional Semantics

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Disambiguation of Thai Personal Name from Online News Articles

The College Board Redesigned SAT Grade 12

Developing a TT-MCTAG for German with an RCG-based Parser

An Interactive Intelligent Language Tutor Over The Internet

Context Free Grammars. Many slides from Michael Collins

The Discourse Anaphoric Properties of Connectives

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Memory-based grammatical error correction

Adapting Stochastic Output for Rule-Based Semantics

Introduction to Text Mining

Experiments with a Higher-Order Projective Dependency Parser

Semi-supervised Training for the Averaged Perceptron POS Tagger

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Applications of memory-based natural language processing

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Multi-Lingual Text Leveling

Character Stream Parsing of Mixed-lingual Text

Update on Soar-based language processing

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Extracting Verb Expressions Implying Negative Opinions

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Methods for the Qualitative Evaluation of Lexical Association Measures

Accuracy (%) # features

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Grammar Extraction from Treebanks for Hindi and Telugu

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Word Segmentation of Off-line Handwritten Documents

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Chapter 4: Valence & Agreement CSLI Publications

Annotation Projection for Discourse Connectives

Automatic Translation of Norwegian Noun Compounds

Advanced Grammar in Use

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Loughton School s curriculum evening. 28 th February 2017

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Distant Supervised Relation Extraction with Wikipedia and Freebase

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

A Re-examination of Lexical Association Measures

Transcription:

Build a Chinese Treebank as the test suite for Chinese parser Zhou Qiang, Sun Maosong The State key laboratory of intelligent technology and systems Dept. of Computer Science and Technology Tsinghua University, Beijing 100084, P. R. China zhouq@s1000e.cs.tsinghua.edu.cn Abstract This paper will introduce our current work to build a Chinese treebank that can be used as a test suite for Chinese parser. The treebank will consist of 10,000 Chinese sentences extracted from a Chinese balanced corpus with about 2,000,000 Chinese characters. The corpus has already been annotated with correct segmentation and Part-Of-Speech(POS) information. The following issues will be discussed in the paper : the survey of the balanced corpus, the strategies and methods for sampling the treebank sentences, the processing schemes and tools for treebank construction. 1. Introduction Syntactic parsing is an important stage in natural language understanding. In this respect, two important issues are how to develop the new efficient and robust parsers and how to evaluate the performance of different parsers. Therefore, a good test suite is requisite for the development of parsers. As the large scale corpus and annotated materials are daily on the increase, many statistics-based English parsers, such as Magerman s statistical decision tree parser[mag95], Collins Bigram dependence model parser[col96], Ratnaparkhi s maximum entropy model parser[rat97] and so on, were developed. The interested characteristics of them is that they used the same annotated corpus: the Wall Street Journal(WSJ) corpus extracted from the Penn treebank[msm93] to train their parser and the same performance measure schemes: the PARSEVAL measures[bla91] to evaluate their parsers. Thus, the advantages and disadvantages of different statistical computation model can be easily found through the performance comparison. In the paper, we will introduce our work to build a similar test suite for Chinese parser, i.e. a Chinese treebank with about 10,000 representative Chinese sentences extracted from a large-scale Chinese balanced corpus. It was firstly preprocessed by an automatic chunker and a statistics-based parser and then manually proofread so as to get the corpus annotated with correct constituent boundary tags(in chunk bank) and parse trees(in treebank). In the following sections, section 2 introduces the overview of the balanced corpus, section 3 discusses the sampling strategy and algorithm for the test suite, section 4 gives the treebank construction procedure, section 5 introduces our current work to develop some useful processing tools for treebank construction. 2. The balanced corpus The balanced corpus was built according to the following principles: 1) To select the contemporary Chinese written texts, most of them are published in 1990s and few in 1980s. 2) The text selection gives priority to its styles, then its domain categories. Four main styles are literature, news, academy and practical writing. 3) To select the complete articles so as to keep the content coherence of the texts. Table 1 shows the basic statistics of the corpus. All the texts were preprocessed by an automatic Chinese segmentation and POS tagging tool, then manually corrected. To guarantee the consistence of manual proofreading, a detailed Chinese

segmentation and POS tagging specification was developed. The POS tag set consists of 95 tags, which give meticulous descriptions for different syntactic functions for Chinese words, especially for the verbs. All these work give good support for the further syntactic parsing. Table 2 shows some statistics of the corpus after segmentation and POS tagging. Table 1 Basic statistics of the balanced corpus Styles Number of articles Chinese characters Ratio Literature 295 880,057 44% News 376 600,490 30% Academy 29 402,623 20% Practical writing 258 119,488 6% Total 958 2,002,658 100% Table 2. Some statistics of the corpus after segmentation and POS tagging Styles punctuation words Ratio Literature 148,453 760,337 48% News 86,163 438,095 28% Academy 52,823 278,728 18% Practical writing 28,727 91,929 6% Total 316,116 1,569,089 100% 3. Sampling the treebank sentences A good test suite should comprise various language phenomena. Specially designed for Chinese parsers, our test suite must consist of the Chinese sentence with different parsing complexities as equal as possible. On the basis of the assumption that there are most of the Chinese grammatical phenomena in the sentences of the balanced corpus, our goal is to find out some simple measures for the parsing complexity so as to select the representative sentences from the corpus. Because the only annotated information can be used now are the word segmentation and POS tags, we select the following two simple features as the measure standard for sentence parsing complexity: The number of different kinds of common verbs in a sentence(vnum). The length of the sentence, i.e. the number of Chinese words(including punctuation) in the sentence(slen). A basic assumption is that the parsing complexity is in direct proportion to the VNum and SLen of a sentence. Based on this, a simple sentence sampling algorithm is developed as follows: 1) Extract every complete sentence(i.e. the sentence ending with period, question mark and exclamation mark) from source texts. 2) Get a common verb list, by setting a threshold to delete verbs with lower frequency in the corpus. 3) Classify all the common verbs into 7 categories(table 3), according to their different POS tags. Table 3. The common verb categories Cat. POS Content descriptions 1 vgn The verb takes a nominal object in 2 vgv vga The verb takes a predicative object in 3 vgs The verb takes a sentential object in 4 vgd The verb takes two nominal objects, i.e. the direct and the indirect object. 5 vgj The verb takes a nominal object and a complement. 6 vgp The verb acts as the direct modifier of a noun in 7 Other verbs 4) Set the basic sentence sets, by classifying all the complete sentences into 6 sets, according to the number of different types of verbs in the sentences (VNum = 0, 1, 2, 3, 4, >4). 5) Share out the total number(n) of the required sentences in every set(ni), according to the distribution of sentence sum(si) in it: Ni = N * ( Si / Sj), i [1,6] 6 j= 1

6) Sample the sentences in every set, according to the amount distribution of the sentences with different length in the set. If the total number of sentences with same length(slen = l) in the i-th set is N il, then the sample number(sn il ) for this type of sentence is: Where Mi is the maximum sentence length in the i-th set. After the above six-stage processing, we got a sample set with about 10,000 sentences from the balanced corpus. Table 4 shows its basic statistics. The distribution of sentences with different length can be found in Figure 1. Due to the special sampling strategies, there are some sudden-change points in the figure. Table 4. The basic statistics of sample set Sentence set = Ni * ( Nil / N Of char. k= 1 Of words ), i [1,6] of sent. Ave. sent. length Slen<20 117287 73898 7057 10.47 Slen 20 265226 166676 5047 33.03 Total 382513 240574 12104 19.88 Sentence Number 500 400 300 200 100 SN il 0 0 10 20 30 40 50 60 70 80 Sentence Length Mi ik group(cg) tags[zsh99] and corrected by human annotators. Therefore, a correct chunk bank is constructed. At the second stage, the chunked sentences are parsed through a statistics-based Chinese parser[zq97] and then manually proofread. After that, a correct treebank can be built. Figure 3 shows the overview of the two-stage approach and Figure 2 gives a detailed example. (a) (my)/r (brother)/n (want)/v (buy)/v (two)/m (-classifier)/q (football)/n (period)/w 1 My brother wants to buy two footballs. (b) {PR {MD /r /n] [ /v [ /v [ /m /q ] /n } /w } (c) [zj [dj [np /r /n] [vp /v [vp /v [np [mp /m /q ] /n ]]]] /w ] Figure 2. An overview of the annotating representation: (a) The segmented and tagged sentence; (b) The chunked sentence; (c) The bracketed and labeled sentence. The advantage of this two-stage approach lies in the great increasing of the overall parsing efficiency and the great decreasing of the manual-proofreading burden. On the one hand, the simple description formation of the chunk information gives us the convenience to develop high-quality chunking tool and to manually correct the sentences automatically chunked. On the other hand, the correct constituent boundary information annotated in the chunk bank can reduce the possibilities of the generation of lots of ambiguous structures during syntactic parsing. Therefore, the Figure 1. Sentence length vs. sentence number in the test suite 4. Building the treebank The treebank building procedure is a two-stage process. At the first stage, the sentences are automatically assigned chunk information, including word boundary(ws) and constituent 1 The POS and syntactic tags used in this sentence are briefly describes as follows. [POS tags]: r--pronoun, n--noun, v--verb, m--numeral, q--classifier, w punctuation(we only use the main category tags here). [Syn tags]: np--noun phrase, mp--numeral-classifier phrase, vp--verb phrase, dj--simple sentence pattern, zj--complete sentence.

Input sentences Segmentation and POS tagging Chunk Parsing Chunk Bank Manual proofreading Syntactic Parsing treebank Figure 3 Build the Chinese treebank through two-stage processing efficiency of the parser and the precision of the parsed results can be improved. The proofreader can only focus on the examination of some difficult ambiguous structures in 5. Current work The construction of a large-scale Chinese treebank is a systemic project. It needs the cooperation of computational linguists, linguists, knowledge engineers, computer programmers. Since 1993, we have made some tentative exploration in respect of treebank construction and developed many useful tools: 1) A chunk analyzing tool Based on the following processing strategies and schemes: 1) To combine rule-based finite-state constituent identifier with statistics-based word boundary predictor, 2) To use finite-state transducers for constituent group identification, our chunker obtained better performance in the experiments of the automatic chunk identification on Chinese real texts. For constituent group identification, the precision is 91%. For word boundary prediction, the precision is 92%[ZQ99]. 2) Two tools for knowledge acquisition The chunk bank provides a good foundation for grammatical knowledge acquisition. We used them to learned the following two types of knowledge for syntactic disambiguation: The probabilistic context-free grammar (PCFG) knowledge, which can be used for overall disambiguation during parsing[zh98]. The structure preference relation(spr) knowledge, which can be used for local disambiguation during parsing[zh99]. 3) A statistics-based parser The input of the parser is a chunked sentence. Through the following two processing stages: Generate all possible syntactic trees by applying the bracket matching principle[zh97] to the chunked sentences. Disambiguate the parse trees according to the PCFG and SPR information automatically learned. The best parse tree of it can be gotten. The performance analyzing experiment on a test treebank with 5573 Chinese sentences show the following results: The labeled precision is 82.99%, the labeled recall is 83.14%[ZQ97]. By using the above tools to process the sample set, we hope to get the correct chunk bank till the end of this year, and finish the construction of the correct treebank till June, 2000. References [Bla91] E. Black et al. (1991). A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars. In Proceedings of the February 1991 DARPA Speech and Natural language Workshop, 306-311. [Col96] Michael John Collins (1996). A New Statistical Parser Based on Bigram Lexical Dependencies. In Proc. of ACL-34, 184-191. [Mag95] David M. Magerman. (1995). Statistical Decision-Tree Models for Parsing, In Proc. of ACL-95, 276-303. [MSM93] Mitchell P.Marcus, Mary Ann

Marcinkiewicz, and Beatrice Santorini (1993). "Building a Large Annotated Corpus of English: The Penn Treebank", Computational Linguistics, 19(2), 313-330. [Rat97] Adwait Ratnaparkhi (1997). A linear observed time statistical parser based on maximum entropy models. In Claire Cardie and Ralph Weischedel(eds.), Second Conference on Empirical Methods in Natural Language Processing(EMNLP-2), Somerset, New Jersey, ACL. [ZH97] Zhou Qiang, Huang Chang-ning(1997). A Chinese syntactic parser based on bracket matching principle, Communication of COLIPS, 7(2), #97008. [ZH98] Zhou Qiang, Huang Chang-ning(1998). An Inference Approach for Chinese Probabilistic Context-free Grammar, Chinese Journal of Computers, 21(5), 385-392.