Build a Chinese Treebank as the test suite for Chinese parser Zhou Qiang, Sun Maosong The State key laboratory of intelligent technology and systems Dept. of Computer Science and Technology Tsinghua University, Beijing 100084, P. R. China zhouq@s1000e.cs.tsinghua.edu.cn Abstract This paper will introduce our current work to build a Chinese treebank that can be used as a test suite for Chinese parser. The treebank will consist of 10,000 Chinese sentences extracted from a Chinese balanced corpus with about 2,000,000 Chinese characters. The corpus has already been annotated with correct segmentation and Part-Of-Speech(POS) information. The following issues will be discussed in the paper : the survey of the balanced corpus, the strategies and methods for sampling the treebank sentences, the processing schemes and tools for treebank construction. 1. Introduction Syntactic parsing is an important stage in natural language understanding. In this respect, two important issues are how to develop the new efficient and robust parsers and how to evaluate the performance of different parsers. Therefore, a good test suite is requisite for the development of parsers. As the large scale corpus and annotated materials are daily on the increase, many statistics-based English parsers, such as Magerman s statistical decision tree parser[mag95], Collins Bigram dependence model parser[col96], Ratnaparkhi s maximum entropy model parser[rat97] and so on, were developed. The interested characteristics of them is that they used the same annotated corpus: the Wall Street Journal(WSJ) corpus extracted from the Penn treebank[msm93] to train their parser and the same performance measure schemes: the PARSEVAL measures[bla91] to evaluate their parsers. Thus, the advantages and disadvantages of different statistical computation model can be easily found through the performance comparison. In the paper, we will introduce our work to build a similar test suite for Chinese parser, i.e. a Chinese treebank with about 10,000 representative Chinese sentences extracted from a large-scale Chinese balanced corpus. It was firstly preprocessed by an automatic chunker and a statistics-based parser and then manually proofread so as to get the corpus annotated with correct constituent boundary tags(in chunk bank) and parse trees(in treebank). In the following sections, section 2 introduces the overview of the balanced corpus, section 3 discusses the sampling strategy and algorithm for the test suite, section 4 gives the treebank construction procedure, section 5 introduces our current work to develop some useful processing tools for treebank construction. 2. The balanced corpus The balanced corpus was built according to the following principles: 1) To select the contemporary Chinese written texts, most of them are published in 1990s and few in 1980s. 2) The text selection gives priority to its styles, then its domain categories. Four main styles are literature, news, academy and practical writing. 3) To select the complete articles so as to keep the content coherence of the texts. Table 1 shows the basic statistics of the corpus. All the texts were preprocessed by an automatic Chinese segmentation and POS tagging tool, then manually corrected. To guarantee the consistence of manual proofreading, a detailed Chinese
segmentation and POS tagging specification was developed. The POS tag set consists of 95 tags, which give meticulous descriptions for different syntactic functions for Chinese words, especially for the verbs. All these work give good support for the further syntactic parsing. Table 2 shows some statistics of the corpus after segmentation and POS tagging. Table 1 Basic statistics of the balanced corpus Styles Number of articles Chinese characters Ratio Literature 295 880,057 44% News 376 600,490 30% Academy 29 402,623 20% Practical writing 258 119,488 6% Total 958 2,002,658 100% Table 2. Some statistics of the corpus after segmentation and POS tagging Styles punctuation words Ratio Literature 148,453 760,337 48% News 86,163 438,095 28% Academy 52,823 278,728 18% Practical writing 28,727 91,929 6% Total 316,116 1,569,089 100% 3. Sampling the treebank sentences A good test suite should comprise various language phenomena. Specially designed for Chinese parsers, our test suite must consist of the Chinese sentence with different parsing complexities as equal as possible. On the basis of the assumption that there are most of the Chinese grammatical phenomena in the sentences of the balanced corpus, our goal is to find out some simple measures for the parsing complexity so as to select the representative sentences from the corpus. Because the only annotated information can be used now are the word segmentation and POS tags, we select the following two simple features as the measure standard for sentence parsing complexity: The number of different kinds of common verbs in a sentence(vnum). The length of the sentence, i.e. the number of Chinese words(including punctuation) in the sentence(slen). A basic assumption is that the parsing complexity is in direct proportion to the VNum and SLen of a sentence. Based on this, a simple sentence sampling algorithm is developed as follows: 1) Extract every complete sentence(i.e. the sentence ending with period, question mark and exclamation mark) from source texts. 2) Get a common verb list, by setting a threshold to delete verbs with lower frequency in the corpus. 3) Classify all the common verbs into 7 categories(table 3), according to their different POS tags. Table 3. The common verb categories Cat. POS Content descriptions 1 vgn The verb takes a nominal object in 2 vgv vga The verb takes a predicative object in 3 vgs The verb takes a sentential object in 4 vgd The verb takes two nominal objects, i.e. the direct and the indirect object. 5 vgj The verb takes a nominal object and a complement. 6 vgp The verb acts as the direct modifier of a noun in 7 Other verbs 4) Set the basic sentence sets, by classifying all the complete sentences into 6 sets, according to the number of different types of verbs in the sentences (VNum = 0, 1, 2, 3, 4, >4). 5) Share out the total number(n) of the required sentences in every set(ni), according to the distribution of sentence sum(si) in it: Ni = N * ( Si / Sj), i [1,6] 6 j= 1
6) Sample the sentences in every set, according to the amount distribution of the sentences with different length in the set. If the total number of sentences with same length(slen = l) in the i-th set is N il, then the sample number(sn il ) for this type of sentence is: Where Mi is the maximum sentence length in the i-th set. After the above six-stage processing, we got a sample set with about 10,000 sentences from the balanced corpus. Table 4 shows its basic statistics. The distribution of sentences with different length can be found in Figure 1. Due to the special sampling strategies, there are some sudden-change points in the figure. Table 4. The basic statistics of sample set Sentence set = Ni * ( Nil / N Of char. k= 1 Of words ), i [1,6] of sent. Ave. sent. length Slen<20 117287 73898 7057 10.47 Slen 20 265226 166676 5047 33.03 Total 382513 240574 12104 19.88 Sentence Number 500 400 300 200 100 SN il 0 0 10 20 30 40 50 60 70 80 Sentence Length Mi ik group(cg) tags[zsh99] and corrected by human annotators. Therefore, a correct chunk bank is constructed. At the second stage, the chunked sentences are parsed through a statistics-based Chinese parser[zq97] and then manually proofread. After that, a correct treebank can be built. Figure 3 shows the overview of the two-stage approach and Figure 2 gives a detailed example. (a) (my)/r (brother)/n (want)/v (buy)/v (two)/m (-classifier)/q (football)/n (period)/w 1 My brother wants to buy two footballs. (b) {PR {MD /r /n] [ /v [ /v [ /m /q ] /n } /w } (c) [zj [dj [np /r /n] [vp /v [vp /v [np [mp /m /q ] /n ]]]] /w ] Figure 2. An overview of the annotating representation: (a) The segmented and tagged sentence; (b) The chunked sentence; (c) The bracketed and labeled sentence. The advantage of this two-stage approach lies in the great increasing of the overall parsing efficiency and the great decreasing of the manual-proofreading burden. On the one hand, the simple description formation of the chunk information gives us the convenience to develop high-quality chunking tool and to manually correct the sentences automatically chunked. On the other hand, the correct constituent boundary information annotated in the chunk bank can reduce the possibilities of the generation of lots of ambiguous structures during syntactic parsing. Therefore, the Figure 1. Sentence length vs. sentence number in the test suite 4. Building the treebank The treebank building procedure is a two-stage process. At the first stage, the sentences are automatically assigned chunk information, including word boundary(ws) and constituent 1 The POS and syntactic tags used in this sentence are briefly describes as follows. [POS tags]: r--pronoun, n--noun, v--verb, m--numeral, q--classifier, w punctuation(we only use the main category tags here). [Syn tags]: np--noun phrase, mp--numeral-classifier phrase, vp--verb phrase, dj--simple sentence pattern, zj--complete sentence.
Input sentences Segmentation and POS tagging Chunk Parsing Chunk Bank Manual proofreading Syntactic Parsing treebank Figure 3 Build the Chinese treebank through two-stage processing efficiency of the parser and the precision of the parsed results can be improved. The proofreader can only focus on the examination of some difficult ambiguous structures in 5. Current work The construction of a large-scale Chinese treebank is a systemic project. It needs the cooperation of computational linguists, linguists, knowledge engineers, computer programmers. Since 1993, we have made some tentative exploration in respect of treebank construction and developed many useful tools: 1) A chunk analyzing tool Based on the following processing strategies and schemes: 1) To combine rule-based finite-state constituent identifier with statistics-based word boundary predictor, 2) To use finite-state transducers for constituent group identification, our chunker obtained better performance in the experiments of the automatic chunk identification on Chinese real texts. For constituent group identification, the precision is 91%. For word boundary prediction, the precision is 92%[ZQ99]. 2) Two tools for knowledge acquisition The chunk bank provides a good foundation for grammatical knowledge acquisition. We used them to learned the following two types of knowledge for syntactic disambiguation: The probabilistic context-free grammar (PCFG) knowledge, which can be used for overall disambiguation during parsing[zh98]. The structure preference relation(spr) knowledge, which can be used for local disambiguation during parsing[zh99]. 3) A statistics-based parser The input of the parser is a chunked sentence. Through the following two processing stages: Generate all possible syntactic trees by applying the bracket matching principle[zh97] to the chunked sentences. Disambiguate the parse trees according to the PCFG and SPR information automatically learned. The best parse tree of it can be gotten. The performance analyzing experiment on a test treebank with 5573 Chinese sentences show the following results: The labeled precision is 82.99%, the labeled recall is 83.14%[ZQ97]. By using the above tools to process the sample set, we hope to get the correct chunk bank till the end of this year, and finish the construction of the correct treebank till June, 2000. References [Bla91] E. Black et al. (1991). A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars. In Proceedings of the February 1991 DARPA Speech and Natural language Workshop, 306-311. [Col96] Michael John Collins (1996). A New Statistical Parser Based on Bigram Lexical Dependencies. In Proc. of ACL-34, 184-191. [Mag95] David M. Magerman. (1995). Statistical Decision-Tree Models for Parsing, In Proc. of ACL-95, 276-303. [MSM93] Mitchell P.Marcus, Mary Ann
Marcinkiewicz, and Beatrice Santorini (1993). "Building a Large Annotated Corpus of English: The Penn Treebank", Computational Linguistics, 19(2), 313-330. [Rat97] Adwait Ratnaparkhi (1997). A linear observed time statistical parser based on maximum entropy models. In Claire Cardie and Ralph Weischedel(eds.), Second Conference on Empirical Methods in Natural Language Processing(EMNLP-2), Somerset, New Jersey, ACL. [ZH97] Zhou Qiang, Huang Chang-ning(1997). A Chinese syntactic parser based on bracket matching principle, Communication of COLIPS, 7(2), #97008. [ZH98] Zhou Qiang, Huang Chang-ning(1998). An Inference Approach for Chinese Probabilistic Context-free Grammar, Chinese Journal of Computers, 21(5), 385-392.