Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval

Chinese word segmentation accuracy and its effects on information retrieval Foo, S., Li, H. (2002). TEXT Technology. Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval Schubert Foo* and Hui Li Division of Information Studies, School of Computer Engineering, Nanyang Technology University, Nanyang Avenue, Singapore 639798 Email: assfoo@ntu.edu.sg(*) Abstract In Chinese information retrieval (IR), word segmentation is an essential prerequisite process to break down the documents into smaller linguistic units or word segments so that they can be indexed for subsequent retrieval. Despite a host of Chinese information systems that are in existence today, little work has been done to study word segmentation accuracy and its effect on IR. This article describes a set of experiments that was conducted in the Division of Information Studies, Nanyang Technological University, Singapore, to explore this issue. Four types of automatic character-based segmentation approaches as well as a manual segmentation were used to index a set of test corpus, thereby resulting in five different indices for use in the IR experiments. The segmentation accuracy of each approach was obtained by comparing the automatic segmentation results with the manual segmentation results. A set of IR experiments provided a measure of IR effectiveness using the traditional measures of data recall and data precision. Statistical analysis was applied to explore the correlation between the segmentation accuracy and IR effectiveness. The analysis suggests that the word segmentation approach do have affect on IR results. In particular, the approach that recognizes the higher number of correct words that include 2 or more characters can produce better precision and recall. On the other hand, the existence of ambiguous words resulting from the word segmentation process adversely affect precision. 1. Introduction Research interest in Chinese IR systems has increase as a result of the growing need for knowledge transfer and global information exchange. Typically, an IR system determines the relevant documents according to the frequency of occurrence of the words of a query within the documents and corpus (Nie). For English and other western languages, the identification of distinct words in the documents is trivial. However, this is much more difficult for the Chinese language, as well as other Asian languages such as 1

Japanese or Korean. This is so since Chinese text appears as a string of ideographic characters without any obvious boundary between words except for punctuation signs at the end of each sentence, and occasional commas within sentences. Thus, Chinese text information processing undergoes an essential word segmentation process to break up the text into smaller linguistic units or segments. These segments are subsequently indexed for retrieval operations. Numerous different segmentation approaches have been proposed for Chinese IR. However, these approaches can be basically divided into character-based and word-based approaches. Under these two basic groups, there are many alternatives, such as single-character or multiple-characters for segmentation, use of dictionary or statistics, or introducing linguistic knowledge for segmentation. However, in previous reported Chinese IR studies, it has only been consistently shown that the IR result using single-character indexing is significantly worse than those using other segmentation techniques (Tong). No firm conclusions on the performance of multi-character approaches and word-based approaches have so far been reported. Some researchers obtained better results using bigram (two-character) methods while the others obtained better results using word-based approaches (Wilkinson). Therefore, some researchers believe that a better segmentation approach will be able to yield superior IR results (Nie); while others have not found any direct relationship between the segmentation approach and IR results from their experimental results (Kwok). It is also evident that there has been no such systematic study that is been carried out to investigate this relationship. This research aims to address this gap to investigate the relationship between word segmentation accuracy and IR effectiveness. This paper outlines a set of IR experiments that index the document corpus using manual segmentation as well as four other automatic segmentation approaches. The experiment uses the same manually segmented queries for all these different approaches. The performance of segmentation is evaluated by accuracy measures, while the IR effectiveness is evaluated by the standard measures of data recall and data precision. Meanwhile, a statistical regression model was used to analyse the correlation between the segmentation accuracy measures and IR results. The following sections provide details of each of these phases of work. 2. Experiments and results of segmentation and IR 2.1. Test Corpus The whole month of news from March 1998 of the economics section of the online Chinese People s Daily newspaper was selected as the test corpus. A total of 266 files that contain 182,292 Chinese characters and occupy 532 KB (kilobytes) of computer space were used for the study. Each file varies between 1 KB to 13 KB in size. The same corpus was used for both the word segmentation and IR experiments. 2

2.2. Manual and Automatic Segmentation As mentioned previously, the first step in Chinese IR is to carry out a word segmentation process to divide the texts into linguistic units. A study that estimated the number of words according to the total number of existing Chinese characters has revealed that there are approximately 5% one-character words, 75% two-character words, 14% three-character words, with the remaining 6% comprising four or more character words (Wu). Based on this statistic, it is therefore not surprising to find that many existing segmentation approaches used for Chinese IR are based on the bigram (twocharacter) approach, or variations of the approach. Likewise, this has been chosen for the basis of this research. In order to evaluate the accuracy of the segmentation approach, a manual segmentation of the text must first be carried out. Manual segmentation, as the name implies, is concerned with manually identifying and segmenting the linguistic units (i.e. words) from the text. Manual segmentation is treated as an ideal segmentation technique and used as the baseline to compare against the automatic approaches that are carried out by computer. However, due to the subjective nature of manual segmentation, different results can exist when the same document is segmented by different people. This will obviously have a direct implication on the evaluation of segmentation and computation of accuracy results. To keep the manual segmentation as correct and consistent as possible. and to ensure the comparability of results, a proposed set of rules and guidelines for modern Chinese segmentation for information processing (Liu) has been used as a basis for manual segmentation in these experiments. These rules have been submitted for consideration as a national Chinese standard for use in Chinese information processing. In the automatic segmentation, both English words and other non-chinese strings including Arabic numbers within the Chinese text are treated as distinctive segment units so that these are not broken into bigrams. This treatment is consistent with the rules of segmentation (Liu) and widely adopted by other researchers (Mateev). The four segmentation approaches used in this research are all based on the basic bigram method and includes the: Pure bigram method (abbreviated as pbi) that segments a typical sentence ABCDEF into AB, CD and EF; Overlapping bigram method (abbreviated as ovlap) that segments a typical sentence ABCDEF into AB, BC, CD, DE and EF; Pure bigram method combined with 1-character word list (abbreviated as pstop) that segments a typical sentence ABCDEF into AB, C, DE and F if C is in 1-character word list; Overlapping bigram method combined with 1-character word list (abbreviated as ovstop) that segments a typical sentence ABCDEF into AB, C, DE and EF if C is in 1-character word list. The 1-character word list used in segmentation experiments is constructed according to the statistics of 10 samples from the whole corpus. This is based on the 3

observation that many 1-character words can be used as natural word boundaries of Chinese text. The aim of this list is to break a long Chinese sentence into shorter ones, thereby enabling a larger number of meaningful 2-character words to be identified from these new shorter sentences. A process to generate such a word list was proposed and verified earlier (Foo). In this same paper, it was shown that the use of this 1-character word list led to superior segmentation results using the same test corpus of these IR experiments. At the same time, this 1-character word list is also used as a stoplist in the IR experiments. By using the above four approaches to automatically segment all the 266 files and comparing the results with manual segmentation, it becomes possible to obtain a series of segmentation counts and a set of different accuracy measures as shown in Tables 1 and 2 respectively. The computations are based on individual sentences in conjunction with a word-by-word comparison with the automatic and manual results. In calculating the number of words and accuracy in these Tables, the words in the 1-character word list that are used as stopwords are not included since they are not indexed in the experiments. The manual results are also shown for comparison. In the automatic segmentation calculations, the measures CNwM, APCNwM and MPCNwM includes only 2-character words. Table 1. Segment counts of manual and various automatic approaches CNw CNw1 CNwM AMB2 AN MN (1) pbi 38193 3931 34262 10238 96700 (2) ovlap 56444 1382 55062 17369 162160 (3) pstop 46270 8286 37984 9646 89486 (4) ovstop 55301 4694 50607 15872 131874 (5) manual 90490 27583 62907 0 90490 Legend: CNw: the total number of correct segments CNw1: the total number of correct 1-character segments CNwM: the total number of correct segments with 2 and more characters AMB2: the number of ambiguous 2-character segments AN: the total number of segments in automatic segmentation results MN: the total number of segments in manual segmentation results 4

Table 2. Segmentation accuracy measures of various approaches WSA1 APCNw1 APCNwM APAMB2 WSA2 MPCNw1 MPCNwM (1) pbi 0.394964 0.040651 0.354312 0.105874 0.422069 0.142515 0.544645 (2) ovlap 0.348076 0.008522 0.339554 0.10711 0.623760 0.050103 0.875292 (3) pstop 0.517064 0.092595 0.424469 0.107793 0.511327 0.300402 0.603812 (4) ovstop 0.419347 0.035595 0.383753 0.120357 0.611128 0.170177 0.804473 (5) manual 1 0.304818 0.695182 0 1 1 1 Segmentation accuracy measures: WSA1 = CNw/AN APCNw1 = CNw1/AN APCNwM = CNwM/AN APAMB2 = AMB2/AN WSA2 = CNw/MN MPCNw1 = CNw1/CNw1(manual) MPCNwM = CNwM/CNwM(manual) * For manual calculation, AN is the same as MN. In these tables, an ambiguous segment implies an incorrect word in the context of the sentence but is a legitimate word found in a Chinese language dictionary. For example, xiang zhen qi ye (township enterprises) is a segment generated from manual segmentation. The segmentation of this word using the bigram approach yields xiang zhen (town) / qi ye (enterprises) that are both legitimate words in some other sentences, but in this situation, becomes two ambiguous words. This is an example of the additional difficulty posed in Chinese information retrieval in contrast to its English counterpart. It is evident that there are many different ways to compute segmentation accuracy. These measures are primarily based on the total number of correct segments, total number of correct 1-character segments, total number of correct 2-character segments and total number of ambiguous segments against the results of manual and automatic segmentation. These are identified by their abbreviations and corresponding formulae in Table 2. It should be noted that WSA1 is the sum of APCNw1 and APCNwM, while WSA2 is not the sum of MPCNw1 and MPCNwM. 2.3. Information retrieval experiment In order to formulate a set of queries for the IR experiment, four graduate students from Nanyang Technological University, Singapore, who were proficient in the Chinese language and familiar with the economics subject area, were asked to submit some topics about economic news that interest them. The initial 41 submitted queries were subsequently reduced to 20 by removing all similar or redundant queries, and those that had few document matches that satisfied the queries. These 20 queries form the experimental query set that was given to six assessors. On average, each query contains eight to nine Chinese characters. Of the six assessors, four were the same people that provided the initial queries. Each assessor chose between 5

two to five queries from the query set and made user relevance decisions on the chosen queries. As the size of the document collection is relatively small, it is possible to obtain relevance judgments for all documents with respect to each query. Each assessor was asked to read each document in turn. After each reading, the assessor will to through each query on the list. If the document was deemed to be relevant to the query, the corresponding unique Document ID was noted alongside the query. This is repeated for the remaining queries. Each assessor went through this entire process for each of the documents in turn. The Chinese IR system used for experiments (Lim) was modified from an existing English IR system known as the mg system (Lane). This system is currently housed in the Division of Information Studies Laboratory to support different Chinese language, multilingual and digital libraries research activities in the Division. The five different indices created based on the four automatic segmentation approaches and manual segmentation were stored in separate directories. An optional parameter in the mg command line is used to activate the desired index file in the query and retrieval process. In order to explore the relationship between segmentation and its effect on IR, the five different segmentation approaches were tested while keeping the same query segments for all approaches. In these experiments, manual segmentation results were used as the baseline to represent the ideal segmentation and index. The queries were first segmented manually. Stopwords were excluded. Each query was submitted to the IR system to retrieve a set of ranked results. The ranking is derived from computing a similarity coefficient for each document-query pair. Data recall and precision was used to measure and evaluate the effectiveness of the IR experiments using different segmentation approaches for the index. Comparison of Various Approaches 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 pbi pstop ovlap ovstop manual Figure 1. Recall-precision graph comparing various indexing approaches 6

Figure 1 shows the data recall-precision graph for the different segmentation approaches. The performance of manual segmentation appears superior to the automatic segmentation approaches. Ovlap and ovstop segmentation does not perform as well as manual segmentation but better than pbi and pstop. It is also apparent that the difference of the results of the ovstop and ovlap pair, and pstop and pbi pair are marginal. The average recall and precision results of various approaches are shown in Table 3. The precision value quoted in the table is the average precision (non-interpolated) over all relevant documents. The precision is calculated after each relevant document is retrieved. All precision values are then averaged together to get a single number for each individual query. These values are then averaged over all queries (Harman). The recall is obtained by taking the arithmetic mean over the total number of queries (Salton). The calculation of precision and recall is consistent with the methods used in the TREC experiments (Harmon). Table 3. Average precision and recall values from IR experiments Segmentation Precision Recall Approaches (1) pbi 0.4451 0.8084 (2) ovlap 0.5200 0.8832 (3) pstop 0.4698 0.8318 (4) ovstop 0.5115 0.8832 (5) manual 0.5784 0.9299 3. Analysis on the correlation between segmentation accuracy and IR effectiveness The analysis is concerned with identifying factors (if any) that can co-relate the segmentation accuracy measures with IR effectiveness. This is accomplished through the use of the statistical regression technique to examine the relationship between data precision and recall with the various segmentation accuracy measures. In the analysis, data precision and recall are taken as the dependent variables Y1 and Y2 respectively. The segmentation accuracy measures, WSA1, WSA2, APCNw1, APCNwM, MPCNw1, MPCNwM and APAMB2 are taken as the independent variables. We are interested to know if these variables have an influence on the data precision and recall. There are basically two models of regression analysis, namely, bivariate regression and multiple regression. Since the definitions and calculation of segmentation accuracy measures are related to one or more of the other independent variables, simple bivariate analysis is more appropriate to explore which are the variables of segmentation results that affect precision and recall respectively (Berry). The general bivariate regression model is a linear equation of the form: Y = a + bx + e. Y is the dependent variable (precision or recall), X is the independent variable 7

(the various segmentation accuracy measures), a is the intercept, b is the slope that indicates the average change in Y associated with a unit change in X, and e is the error measure of the regression fit. Using Microsoft Excel 97 as the statistical tool, a bivariate regression of the precision (Y1) and recall (Y2) on the every segmentation measure (Columns 1 to 7 of Table 2) yielded a set of 14 equations. The equations that are recognised as statistically significant to affect the precision and recall value are summarised in Table 4. Dependent Variable (Y) Precision (Y1) Table 4. Statistical significant accuracy measures that affect data precision Independent Variable (X) Intercept a t-value for a Slope b t-value for b Coefficient of determination, R 2 Standard error of estimate, Se WSA2 0.3630 15.6 0.2240 6.37 0.93 0.015 MPCNwM 0.3020 12.63 0.2651 8.69 0.96 0.012 APAMB2 0.5741 15.03-0.7833-2.03 0.58 0.038 WSA2 0.7395 24.12 0.2016 4.36 0.86 0.020 Recall (Y2) MPCNwM 0.6763 37.74 0.2495 10.92 0.98 0.009 In Table 4, R 2 is the coefficient of determination that provides a summary statistic that measures how well the regression equation fits the data. R 2 varies between 0 and 1. As R 2 1, it indicates the ability of the independent variable to explain Y. Se is the standard of error of estimate for Y that gives an indication of the error of the regression fit (so that if R 2 = 1, then Se = 0). Also, if the value of the t-ratio (or t-statistic) exceeds 2, we can conclude that the results are statistically significant at the 0.05 level. The regression results show that MPCNwM and WSA2 are the two segmentation accuracy measures that affect both precision and recall performance. Of the two, MPCNwM is more critical since it exhibits higher R 2 values and t-ratios for b. Additionally, the regression result shows that ambiguous words have a significantly adverse influence on precision but less so on recall (since t = -1.6 (not shown in Table 4)). These results of regression analysis implies that the segmentation approach that correctly recognises a higher number of multi-character words in the manual segmentation results can yield not only better precision but better recall as well. However, a segmentation approach that exhibits a high percentage of multi-character words in the automatic segmentation result (APCNwM) but a low absolute number of correct multi-character words (hence a lower MPCNwM) will not necessarily yield better precision and recall. It is possible to explain why the same factor affects both precision and recall from the process of query matching. With matched queries, documents with higher similarity will be ranked before those with lower similarity. As a result, precision may be improved. Without matched queries, no related documents will be retrieved so that recall is adversely affected. Since the queries are manually segmented in this research, the 8

similarity of query and document becomes more dependent on how correct the document has been segmented. The more accurate segmentation does not only increase the similarity between query and document thereby improving precision, but also increase the chance of being retrieved thereby improving recall. Table 5 uses the example of Query 7 in the experiment that has a total of 17 relevant documents to show how precision and recall change when using different indexing approaches. For each approach, the first column is the Doc_ID of the documents retrieved and relevant, the second column indicates the number of documents retrieved when particular relevant documents are retrieved, and the third and fourth column indicate the corresponding recall and precision value respectively. Table 5. Comparison of IR results of manual and automatic approaches Manual Pstop Ovstop Query 7 Doc_ID #Doc Recall Precision Retrieved 46 1 0.0588 1.0000 234 2 0.1176 1.0000 119 3 0.1765 1.0000 104 4 0.2353 1.0000 118 5 0.2941 1.0000 9 6 0.3529 1.0000 182 7 0.4118 1.0000 122 8 0.4706 1.0000 7 9 0.5294 1.0000 120 10 0.5882 1.0000 44 11 0.6471 1.0000 185 14 0.7059 0.8571 204 15 0.7647 0.8667 47 16 0.8235 0.8750 157 17 0.8824 0.8824 180 21 0.9412 0.7619 85 22 1.0000 0.7727 Doc_ID #Doc Recall Precision Retrieved 46 1 0.0588 1.0000 234 3 0.1176 0.6667 118 4 0.1765 0.7500 119 5 0.2353 0.8000 182 7 0.2941 0.7143 9 8 0.3529 0.7500 204 11 0.4118 0.6364 104 15 0.4706 0.5333 120 17 0.5294 0.5294 44 23 0.5882 0.4348 85 31 0.6471 0.3548 185 33 0.7059 0.3636 Doc_ID #Doc Recall Precision Retrieved 46 1 0.0588 1.0000 234 3 0.1176 0.6667 119 4 0.1765 0.7500 104 5 0.2353 0.8000 182 6 0.2941 0.8333 118 7 0.3529 0.8571 9 9 0.4118 0.7778 204 17 0.4706 0.4706 85 20 0.5294 0.4500 120 21 0.5882 0.4762 185 23 0.6471 0.4783 44 25 0.7059 0.4800 7 27 0.7647 0.4815 As multi-character words are more meaningful in expressing linguistic concepts in Chinese documents, the correct recognition of them would have more significant effects on the results. Although the automatic approaches reported here are only based on 2- character segmentation, they can recognise the majority of multi-character words found from manual segmentation since 2-character words are the most frequently used words in Chinese documents. This number ranges from 54.5% to 87.5% in these experiments. Thus, it is reasonable to conclude that the segmentation approach that recognises the higher number of correct 2-character words and longer words with more than 2 characters will yield better IR results while the correct recognition of 1-character words has less affect on IR results. 4. Conclusion A set of IR experiments based on five different approaches including the manual approach and four automatic bigram segmentation approaches has been conducted to study the relationship between the different accuracy segmentation measures in relation 9

to IR effectiveness via the measures of data precision and recall. By carrying out a regression analysis on the results, two statistically significant relationships are revealed: (1) Precision and recall is largely dependent upon the recognition of the correct 2- character and longer words in the manual segmentation results. The approach that recognises the higher number of such words will yield higher precision and recall. (2) The existence of ambiguous words adversely affect recall. As part of future work, we intend to conduct these experiments using a larger corpora and in other subject domains. Other segmentation approaches, such as dictionarybased approaches, can also be incorporated into the model to enable a more elaborate study to be carried out. With the projected increasing number of Chinese IR systems in future, such studies will prove useful in identifying the most appropriate choice of segmentation approaches for IR applications. References Berry, W.D., and Feldman, S. (1985). Multiple regression in practice. Newbury Park: SAGE publications. Foo, S., and Li, H. (1998). An integrated bigram approach with single-character word list for Chinese word segmentation, TEXT Technology, 8(4), 17-28. Harman, D. (1993). Overview of the Second Text REtrieval Conference. http://trec.nist.gov/pubs/trec6/t6_proceedings.html, Maryland. (Last visited: 3/03/2001) Kwok, K.L. (1997). Comparing representations in Chinese information retrieval. http://ir.cs.qc.edu/#publi_. (Last visited: 3/03/2001) Lane, D. (1997). MG pages. http://www.mds.rmit.edu.au/mg. (Last visited: 3/03/2001) Lim, H.K. (1999). Chinese text retrieval system. Master of Applied Science (M.A.Sc.) Dissertation: School of Applied Science, Nanyang Technological University, Singapore. Liu, Y. (1994). The rules of modern Chinese segmentation for the purpose of information processing and approaches of automatic Chinese segmentation (in Chinese). Beijing: Tsinghua University Press, 36-63. Mateev, B. et al. (1998). ETH TREC-6: routing, Chinese, cross-language and spoken document retrieval. http://trec.nist.gov/pubs/trec6/t6_proceedings.html, Maryland. (Last visited: 3/03/2001) Nie, J.Y., Brisebois, M., and Ren, X.B. (1996). On Chinese text retrieval. Proceedings of SIGIR 96, Zurich, Switzerland, 225-233. Salton, G., and McGill, M.J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill Book Company. Tong, X., Zhai, C.X., Milic-Frayling, N., and Evans, D.A. (1996). Experiments on Chinese text indexing CLARIT TREC 5 Chinese track report. http://trec.nist.gov/pubs/trec5/t5_proceedings.html, Maryland. (Last visited: 3/03/2001) 10

Wilkinson, R. (1998). Chinese Document Retrieval at TREC-6. http://trec.nist.gov/pubs/trec6/t6_proceedings.html, Maryland. (Last visited: 3/03/2001) Wu, Z.M., and Tseng, G. (1993). Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44 (9), 532-542. 11