Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval

Similar documents
Cross Language Information Retrieval

Mandarin Lexical Tone Recognition: The Gating Paradigm

How to Judge the Quality of an Objective Classroom Test

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

On-the-Fly Customization of Automated Essay Scoring

Grade 6: Correlated to AGS Basic Math Skills

International Advanced level examinations

Mathematics. Mathematics

Statewide Framework Document for:

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Probability and Statistics Curriculum Pacing Guide

Rendezvous with Comet Halley Next Generation of Science Standards

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Radius STEM Readiness TM

Constructing Parallel Corpus from Movie Subtitles

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

1.11 I Know What Do You Know?

BENCHMARK TREND COMPARISON REPORT:

On document relevance and lexical cohesion between query terms

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Matching Similarity for Keyword-Based Clustering

Loughton School s curriculum evening. 28 th February 2017

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Multi-Lingual Text Leveling

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

The Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

AQUA: An Ontology-Driven Question Answering System

Linking Task: Identifying authors and book titles in verbose queries

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Individual Differences & Item Effects: How to test them, & how to test them well

Assignment 1: Predicting Amazon Review Ratings

STA 225: Introductory Statistics (CT)

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Language Acquisition Chart

Characterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University

Language Independent Passage Retrieval for Question Answering

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Cal s Dinner Card Deals

INTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

ICTCM 28th International Conference on Technology in Collegiate Mathematics

Mathematics Program Assessment Plan

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Mathematics process categories

Psychometric Research Brief Office of Shared Accountability

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Matching Meaning for Cross-Language Information Retrieval

Switchboard Language Model Improvement with Conversational Data from Gigaword

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Corpus Linguistics (L615)

Myths, Legends, Fairytales and Novels (Writing a Letter)

Math 96: Intermediate Algebra in Context

Extending Place Value with Whole Numbers to 1,000,000

INTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE

Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students

Unit 7 Data analysis and design

Master Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management

American Journal of Business Education October 2009 Volume 2, Number 7

10.2. Behavior models

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Disambiguation of Thai Personal Name from Online News Articles

Learning Microsoft Publisher , (Weixel et al)

This scope and sequence assumes 160 days for instruction, divided among 15 units.

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Resolving Ambiguity for Cross-language Retrieval

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias

South Carolina English Language Arts

Mathematics Scoring Guide for Sample Test 2005

Learning From the Past with Experiment Databases

Speech Recognition at ICSI: Broadcast News and beyond

Guidelines for the Use of the Continuing Education Unit (CEU)

TEKS Correlations Proclamation 2017

What the National Curriculum requires in reading at Y5 and Y6

COURSE SYNOPSIS COURSE OBJECTIVES. UNIVERSITI SAINS MALAYSIA School of Management

Generating Test Cases From Use Cases

1 3-5 = Subtraction - a binary operation

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

A Case Study: News Classification Based on Term Frequency

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Cross-Language Information Retrieval

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

NUMBERS AND OPERATIONS

Transcription:

Chinese word segmentation accuracy and its effects on information retrieval Foo, S., Li, H. (2002). TEXT Technology. Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval Schubert Foo* and Hui Li Division of Information Studies, School of Computer Engineering, Nanyang Technology University, Nanyang Avenue, Singapore 639798 Email: assfoo@ntu.edu.sg(*) Abstract In Chinese information retrieval (IR), word segmentation is an essential prerequisite process to break down the documents into smaller linguistic units or word segments so that they can be indexed for subsequent retrieval. Despite a host of Chinese information systems that are in existence today, little work has been done to study word segmentation accuracy and its effect on IR. This article describes a set of experiments that was conducted in the Division of Information Studies, Nanyang Technological University, Singapore, to explore this issue. Four types of automatic character-based segmentation approaches as well as a manual segmentation were used to index a set of test corpus, thereby resulting in five different indices for use in the IR experiments. The segmentation accuracy of each approach was obtained by comparing the automatic segmentation results with the manual segmentation results. A set of IR experiments provided a measure of IR effectiveness using the traditional measures of data recall and data precision. Statistical analysis was applied to explore the correlation between the segmentation accuracy and IR effectiveness. The analysis suggests that the word segmentation approach do have affect on IR results. In particular, the approach that recognizes the higher number of correct words that include 2 or more characters can produce better precision and recall. On the other hand, the existence of ambiguous words resulting from the word segmentation process adversely affect precision. 1. Introduction Research interest in Chinese IR systems has increase as a result of the growing need for knowledge transfer and global information exchange. Typically, an IR system determines the relevant documents according to the frequency of occurrence of the words of a query within the documents and corpus (Nie). For English and other western languages, the identification of distinct words in the documents is trivial. However, this is much more difficult for the Chinese language, as well as other Asian languages such as 1

Japanese or Korean. This is so since Chinese text appears as a string of ideographic characters without any obvious boundary between words except for punctuation signs at the end of each sentence, and occasional commas within sentences. Thus, Chinese text information processing undergoes an essential word segmentation process to break up the text into smaller linguistic units or segments. These segments are subsequently indexed for retrieval operations. Numerous different segmentation approaches have been proposed for Chinese IR. However, these approaches can be basically divided into character-based and word-based approaches. Under these two basic groups, there are many alternatives, such as single-character or multiple-characters for segmentation, use of dictionary or statistics, or introducing linguistic knowledge for segmentation. However, in previous reported Chinese IR studies, it has only been consistently shown that the IR result using single-character indexing is significantly worse than those using other segmentation techniques (Tong). No firm conclusions on the performance of multi-character approaches and word-based approaches have so far been reported. Some researchers obtained better results using bigram (two-character) methods while the others obtained better results using word-based approaches (Wilkinson). Therefore, some researchers believe that a better segmentation approach will be able to yield superior IR results (Nie); while others have not found any direct relationship between the segmentation approach and IR results from their experimental results (Kwok). It is also evident that there has been no such systematic study that is been carried out to investigate this relationship. This research aims to address this gap to investigate the relationship between word segmentation accuracy and IR effectiveness. This paper outlines a set of IR experiments that index the document corpus using manual segmentation as well as four other automatic segmentation approaches. The experiment uses the same manually segmented queries for all these different approaches. The performance of segmentation is evaluated by accuracy measures, while the IR effectiveness is evaluated by the standard measures of data recall and data precision. Meanwhile, a statistical regression model was used to analyse the correlation between the segmentation accuracy measures and IR results. The following sections provide details of each of these phases of work. 2. Experiments and results of segmentation and IR 2.1. Test Corpus The whole month of news from March 1998 of the economics section of the online Chinese People s Daily newspaper was selected as the test corpus. A total of 266 files that contain 182,292 Chinese characters and occupy 532 KB (kilobytes) of computer space were used for the study. Each file varies between 1 KB to 13 KB in size. The same corpus was used for both the word segmentation and IR experiments. 2

2.2. Manual and Automatic Segmentation As mentioned previously, the first step in Chinese IR is to carry out a word segmentation process to divide the texts into linguistic units. A study that estimated the number of words according to the total number of existing Chinese characters has revealed that there are approximately 5% one-character words, 75% two-character words, 14% three-character words, with the remaining 6% comprising four or more character words (Wu). Based on this statistic, it is therefore not surprising to find that many existing segmentation approaches used for Chinese IR are based on the bigram (twocharacter) approach, or variations of the approach. Likewise, this has been chosen for the basis of this research. In order to evaluate the accuracy of the segmentation approach, a manual segmentation of the text must first be carried out. Manual segmentation, as the name implies, is concerned with manually identifying and segmenting the linguistic units (i.e. words) from the text. Manual segmentation is treated as an ideal segmentation technique and used as the baseline to compare against the automatic approaches that are carried out by computer. However, due to the subjective nature of manual segmentation, different results can exist when the same document is segmented by different people. This will obviously have a direct implication on the evaluation of segmentation and computation of accuracy results. To keep the manual segmentation as correct and consistent as possible. and to ensure the comparability of results, a proposed set of rules and guidelines for modern Chinese segmentation for information processing (Liu) has been used as a basis for manual segmentation in these experiments. These rules have been submitted for consideration as a national Chinese standard for use in Chinese information processing. In the automatic segmentation, both English words and other non-chinese strings including Arabic numbers within the Chinese text are treated as distinctive segment units so that these are not broken into bigrams. This treatment is consistent with the rules of segmentation (Liu) and widely adopted by other researchers (Mateev). The four segmentation approaches used in this research are all based on the basic bigram method and includes the: Pure bigram method (abbreviated as pbi) that segments a typical sentence ABCDEF into AB, CD and EF; Overlapping bigram method (abbreviated as ovlap) that segments a typical sentence ABCDEF into AB, BC, CD, DE and EF; Pure bigram method combined with 1-character word list (abbreviated as pstop) that segments a typical sentence ABCDEF into AB, C, DE and F if C is in 1-character word list; Overlapping bigram method combined with 1-character word list (abbreviated as ovstop) that segments a typical sentence ABCDEF into AB, C, DE and EF if C is in 1-character word list. The 1-character word list used in segmentation experiments is constructed according to the statistics of 10 samples from the whole corpus. This is based on the 3

observation that many 1-character words can be used as natural word boundaries of Chinese text. The aim of this list is to break a long Chinese sentence into shorter ones, thereby enabling a larger number of meaningful 2-character words to be identified from these new shorter sentences. A process to generate such a word list was proposed and verified earlier (Foo). In this same paper, it was shown that the use of this 1-character word list led to superior segmentation results using the same test corpus of these IR experiments. At the same time, this 1-character word list is also used as a stoplist in the IR experiments. By using the above four approaches to automatically segment all the 266 files and comparing the results with manual segmentation, it becomes possible to obtain a series of segmentation counts and a set of different accuracy measures as shown in Tables 1 and 2 respectively. The computations are based on individual sentences in conjunction with a word-by-word comparison with the automatic and manual results. In calculating the number of words and accuracy in these Tables, the words in the 1-character word list that are used as stopwords are not included since they are not indexed in the experiments. The manual results are also shown for comparison. In the automatic segmentation calculations, the measures CNwM, APCNwM and MPCNwM includes only 2-character words. Table 1. Segment counts of manual and various automatic approaches CNw CNw1 CNwM AMB2 AN MN (1) pbi 38193 3931 34262 10238 96700 (2) ovlap 56444 1382 55062 17369 162160 (3) pstop 46270 8286 37984 9646 89486 (4) ovstop 55301 4694 50607 15872 131874 (5) manual 90490 27583 62907 0 90490 Legend: CNw: the total number of correct segments CNw1: the total number of correct 1-character segments CNwM: the total number of correct segments with 2 and more characters AMB2: the number of ambiguous 2-character segments AN: the total number of segments in automatic segmentation results MN: the total number of segments in manual segmentation results 4

Table 2. Segmentation accuracy measures of various approaches WSA1 APCNw1 APCNwM APAMB2 WSA2 MPCNw1 MPCNwM (1) pbi 0.394964 0.040651 0.354312 0.105874 0.422069 0.142515 0.544645 (2) ovlap 0.348076 0.008522 0.339554 0.10711 0.623760 0.050103 0.875292 (3) pstop 0.517064 0.092595 0.424469 0.107793 0.511327 0.300402 0.603812 (4) ovstop 0.419347 0.035595 0.383753 0.120357 0.611128 0.170177 0.804473 (5) manual 1 0.304818 0.695182 0 1 1 1 Segmentation accuracy measures: WSA1 = CNw/AN APCNw1 = CNw1/AN APCNwM = CNwM/AN APAMB2 = AMB2/AN WSA2 = CNw/MN MPCNw1 = CNw1/CNw1(manual) MPCNwM = CNwM/CNwM(manual) * For manual calculation, AN is the same as MN. In these tables, an ambiguous segment implies an incorrect word in the context of the sentence but is a legitimate word found in a Chinese language dictionary. For example, xiang zhen qi ye (township enterprises) is a segment generated from manual segmentation. The segmentation of this word using the bigram approach yields xiang zhen (town) / qi ye (enterprises) that are both legitimate words in some other sentences, but in this situation, becomes two ambiguous words. This is an example of the additional difficulty posed in Chinese information retrieval in contrast to its English counterpart. It is evident that there are many different ways to compute segmentation accuracy. These measures are primarily based on the total number of correct segments, total number of correct 1-character segments, total number of correct 2-character segments and total number of ambiguous segments against the results of manual and automatic segmentation. These are identified by their abbreviations and corresponding formulae in Table 2. It should be noted that WSA1 is the sum of APCNw1 and APCNwM, while WSA2 is not the sum of MPCNw1 and MPCNwM. 2.3. Information retrieval experiment In order to formulate a set of queries for the IR experiment, four graduate students from Nanyang Technological University, Singapore, who were proficient in the Chinese language and familiar with the economics subject area, were asked to submit some topics about economic news that interest them. The initial 41 submitted queries were subsequently reduced to 20 by removing all similar or redundant queries, and those that had few document matches that satisfied the queries. These 20 queries form the experimental query set that was given to six assessors. On average, each query contains eight to nine Chinese characters. Of the six assessors, four were the same people that provided the initial queries. Each assessor chose between 5

two to five queries from the query set and made user relevance decisions on the chosen queries. As the size of the document collection is relatively small, it is possible to obtain relevance judgments for all documents with respect to each query. Each assessor was asked to read each document in turn. After each reading, the assessor will to through each query on the list. If the document was deemed to be relevant to the query, the corresponding unique Document ID was noted alongside the query. This is repeated for the remaining queries. Each assessor went through this entire process for each of the documents in turn. The Chinese IR system used for experiments (Lim) was modified from an existing English IR system known as the mg system (Lane). This system is currently housed in the Division of Information Studies Laboratory to support different Chinese language, multilingual and digital libraries research activities in the Division. The five different indices created based on the four automatic segmentation approaches and manual segmentation were stored in separate directories. An optional parameter in the mg command line is used to activate the desired index file in the query and retrieval process. In order to explore the relationship between segmentation and its effect on IR, the five different segmentation approaches were tested while keeping the same query segments for all approaches. In these experiments, manual segmentation results were used as the baseline to represent the ideal segmentation and index. The queries were first segmented manually. Stopwords were excluded. Each query was submitted to the IR system to retrieve a set of ranked results. The ranking is derived from computing a similarity coefficient for each document-query pair. Data recall and precision was used to measure and evaluate the effectiveness of the IR experiments using different segmentation approaches for the index. Comparison of Various Approaches 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 pbi pstop ovlap ovstop manual Figure 1. Recall-precision graph comparing various indexing approaches 6

Figure 1 shows the data recall-precision graph for the different segmentation approaches. The performance of manual segmentation appears superior to the automatic segmentation approaches. Ovlap and ovstop segmentation does not perform as well as manual segmentation but better than pbi and pstop. It is also apparent that the difference of the results of the ovstop and ovlap pair, and pstop and pbi pair are marginal. The average recall and precision results of various approaches are shown in Table 3. The precision value quoted in the table is the average precision (non-interpolated) over all relevant documents. The precision is calculated after each relevant document is retrieved. All precision values are then averaged together to get a single number for each individual query. These values are then averaged over all queries (Harman). The recall is obtained by taking the arithmetic mean over the total number of queries (Salton). The calculation of precision and recall is consistent with the methods used in the TREC experiments (Harmon). Table 3. Average precision and recall values from IR experiments Segmentation Precision Recall Approaches (1) pbi 0.4451 0.8084 (2) ovlap 0.5200 0.8832 (3) pstop 0.4698 0.8318 (4) ovstop 0.5115 0.8832 (5) manual 0.5784 0.9299 3. Analysis on the correlation between segmentation accuracy and IR effectiveness The analysis is concerned with identifying factors (if any) that can co-relate the segmentation accuracy measures with IR effectiveness. This is accomplished through the use of the statistical regression technique to examine the relationship between data precision and recall with the various segmentation accuracy measures. In the analysis, data precision and recall are taken as the dependent variables Y1 and Y2 respectively. The segmentation accuracy measures, WSA1, WSA2, APCNw1, APCNwM, MPCNw1, MPCNwM and APAMB2 are taken as the independent variables. We are interested to know if these variables have an influence on the data precision and recall. There are basically two models of regression analysis, namely, bivariate regression and multiple regression. Since the definitions and calculation of segmentation accuracy measures are related to one or more of the other independent variables, simple bivariate analysis is more appropriate to explore which are the variables of segmentation results that affect precision and recall respectively (Berry). The general bivariate regression model is a linear equation of the form: Y = a + bx + e. Y is the dependent variable (precision or recall), X is the independent variable 7

(the various segmentation accuracy measures), a is the intercept, b is the slope that indicates the average change in Y associated with a unit change in X, and e is the error measure of the regression fit. Using Microsoft Excel 97 as the statistical tool, a bivariate regression of the precision (Y1) and recall (Y2) on the every segmentation measure (Columns 1 to 7 of Table 2) yielded a set of 14 equations. The equations that are recognised as statistically significant to affect the precision and recall value are summarised in Table 4. Dependent Variable (Y) Precision (Y1) Table 4. Statistical significant accuracy measures that affect data precision Independent Variable (X) Intercept a t-value for a Slope b t-value for b Coefficient of determination, R 2 Standard error of estimate, Se WSA2 0.3630 15.6 0.2240 6.37 0.93 0.015 MPCNwM 0.3020 12.63 0.2651 8.69 0.96 0.012 APAMB2 0.5741 15.03-0.7833-2.03 0.58 0.038 WSA2 0.7395 24.12 0.2016 4.36 0.86 0.020 Recall (Y2) MPCNwM 0.6763 37.74 0.2495 10.92 0.98 0.009 In Table 4, R 2 is the coefficient of determination that provides a summary statistic that measures how well the regression equation fits the data. R 2 varies between 0 and 1. As R 2 1, it indicates the ability of the independent variable to explain Y. Se is the standard of error of estimate for Y that gives an indication of the error of the regression fit (so that if R 2 = 1, then Se = 0). Also, if the value of the t-ratio (or t-statistic) exceeds 2, we can conclude that the results are statistically significant at the 0.05 level. The regression results show that MPCNwM and WSA2 are the two segmentation accuracy measures that affect both precision and recall performance. Of the two, MPCNwM is more critical since it exhibits higher R 2 values and t-ratios for b. Additionally, the regression result shows that ambiguous words have a significantly adverse influence on precision but less so on recall (since t = -1.6 (not shown in Table 4)). These results of regression analysis implies that the segmentation approach that correctly recognises a higher number of multi-character words in the manual segmentation results can yield not only better precision but better recall as well. However, a segmentation approach that exhibits a high percentage of multi-character words in the automatic segmentation result (APCNwM) but a low absolute number of correct multi-character words (hence a lower MPCNwM) will not necessarily yield better precision and recall. It is possible to explain why the same factor affects both precision and recall from the process of query matching. With matched queries, documents with higher similarity will be ranked before those with lower similarity. As a result, precision may be improved. Without matched queries, no related documents will be retrieved so that recall is adversely affected. Since the queries are manually segmented in this research, the 8

similarity of query and document becomes more dependent on how correct the document has been segmented. The more accurate segmentation does not only increase the similarity between query and document thereby improving precision, but also increase the chance of being retrieved thereby improving recall. Table 5 uses the example of Query 7 in the experiment that has a total of 17 relevant documents to show how precision and recall change when using different indexing approaches. For each approach, the first column is the Doc_ID of the documents retrieved and relevant, the second column indicates the number of documents retrieved when particular relevant documents are retrieved, and the third and fourth column indicate the corresponding recall and precision value respectively. Table 5. Comparison of IR results of manual and automatic approaches Manual Pstop Ovstop Query 7 Doc_ID #Doc Recall Precision Retrieved 46 1 0.0588 1.0000 234 2 0.1176 1.0000 119 3 0.1765 1.0000 104 4 0.2353 1.0000 118 5 0.2941 1.0000 9 6 0.3529 1.0000 182 7 0.4118 1.0000 122 8 0.4706 1.0000 7 9 0.5294 1.0000 120 10 0.5882 1.0000 44 11 0.6471 1.0000 185 14 0.7059 0.8571 204 15 0.7647 0.8667 47 16 0.8235 0.8750 157 17 0.8824 0.8824 180 21 0.9412 0.7619 85 22 1.0000 0.7727 Doc_ID #Doc Recall Precision Retrieved 46 1 0.0588 1.0000 234 3 0.1176 0.6667 118 4 0.1765 0.7500 119 5 0.2353 0.8000 182 7 0.2941 0.7143 9 8 0.3529 0.7500 204 11 0.4118 0.6364 104 15 0.4706 0.5333 120 17 0.5294 0.5294 44 23 0.5882 0.4348 85 31 0.6471 0.3548 185 33 0.7059 0.3636 Doc_ID #Doc Recall Precision Retrieved 46 1 0.0588 1.0000 234 3 0.1176 0.6667 119 4 0.1765 0.7500 104 5 0.2353 0.8000 182 6 0.2941 0.8333 118 7 0.3529 0.8571 9 9 0.4118 0.7778 204 17 0.4706 0.4706 85 20 0.5294 0.4500 120 21 0.5882 0.4762 185 23 0.6471 0.4783 44 25 0.7059 0.4800 7 27 0.7647 0.4815 As multi-character words are more meaningful in expressing linguistic concepts in Chinese documents, the correct recognition of them would have more significant effects on the results. Although the automatic approaches reported here are only based on 2- character segmentation, they can recognise the majority of multi-character words found from manual segmentation since 2-character words are the most frequently used words in Chinese documents. This number ranges from 54.5% to 87.5% in these experiments. Thus, it is reasonable to conclude that the segmentation approach that recognises the higher number of correct 2-character words and longer words with more than 2 characters will yield better IR results while the correct recognition of 1-character words has less affect on IR results. 4. Conclusion A set of IR experiments based on five different approaches including the manual approach and four automatic bigram segmentation approaches has been conducted to study the relationship between the different accuracy segmentation measures in relation 9

to IR effectiveness via the measures of data precision and recall. By carrying out a regression analysis on the results, two statistically significant relationships are revealed: (1) Precision and recall is largely dependent upon the recognition of the correct 2- character and longer words in the manual segmentation results. The approach that recognises the higher number of such words will yield higher precision and recall. (2) The existence of ambiguous words adversely affect recall. As part of future work, we intend to conduct these experiments using a larger corpora and in other subject domains. Other segmentation approaches, such as dictionarybased approaches, can also be incorporated into the model to enable a more elaborate study to be carried out. With the projected increasing number of Chinese IR systems in future, such studies will prove useful in identifying the most appropriate choice of segmentation approaches for IR applications. References Berry, W.D., and Feldman, S. (1985). Multiple regression in practice. Newbury Park: SAGE publications. Foo, S., and Li, H. (1998). An integrated bigram approach with single-character word list for Chinese word segmentation, TEXT Technology, 8(4), 17-28. Harman, D. (1993). Overview of the Second Text REtrieval Conference. http://trec.nist.gov/pubs/trec6/t6_proceedings.html, Maryland. (Last visited: 3/03/2001) Kwok, K.L. (1997). Comparing representations in Chinese information retrieval. http://ir.cs.qc.edu/#publi_. (Last visited: 3/03/2001) Lane, D. (1997). MG pages. http://www.mds.rmit.edu.au/mg. (Last visited: 3/03/2001) Lim, H.K. (1999). Chinese text retrieval system. Master of Applied Science (M.A.Sc.) Dissertation: School of Applied Science, Nanyang Technological University, Singapore. Liu, Y. (1994). The rules of modern Chinese segmentation for the purpose of information processing and approaches of automatic Chinese segmentation (in Chinese). Beijing: Tsinghua University Press, 36-63. Mateev, B. et al. (1998). ETH TREC-6: routing, Chinese, cross-language and spoken document retrieval. http://trec.nist.gov/pubs/trec6/t6_proceedings.html, Maryland. (Last visited: 3/03/2001) Nie, J.Y., Brisebois, M., and Ren, X.B. (1996). On Chinese text retrieval. Proceedings of SIGIR 96, Zurich, Switzerland, 225-233. Salton, G., and McGill, M.J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill Book Company. Tong, X., Zhai, C.X., Milic-Frayling, N., and Evans, D.A. (1996). Experiments on Chinese text indexing CLARIT TREC 5 Chinese track report. http://trec.nist.gov/pubs/trec5/t5_proceedings.html, Maryland. (Last visited: 3/03/2001) 10

Wilkinson, R. (1998). Chinese Document Retrieval at TREC-6. http://trec.nist.gov/pubs/trec6/t6_proceedings.html, Maryland. (Last visited: 3/03/2001) Wu, Z.M., and Tseng, G. (1993). Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44 (9), 532-542. 11