Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval
|
|
- Horace Carr
- 6 years ago
- Views:
Transcription
1 Chinese word segmentation accuracy and its effects on information retrieval Foo, S., Li, H. (2002). TEXT Technology. Chinese Word Segmentation Accuracy and Its Effects on Information Retrieval Schubert Foo* and Hui Li Division of Information Studies, School of Computer Engineering, Nanyang Technology University, Nanyang Avenue, Singapore Abstract In Chinese information retrieval (IR), word segmentation is an essential prerequisite process to break down the documents into smaller linguistic units or word segments so that they can be indexed for subsequent retrieval. Despite a host of Chinese information systems that are in existence today, little work has been done to study word segmentation accuracy and its effect on IR. This article describes a set of experiments that was conducted in the Division of Information Studies, Nanyang Technological University, Singapore, to explore this issue. Four types of automatic character-based segmentation approaches as well as a manual segmentation were used to index a set of test corpus, thereby resulting in five different indices for use in the IR experiments. The segmentation accuracy of each approach was obtained by comparing the automatic segmentation results with the manual segmentation results. A set of IR experiments provided a measure of IR effectiveness using the traditional measures of data recall and data precision. Statistical analysis was applied to explore the correlation between the segmentation accuracy and IR effectiveness. The analysis suggests that the word segmentation approach do have affect on IR results. In particular, the approach that recognizes the higher number of correct words that include 2 or more characters can produce better precision and recall. On the other hand, the existence of ambiguous words resulting from the word segmentation process adversely affect precision. 1. Introduction Research interest in Chinese IR systems has increase as a result of the growing need for knowledge transfer and global information exchange. Typically, an IR system determines the relevant documents according to the frequency of occurrence of the words of a query within the documents and corpus (Nie). For English and other western languages, the identification of distinct words in the documents is trivial. However, this is much more difficult for the Chinese language, as well as other Asian languages such as 1
2 Japanese or Korean. This is so since Chinese text appears as a string of ideographic characters without any obvious boundary between words except for punctuation signs at the end of each sentence, and occasional commas within sentences. Thus, Chinese text information processing undergoes an essential word segmentation process to break up the text into smaller linguistic units or segments. These segments are subsequently indexed for retrieval operations. Numerous different segmentation approaches have been proposed for Chinese IR. However, these approaches can be basically divided into character-based and word-based approaches. Under these two basic groups, there are many alternatives, such as single-character or multiple-characters for segmentation, use of dictionary or statistics, or introducing linguistic knowledge for segmentation. However, in previous reported Chinese IR studies, it has only been consistently shown that the IR result using single-character indexing is significantly worse than those using other segmentation techniques (Tong). No firm conclusions on the performance of multi-character approaches and word-based approaches have so far been reported. Some researchers obtained better results using bigram (two-character) methods while the others obtained better results using word-based approaches (Wilkinson). Therefore, some researchers believe that a better segmentation approach will be able to yield superior IR results (Nie); while others have not found any direct relationship between the segmentation approach and IR results from their experimental results (Kwok). It is also evident that there has been no such systematic study that is been carried out to investigate this relationship. This research aims to address this gap to investigate the relationship between word segmentation accuracy and IR effectiveness. This paper outlines a set of IR experiments that index the document corpus using manual segmentation as well as four other automatic segmentation approaches. The experiment uses the same manually segmented queries for all these different approaches. The performance of segmentation is evaluated by accuracy measures, while the IR effectiveness is evaluated by the standard measures of data recall and data precision. Meanwhile, a statistical regression model was used to analyse the correlation between the segmentation accuracy measures and IR results. The following sections provide details of each of these phases of work. 2. Experiments and results of segmentation and IR 2.1. Test Corpus The whole month of news from March 1998 of the economics section of the online Chinese People s Daily newspaper was selected as the test corpus. A total of 266 files that contain 182,292 Chinese characters and occupy 532 KB (kilobytes) of computer space were used for the study. Each file varies between 1 KB to 13 KB in size. The same corpus was used for both the word segmentation and IR experiments. 2
3 2.2. Manual and Automatic Segmentation As mentioned previously, the first step in Chinese IR is to carry out a word segmentation process to divide the texts into linguistic units. A study that estimated the number of words according to the total number of existing Chinese characters has revealed that there are approximately 5% one-character words, 75% two-character words, 14% three-character words, with the remaining 6% comprising four or more character words (Wu). Based on this statistic, it is therefore not surprising to find that many existing segmentation approaches used for Chinese IR are based on the bigram (twocharacter) approach, or variations of the approach. Likewise, this has been chosen for the basis of this research. In order to evaluate the accuracy of the segmentation approach, a manual segmentation of the text must first be carried out. Manual segmentation, as the name implies, is concerned with manually identifying and segmenting the linguistic units (i.e. words) from the text. Manual segmentation is treated as an ideal segmentation technique and used as the baseline to compare against the automatic approaches that are carried out by computer. However, due to the subjective nature of manual segmentation, different results can exist when the same document is segmented by different people. This will obviously have a direct implication on the evaluation of segmentation and computation of accuracy results. To keep the manual segmentation as correct and consistent as possible. and to ensure the comparability of results, a proposed set of rules and guidelines for modern Chinese segmentation for information processing (Liu) has been used as a basis for manual segmentation in these experiments. These rules have been submitted for consideration as a national Chinese standard for use in Chinese information processing. In the automatic segmentation, both English words and other non-chinese strings including Arabic numbers within the Chinese text are treated as distinctive segment units so that these are not broken into bigrams. This treatment is consistent with the rules of segmentation (Liu) and widely adopted by other researchers (Mateev). The four segmentation approaches used in this research are all based on the basic bigram method and includes the: Pure bigram method (abbreviated as pbi) that segments a typical sentence ABCDEF into AB, CD and EF; Overlapping bigram method (abbreviated as ovlap) that segments a typical sentence ABCDEF into AB, BC, CD, DE and EF; Pure bigram method combined with 1-character word list (abbreviated as pstop) that segments a typical sentence ABCDEF into AB, C, DE and F if C is in 1-character word list; Overlapping bigram method combined with 1-character word list (abbreviated as ovstop) that segments a typical sentence ABCDEF into AB, C, DE and EF if C is in 1-character word list. The 1-character word list used in segmentation experiments is constructed according to the statistics of 10 samples from the whole corpus. This is based on the 3
4 observation that many 1-character words can be used as natural word boundaries of Chinese text. The aim of this list is to break a long Chinese sentence into shorter ones, thereby enabling a larger number of meaningful 2-character words to be identified from these new shorter sentences. A process to generate such a word list was proposed and verified earlier (Foo). In this same paper, it was shown that the use of this 1-character word list led to superior segmentation results using the same test corpus of these IR experiments. At the same time, this 1-character word list is also used as a stoplist in the IR experiments. By using the above four approaches to automatically segment all the 266 files and comparing the results with manual segmentation, it becomes possible to obtain a series of segmentation counts and a set of different accuracy measures as shown in Tables 1 and 2 respectively. The computations are based on individual sentences in conjunction with a word-by-word comparison with the automatic and manual results. In calculating the number of words and accuracy in these Tables, the words in the 1-character word list that are used as stopwords are not included since they are not indexed in the experiments. The manual results are also shown for comparison. In the automatic segmentation calculations, the measures CNwM, APCNwM and MPCNwM includes only 2-character words. Table 1. Segment counts of manual and various automatic approaches CNw CNw1 CNwM AMB2 AN MN (1) pbi (2) ovlap (3) pstop (4) ovstop (5) manual Legend: CNw: the total number of correct segments CNw1: the total number of correct 1-character segments CNwM: the total number of correct segments with 2 and more characters AMB2: the number of ambiguous 2-character segments AN: the total number of segments in automatic segmentation results MN: the total number of segments in manual segmentation results 4
5 Table 2. Segmentation accuracy measures of various approaches WSA1 APCNw1 APCNwM APAMB2 WSA2 MPCNw1 MPCNwM (1) pbi (2) ovlap (3) pstop (4) ovstop (5) manual Segmentation accuracy measures: WSA1 = CNw/AN APCNw1 = CNw1/AN APCNwM = CNwM/AN APAMB2 = AMB2/AN WSA2 = CNw/MN MPCNw1 = CNw1/CNw1(manual) MPCNwM = CNwM/CNwM(manual) * For manual calculation, AN is the same as MN. In these tables, an ambiguous segment implies an incorrect word in the context of the sentence but is a legitimate word found in a Chinese language dictionary. For example, xiang zhen qi ye (township enterprises) is a segment generated from manual segmentation. The segmentation of this word using the bigram approach yields xiang zhen (town) / qi ye (enterprises) that are both legitimate words in some other sentences, but in this situation, becomes two ambiguous words. This is an example of the additional difficulty posed in Chinese information retrieval in contrast to its English counterpart. It is evident that there are many different ways to compute segmentation accuracy. These measures are primarily based on the total number of correct segments, total number of correct 1-character segments, total number of correct 2-character segments and total number of ambiguous segments against the results of manual and automatic segmentation. These are identified by their abbreviations and corresponding formulae in Table 2. It should be noted that WSA1 is the sum of APCNw1 and APCNwM, while WSA2 is not the sum of MPCNw1 and MPCNwM Information retrieval experiment In order to formulate a set of queries for the IR experiment, four graduate students from Nanyang Technological University, Singapore, who were proficient in the Chinese language and familiar with the economics subject area, were asked to submit some topics about economic news that interest them. The initial 41 submitted queries were subsequently reduced to 20 by removing all similar or redundant queries, and those that had few document matches that satisfied the queries. These 20 queries form the experimental query set that was given to six assessors. On average, each query contains eight to nine Chinese characters. Of the six assessors, four were the same people that provided the initial queries. Each assessor chose between 5
6 two to five queries from the query set and made user relevance decisions on the chosen queries. As the size of the document collection is relatively small, it is possible to obtain relevance judgments for all documents with respect to each query. Each assessor was asked to read each document in turn. After each reading, the assessor will to through each query on the list. If the document was deemed to be relevant to the query, the corresponding unique Document ID was noted alongside the query. This is repeated for the remaining queries. Each assessor went through this entire process for each of the documents in turn. The Chinese IR system used for experiments (Lim) was modified from an existing English IR system known as the mg system (Lane). This system is currently housed in the Division of Information Studies Laboratory to support different Chinese language, multilingual and digital libraries research activities in the Division. The five different indices created based on the four automatic segmentation approaches and manual segmentation were stored in separate directories. An optional parameter in the mg command line is used to activate the desired index file in the query and retrieval process. In order to explore the relationship between segmentation and its effect on IR, the five different segmentation approaches were tested while keeping the same query segments for all approaches. In these experiments, manual segmentation results were used as the baseline to represent the ideal segmentation and index. The queries were first segmented manually. Stopwords were excluded. Each query was submitted to the IR system to retrieve a set of ranked results. The ranking is derived from computing a similarity coefficient for each document-query pair. Data recall and precision was used to measure and evaluate the effectiveness of the IR experiments using different segmentation approaches for the index. Comparison of Various Approaches pbi pstop ovlap ovstop manual Figure 1. Recall-precision graph comparing various indexing approaches 6
7 Figure 1 shows the data recall-precision graph for the different segmentation approaches. The performance of manual segmentation appears superior to the automatic segmentation approaches. Ovlap and ovstop segmentation does not perform as well as manual segmentation but better than pbi and pstop. It is also apparent that the difference of the results of the ovstop and ovlap pair, and pstop and pbi pair are marginal. The average recall and precision results of various approaches are shown in Table 3. The precision value quoted in the table is the average precision (non-interpolated) over all relevant documents. The precision is calculated after each relevant document is retrieved. All precision values are then averaged together to get a single number for each individual query. These values are then averaged over all queries (Harman). The recall is obtained by taking the arithmetic mean over the total number of queries (Salton). The calculation of precision and recall is consistent with the methods used in the TREC experiments (Harmon). Table 3. Average precision and recall values from IR experiments Segmentation Precision Recall Approaches (1) pbi (2) ovlap (3) pstop (4) ovstop (5) manual Analysis on the correlation between segmentation accuracy and IR effectiveness The analysis is concerned with identifying factors (if any) that can co-relate the segmentation accuracy measures with IR effectiveness. This is accomplished through the use of the statistical regression technique to examine the relationship between data precision and recall with the various segmentation accuracy measures. In the analysis, data precision and recall are taken as the dependent variables Y1 and Y2 respectively. The segmentation accuracy measures, WSA1, WSA2, APCNw1, APCNwM, MPCNw1, MPCNwM and APAMB2 are taken as the independent variables. We are interested to know if these variables have an influence on the data precision and recall. There are basically two models of regression analysis, namely, bivariate regression and multiple regression. Since the definitions and calculation of segmentation accuracy measures are related to one or more of the other independent variables, simple bivariate analysis is more appropriate to explore which are the variables of segmentation results that affect precision and recall respectively (Berry). The general bivariate regression model is a linear equation of the form: Y = a + bx + e. Y is the dependent variable (precision or recall), X is the independent variable 7
8 (the various segmentation accuracy measures), a is the intercept, b is the slope that indicates the average change in Y associated with a unit change in X, and e is the error measure of the regression fit. Using Microsoft Excel 97 as the statistical tool, a bivariate regression of the precision (Y1) and recall (Y2) on the every segmentation measure (Columns 1 to 7 of Table 2) yielded a set of 14 equations. The equations that are recognised as statistically significant to affect the precision and recall value are summarised in Table 4. Dependent Variable (Y) Precision (Y1) Table 4. Statistical significant accuracy measures that affect data precision Independent Variable (X) Intercept a t-value for a Slope b t-value for b Coefficient of determination, R 2 Standard error of estimate, Se WSA MPCNwM APAMB WSA Recall (Y2) MPCNwM In Table 4, R 2 is the coefficient of determination that provides a summary statistic that measures how well the regression equation fits the data. R 2 varies between 0 and 1. As R 2 1, it indicates the ability of the independent variable to explain Y. Se is the standard of error of estimate for Y that gives an indication of the error of the regression fit (so that if R 2 = 1, then Se = 0). Also, if the value of the t-ratio (or t-statistic) exceeds 2, we can conclude that the results are statistically significant at the 0.05 level. The regression results show that MPCNwM and WSA2 are the two segmentation accuracy measures that affect both precision and recall performance. Of the two, MPCNwM is more critical since it exhibits higher R 2 values and t-ratios for b. Additionally, the regression result shows that ambiguous words have a significantly adverse influence on precision but less so on recall (since t = -1.6 (not shown in Table 4)). These results of regression analysis implies that the segmentation approach that correctly recognises a higher number of multi-character words in the manual segmentation results can yield not only better precision but better recall as well. However, a segmentation approach that exhibits a high percentage of multi-character words in the automatic segmentation result (APCNwM) but a low absolute number of correct multi-character words (hence a lower MPCNwM) will not necessarily yield better precision and recall. It is possible to explain why the same factor affects both precision and recall from the process of query matching. With matched queries, documents with higher similarity will be ranked before those with lower similarity. As a result, precision may be improved. Without matched queries, no related documents will be retrieved so that recall is adversely affected. Since the queries are manually segmented in this research, the 8
9 similarity of query and document becomes more dependent on how correct the document has been segmented. The more accurate segmentation does not only increase the similarity between query and document thereby improving precision, but also increase the chance of being retrieved thereby improving recall. Table 5 uses the example of Query 7 in the experiment that has a total of 17 relevant documents to show how precision and recall change when using different indexing approaches. For each approach, the first column is the Doc_ID of the documents retrieved and relevant, the second column indicates the number of documents retrieved when particular relevant documents are retrieved, and the third and fourth column indicate the corresponding recall and precision value respectively. Table 5. Comparison of IR results of manual and automatic approaches Manual Pstop Ovstop Query 7 Doc_ID #Doc Recall Precision Retrieved Doc_ID #Doc Recall Precision Retrieved Doc_ID #Doc Recall Precision Retrieved As multi-character words are more meaningful in expressing linguistic concepts in Chinese documents, the correct recognition of them would have more significant effects on the results. Although the automatic approaches reported here are only based on 2- character segmentation, they can recognise the majority of multi-character words found from manual segmentation since 2-character words are the most frequently used words in Chinese documents. This number ranges from 54.5% to 87.5% in these experiments. Thus, it is reasonable to conclude that the segmentation approach that recognises the higher number of correct 2-character words and longer words with more than 2 characters will yield better IR results while the correct recognition of 1-character words has less affect on IR results. 4. Conclusion A set of IR experiments based on five different approaches including the manual approach and four automatic bigram segmentation approaches has been conducted to study the relationship between the different accuracy segmentation measures in relation 9
10 to IR effectiveness via the measures of data precision and recall. By carrying out a regression analysis on the results, two statistically significant relationships are revealed: (1) Precision and recall is largely dependent upon the recognition of the correct 2- character and longer words in the manual segmentation results. The approach that recognises the higher number of such words will yield higher precision and recall. (2) The existence of ambiguous words adversely affect recall. As part of future work, we intend to conduct these experiments using a larger corpora and in other subject domains. Other segmentation approaches, such as dictionarybased approaches, can also be incorporated into the model to enable a more elaborate study to be carried out. With the projected increasing number of Chinese IR systems in future, such studies will prove useful in identifying the most appropriate choice of segmentation approaches for IR applications. References Berry, W.D., and Feldman, S. (1985). Multiple regression in practice. Newbury Park: SAGE publications. Foo, S., and Li, H. (1998). An integrated bigram approach with single-character word list for Chinese word segmentation, TEXT Technology, 8(4), Harman, D. (1993). Overview of the Second Text REtrieval Conference. Maryland. (Last visited: 3/03/2001) Kwok, K.L. (1997). Comparing representations in Chinese information retrieval. (Last visited: 3/03/2001) Lane, D. (1997). MG pages. (Last visited: 3/03/2001) Lim, H.K. (1999). Chinese text retrieval system. Master of Applied Science (M.A.Sc.) Dissertation: School of Applied Science, Nanyang Technological University, Singapore. Liu, Y. (1994). The rules of modern Chinese segmentation for the purpose of information processing and approaches of automatic Chinese segmentation (in Chinese). Beijing: Tsinghua University Press, Mateev, B. et al. (1998). ETH TREC-6: routing, Chinese, cross-language and spoken document retrieval. Maryland. (Last visited: 3/03/2001) Nie, J.Y., Brisebois, M., and Ren, X.B. (1996). On Chinese text retrieval. Proceedings of SIGIR 96, Zurich, Switzerland, Salton, G., and McGill, M.J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill Book Company. Tong, X., Zhai, C.X., Milic-Frayling, N., and Evans, D.A. (1996). Experiments on Chinese text indexing CLARIT TREC 5 Chinese track report. Maryland. (Last visited: 3/03/2001) 10
11 Wilkinson, R. (1998). Chinese Document Retrieval at TREC-6. Maryland. (Last visited: 3/03/2001) Wu, Z.M., and Tseng, G. (1993). Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44 (9),
Cross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationHow to Judge the Quality of an Objective Classroom Test
How to Judge the Quality of an Objective Classroom Test Technical Bulletin #6 Evaluation and Examination Service The University of Iowa (319) 335-0356 HOW TO JUDGE THE QUALITY OF AN OBJECTIVE CLASSROOM
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationAGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS
AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic
More informationMASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE
MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE University of Amsterdam Graduate School of Communication Kloveniersburgwal 48 1012 CX Amsterdam The Netherlands E-mail address: scripties-cw-fmg@uva.nl
More informationClassroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice
Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice Title: Considering Coordinate Geometry Common Core State Standards
More informationOn-the-Fly Customization of Automated Essay Scoring
Research Report On-the-Fly Customization of Automated Essay Scoring Yigal Attali Research & Development December 2007 RR-07-42 On-the-Fly Customization of Automated Essay Scoring Yigal Attali ETS, Princeton,
More informationGrade 6: Correlated to AGS Basic Math Skills
Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and
More informationInternational Advanced level examinations
International Advanced level examinations Entry, Aggregation and Certification Procedures and Rules Effective from 2014 onwards Document running section Contents Introduction 3 1. Making entries 4 2. Receiving
More informationMathematics. Mathematics
Mathematics Program Description Successful completion of this major will assure competence in mathematics through differential and integral calculus, providing an adequate background for employment in
More informationStatewide Framework Document for:
Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance
More informationTIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy
TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE Pierre Foy TIMSS Advanced 2015 orks User Guide for the International Database Pierre Foy Contributors: Victoria A.S. Centurino, Kerry E. Cotter,
More informationProbability and Statistics Curriculum Pacing Guide
Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods
More informationRendezvous with Comet Halley Next Generation of Science Standards
Next Generation of Science Standards 5th Grade 6 th Grade 7 th Grade 8 th Grade 5-PS1-3 Make observations and measurements to identify materials based on their properties. MS-PS1-4 Develop a model that
More informationCAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011
CAAP Content Analysis Report Institution Code: 911 Institution Type: 4-Year Normative Group: 4-year Colleges Introduction This report provides information intended to help postsecondary institutions better
More informationCombining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval
Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval Jianqiang Wang and Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park,
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationTHE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS
THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS ELIZABETH ANNE SOMERS Spring 2011 A thesis submitted in partial
More informationBridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models
Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &
More information1.11 I Know What Do You Know?
50 SECONDARY MATH 1 // MODULE 1 1.11 I Know What Do You Know? A Practice Understanding Task CC BY Jim Larrison https://flic.kr/p/9mp2c9 In each of the problems below I share some of the information that
More informationBENCHMARK TREND COMPARISON REPORT:
National Survey of Student Engagement (NSSE) BENCHMARK TREND COMPARISON REPORT: CARNEGIE PEER INSTITUTIONS, 2003-2011 PREPARED BY: ANGEL A. SANCHEZ, DIRECTOR KELLI PAYNE, ADMINISTRATIVE ANALYST/ SPECIALIST
More informationOn document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationPROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia
PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT by James B. Chapman Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment
More informationGuide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams
Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams This booklet explains why the Uniform mark scale (UMS) is necessary and how it works. It is intended for exams officers and
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationLoughton School s curriculum evening. 28 th February 2017
Loughton School s curriculum evening 28 th February 2017 Aims of this session Share our approach to teaching writing, reading, SPaG and maths. Share resources, ideas and strategies to support children's
More informationLinking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report
Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report Contact Information All correspondence and mailings should be addressed to: CaMLA
More informationGCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education
GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge
More informationNumeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C
Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C Using and applying mathematics objectives (Problem solving, Communicating and Reasoning) Select the maths to use in some classroom
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationMulti-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationStacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes
Stacks Teacher notes Activity description (Interactive not shown on this sheet.) Pupils start by exploring the patterns generated by moving counters between two stacks according to a fixed rule, doubling
More informationThe Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences
The Ohio State University Colleges of the Arts and Sciences Bachelor of Science Degree Requirements Spring Quarter 2004 (May 4, 2004) The Aim of the Arts and Sciences Five colleges comprise the Colleges
More informationCONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and
CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and in other settings. He may also make use of tests in
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationA Study of Metacognitive Awareness of Non-English Majors in L2 Listening
ISSN 1798-4769 Journal of Language Teaching and Research, Vol. 4, No. 3, pp. 504-510, May 2013 Manufactured in Finland. doi:10.4304/jltr.4.3.504-510 A Study of Metacognitive Awareness of Non-English Majors
More informationAssessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2
Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2 Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN, 55812 USA tpederse@d.umn.edu
More informationIndividual Differences & Item Effects: How to test them, & how to test them well
Individual Differences & Item Effects: How to test them, & how to test them well Individual Differences & Item Effects Properties of subjects Cognitive abilities (WM task scores, inhibition) Gender Age
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationSTA 225: Introductory Statistics (CT)
Marshall University College of Science Mathematics Department STA 225: Introductory Statistics (CT) Course catalog description A critical thinking course in applied statistical reasoning covering basic
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationLanguage Acquisition Chart
Language Acquisition Chart This chart was designed to help teachers better understand the process of second language acquisition. Please use this chart as a resource for learning more about the way people
More informationCharacterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University
Characterizing Mathematical Digital Literacy: A Preliminary Investigation Todd Abel Appalachian State University Jeremy Brazas, Darryl Chamberlain Jr., Aubrey Kemp Georgia State University This preliminary
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationEdexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE
Edexcel GCSE Statistics 1389 Paper 1H June 2007 Mark Scheme Edexcel GCSE Statistics 1389 NOTES ON MARKING PRINCIPLES 1 Types of mark M marks: method marks A marks: accuracy marks B marks: unconditional
More informationCal s Dinner Card Deals
Cal s Dinner Card Deals Overview: In this lesson students compare three linear functions in the context of Dinner Card Deals. Students are required to interpret a graph for each Dinner Card Deal to help
More informationINTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE
INTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE Please read the following carefully: The completed application packet with all materials listed below must be submitted and reviewed by an ISF
More informationThe role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning
1 Article Title The role of the first language in foreign language learning Author Paul Nation Bio: Paul Nation teaches in the School of Linguistics and Applied Language Studies at Victoria University
More informationICTCM 28th International Conference on Technology in Collegiate Mathematics
DEVELOPING DIGITAL LITERACY IN THE CALCULUS SEQUENCE Dr. Jeremy Brazas Georgia State University Department of Mathematics and Statistics 30 Pryor Street Atlanta, GA 30303 jbrazas@gsu.edu Dr. Todd Abel
More informationMathematics Program Assessment Plan
Mathematics Program Assessment Plan Introduction This assessment plan is tentative and will continue to be refined as needed to best fit the requirements of the Board of Regent s and UAS Program Review
More informationRote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney
Rote rehearsal and spacing effects in the free recall of pure and mixed lists By: Peter P.J.L. Verkoeijen and Peter F. Delaney Verkoeijen, P. P. J. L, & Delaney, P. F. (2008). Rote rehearsal and spacing
More informationMathematics process categories
Mathematics process categories All of the UK curricula define multiple categories of mathematical proficiency that require students to be able to use and apply mathematics, beyond simple recall of facts
More informationPsychometric Research Brief Office of Shared Accountability
August 2012 Psychometric Research Brief Office of Shared Accountability Linking Measures of Academic Progress in Mathematics and Maryland School Assessment in Mathematics Huafang Zhao, Ph.D. This brief
More informationReview in ICAME Journal, Volume 38, 2014, DOI: /icame
Review in ICAME Journal, Volume 38, 2014, DOI: 10.2478/icame-2014-0012 Gaëtanelle Gilquin and Sylvie De Cock (eds.). Errors and disfluencies in spoken corpora. Amsterdam: John Benjamins. 2013. 172 pp.
More informationMatching Meaning for Cross-Language Information Retrieval
Matching Meaning for Cross-Language Information Retrieval Jianqiang Wang Department of Library and Information Studies University at Buffalo, the State University of New York Buffalo, NY 14260, U.S.A.
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationPerformance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationMyths, Legends, Fairytales and Novels (Writing a Letter)
Assessment Focus This task focuses on Communication through the mode of Writing at Levels 3, 4 and 5. Two linked tasks (Hot Seating and Character Study) that use the same context are available to assess
More informationMath 96: Intermediate Algebra in Context
: Intermediate Algebra in Context Syllabus Spring Quarter 2016 Daily, 9:20 10:30am Instructor: Lauri Lindberg Office Hours@ tutoring: Tutoring Center (CAS-504) 8 9am & 1 2pm daily STEM (Math) Center (RAI-338)
More informationExtending Place Value with Whole Numbers to 1,000,000
Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit
More informationINTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE
INTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE Please read the following carefully: The completed application packet with all materials listed below must be submitted and reviewed by an ISF
More informationEmpirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students
Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students Yunxia Zhang & Li Li College of Electronics and Information Engineering,
More informationUnit 7 Data analysis and design
2016 Suite Cambridge TECHNICALS LEVEL 3 IT Unit 7 Data analysis and design A/507/5007 Guided learning hours: 60 Version 2 - revised May 2016 *changes indicated by black vertical line ocr.org.uk/it LEVEL
More informationMaster Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management
Master Program: Strategic Management Department of Strategic Management, Marketing & Tourism Innsbruck University School of Management Master s Thesis a roadmap to success Index Objectives... 1 Topics...
More informationAmerican Journal of Business Education October 2009 Volume 2, Number 7
Factors Affecting Students Grades In Principles Of Economics Orhan Kara, West Chester University, USA Fathollah Bagheri, University of North Dakota, USA Thomas Tolin, West Chester University, USA ABSTRACT
More information10.2. Behavior models
User behavior research 10.2. Behavior models Overview Why do users seek information? How do they seek information? How do they search for information? How do they use libraries? These questions are addressed
More informationAlgebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview
Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best
More informationDisambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationLearning Microsoft Publisher , (Weixel et al)
Prentice Hall Learning Microsoft Publisher 2007 2008, (Weixel et al) C O R R E L A T E D T O Mississippi Curriculum Framework for Business and Computer Technology I and II BUSINESS AND COMPUTER TECHNOLOGY
More informationThis scope and sequence assumes 160 days for instruction, divided among 15 units.
In previous grades, students learned strategies for multiplication and division, developed understanding of structure of the place value system, and applied understanding of fractions to addition and subtraction
More informationDOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?
DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY? Noor Rachmawaty (itaw75123@yahoo.com) Istanti Hermagustiana (dulcemaria_81@yahoo.com) Universitas Mulawarman, Indonesia Abstract: This paper is based
More informationDyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers
Dyslexia and Dyscalculia Screeners Digital Guidance and Information for Teachers Digital Tests from GL Assessment For fully comprehensive information about using digital tests from GL Assessment, please
More informationUnderstanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)
Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010) Jaxk Reeves, SCC Director Kim Love-Myers, SCC Associate Director Presented at UGA
More informationResolving Ambiguity for Cross-language Retrieval
Resolving Ambiguity for Cross-language Retrieval Lisa Ballesteros balleste@cs.umass.edu Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA
More informationCandidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.
The Test of Interactive English, C2 Level Qualification Structure The Test of Interactive English consists of two units: Unit Name English English Each Unit is assessed via a separate examination, set,
More informationAP Calculus AB. Nevada Academic Standards that are assessable at the local level only.
Calculus AB Priority Keys Aligned with Nevada Standards MA I MI L S MA represents a Major content area. Any concept labeled MA is something of central importance to the entire class/curriculum; it is a
More informationStudent Course Evaluation Class Size, Class Level, Discipline and Gender Bias
Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias Jacob Kogan Department of Mathematics and Statistics,, Baltimore, MD 21250, U.S.A. kogan@umbc.edu Keywords: Abstract: World
More informationSouth Carolina English Language Arts
South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content
More informationMathematics Scoring Guide for Sample Test 2005
Mathematics Scoring Guide for Sample Test 2005 Grade 4 Contents Strand and Performance Indicator Map with Answer Key...................... 2 Holistic Rubrics.......................................................
More informationLearning From the Past with Experiment Databases
Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationGuidelines for the Use of the Continuing Education Unit (CEU)
Guidelines for the Use of the Continuing Education Unit (CEU) The UNC Policy Manual The essential educational mission of the University is augmented through a broad range of activities generally categorized
More informationTEKS Correlations Proclamation 2017
and Skills (TEKS): Material Correlations to the Texas Essential Knowledge and Skills (TEKS): Material Subject Course Publisher Program Title Program ISBN TEKS Coverage (%) Chapter 114. Texas Essential
More informationWhat the National Curriculum requires in reading at Y5 and Y6
What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the
More informationCOURSE SYNOPSIS COURSE OBJECTIVES. UNIVERSITI SAINS MALAYSIA School of Management
COURSE SYNOPSIS This course is designed to introduce students to the research methods that can be used in most business research and other research related to the social phenomenon. The areas that will
More informationGenerating Test Cases From Use Cases
1 of 13 1/10/2007 10:41 AM Generating Test Cases From Use Cases by Jim Heumann Requirements Management Evangelist Rational Software pdf (155 K) In many organizations, software testing accounts for 30 to
More information1 3-5 = Subtraction - a binary operation
High School StuDEnts ConcEPtions of the Minus Sign Lisa L. Lamb, Jessica Pierson Bishop, and Randolph A. Philipp, Bonnie P Schappelle, Ian Whitacre, and Mindy Lewis - describe their research with students
More informationPage 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified
Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General Grade(s): None specified Unit: Creating a Community of Mathematical Thinkers Timeline: Week 1 The purpose of the Establishing a Community
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationCross-Language Information Retrieval
Cross-Language Information Retrieval ii Synthesis One liner Lectures Chapter in Title Human Language Technologies Editor Graeme Hirst, University of Toronto Synthesis Lectures on Human Language Technologies
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationNUMBERS AND OPERATIONS
SAT TIER / MODULE I: M a t h e m a t i c s NUMBERS AND OPERATIONS MODULE ONE COUNTING AND PROBABILITY Before You Begin When preparing for the SAT at this level, it is important to be aware of the big picture
More information