Research and Implementation of Unlisted Word Discovery System

Similar documents
Mining Association Rules in Student s Assessment Data

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Universiteit Leiden ICT in Business

Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students

Cross Language Information Retrieval

Speech Emotion Recognition Using Support Vector Machine

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

A Case Study: News Classification Based on Term Frequency

Application of Visualization Technology in Professional Teaching

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Linking Task: Identifying authors and book titles in verbose queries

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

AQUA: An Ontology-Driven Question Answering System

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Multiple Intelligence Theory into College Sports Option Class in the Study To Class, for Example Table Tennis

Australian Journal of Basic and Applied Sciences

Detecting English-French Cognates Using Orthographic Edit Distance

Rule Learning With Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

Constructing Parallel Corpus from Movie Subtitles

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Disambiguation of Thai Personal Name from Online News Articles

Mining Student Evolution Using Associative Classification and Clustering

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

ZHANG Xiaojun, XIONG Xiaoliang School of Finance and Business English, Wuhan Yangtze Business University, P.R.China,

Extracting and Ranking Product Features in Opinion Documents

Rule Learning with Negation: Issues Regarding Effectiveness

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

On-Line Data Analytics

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Matching Similarity for Keyword-Based Clustering

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

National Taiwan Normal University - List of Presidents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling user preferences and norms in context-aware systems

CS Machine Learning

Circuit Simulators: A Revolutionary E-Learning Platform

Problems of the Arabic OCR: New Attitudes

Eileen Bau CIE/USA-DFW 2014

Software Maintenance

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Reducing Features to Improve Bug Prediction

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

Switchboard Language Model Improvement with Conversational Data from Gigaword

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Prediction of Maximal Projection for Semantic Role Labeling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mining Topic-level Opinion Influence in Microblog

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Radius STEM Readiness TM

Noisy SMS Machine Translation in Low-Density Languages

SARDNET: A Self-Organizing Feature Map for Sequences

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Humboldt-Universität zu Berlin

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Probabilistic Latent Semantic Analysis

A Comparison of Standard and Interval Association Rules

Mandarin Lexical Tone Recognition: The Gating Paradigm

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Rule-based Expert Systems

A heuristic framework for pivot-based bilingual dictionary induction

Guidelines for Writing an Internship Report

Language Model and Grammar Extraction Variation in Machine Translation

Efficient Online Summarization of Microblogging Streams

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

The taming of the data:

Using Synonyms for Author Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

The Current Situations of International Cooperation and Exchange and Future Expectations of Guangzhou Ploytechnic of Sports

The Smart/Empire TIPSTER IR System

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

A Class-based Language Model Approach to Chinese Named Entity Identification 1

Georgetown University at TREC 2017 Dynamic Domain Track

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

BYLINE [Heng Ji, Computer Science Department, New York University,

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Transcription:

2017 2nd International Conference on Mechanical Control and Automation (ICMCA 2017) ISBN: 978-1-60595-460-8 Research and Implementation of Unlisted Word Discovery System Shi-wei JIA 1,a,* and Yu-meng ZHANG 2 1 Department of Library and Information Archives, Shanghai University, Shanghai, China 2 Business School of Ningbo University, Ningbo, China a sw_jia.foxmail.com *Corresponding author Keywords: Unlisted Word, Apriori Algorithm, Transaction Compression. Abstract. Unlisted word is a problem in Chinese word segmentation. In this paper, an improved Apriori algorithm is proposed, which can quickly and accurately identify unlisted words. The improved algorithm applied a compressed database approach to reduce the number of transactions. Compared with the traditional n-tuple algorithm and NApriori algorithm, it is faster and more effective. Introduction Automatic identification of unlisted word is an important problem in Chinese information processing. It has a wide range of applications in information retrieval, information filtering and so on [1]. With the development of society, there have been a large number of unlisted words, greatly increasing the difficulty of Chinese information processing, resulting in Chinese automatic word segmentation often mistakes. A large number of literatures show that the errors caused by unlisted words are much greater than those caused by ambiguity [2-3]. In order to solve this problem, domestic and foreign researchers put forward a variety of programs, roughly divided into three, one is based on the dictionary method, one is based on the understanding of the method, one is based on statistical methods. However, unlisted words refer to words that do not exist in the dictionary, and the dictionary-based approach can t find unlisted words, and the method based on comprehension is still in the initial stage of the study. Therefore, the method based on statistical recognition is the current mainstream method [4-6]. For example, Nie used the method of statistic and general likelihood ratio to calculate inter-word correlation, the unlisted words were obtained automatically from corpus to segment words or construct automatically dictionary [6]. As we all know, there is no obvious delineation between the Chinese word words, which makes the current mature theory is not much, and the main research results focused on domestic scholars. For example, Professor Chen proposed a package of solutions for unlisted questions [7], Professor Liang designed the CDWS model [8], and Professor He put forward the concept of expert word segmentation system [9] and so on. In recent years, the identification of proprietary nouns has achieved good results, the formation of a more mature word recognition system, on behalf of the Chinese Academy of Sciences Chinese word segmentation system (ICTCLAS), PanGu segmentation system and so on. For name recognition, now has been able to identify the 90

"Lao Zhang", "Xiao Li" and other slang, and the accuracy is high. But for the new words from network, there is not a good way to identify. Based on the previous research, this paper proposes an improved Apriori algorithm based on the idea of data mining. The algorithm uses its anti-monotonicity to compress the number of transactions, reduce the number of candidate strings, and improve the unlisted words the efficiency and quality of identification. Design of Unknown Word Recognition Algorithm Corpus Construction Corpus, also known as language database, is the basic resource for unrecognized word recognition. With the development of society and the popularization of the Internet, a large number of network new words have emerged. These network new words have caused great distress to Chinese information processing. The traditional static corpus has been unable to meet the demand of word segmentation. Therefore, it is imperative to construct corpus with Internet resources. In China, the portal is the main place where network new words appear. In order to construct a comprehensive and high-quality corpus, the average daily PV, update frequency and so on of the major domestic portal websites were analyzed and finally decided to adopt Sina as the information source. Use crawler technology to grab a large number of web pages, and then parse the page content to build a clean corpus Text Preprocessing In order to improve the speed of algorithm recognition, it should be as much as possible to cut the text into a shorter string. As the corpus resources from the Internet, the format is not standardized, so the text preprocessing is a very critical step. Although there is no clear delineation between Chinese words and words, there are two types of separators that can be used: (1) non-chinese characters, including punctuation, numbers, letters, etc. (2) noise words, refers to the word structure of the poor words or words, such as "the", "ah", such words exist very common, but rarely express document-related information. Thus, the text can be cut into short sentences set by using these separators, and unlisted words exist in these word segments. An improved Apriori Recognition Algorithm In the unlisted word recognition process, effectively identify statistical methods can greatly shorten the computation time and improve operational efficiency. The traditional n-gram method will produce a large number of candidate strings, making the statistics time-consuming, and Apriori algorithm performance will be reduced with the increase in the number of transactions, both are difficult to meet the requirements of massive data processing. In order to improve the efficiency of the operation, this paper proposes an improved Apriori unlisted word recognition algorithm, which can quickly and accurately identify unlisted words. The improved algorithm will compress the number of transactions in the transaction database, perform two pruning processes at the pruning step, and mark the useless transactions to avoid scanning the next scan. This approach greatly reduces the number of transactions and reduces I/O overhead. At the same time modifies the valid frequency that the string appears in a text, to improve the accuracy of unlisted word recognition. The algorithm steps are as follows: (1) Transaction pruning step 91

Scan the transaction database DB, get k-itemsets and their support numbers. According to the minimum count of support to get frequent k-itemsets Lk, and the infrequent itemsets set tag = 0, which means skip the next scan. (2) Candidate string forming step In order to compute Lk+1, according to the Apriori property, it is necessary to select all the connectable sets of candidate (k+1)-items that can be connected from Lk, denoted as Ck+1. Assuming that the items in the item set are sorted by word order, the connectable pair means that only two of the frequent itemsets are the last. (3) Itemsets pruning step According to the Apriori property, any infrequent (k-1) item set is not a subset of frequent k terms. Thus, if any (k-1) item subset of candidate k items set in Ck is not in Lk-1, the set of candidate k items can t be frequent and can be removed from Ck. (4) Modify valid support count step After finding all frequent k- itemsets, correcting the number of effective support to identify the real unlisted words. Valid Support Count Correction After the statistics of the algorithm, often get some meaningless high-frequency string, these strings may be split from their parent-string. For example, "zu sai"( group match) is the sub-string of "xiao zu sai"( group stage). So they are not their independent real frequency, the support count must be modified to exclude such high-frequency string interference [10]. The valid support count of the string is equal to the frequency of its occurrence minus the frequency of its most frequently used super string, and its formula is as follows, where indicates the frequency of the candidate string, and represents the frequent parent string of the candidate string: Algorithm Example Valid( x ) = Fre( x ) Max{Fre(sup( x ))} i i i For example, a string "ningbodaxuezainingboshijiangbeiqu, woshiningbodaxuede xuesheng." University in Ningbo Jiangbei District, I am a student of Ningbo University) was divided into a set S= {"ningbodaxue" University)," ningboshijiangbeiqu" Jiangbei District), "ningbodaxue" University), "xuesheng"(student)}. (1) Set the minimum count of support to 2, scan the phrase set S and count, after the first scan iteration, the candidate 1-string set C1. And according to the minimum number of support, get frequent 1-itemsets, and all the infrequent itemsets set tag = 0, so that it will not be never scanned. (2) Frequent 1-string set to connect, generate a new candidate 2-string set, calculate the number of each candidate string support, and finally determine the frequent 2-string set. (3) Repeat step (2) until the k-item set can t be found. After level-wise scanning, get frequent 3-item items and frequent 4-itemsets, as shown in Table 1. 92

Strings ningbo ) Table 1. Support count of frequent strings. boda (Big waves) daxue (Universit y) Ningbod a big) bodaxue (Wave University ) ningbodax ue University) Support Count 3 2 2 2 2 2 (4) the support number is effectively corrected to show the true frequency of the candidate string, as shown in Table 2. Strings Valid Support Count ningbo ) Table 2. Valid support count of frequent strings. boda (Big waves) daxue (University) Ningboda big) bodaxue (Wave University) ningbodaxue University) 1 0 0 0 0 2 Experimental Results and Analysis In the A6-4400M, 4G memory, Windows 7 platform. The texts with different lengths are selected randomly to text. The algorithm proposed in this paper, the algorithm in the reference [11] and the traditional n-tuple algorithm are applied to test the unlisted word extraction. The experimental results on two aspects are listed as follows: (1) the frequent string extraction efficiency; (2) the frequent string extraction accuracy. The Efficiency of Extraction Generate (k + 1) - candidate strings on a frequent k-string basis, looping until no candidate strings are generated. The experimental results are shown in Table 3: Number of test strings (bars) Table 3. Time-consuming comparison of different lengths candidate strings. Number of tests (times) The average time of the traditional algorithm (s) The average time of the algorithm in the reference[11] (s) The average time of this algorithm 100 10 8.22 7.53 3.33 200 10 23.36 20.35 7.81 500 10 97.88 75.49 24.20 1000 10 249.24 231.58 55.85 2000 1 939.48 908.85 174.87 5000 1 2126.04 2074.94 426.979 For the same corpus, the algorithm proposed in this paper is time-consuming than the other two algorithms, and the growth rate is the most slow, in line with the time required to deal with massive data, extraction efficiency is the highest. This is due to the improvement of the Apriori algorithm, greatly compressed the number of scanning transactions, making the number of candidate string generation reduced, reducing the algorithm running time. The total number of candidate strings is shown in Figure 1. The abscissa in the graph is the length of the test text, and the ordinate is the number of candidate strings. (s) 93

The Quantity and Quality of Extraction Figure 1. Quantity comparison of candidate strings. The sample above is taken as the test object. The algorithm modifies the valid count in the final string and selects the string that satisfies the minimum supported number of conditions. The experimental results are shown in Table 4. Docid Length of text Table 4. The effect of frequent word extraction. Number of words Number of Correct words Correct Rate 1 456 17 16 94.1% 2 1035 19 18 94.7% 3 2780 28 27 96.4% 4 5170 98 95 96.9% 5 11574 307 292 95.1% 6 22535 650 618 95.1% In order to test the validity of this algorithm, the algorithm in reference [11] and traditional n-tuple method are taken to be compared. The novel The Stewardess is taken as the test data. Set the minimum support count of 4. The experimental results are shown in Table 5. Table 5. The comparison of different word extraction methods. The number of The number of Correct rate frequent words correct words Traditional Approach 246 237 96.34% Approach in [11] 211 203 96.21% Approach in this paper 225 217 96.44% Acknowledgement Fund Project: Zhejiang Provincial Department of Education Research Project (Y200907096) 94

References [1] M Sun, Zou J. A critical appraisal of the research on Chinese word segmentation [J]. Contemporary Linguistics, 2001. [2] Dexin Zhang. "A clear stream is avoided by fish" the words of my freshman standard theory [J]. Peking University (Philosophy and Social Sciences), 2000 (5): 105-118. (In Chinese) [3] Changning Huang, Hai Zhao. Ten Years of Chinese Words Segmentation. Chinese Journal of Information, 2007, 21 (3): 8-19. (In Chinese) [4] Aiyuan He. Research on Chinese Word Segmentation Algorithm Based on Dictionary and Probability Statistics [D]. Liaoning University, 2011. (In Chinese) [5] Ling G C, Asahara M, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking[c]// Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2003:197-200. [6] Jian-Yun Nie, Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge. Communications of COLIPS, 5 (I&2), 47-57. [7] Chen X. A package scheme for identifying unlisted words in Chinese segmentation [J]. Applied Linguistics, 1999. [8] Nanyuan Liang. Written Chinese automatic word segmentation system-cdws [J]. Chinese Journal of Information, 1987, 1 (2): 46-54. (In Chinese) [9] KeKang He, Hui Xu, Bo Sun. Automatic Chinese design principles written word expert system [J] Chinese Information Technology, 1991, 5 (2): 1-14. (In Chinese) [10] Zhang Y, Liu C. An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus [M]// MICAI 2007: Advances in Artificial Intelligence. Springer Berlin Heidelberg, 2007:894-903. [11] Guo J M, Song S L, Shi-Song L I. Improved algorithm based on Apriori algorithm[j]. Computer Engineering & Design, 2008. 95