n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N

Similar documents
have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Syllable Based Word Recognition Model for Korean Noun Extraction

Cross Language Information Retrieval

Parsing of part-of-speech tagged Assamese Texts

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Linking Task: Identifying authors and book titles in verbose queries

Learning Methods in Multilingual Speech Recognition

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

phone hidden time phone

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

LING 329 : MORPHOLOGY

Proceedings of the 19th COLING, , 2002.

The Role of the Head in the Interpretation of English Deverbal Compounds

Probabilistic Latent Semantic Analysis

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

PROTEIN NAMES AND HOW TO FIND THEM

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Using dialogue context to improve parsing performance in dialogue systems

CS 598 Natural Language Processing

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Disambiguation of Thai Personal Name from Online News Articles

Studies on Key Skills for Jobs that On-Site. Professionals from Construction Industry Demand

Accuracy (%) # features

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Universiteit Leiden ICT in Business

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

AQUA: An Ontology-Driven Question Answering System

Prediction of Maximal Projection for Semantic Role Labeling

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

First Grade Curriculum Highlights: In alignment with the Common Core Standards

ScienceDirect. Malayalam question answering system

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The Smart/Empire TIPSTER IR System

Mandarin Lexical Tone Recognition: The Gating Paradigm

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

THE VERB ARGUMENT BROWSER

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Multilingual Sentiment and Subjectivity Analysis

Character Stream Parsing of Mixed-lingual Text

A Case Study: News Classification Based on Term Frequency

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

The Ups and Downs of Preposition Error Detection in ESL Writing

Characteristics of the Text Genre Realistic fi ction Text Structure

Using Semantic Relations to Refine Coreference Decisions

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Problems of the Arabic OCR: New Attitudes

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

The stages of event extraction

Tour. English Discoveries Online

Physics 270: Experimental Physics

A Domain Ontology Development Environment Using a MRD and Text Corpus

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Loughton School s curriculum evening. 28 th February 2017

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Ensemble Technique Utilization for Indonesian Dependency Parser

Constructing Parallel Corpus from Movie Subtitles

On document relevance and lexical cohesion between query terms

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Some Principles of Automated Natural Language Information Extraction

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

arxiv: v1 [cs.cl] 2 Apr 2017

Context Free Grammars. Many slides from Michael Collins

On-Line Data Analytics

Training and evaluation of POS taggers on the French MULTITAG corpus

Derivational and Inflectional Morphemes in Pak-Pak Language

Short Text Understanding Through Lexical-Semantic Analysis

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

A heuristic framework for pivot-based bilingual dictionary induction

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Transcription:

Segmenting Korean Compound Nouns using Statistical Information and a Preference Rule Bo-Hyun Yun, Min-Jeung Cho, Hae-Chang Rim Department of Computer Science, Korea University 1, 5-ka, Anam-dong, SEOUL, 136-701, KOREA ybh@nlp.korea.ac.kr, cmj@nlp.korea.ac.kr, rim@nlp.korea.ac.kr Abstract This paper presents a method of segmenting Korean compound nouns by using statistical information and a preference rule. Statistical information is represented by CFP(Compound noun Formation Probability) that consists of both frequencies of axes and frequencies of two-syllabled and three-syllabled nouns. A preference rule is MNPR(Minimal Noun Preference Rule) that prefers a structure pattern of a compound noun with minimal number of unit nouns. Moreover, we apply three kinds of heuristics in order to segment compound nouns including unknown unit nouns. Experimental results show that the precision of the proposed method is approximately 96% on average. Furthermore, the experiments prove the proposed method can segment compound nouns including unknown nouns and maintain the constant precision rate in segmenting compound nouns extracted from various domains. 1 Introduction Segmenting a compound noun(cn) in a raw corpus is one of the crucial issues for natural language processing systems such as a machine translation system, an information retrieval system, and a spelling checker. It is necessary to segment Korean compound nouns correctly in order to select the right target lexemes in machine translation, to increase the recall rate in information retrieval, and to correct a spacing error of compound nouns in spelling checking. However, the segmentation is a dicult problem because a Korean compound noun consists of more than one unit noun without blanks and because there are possibly many ambiguous segmentations in a compound noun. In segmenting Korean compound nouns in a raw corpus, we have to consider following problems: 1) A raw corpus has various eojeols 1 such as a verbal and an adjectival to be eliminated. 2) An eojeol including a compound noun has several suxes to be removed. 3) There exist many ambiguous segmentations to be resolved in Korean compound nouns. 4) Because all of unit noun(un)s can't be registered in a lexicon, there are many compound nouns including unknown 1 Eojeol is the spacing unit in Korean like a word in English. An eojeol consists of one or more morphemes. It sometimes corresponds to a word or a phrase in English. unit nouns. In this research, we have solved the rst and the second problems by using a morphological analyzer[8] and a POS(Part-Of-Speech) tagger 2 [6, 7, 9] and suggest only the solutions of the third and the fourth problems. To analyze compound nouns in Japanese, Yosiyuki et al[11] uses collocation information and a thesaurus. The accuracy of this method is about 80%. In Chienese, Nie et al[10], at rst, segment a text by using the rule and dictionary-based method. Then a hybrid approach is applied to locate candidates for the unknown words contained therein, and the segmentation process is driven again. This method shows the accuracy of 96.51%. For Korean compound nouns, several segmentation methods[3, 4, 12] have been proposed. Choi[4] applies structure patterns of compound nouns orderly and then segments compound nouns. But this method can't resolve ambiguous segmentations. Yun et al[12] applies several structure patterns and resolves ambiguous segmentations by using the frequencies of head words and statistical preference rules. However, both methods can't segment compound nouns including unknown unit nouns. Chang et al[3] constructs a trie to store corpus information, inserts 2 The morphological analyzer and POS tagger have been developed at the NLP Lab. of Korea University.

n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N i ; S In ) P (N i js In ) = Pn i=1 C(N i; S In ) C(N n ; S P ost ) P (N n js P ost ) = Pn i=1 C(N i; S P ost ) (2) (3) (4) dummy nodes to mark the end of a noun in the learning phase, and analyzes the compound noun by using the constructed trie in the application phase. But the performance of this method is dependent on specic domains. To solve these problems, we propose a method of segmenting compound nouns based on statistical information, CFP and a preference rule, MNPR. 2 Statistical Information and Preference Rule 2.1 Statistical Information To acquire statistical information, we assume that the structure of Korean compound nouns can be expressed in a binary tree. The binary tree consists of a specier and a head. That is, the structure of Korean compound nouns corresponds to the Binary Branch Structure(BBS) based on X' theory in linguistics[5]. Figure 1 shows that the specier and the head have the recursive property as indicated by a symbol '+' in the structure of Korean compound nouns. The specier and the head can also have a subspecier and a subhead respectively. In this research, to simplify the acquisition of statistical information, we dene the unit noun between a specier and a head as an intermediate. Based on the above structure, the frequencies of two-syllabled and three-syllabled nouns are obtained from 81,276 compound nouns registered in the dictionary of Kumsung Publishing Company as follows: The rst unit noun N1 is counted as the speci- er. The middle unit nouns N2? N n?1 are counted as the intermediate. The last unit noun N n is counted as the head. As those compound nouns have the mark '-' which stands for a correct segmentation, it is easy to distinguish the specier, the intermediate and the head. Figure 1: The Structure of Compound Noun The frequencies of one-syllabled axes are acquired from 4,486 three-syllabled compound nouns with a N1 - N2 form as follows: If N1 is one syllable, N1 is counted as the prex. If N2 is one syllable, N2 is counted as the sux. By using the frequency data, we can dene CFP as the equation (1). where S P re, S In, and S P ost are the state of a specier, an intermediate, and a head respectively. C(N1; S P re ), C(N i ; S In ), and C(N1; S P ost ) are the frequencies that N is used as a specier, an intermediate, and a head respectively. Equation (1) is calculated by multiplying the probability that N1 is used as a specier, the probability that N2; :::; N n? 1 is used as an intermediate, and the probability that N n is used as a head[2]. In other words, CFP represents the capacity that unit nouns form a compound noun. Indeed, by using log, we forces the value of the probability to be ranged from 0 to?1. In equation (2), P (N1jS P re ) expresses the probability that N1 is used as a specier. Likewise, the probabilities in equation (3) and (4) have the similar meanings with the probability in equation (2). 2.2 Preference Rule A preference rule, MNPR, is the rule acquired by an empirical study. The basic principle is based on MAP(Minimal Attachment Principle) that is applied to a syntactic analysis[1]. The MAP is the principle

that a parse tree with the least node is preferred in resolving structural ambiguity. Similarly, we dene MNPR based on MAP as follows: MNPR(Minimal Noun Preference Rule): If the number of unit nouns is dierent among ambiguous segmentations, we prefer the structure pattern with minimal number of unit nouns. 3 Segmentation Algorithm The algorithm of segmenting compound nouns is shown in Figure 2. At rst, we apply structure patterns of compound nouns by consulting a general noun dictionary with 50,518 entries. If one result is generated, we regard the segmentation result as the correct segmentation. If the given compound noun can be ambiguously segmented, we resolve it by using CFP and MNPR. The method of resolving an ambiguous segmentation is explained in Section 3.1 in detail. If a compound noun can not be segmented, we regard the compound noun as a compound noun including unknown unit nouns and segment compound nouns by the method suggested in the Section 3.2. Segment CN (CN) f Apply structure patterns of compound nouns if ( one segmentation result ) Print the segmentation result else if ( several segmentation results ) Resolve Ambiguity() else if ( no segmentation result ) Segment CN including Unknown Word(CN) g Figure 2: A Segmentation Algorithm 3.1 Resolving Ambiguous Segmentations The algorithm of resolving ambiguous segmentations is performed dierently according to the number of unit nouns. If the number of segmented unit nouns is the same among ambiguous segmentations, we apply statistical information, CFP; otherwise, we apply a preference rule, MNPR. First, if the number of unit nouns is the same among ambiguous segmentations, we apply CFP to segment the compound noun. Table 1 shows total summations of the frequency data used as a speci- er, an intermediate, and a head for the calculation of CFP. For instance, a compound noun 'bujeonghapgukja( A <,, a illegally successful candidate)' can be segmented into both 'bujeonghapgukja( A /<, /, a disharmonious Table 1: Summations of each Specier, Intermediate, and Head Type 2-Syllable 3-Syllable Pn i=1 C(N i; S P re ) 71154 18556 Pn i=1 C(N i; S In ) 11160 2022 Pn i=1 C(N i; S P ost ) 67901 21883 lattice)' and 'bujeonghapgukja( A/ <, /, a illegally successful candidate)'. The frequencies of unit nouns are as follows: C(bujeonghap; S P re ) = 1 C(gukja; S P ost ) = 13 C(bujeong; S P re ) = 87 C(hapgukja; S P ost ) = 4 By using the above frequencies, we calculate CFPs of two candidates as follows: log(cf P (bujeonghap=gukja)) = log(p (bujeonghapjs P re ) P (gukjajs P ost )) =?7:9866 log(cf P (bujeong=hapgukja)) = log(p (bujeongjs P re ) P (hapgukjajs P ost )) =?6:6507 Because CFP(bujeong/hapgukja) is larger thann CFP(bujeonghap/gukja), 'bujeonghapgukja( A <)' is segmented into 'bujeong/hapgukja( A/ <)'. Second, if the number of unit nouns is dierent, we resolve an ambiguous segmentation by MNPR. For example, a compound noun 'golfjangsaupja(p $P z, golfw?, a golf course businessman)' can be segmented into both 'golf/jangsa/upja(p /$P/z, golf/$p/?, a golf trade businessman)' and 'golfjang/saupja(p $/Pz, golf/w?, a golf course businessman)'. The number of unit nouns in 'golfjangsaupja(p /$P/z)' is 3 and the number of unit nouns in 'golfjangsaupja(p $/Pz)' is 2. By MNPR, we choose 'golfjangsaupja(p $/P z)' for the correct segmentation because it has smaller number of unit nouns. 3.2 Segmenting Compound Nouns including Unknown Nouns In general, because all unit nouns can't be registered in a lexicon, many compound nouns include unknown unit nouns. Most of the unknown unit nouns are three-syllabled noun, a foreign noun, and a noun of

a specic area. In this research, we segment these compound nouns through three phases. First, if more than three-syllabled noun of a specic position is a known noun, we apply the structure pattern itself. The unit nouns of a specic position are underlined as follows: 6 syllable : 3/3, 4/2, 2/4 7 syllable : 2/3/2, 3/4, 4/3, 5/2, 2/5 8 syllable : 2/3/3, 3/3/2, 2/4/2, 3/5, 5/3, 6/2, 2/6 9 syllable : 3/3/3, 2/3/4, 2/4/3, 3/4/2, 4/3/2, 2/5/2, 3/6, 6/3, 2/7, 7/2 10 syllable : 2/4/4, 4/4/2, 2/4/3, 4/3/3, 3/4/3, 3/3/4, 3/5/2, 2/5/3 For example, a compound noun 'orengekaunti( b /, Orange County)' have a known noun 'orenge( )' and have an unknown noun 'kaunti(b /)'. By a structure pattern '3/3', a compound noun 'orengekaunti( b /)' is correctly segmented into 'orengekaunti( /b /)'. Second, if two-syllabled noun is registered but three-syllabled noun is not registered, we apply the frequencies of an ax. For instance, a compound noun 'gunchuksahuphoy(&9p,, an architect society)' is at rst segmented into 'gunchuk/sa/huphoy(&9/p/ )' because 'gunchuk(&9)' and 'huphoy( )' is registered but 'gunchuksa(&9p)' is not. Then, in order to decide whether an ax 'sa(p)' is a prex or a sux, we use the frequency of a prex and a suf- x. An ax 'sa(p)' was used 29 as a prex and 111 as a sux. Therefore, a compound noun 'gunchuksahuphoy(&9p )' can be correctly segmented into 'gunchuksa/huphoy(&9p/ )'. Third, we assume following default patterns as the patterns that are frequently segmented, and we apply them for a segmentation. 4 syllable : 2/2 5 syllable : 2/3 6 syllable : 2/2/2 7 syllable : 2/2/3 8 syllable : 2/2/2/2 9 syllable : 2/2/2/3 10 syllable : 2/2/2/2/2 Figure 3: System Conguration 4 Experimental Results The system conguration that implements the proposed algorithm is shown in Figure 3. A raw text is analyzed by a morphological analyzer and is tagged a POS tagger. Then, we extract N, N+N, N+N+N, and N+N+N+N forms from a POS-tagged corpus. But a N form may be an unit noun or a compound noun due to the recognition process of unknown nouns. Accordingly, we assume the unit noun is registered in an unit noun dictionary and lter out the unit noun of N forms. After all, the segmentation system receives only compound nouns as an input and produces the one segmentation result. We use three kinds of data to estimate the precision rate on the proposed algorithm. The rst test data is 345 compound nouns including a great deal of an unknown unit noun. The second test data is 1,200 compound nouns extracted from about 1,000 documents of KTSET 2.0 which are used for a test set of information retrieval. The KTSET 2.0 test collection consists of 44,400 documents and 50 queries. It includes the relevance judgment of each document with respect to each query. The third test data is 1,644 compound nouns extracted in a balance and at random from corpora. The third test data is extracted from 19,613 compound nouns that the Korean morphological analyzer can not analyze. We dene the criteria of evaluating the segmentation algorithm as follows: The inclusion rate of unknown nouns : D=B 100

Table 2: Experimental Results Type data 1 data 2 data 3 # of CNs in the Input(A) 345 1200 1644 # of CNs Segmented by the System(B) 345 1200 1644 # of CNs including only Known UNs(C) 246 998 1208 # of CNs including at least one Unknown UN(D) 99 122 396 # of Ambiguously Segmented(E) 121 259 395 # of CNs Correctly Segmented(F) 330 1162 1575 Inclusion Rate of UN 28.6% 12% 24.1% Rate of Ambiguous Segmenatations 35% 21.5% 24% Precision Rate 95.6% 96.8% 95.8% The rate of ambiguous segmentations : E=B 100 The precision rate : F=B 100 where B,D,E, and F are shown in Table 2. From the result of the rst test data and the third test data, we can say that the proposed algorithm can segment compound nouns including unknown nouns correctly. By the result of the second test data, we can nd that the performance of the proposed algorithm can maintain the constant precision rate in segmenting compound nouns extracted from various domains. In Table 3, we show a data analysis on CFP, MNPR, heuristics of resolving ambiguous segmentations, where B and F are shown in Table 2. The table show that CFP and MNPR are useful informations in resolving ambiguous segmentations. But heuristics to segment compound nouns including at least one unit noun show the precision of 78%. This means that there's still plenty of room for improvement. Table 3: Data Analysis of CFP, MNPR, Heuristics Method B F Precision CFP 137 122 99% MNPR 307 294 95.5% Heuristics 99 77 78% Our proposed method is compared with other researches as shown in Table 4. In this table, 'Segmentation' means the segmentation of CNs including unknown nouns and 'Resolution' means the resolution of ambiguous segmentations. This table shows the proposed method can segment compound nouns including unknown nouns and resolve ambiguous segmentations at better precision rate. In Table 5, we compare our method with that of Chang apart from existing researches. The reason is Table 4: Results of Comparision 1 Factor Yun95 Choi96 Proposed Segmentation No No Yes Resolution Yes No Yes Precision 82% 83% 95.6% that Yun[12] and Choi[4] use dictionary-based methods but Chang[3] utilizes the corpus-based method. In this table, 'Trained' means the trained data used in order to construct a trie and acquire statistical information. 'Untrained' means the untrained data to evaluate the precision rate besides the trained data. This result shows the proposed method can maintain a constant precision rate regardless of a specic area. Table 5: Results of Comparision 2 Data Chang96 Proposed Tranined 97.66% 98.0% Untrained 87.75% 95.6% KTSET 2.0 85.43% 96.8% 5 Conclusion In this paper, we have presented four requirements necessary for segmenting Korean compound nouns in a raw corpus and suggested a method of segmenting Korean compound nouns into unit nouns. We applied structure patterns of compound nouns and resolved ambiguous segmentations by using statistical information, CFP, and a preference rule, MNPR. The experimental results have shown that the precision rate is about 96%. The experiments have

proved the proposed method can segment compound nouns including unknown nouns and maintain the constant precision rate in segmenting compound nouns extracted from various domains. In future work, we will try to improve the accuracy of segmenting compound nouns including unit nouns. In addition, we will apply the segmentation method to compound noun indexing in order to improve the performance of an information retrieval system. References [11] K. Yosiyuki, T. Takenobu, T. Hozumi, \Analysis of Japanese Compound Nouns using Collocation information," Proc. of the 14th Conference on Computational Linguistics (COLING- 94), pp. 865-869, 1994. [12] B.H. Yun, H.S. Lim, H.C. Rim, \Analysis of Korean Compound Nouns using Statistical Information," Proc. of the 22nd Korea Information Science Society Spring Conference, pp. 925-928, April 1994. [1] J. Allen, Natural Language Understanding, The Benjamin/Cummings Publishing Company Inc., 1995. [2] E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz, \Equations for Part-of-speech Tagging," Proc. of the Eleventh National Conference on Ariticial Intelligence, pp.784-789, 1993. [3] D.H. Chang, S.H. Myaeng, \A Korean Compound Noun Analysis method for Eective indexing," Hangul and Korean Information Processing Conference, pp. 32-35, 1996. (in Korean) [4] J.H. Choi, \A Division Method of Korean Compound Noun by number of syllable," Hangul and Korean Information Processing Conference, pp. 262-267, 1996. (in Korean) [5] W.S. Jung, Word Formation Theory of Korean language, 1st Ed., p.267, Hansin-Culture Publishing Company. 1994. (in Korean) [6] J.D. Kim, A Korean Part-of-Speech Tagging Model Based on Morpheme-unit with Eojeol Context, M.S. Dissertation, Korea University, 1996. (in Korean) [7] S.Z. Lee, Two-level Korean Part-of-Speech Tagging using HMM, M.S. Dissertation, Korea University, 1994. (in Korean) [8] H.S. Lim, Korean Mophological Analyzer based on Classication of Ambiguity pattern, M.S. Dissertation, Korea University, 1993. (in Korean) [9] H.S. Lim, J.D. Kim, H.C. Rim, \Improvement of Transformation Rule-Based Korean Part-Of- Speech Tagger," Hangul and Korean Information Processing Conference, pp.216-221, 1996. (in Korean) [10] J.Y. Nie, M.L. Hannan, W. Jin, \Combining Dictionary, Rules and Statistical Information in Segmentation of Chinese," Computer Processing of Chinese and Oriental Languages, Vol. 9, No., 2, pp. 125-143, 1995.