n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N

Size: px

Start display at page:

Download "n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N"

Nancy Woods
6 years ago
Views:

1 Segmenting Korean Compound Nouns using Statistical Information and a Preference Rule Bo-Hyun Yun, Min-Jeung Cho, Hae-Chang Rim Department of Computer Science, Korea University 1, 5-ka, Anam-dong, SEOUL, , KOREA ybh@nlp.korea.ac.kr, cmj@nlp.korea.ac.kr, rim@nlp.korea.ac.kr Abstract This paper presents a method of segmenting Korean compound nouns by using statistical information and a preference rule. Statistical information is represented by CFP(Compound noun Formation Probability) that consists of both frequencies of axes and frequencies of two-syllabled and three-syllabled nouns. A preference rule is MNPR(Minimal Noun Preference Rule) that prefers a structure pattern of a compound noun with minimal number of unit nouns. Moreover, we apply three kinds of heuristics in order to segment compound nouns including unknown unit nouns. Experimental results show that the precision of the proposed method is approximately 96% on average. Furthermore, the experiments prove the proposed method can segment compound nouns including unknown nouns and maintain the constant precision rate in segmenting compound nouns extracted from various domains. 1 Introduction Segmenting a compound noun(cn) in a raw corpus is one of the crucial issues for natural language processing systems such as a machine translation system, an information retrieval system, and a spelling checker. It is necessary to segment Korean compound nouns correctly in order to select the right target lexemes in machine translation, to increase the recall rate in information retrieval, and to correct a spacing error of compound nouns in spelling checking. However, the segmentation is a dicult problem because a Korean compound noun consists of more than one unit noun without blanks and because there are possibly many ambiguous segmentations in a compound noun. In segmenting Korean compound nouns in a raw corpus, we have to consider following problems: 1) A raw corpus has various eojeols 1 such as a verbal and an adjectival to be eliminated. 2) An eojeol including a compound noun has several suxes to be removed. 3) There exist many ambiguous segmentations to be resolved in Korean compound nouns. 4) Because all of unit noun(un)s can't be registered in a lexicon, there are many compound nouns including unknown 1 Eojeol is the spacing unit in Korean like a word in English. An eojeol consists of one or more morphemes. It sometimes corresponds to a word or a phrase in English. unit nouns. In this research, we have solved the rst and the second problems by using a morphological analyzer[8] and a POS(Part-Of-Speech) tagger 2 [6, 7, 9] and suggest only the solutions of the third and the fourth problems. To analyze compound nouns in Japanese, Yosiyuki et al[11] uses collocation information and a thesaurus. The accuracy of this method is about 80%. In Chienese, Nie et al[10], at rst, segment a text by using the rule and dictionary-based method. Then a hybrid approach is applied to locate candidates for the unknown words contained therein, and the segmentation process is driven again. This method shows the accuracy of 96.51%. For Korean compound nouns, several segmentation methods[3, 4, 12] have been proposed. Choi[4] applies structure patterns of compound nouns orderly and then segments compound nouns. But this method can't resolve ambiguous segmentations. Yun et al[12] applies several structure patterns and resolves ambiguous segmentations by using the frequencies of head words and statistical preference rules. However, both methods can't segment compound nouns including unknown unit nouns. Chang et al[3] constructs a trie to store corpus information, inserts 2 The morphological analyzer and POS tagger have been developed at the NLP Lab. of Korea University.

2 n?1 Y CF P (N1; ::::; N n ) = log(p (N1jS P re ) P (N i js In ) P (N n js P ost )) (1) i=2 C(N1; S P re ) P (N1jS P re ) = Pn i=1 C(N i; S P re ) C(N i ; S In ) P (N i js In ) = Pn i=1 C(N i; S In ) C(N n ; S P ost ) P (N n js P ost ) = Pn i=1 C(N i; S P ost ) (2) (3) (4) dummy nodes to mark the end of a noun in the learning phase, and analyzes the compound noun by using the constructed trie in the application phase. But the performance of this method is dependent on specic domains. To solve these problems, we propose a method of segmenting compound nouns based on statistical information, CFP and a preference rule, MNPR. 2 Statistical Information and Preference Rule 2.1 Statistical Information To acquire statistical information, we assume that the structure of Korean compound nouns can be expressed in a binary tree. The binary tree consists of a specier and a head. That is, the structure of Korean compound nouns corresponds to the Binary Branch Structure(BBS) based on X' theory in linguistics[5]. Figure 1 shows that the specier and the head have the recursive property as indicated by a symbol '+' in the structure of Korean compound nouns. The specier and the head can also have a subspecier and a subhead respectively. In this research, to simplify the acquisition of statistical information, we dene the unit noun between a specier and a head as an intermediate. Based on the above structure, the frequencies of two-syllabled and three-syllabled nouns are obtained from 81,276 compound nouns registered in the dictionary of Kumsung Publishing Company as follows: The rst unit noun N1 is counted as the speci- er. The middle unit nouns N2? N n?1 are counted as the intermediate. The last unit noun N n is counted as the head. As those compound nouns have the mark '-' which stands for a correct segmentation, it is easy to distinguish the specier, the intermediate and the head. Figure 1: The Structure of Compound Noun The frequencies of one-syllabled axes are acquired from 4,486 three-syllabled compound nouns with a N1 - N2 form as follows: If N1 is one syllable, N1 is counted as the prex. If N2 is one syllable, N2 is counted as the sux. By using the frequency data, we can dene CFP as the equation (1). where S P re, S In, and S P ost are the state of a specier, an intermediate, and a head respectively. C(N1; S P re ), C(N i ; S In ), and C(N1; S P ost ) are the frequencies that N is used as a specier, an intermediate, and a head respectively. Equation (1) is calculated by multiplying the probability that N1 is used as a specier, the probability that N2; :::; N n? 1 is used as an intermediate, and the probability that N n is used as a head[2]. In other words, CFP represents the capacity that unit nouns form a compound noun. Indeed, by using log, we forces the value of the probability to be ranged from 0 to?1. In equation (2), P (N1jS P re ) expresses the probability that N1 is used as a specier. Likewise, the probabilities in equation (3) and (4) have the similar meanings with the probability in equation (2). 2.2 Preference Rule A preference rule, MNPR, is the rule acquired by an empirical study. The basic principle is based on MAP(Minimal Attachment Principle) that is applied to a syntactic analysis[1]. The MAP is the principle

3 that a parse tree with the least node is preferred in resolving structural ambiguity. Similarly, we dene MNPR based on MAP as follows: MNPR(Minimal Noun Preference Rule): If the number of unit nouns is dierent among ambiguous segmentations, we prefer the structure pattern with minimal number of unit nouns. 3 Segmentation Algorithm The algorithm of segmenting compound nouns is shown in Figure 2. At rst, we apply structure patterns of compound nouns by consulting a general noun dictionary with 50,518 entries. If one result is generated, we regard the segmentation result as the correct segmentation. If the given compound noun can be ambiguously segmented, we resolve it by using CFP and MNPR. The method of resolving an ambiguous segmentation is explained in Section 3.1 in detail. If a compound noun can not be segmented, we regard the compound noun as a compound noun including unknown unit nouns and segment compound nouns by the method suggested in the Section 3.2. Segment CN (CN) f Apply structure patterns of compound nouns if ( one segmentation result ) Print the segmentation result else if ( several segmentation results ) Resolve Ambiguity() else if ( no segmentation result ) Segment CN including Unknown Word(CN) g Figure 2: A Segmentation Algorithm 3.1 Resolving Ambiguous Segmentations The algorithm of resolving ambiguous segmentations is performed dierently according to the number of unit nouns. If the number of segmented unit nouns is the same among ambiguous segmentations, we apply statistical information, CFP; otherwise, we apply a preference rule, MNPR. First, if the number of unit nouns is the same among ambiguous segmentations, we apply CFP to segment the compound noun. Table 1 shows total summations of the frequency data used as a speci- er, an intermediate, and a head for the calculation of CFP. For instance, a compound noun 'bujeonghapgukja( A <,, a illegally successful candidate)' can be segmented into both 'bujeonghapgukja( A /<, /, a disharmonious Table 1: Summations of each Specier, Intermediate, and Head Type 2-Syllable 3-Syllable Pn i=1 C(N i; S P re ) Pn i=1 C(N i; S In ) Pn i=1 C(N i; S P ost ) lattice)' and 'bujeonghapgukja( A/ <, /, a illegally successful candidate)'. The frequencies of unit nouns are as follows: C(bujeonghap; S P re ) = 1 C(gukja; S P ost ) = 13 C(bujeong; S P re ) = 87 C(hapgukja; S P ost ) = 4 By using the above frequencies, we calculate CFPs of two candidates as follows: log(cf P (bujeonghap=gukja)) = log(p (bujeonghapjs P re ) P (gukjajs P ost )) =?7:9866 log(cf P (bujeong=hapgukja)) = log(p (bujeongjs P re ) P (hapgukjajs P ost )) =?6:6507 Because CFP(bujeong/hapgukja) is larger thann CFP(bujeonghap/gukja), 'bujeonghapgukja( A <)' is segmented into 'bujeong/hapgukja( A/ <)'. Second, if the number of unit nouns is dierent, we resolve an ambiguous segmentation by MNPR. For example, a compound noun 'golfjangsaupja(p $P z, golfw?, a golf course businessman)' can be segmented into both 'golf/jangsa/upja(p /$P/z, golf/$p/?, a golf trade businessman)' and 'golfjang/saupja(p $/Pz, golf/w?, a golf course businessman)'. The number of unit nouns in 'golfjangsaupja(p /$P/z)' is 3 and the number of unit nouns in 'golfjangsaupja(p $/Pz)' is 2. By MNPR, we choose 'golfjangsaupja(p $/P z)' for the correct segmentation because it has smaller number of unit nouns. 3.2 Segmenting Compound Nouns including Unknown Nouns In general, because all unit nouns can't be registered in a lexicon, many compound nouns include unknown unit nouns. Most of the unknown unit nouns are three-syllabled noun, a foreign noun, and a noun of

4 a specic area. In this research, we segment these compound nouns through three phases. First, if more than three-syllabled noun of a specic position is a known noun, we apply the structure pattern itself. The unit nouns of a specic position are underlined as follows: 6 syllable : 3/3, 4/2, 2/4 7 syllable : 2/3/2, 3/4, 4/3, 5/2, 2/5 8 syllable : 2/3/3, 3/3/2, 2/4/2, 3/5, 5/3, 6/2, 2/6 9 syllable : 3/3/3, 2/3/4, 2/4/3, 3/4/2, 4/3/2, 2/5/2, 3/6, 6/3, 2/7, 7/2 10 syllable : 2/4/4, 4/4/2, 2/4/3, 4/3/3, 3/4/3, 3/3/4, 3/5/2, 2/5/3 For example, a compound noun 'orengekaunti( b /, Orange County)' have a known noun 'orenge( )' and have an unknown noun 'kaunti(b /)'. By a structure pattern '3/3', a compound noun 'orengekaunti( b /)' is correctly segmented into 'orengekaunti( /b /)'. Second, if two-syllabled noun is registered but three-syllabled noun is not registered, we apply the frequencies of an ax. For instance, a compound noun 'gunchuksahuphoy(&9p,, an architect society)' is at rst segmented into 'gunchuk/sa/huphoy(&9/p/ )' because 'gunchuk(&9)' and 'huphoy( )' is registered but 'gunchuksa(&9p)' is not. Then, in order to decide whether an ax 'sa(p)' is a prex or a sux, we use the frequency of a prex and a suf- x. An ax 'sa(p)' was used 29 as a prex and 111 as a sux. Therefore, a compound noun 'gunchuksahuphoy(&9p )' can be correctly segmented into 'gunchuksa/huphoy(&9p/ )'. Third, we assume following default patterns as the patterns that are frequently segmented, and we apply them for a segmentation. 4 syllable : 2/2 5 syllable : 2/3 6 syllable : 2/2/2 7 syllable : 2/2/3 8 syllable : 2/2/2/2 9 syllable : 2/2/2/3 10 syllable : 2/2/2/2/2 Figure 3: System Conguration 4 Experimental Results The system conguration that implements the proposed algorithm is shown in Figure 3. A raw text is analyzed by a morphological analyzer and is tagged a POS tagger. Then, we extract N, N+N, N+N+N, and N+N+N+N forms from a POS-tagged corpus. But a N form may be an unit noun or a compound noun due to the recognition process of unknown nouns. Accordingly, we assume the unit noun is registered in an unit noun dictionary and lter out the unit noun of N forms. After all, the segmentation system receives only compound nouns as an input and produces the one segmentation result. We use three kinds of data to estimate the precision rate on the proposed algorithm. The rst test data is 345 compound nouns including a great deal of an unknown unit noun. The second test data is 1,200 compound nouns extracted from about 1,000 documents of KTSET 2.0 which are used for a test set of information retrieval. The KTSET 2.0 test collection consists of 44,400 documents and 50 queries. It includes the relevance judgment of each document with respect to each query. The third test data is 1,644 compound nouns extracted in a balance and at random from corpora. The third test data is extracted from 19,613 compound nouns that the Korean morphological analyzer can not analyze. We dene the criteria of evaluating the segmentation algorithm as follows: The inclusion rate of unknown nouns : D=B 100

5 Table 2: Experimental Results Type data 1 data 2 data 3 # of CNs in the Input(A) # of CNs Segmented by the System(B) # of CNs including only Known UNs(C) # of CNs including at least one Unknown UN(D) # of Ambiguously Segmented(E) # of CNs Correctly Segmented(F) Inclusion Rate of UN 28.6% 12% 24.1% Rate of Ambiguous Segmenatations 35% 21.5% 24% Precision Rate 95.6% 96.8% 95.8% The rate of ambiguous segmentations : E=B 100 The precision rate : F=B 100 where B,D,E, and F are shown in Table 2. From the result of the rst test data and the third test data, we can say that the proposed algorithm can segment compound nouns including unknown nouns correctly. By the result of the second test data, we can nd that the performance of the proposed algorithm can maintain the constant precision rate in segmenting compound nouns extracted from various domains. In Table 3, we show a data analysis on CFP, MNPR, heuristics of resolving ambiguous segmentations, where B and F are shown in Table 2. The table show that CFP and MNPR are useful informations in resolving ambiguous segmentations. But heuristics to segment compound nouns including at least one unit noun show the precision of 78%. This means that there's still plenty of room for improvement. Table 3: Data Analysis of CFP, MNPR, Heuristics Method B F Precision CFP % MNPR % Heuristics % Our proposed method is compared with other researches as shown in Table 4. In this table, 'Segmentation' means the segmentation of CNs including unknown nouns and 'Resolution' means the resolution of ambiguous segmentations. This table shows the proposed method can segment compound nouns including unknown nouns and resolve ambiguous segmentations at better precision rate. In Table 5, we compare our method with that of Chang apart from existing researches. The reason is Table 4: Results of Comparision 1 Factor Yun95 Choi96 Proposed Segmentation No No Yes Resolution Yes No Yes Precision 82% 83% 95.6% that Yun[12] and Choi[4] use dictionary-based methods but Chang[3] utilizes the corpus-based method. In this table, 'Trained' means the trained data used in order to construct a trie and acquire statistical information. 'Untrained' means the untrained data to evaluate the precision rate besides the trained data. This result shows the proposed method can maintain a constant precision rate regardless of a specic area. Table 5: Results of Comparision 2 Data Chang96 Proposed Tranined 97.66% 98.0% Untrained 87.75% 95.6% KTSET % 96.8% 5 Conclusion In this paper, we have presented four requirements necessary for segmenting Korean compound nouns in a raw corpus and suggested a method of segmenting Korean compound nouns into unit nouns. We applied structure patterns of compound nouns and resolved ambiguous segmentations by using statistical information, CFP, and a preference rule, MNPR. The experimental results have shown that the precision rate is about 96%. The experiments have

6 proved the proposed method can segment compound nouns including unknown nouns and maintain the constant precision rate in segmenting compound nouns extracted from various domains. In future work, we will try to improve the accuracy of segmenting compound nouns including unit nouns. In addition, we will apply the segmentation method to compound noun indexing in order to improve the performance of an information retrieval system. References [11] K. Yosiyuki, T. Takenobu, T. Hozumi, \Analysis of Japanese Compound Nouns using Collocation information," Proc. of the 14th Conference on Computational Linguistics (COLING- 94), pp , [12] B.H. Yun, H.S. Lim, H.C. Rim, \Analysis of Korean Compound Nouns using Statistical Information," Proc. of the 22nd Korea Information Science Society Spring Conference, pp , April [1] J. Allen, Natural Language Understanding, The Benjamin/Cummings Publishing Company Inc., [2] E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz, \Equations for Part-of-speech Tagging," Proc. of the Eleventh National Conference on Ariticial Intelligence, pp , [3] D.H. Chang, S.H. Myaeng, \A Korean Compound Noun Analysis method for Eective indexing," Hangul and Korean Information Processing Conference, pp , (in Korean) [4] J.H. Choi, \A Division Method of Korean Compound Noun by number of syllable," Hangul and Korean Information Processing Conference, pp , (in Korean) [5] W.S. Jung, Word Formation Theory of Korean language, 1st Ed., p.267, Hansin-Culture Publishing Company (in Korean) [6] J.D. Kim, A Korean Part-of-Speech Tagging Model Based on Morpheme-unit with Eojeol Context, M.S. Dissertation, Korea University, (in Korean) [7] S.Z. Lee, Two-level Korean Part-of-Speech Tagging using HMM, M.S. Dissertation, Korea University, (in Korean) [8] H.S. Lim, Korean Mophological Analyzer based on Classication of Ambiguity pattern, M.S. Dissertation, Korea University, (in Korean) [9] H.S. Lim, J.D. Kim, H.C. Rim, \Improvement of Transformation Rule-Based Korean Part-Of- Speech Tagger," Hangul and Korean Information Processing Conference, pp , (in Korean) [10] J.Y. Nie, M.L. Hannan, W. Jin, \Combining Dictionary, Rules and Statistical Information in Segmentation of Chinese," Computer Processing of Chinese and Oriental Languages, Vol. 9, No., 2, pp , 1995.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994