Resolving Dependency Ambiguity of Subordinate Clauses using Support Vector Machines

Resolvig Depedecy Ambiguity of Subordiate Clauses usig Support Vector Machies Sag-Soo Kim, Seog-Bae Park, ad Sag-Jo Lee Abstract I this paper, we propose a method of resolvig depedecy ambiguities of Korea subordiate clauses based o Support Vector Machies (SVMs). Depedecy aalysis of clauses is well kow to be oe of the most difficult tasks i parsig seteces, especially i Korea. I order to solve this problem, we assume that the depedecy relatio of Korea subordiate clauses is the depedecy relatio amog verb phrase, verb ad edigs i the clauses. As a result, this problem is represeted as a biary classificatio task. I order to apply SVMs to this problem, we selected two kids of features: static ad dyamic features. The experimetal results o STEP2000 corpus show that our system achieves the accuracy of 73.5%. Keywords Depedecy aalysis, subordiate clauses, biary classificatio, support vector machies. I. INTRODUCTION N Korea, the depedecy aalysis of clauses is kow as I oe of the most difficult tasks i parsig seteces because of the characteristics of Korea. The characteristics of Korea are that (i) it is a partially free word-order, (ii) the omissio of compoets is commo, (iii) it is a head-fial laguage, ad (iv) the spacig uit is a composite of oe or more words. Especially, what makes the clause depedecy aalysis difficult is the third factor. The edigs (Eomi) ca be freely combied with a verb, ad they cotai the sematic relatioship with other verbs. The steps of parsig Korea seteces are as follows. First, the iput setece is aalyzed ito the morphemes, ad the the part-of-speech (POS) of the morphemes is determied by some meas. Fially, the sytactic relatio is aalyzed usig the results of the previous steps. Due to the characteristics of Korea, the depedecy grammar rather tha phrase-structure grammar is geerally used i parsig Korea []. This process is ot much differet from other laguages. However, sice each word used i a setece becomes a processig uit, the complexity of parsig gets too large especially with log seteces, which results i severe ambiguities i parsig This research was supported i part by MIC & IITA through IT Leadig R&D Support Project, ad by grat No. R0-2006-000-96-0 from the Basic Research Program of the Korea Sciece & Egieerig Foudatio. Authors are with Departmet of Computer Egieerig, Kyugpook Natioal Uiversity, Daegu 702-70, Korea (correspodig author to provide e-mail: sskim@sejog.ku.ac.kr). Korea. Recetly, i order to solve this problem, elargig the processig uit gais much iterest from researchers of Korea laguage processig. May kids of research results are reported o Korea text chukig [2,3], ad they gives relatively stable results. I additio, some researchers have studied to fid the boudaries of a clause [4]. Whe the clause boudaries are kow, itra-clause parsig is a simple task compared to iter-clause parsig. This is because Korea is head-fial. However, the relatio betwee clauses is ot determied by the iformatio give by text chukig ad clause boudaries. Due to the freedom of word orderig i Korea seteces, it is extremely difficult to determie the relatio betwee clauses by seeig just the eighbor words. That is, it has bee believed that the surface form of a clause is ot sufficiet to aalyzig the relatio amog clauses. As a result, may previous works o parsig Korea have focused o how to use sematic iformatio of verb phrases i determiig the clause relatio []. However, it is a very expesive ad time-cosumig task to build sematic kowledge for the task. I this paper, we propose a ovel method of aalyzig depedecy relatio of Korea subordiate clauses without exteral kowledge. For this task, we witessed that the most importat compoet i determiig the depedecies is the base verb phrase composed of a verb ad a few edigs, rather tha the complemet ad supplemet compoets withi a clause. Therefore, i order to solve the problem, we assume that the depedecy relatio of Korea subordiate clauses is the depedecy relatio of base verb phrases. I additio, we formulate the depedecy aalysis of Korea clauses to a biary classificatio task. As a classifier for this task, we adopt a support vector machie (SVM) which is kow as the best classifier for may kids of real-world classificatio problems. The rest of this paper is orgaized as follows. Sectio 2 surveys the previous work o clause recogitio ad aalysis of iter-clause relatio, ad Sectio 3 itroduces how a support vector machie works which is adopted as a base learer for the task. Sectio 4 describes the proposed method for clausal depedecy aalysis usig support vector machies. Sectio 5 explais the corpus used i the experimets ad presets the experimetal results. Fially, Sectio 5 draws coclusios ad suggests some future work. 95

II. RELATED WORK There have bee a umber of studies for aalyzig depedecy relatio of subordiate clauses i the clause idetificatio ad the depedecy structure aalysis. The clause idetificatio is a task of recogizig the embeddedess of clauses, while the clause idetificatio is to fid the startig ad edig poits of clauses. I 200, there was a competitio for this task at the Coferece o Computatioal Laguage Learig (CoNLL). The best two methods are a boostig tree [5] ad a hidde Markov Model [6]. However, ulike wester laguages, i Korea the depedecy relatio is ot easily determied eve if the clause boudaries are idetified. Most previous work o depedecy aalysis of seteces has focused o the words rather tha clauses. That is, istead of fidig the depedecy relatio amog clauses, the relatio amog verb phrases withi a clause has bee the core of the research. Uchimoto et al. used a maximum etropy model ad various kids of features to idetify depedecy structure of seteces [7]. They reported the experimetal results o the relatioship betwee feature types ad depedecy aalysis. Kudo ad Matsumoto formulated the aalysis of depedecy structure as a biary classificatio task, ad adopted support vector machies as a classifier [8]. The features used i traiig support vector machies are grammatical features such as lexicos ad part-of-speech tags, ad some fuctioal features such as fuctioal words ad iflectio iformatio. Gao ad Suzuki solved the problem of aalyzig depedecy relatio by traiig a laguage model through a usupervised learig [9]. Utsuro et al. classified text chuks ito several types accordig to the fuctioal words of the fial word i a setece. With the classified type, they determied the depedecy relatio amog chuks [0]. I Korea laguage processig, most research o sytactic aalysis has bee focused o the Josa ad Eomi, ad their depedecy relatio. As a result, most works are based o the had-crafted rules []. Especially, the research o the subordiate clauses was performed o the recogitio simple setece ad restoratio of the omitted compoets i the simple setece. The first effort to use a machie learig algorithm i hadlig clauses was doe for clause boudary detectio. Lee et al. extracted -gram iformatio from a setece, ad the recogized the boudary of a clause usig the iformatio []. However, their work was limited to detectio of clauses, ad did ot suggest ay method for aalyzig their depedecy. The mai reaso why the machie learig algorithms are rare i hadlig Korea clauses is that there is o stadard large-scale dataset for the task. Recetly the great fudig of the Korea govermet i writig a large-scale tree-tagged corpus makes it possible to trasform the corpus ito the data for clause detectio ad their depedecy aalysis. III. SUPPORT VECTOR MACHINES Support Vector Machie (SVM) proposed by Vapik is a kid of machie learig algorithms, ad is well kow as the most successful biary classifier, ad have bee applied to may classificatio tasks. I the field of atural laguage processig, it has bee successfully applied to text categorizatio, spam-mail filterig ad chuk idetificatio, ad it is reported to accomplish high performace without fallig ito over-fittig eve with a large umber of features [2, 3]. Assume that the traiig data with either positive or egative class as follows: x, y ),( x, y ),...,( x, y ) ( 2 2 xi R, yi { +, } where x i is a feature vector of the i-th traiig datum i a -dimesioal space, ad y i is its class label. I the basic SVM framework, the hyperplae is defied as follows: ( w x ) + b = 0, w R, b R. Accordig to the hyperplae defiitio, there could be the ifiite umber of hyperplaes that ca separate traiig data ito two classes correctly. Fig. The margi of a hyperplae Amog such hyperplaes, we defie the optimal hyperplae as the oe with the largest margi betwee two classes. Fig. illustrates the otio of the margi. The solid lie, hyperplae, correctly divides traiig data ito two classes without misclassificatio. Two dash lies which are parallel with the hyperplae represet the distace betwee hyperplae ad the closest istace. The distace betwee each parallel dash lies, d, is called the margi. Thus, assumig that the earest distace is, the margi ca be rewritte as: ( w x) + b + ( w x) + b 2 d = w Therefore, SVM geerates a hyperplae which maximizes a margi by miimizig w uder the costraits: [( w x ) + b] y i d l l 96

Fig. 2 A example of a depedecy relatio betwee clauses SVMs have a advatage over covetioal machie learig algorithms such as eural etworks or decisio trees. SVMs show higher geeralizatio performace idepedet of the dimesio of feature vectors. Covetioal machie learig algorithms usually require careful feature selectio, which is ofte optimized heuristically to avoid over-fittig. SVMs also ca carry out their learig with all combiatios of give features without icreasig computatioal complexity by itroducig the kerel fuctio. IV. ANALYZING DEPENDENCY RELATION OF SUBORDINATE CLAUSES A. The Probability Model ad Geeratig Traiig Data Let a sequece of clauses be {c, c 2,..., c } deoted by C, ad the sequece depedecy patters be {Dep(), Dep(2),, Dep(-)} deoted by D, where Dep(i)=j implies that the clause c i modifies the clause c j. I Kora uder this framework, this depedecy relatio has to satisfy some costraits. A clause has oly oe depedecy relatio except for the rightmost oe. It meas that a clause modifies oly oe clause. A depedecy relatio is defied as a searchig problem for depedecy patter D that maximizes the coditioal probability P(D C). That is, D = arg max PD ( C) best D If we assume that the depedecy probability is idepedet oe aother, P(D C) ca be rewritte as: m PDC ( ) = PDepi ( ( ) = j f } i= f = { f,..., f} R where f is a -dimesioal feature vector that represets relatio betwee clauses. I order to use SVMs i aalyzig the clausal depedecy, we geerate positive ad egative examples. We adopt simple ad effective method for this purpose. (f, y ) = {(f, y ),(f, y ),...,(f, y )} U i m i+ j m 2 2 23 23 m m m m f = { f,..., f } R y { Dep( + ), Not Dep( )} TABLE I FEATURES USED FOR ANALYZING DEPENDENCY RELATION Static Features Lexico Iformatio Positio Iformatio Dyamic Features Left Clause A word of verb POS tag of verb A word of edigs POS tag of edigs Right Clause A word of verb POS tag of verb A word of edigs POS tag of edigs Distace betwee left ad right clause Positio idex of left ad right Clauses A sytactic relatio betwee clauses Accordig to the above equatio, we geerate pairs of two clauses i the traiig data, ad the take a pair of clauses that are i a depedecy relatio as a positive example, ad two clauses that appear i a setece but are ot with a depedecy relatio as a egative example. Fig. 2 shows a example of depedecy relatio extractio betwee clauses. I this example, clause, 2 ad 3 meas shower-room was moved to the gymasium, office-room was closed, ad a rest room was made i the place (origial place of shower ad office room). I this case, we ca geerate oe positive example (Case 2 i Fig. 2) ad oe egative example (Case i Fig. 2). B. Feature Selectio for Aalyzig Depedecy Relatio I Korea laguage, the clauses are divided ito three types that are oe to modify other clause (called cojuctive clause), oe to modify a ou phrase (called preomial clause), ad oe to imply the ed of setece (called fial edig clause). Amog these clause types, preomial clause ad cojuctive clause make depedecy relatio. We select depedecy relatio that cojuctive clause was deped o other clause, because preomial clause make a simple depedecy relatio with to modify a ext appearig ou phrase. The cojuctive clause makes the depedecy relatio very complex ad, thus, it is difficult to recogize depedecy relatio. The relatio ca be determied ot accordig to simple sytactic iformatio such as verb type ad positio i setece but accordig to the 97

Fig. 4 A example of determiig depedecy relatio usig dyamic features cotext of setece ad the iflectio of edigs. I the previous sectio, we assume that the depedecy relatio of Korea subordiate clauses is the depedecy relatio of verb phrase, verb ad edigs, i the clauses. Accordig to this assumptio, we select two features that are static ad dyamic features. The feature set is show i Table I. We defie lexico ad positio iformatio appearig i a setece as the static iformatio. The lexico iformatio is a word ad POS tags of verb ad edigs i the pair of left ad right clauses. The positioal iformatio is the distace betwee clauses, ad positio idex is the locatio of clauses i a setece. We expect that this static features weakly represet the sematic iformatio betwee clauses. Fig. 3 is show the static features for Fig. 2. N o 2 Lexico iformatio Positio iformatio Left Clause Right Clase (Distace, Positio Idex) 옮기 /pvg 고 /ecc 하 /px ㄴ /etm, 옮기 /pvg 하 /xsv 었 /ep 고 /ecc 다 /ef 2, 2 Fig. 3 The example static features The dyamic features are the sytactic iformatio i a setece. Therefore, we make a simple CKY cart parser so that it captures sytactic iformatio i the setece. Table 2 shows a rule set for the chart parser. CC implies a cojuctive clause ad PC implies a preomial clause. TABLE II THE RULE USED FOR CHART PARSING Rule : CC CC CC Rule 2: CC PC CC Rule 3: PC PC PC Rule 4: PC CC PC With dyamic features we ca apply a sytactic relatio of clauses to traiig the support vector machies. The sytactic relatio states if a clause is composed of just oe simple clause or more tha oe clause. Fig. 4 shows a example of a aalyzig depedecy relatio usig the dyamic feature. The static feature of Case ad Case 2 are same, but the dyamic features are differet. The dyamic feature of Case is PC CC determied by rule 2, but that of Case 2 is CC. Table III shows the whole features for Fig. 3. No 2 3 TABLE III THE WHOLE FEATURES Static features Lexico iformatio 옮기 /pvg 고 /ecc 하 /px ㄴ /etm 옮기 /pvg 고 /ecc 하 /px ㄴ /etm 옮기 /pvg 고 /ecc 하 /xsv 었 /ep 다 /ef Positio iformatio Dyamic features, PC CC, CC 2, 2 PC V. EXPERIMENTS For the evaluatio of the proposed method, a data set for depedecy aalysis of clauses i Korea is prepared. This dataset is derived from the parse corpus, which is a product of STEP2000 project supported by the Korea govermet. The corpus cosists of 6,934 seteces with 26,876 clauses. The corpus is divided ito two parts: traiig (90%) ad test (0%) set. Table IV shows a simple statistics o the corpus. TABLE IV COUNTS ON THE DATASET Iformatio Traiig Set Test Set No. of all seteces 6,240 694 No. of all clauses 24,226 2,650 No. of preomial ad fi al edig clauses 5,457,666 No. of cojuctive clauses 8,769 984 98

Fig. 5 shows a example of depedecy relatio i the subordiate clause dataset. For the format of this dataset, we follow that of CoNLL-200 shared task ad additioally add the depedecy relatio of clauses to it. Each istace i the traiig ad test data cosists of six colums. The first colum cotais the lexico, the secod presets a part-of-speech tag. The third colum cotais the chuk tag. The verb phrases i these colums are used as static features. The fourth ad fifth cotai a begiig, S, ad a edig, E, of clauses. The sixth colum gives the relatio idex of clauses. We apply SVM Light [4] for support vector machie, ad experimet o three cases. The first ad secod case used oly words ad POS tags of clauses ad the all of static features. The last case used both static features ad dyamic features. The evaluatio measure is defied as: correctly recogized depedecy relatio of clauses Accuracy = 00 total depedecy relatio Whe a clause makes several pairs of depedecy relatio with more tha oe clause, we select a pair which has the largest margi. Table V shows the experimetal results. The base lie is the model that determies the goveror of a clause as the earest oe. TABLE V THE EXPERIMENTAL RESULTS Features Accuracy (% ) Base Lie 57.50 Case Oly words ad POS tags of clauses 64.40 Case 2 All of Static features 68.59 Case 3 All of Static ad Dyamic features 73.50 I Case, whe oly words ad POS tags of clauses are used, the accuracy is just 68.59%, That is, the proposed model improves 6.09% over the base lie. It implies that the verb ad edigs have a depedecy relatio weakly. The secod case which uses all of the static features shows 68.59% of accuracy. It meas that the positioal iformatio i static features affect the depedecy relatio. I the last case, the results with both static ad dyamic features are far better tha those without dyamic features. That is, the model with dyamic features outperforms that with static features oly. The performace of our approach is a little bit lower tha a performace of other researches that aalyze the depedecy relatio i Japaese ad Europea laguages. It seems that our approach select oly a relatio of clauses without relatios of word ad phrases. It is easier to aalyze the relatios of word ad phrases tha to aalyze the relatio of clauses. VI. CONCLUSION We have proposed a method for aalyzig depedecy relatio of Korea subordiate clauses based o Support Vector Machies (SVMs). I other to solve this problem, we assume that the depedecy relatio of Korea subordiate clauses is the depedecy relatio of verb phrase, verb ad edigs, i the clauses. We formulate this problem as a biary classificatio task. We selected two kid of features, static ad dyamic features, for applyig SVMs to this problem. The static features are word, POS tag, ad the positioal iformatio, while the dyamic features iclude the sytactic iformatio of the caluses. For extractig the dyamic iformatio, we make a simple CKY chart parser with simple rules. The experimetal results o STEP2000 corpus show that our system achieves the accuracy of 73.5%. 샤워실 c B-NP S X 0 shower room 2 을 jco I-NP X X 0 POST 3 체육관 c B-NP X X 0 gymasium 4 으로 jca I-NP X X 0 POST 5 옮기 pcg B-VP X X 0 move 6 고 ecc I-NP X E ENDING 7 사무실 c B-NP S X 0 office room 8 을 jco I-NP X X 0 POST 9 폐쇄 cpa B-VP X X 0 closed 0 하 xsv I-VP X X 0 ENDING ㄴ etm I-VP X E 2 ENDING 2 그곳 pd B-NP X X 0 that place 3 에 jca I-NP X X 0 ENDING 4 휴게실 c B-NP X X 0 rest room 5 을 jco I-NP X X 0 POST 6 만들 pvg B-VP X X 0 Make 7 었 ep I-VP X X 0 ENDING 8 다 ef I-VP X X 0 ENDING 9. sf O X E - Fig. 5 A example of depedecy relatio i the subordiate clause dataset REFERENCES [] K.-J. Seo, A Korea laguage parser usig sytactic depedecy relatios betwee word-phrases, M.S. Thesis, KAIST, 993. [2] S.-B. Park ad B.-T. Zhag, Text Chukig by Combiig Had-Crafted Rules ad Memory-Based Learig, I Proceedigs of the 4st Aual Meetig of the Associatio for Computatioal Liguistics, pp. 497--504, 2003. [3] H.-P. Shi, Maximally Efficiet Sytactic Parsig with Miimal Resources, I Proceedigs of the Coferece o Hagul ad Korea Laguage Iformatio Processig, pp. 242-244, 999. (I Korea) [4] H.-J. Lee, S.-B. Park, S.-J. Lee, ad S.-Y Park, Clause Boudary Recogitio Usig Support Vector Machies, I Proceedigs of the 9th Pacific Rim Iteratioal Coferece o Artificial Itelligece, pp. 505--54, 2006. 99

[5] X. Carreras ad L. Marquez, Boostig Trees for Clause Splittig, I Proceedigs of the 5 th Coferece o Computatioal Natural Laguage Learig, pp. -3, 200. [6] A. Molia ad F. Pla, Clause Detectio usig HMM, I Proceedigs of the 5 th Coferece o Computatioal Natural Laguage Learig, pp. 70-72, 200. [7] K. Uchimoto, S. Sekie, ad H. Isahara, Japaese Depedecy Structure Aalysis Based o Maximum Etropy Models, I Proceedigs of the 9th Coferece of the Europea Chapter of the Associatio for Computatioal Liguistics, pp. 96-203, 999. [8] T. Kudo ad Y. Matsumoto, Japaese Depedecy Structure Aalysis Based o Support Vector Machies, I Proceedigs of the Joit SIGDAT Coferece o Empirical Methods i Natural Laguage Processig ad Very Large Corpora, pp. 8-25, 2000. [9] J. Gao ad H. Suzuki, Usupervised Learig of Depedecy Structure of Laguage Modelig, I Proceedigs of the 4st Aual Meetig of the Associatio for Computatioal Liguistics, pp. 52-528, 2003. [0] T. Utsuro, S. Nishiokauama, M. Fujio, ad Y. Matsumoto, Aalyzig Depedecies of Japaese Subordiate Clauses based o Statistics of Scope Embeddig Preferece, I Proceedigs of the st Coferece o North America Chapter of the Associatio for Computatioal Liguistics, pp. 0-7, 2000. [] H.-J. Lee, S.-B. Park, S.-J. Lee, ad S.-Y Park, Clause Boudary Recogitio Usig Support Vector Machies, I Proceedigs of the 9th Pacific Rim Iteratioal Coferece o Artificial Itelligece, pp. 505-54, 2006. [2] N. Cristiaii ad J. Shawe-Taylor, A Itroductio to Support Vector Machies ad Other Kerel-based Learig Methods, Cambridge Uiversity Press, 2000. [3] T. Joachims, Text Categorizatio with Support Vector Machies: Learig with May Relevat Features, I Proceedigs of the Europea Coferece o Machie Learig, pp. 37--42, 998. [4] T. Joachims, Makig Large-Scale SVM Learig Practical, LS8, Uiversitaet Dortmud, 998. 00