QUERY TRANSLATION FOR CROSS-LANGUAGE INFORMATION RETRIEVAL BY PARSING CONSTRAINT SYNCHRONOUS GRAMMAR

QUERY TRANSLATION FOR CROSS-LANGUAGE INFORMATION RETRIEVAL BY PARSING CONSTRAINT SYNCHRONOUS GRAMMAR FRANCISCO OLIVEIRA 1, FAI WONG 1, KA-SENG LEONG 1, CHIO-KIN TONG 1, MING-CHUI DONG 1 1 Faculty of Scence and Technology, Unversty of Macau, Macao E-MAIL: {olfran, derekfw, ma56538, ma66535, mcdong}@umac.mo Abstract: Wth the avalablty of large amounts of multlngual documents, Cross-Language Informaton Retreval (CLIR) has become an actve research area n recent years. However, researchers often face wth the problem of nherent ambgutes nvolved n natural languages. Moreover, ths task s even more challengng for processng the Chnese language because word boundares are not defned n the sentence. Ths paper presents a Chnese-Portuguese query translaton for CLIR based on a Machne Translaton (MT) system that parses Constrant Synchronous Grammar (CSG). Unlke tradtonal transfer-based MT archtectures, ths model only requres a set of CSG rules for modelng syntactc structures of two languages smultaneously to perform the translaton. Moreover, CSG can be used to remove dfferent levels of dsambguaton as the parsng processes n order to generate a translaton wth qualty. Keywords: Cross-Language Informaton Retreval, Machne Translaton, Constrant Synchronous Grammar 1. Introducton The man objectve of Cross-Language Informaton Retreval (CLIR) s to retreve nformaton wrtten n a language dfferent from the language of the user's nput query. Ths s especally helpful n Macau, where many documents are wrtten n Chnese and Portuguese, snce both of them are offcal languages n the terrtory. CLIR systems permt users to retreve documents wrtten not only n ther natve language but also n the other one. However, t s not easy to obtan query translatons wth hgh qualty n any doman due to the nherent ambgutes and lngustc phenomena nvolved n natural languages, and the need of a enormous knowledge for dsambguaton. In the lterature, several approaches have been proposed. In blngual dctonary based approaches [1], query translaton s generated by lookng at the entres of the lexcon. Although t s effcent n terms of translaton, large amount of vocabulary s needed to cover all the words. However, t s not easy to acheve and the problem s even worse wth Chnese, whch do not have word boundares. Moreover, words n the dctonary usually have more than one translaton, and t s a dffcult task for selectng the best translaton by just consderng the blngual dctonary. MT based approaches seems to be the deal soluton for CLIR. It s manly because MT systems translate the sentence as a whole, and the translaton ambguty problem s solved durng the analyss of the source sentence. Rule-based MT [2] uses a method based on a set of lngustc rules, where rules are translated n a lngustc way. Snce these rules are unversal, they are doman ndependent. However, ths approach often requres a large human cost n formulatng rules and t s hard to mantan consstency as the number of rules ncreases. In statstcal based approaches [3, 4], the translaton s determned by estmatng the probabltes between the translaton of words and the orderng of the sentences based on a parallel corpora. However, these approaches suffer from the dependency wth the parallel corpora. For Example-based MT [5], t wll search for peces of examples stored n the parallel corpora for generatng the translaton, but t often depends on the qualty of the examples and the smlarty functon defned. In recent years, some versons of synchronous grammars [6] are proposed for solvng non-somorphc tree based transducton problem and to provde solutons to Machne Translaton. For example, Synchronous Tree Adjonng Grammar [7] was ntally appled for semantcs but was later consdered for translaton [8]. Multple Context-Free Grammar [9] was used by defnng a set of functons for non-termnal symbols n the productons n order to nterpret the symbols n the target generaton. However, t s hard to descrbe dscontnuous consttuents n lngustc expresson [10]. Melamed [6] modeled the problem of MT as a synchronous parsng based on Generalzed Multtext Grammars that mantan two sets of productons as components, one for each language, for modelng parallel texts. Although t can be used to descrbe

semantc nformaton wth detals assocated wth a non-termnal, t s dffcult for the development of a practcal MT system due to ts lack of flexblty. In ths paper, we appled Constrant Synchronous Grammar (CSG) [10], a varaton of synchronous grammar, as the kernel of the MT system n performng query translaton for Chnese-Portuguese CLIR. CSG can be used to express detaled feature structures lke gender, number, agreement, etc n each non-termnal consttuent for performng necessary dsambguaton n each level, and CSG can express non-standard lngustc phenomena, ncludng crossng dependences, and dscontnuous consttuents n the nference rules. Moreover, CSG allows the parser to remove the ambguous parse trees as the parsng progresses by makng use of varous lngustc features defned. Ths paper s organzed as follows: the desgn model of a Chnese-Portuguese query translaton for CLIR by parsng CSG s gven n secton 2. Chnese word segmentaton s presented n secton 3. The parsng of Constrant-based Synchronous Grammar and the generaton of the translaton are detaled n sectons 4 and 5. Fnally, the concluson s gven n secton 6. Tranng Corpus POS Rules Lexcon CSG Rules KB Eu tenho duas rmãs ntelgentes 我有兩個聰明的姐妹 Words Rough Segmentaton 我有兩個聰明的姐妹我有兩個聰明的姐妹 Probablstc Taggng CSG Parsng Generaton wth Morphologcal Analyss Monolngual IR System MT Kernel 我 /pro 有 /v 兩 /num 個 /q 聰明的 /a 姐妹 /n Fgure 1. Kernel desgn of the proposed MT system 2. Desgn Model of Chnese-Portuguese CLIR System The kernel of the MT system proposed n applcaton to CLIR s shown n Fgure 1. The translaton process begns wth word segmentaton of the gven Chnese query, and t s then tagged by a probablstc tagger. Based on the segmented and tagged result, the source sentence s further analyzed by usng a modfed generalzed LR parser [11] for nferrng the syntactc structure of the nput, guded by CSG rules, n order to determne the target language sentental pattern. Ths pattern s then morphologcally analyzed n order to generate the translaton of the source language. The sentence translated s then passed to a monolngual IR system for retrevng documents wrtten n target language. 3. Chnese Segmentaton and Taggng Module Whether Chnese words are segmented effectvely and correctly s vtal n obtanng a good translaton result n MT systems nvolvng Chnese translaton. Ths s manly because Chnese sentences, unlke other Western languages such as Portuguese, there are no delmters between words n the sentence. Moreover, there are many ambguty problems n correctly segmentng a Chnese sentence. In our desgn model, we appled N-Shortest-Paths method [12] for generatng a set of rough segmented results of Chnese sentences. 3.1. Words Rough Segmentaton Model For a gven Chnese sentence, a drected graph s constructed wth each of ts atomc characters as the vertces (V 1, V 2,, V n ). Edges between the vertces are determned by probabltes of the atomc characters or the combnatons of the words obtaned n the Chnese corpus. Let W be one of the possble results of the segmentaton for the Chnese sentence C, then the probablty of W, gven C s defned as: P( W ) P( C W ) P( W C) (1) PC ( ) Snce the probablty of the Chnese sentence P(C) s a constant, and the probablty of C, gven W must be 1, the objectve s to determne the N dfferent segmentatons whch have the N largest probabltes of P(W). Suppose that a possble segmentaton sequence W conssts of w 1, w 2,, w m words, then the probablty P(w ) can be approxmated as:

Pw ( ) ( k 1) m ( k V) j 0 k s the number of occurrences of w and V s the number of word types n the tranng corpus. Smoothng s appled by addng a constant n the numerator by takng nto consderaton that w may not appear n the tranng corpus. By assumng that the context wthn the sentence s not consdered for smplcty, the best word sequence W can be computed as arg max P( W ) arg max P( w ) W m ( k 1) arg max( ) j 0 j m 1 (2) (3) m 1 k j V Based on the segmented canddate Chnese sentences, these are gong to be tagged by a probablstc tagger [13] based on Hdden Markov Model [14] to determne the fnal segmented sentence and the best POS tag for each word. 4. Parsng Constrant Synchronous Grammar Constrant Synchronous Grammar [10] s based on the formalsm of Context Free Grammar (CFG) to the case of synchronous. In CSG formalsm, t conssts of a set of producton rules that descrbes the sentental patterns of the source text and target translaton patterns. 4.1. Defnton of CSG In CSG, every producton rule s n the form of S NP 1 PP NP 2 VP* NP 3 { [NP 1 VP a NP 3 NP 2 ] ; VP cat = vb1 & PP = 把 & VP s:sem = NP 1sem & VP o:sem = NP 2sem & VP o:sem = NP 3sem [NP 1 VP NP 2 em NP 3 ] ; } In ths producton rule, t has two generatve rules assocated wth the sentental pattern of the source NP 1 PP NP 2 VP NP 3. The determnaton of the sutable generatve rule s based on the control condtons defned by rule. The one satsfyng all the condtons determnes the relatonshp between the source and target sentental pattern. For example, f the category of VP s vb1, the preposton gven s 把, and the sense of the subject, drect, and ndrect objects governed by the verb VP corresponds to the frst, second, and the thrd nouns (NP), then the source pattern NP 1 PP NP 2 VP NP 3 s assocated wth the target pattern NP 1 VP a NP 3 NP 2. The astersk * ndcates the head element, and ts usage s to propagate all the related features/lngustc nformaton of the head symbol to the reduced non-termnal symbol n the left hand sde. The use of the * s to acheve the property of features nhertance n CSG formalsm Ther relatonshp s establshed by the gven subscrpts and the sequence s based on the target sentental pattern. As an example, n the frst generatve rule, NP 1 VP a NP 3 NP 2, although the frst NP n the source pattern corresponds to the frst NP n the target one, the verb, the second and thrd noun phrases n the source are changed n the target sentental pattern. Understandng the orderng of consttuents n the target sentental pattern s very mportant because t affects not only n the correctness of the sentence n terms of grammar but also n terms of meanng. For example, suppose that the sentence 貧窮的人 (a poor man) s gong to be translated. If word by word translaton s appled, the sentence wll be translated as pobre homem. Although the sentence s translated correctly n terms of grammar, t s not correct n terms of the meanng. Ths happens because the postonng between adjectves and nouns n Portuguese language may produce dfferent meanngs. In ths case, pobre homem means a ptful man and not a poor man. Ths problem can be easly solved by defnng a CSG producton rule that has dfferent generatve rules assocated wth the same source sentental, where each of these rules are controlled by dfferent condtons. As a result, the source sentence 貧窮的人 wll be translated as homem pobre (a poor man) nstead of pobre homem. 4.2. Feature Descrptors n Attrbute Value Matrx In ths model, semantc nformaton s represented by feature descrptors (FD) whch gve addtonal flexblty n defnng CSG rules for establshng agreements n syntactc and sub-categorzaton dependences. Feature descrptors related to a sngle lexcal word or a consttuent are encoded n Attrbute value matrces (AVM). Each FD s a set of pars n the type of a = v, where a s an attrbute and v s a value, ether an atomc symbol or recursvely a FD. Moreover, feature unfcaton s performed durng the

parsng stage. If FDs of each lexcon word or lexcal are compatble wth each other,.e. there are no conflcts on the value of all the attrbutes defned, unfcaton succeeds and a new FD s constructed. As an example, consder that a new noun phrase s gong to be reduced based on the words 探測 (probe) and 石油 (petroleum) and below, t shows ther AVMs. c-lexcal = 探測 category = NP p-lexcal = tenteamento sense = medcne p-lexcal = sondagem sense = nature (object) Fgure 2. AVMs of the words 探測 and 石油 If the control condton defned by the rule requres that the senses of the noun phrases must be equal to each other, then the unfcaton wll select the meanng of sondagem (probe) snce ths sense can be unfed wth the one of petróleo (petroleum). In tradtonal unfcaton based approaches [15], f FDs of each lexcon word or a consttuent are not compatble wth each other durng the unfcaton process, nothng s returned. However, f only one of the FDs unfcaton fals, then all the related canddate words wll be rejected wthout any flexblty n choosng the next preferable or probable canddate. Thus, n our desgn, each feature s assocated wth an ntal weght and rankng s performed durng the parsng stage for choosng the most sutable canddate word. Suppose that the AVMs of the words 死屍 (corpse) and 漂浮 (to fluctuate) are shown below: c-lexcal = 死屍 category = NP p-lexcal = desenterrado sense = human c-lexcal = 石油 category = NP p-lexcal =petróleo sense = nature (object) c-lexcal = 漂浮 category = VP p-lexcal = flutuar sense = lvng creature p-lexcal = parar sense = transportaton Fgure 3. AVMs of the words 死屍 and 漂浮 Durng the parsng stage, f the control condton requres that the sense of NP must be equal to the sense of the subject governed by VP, weghts are assgned durng the valdaton process and the one that has the hghest weght wll be selected for unfcaton. The assgnment of weghts s based on the followng polces: f unfcaton can be performed between the senses of the lexcal words or consttuents, then the weght s ncreased by 1; f unfcaton fals, but f the sense of a word s an nherted hypernym of the other or vce-versa, the weght s ncreased by 0.5. FDs wth the hghest weght are chosen as the most preferable content. In ths example, tradtonal unfcaton approach wll just return falure. Although there are no exact matches between the senses of 死屍 (corpse) and 漂浮 (to fluctuate), snce the sense human s hyponymc to the sense of lvng creature, FD of the Portuguese word flutuar (to fluctuate) wll stll be unfed wth FD of desenterrado (corpse) and selected as the most sutable canddate. 4.3. Expressveness of CSG As mentoned prevously, CSG can be used to descrbe non-standard lngustc phenomena. For example, consder the blngual sentence: 她 /NP1 把 /PP 兩支鋼筆 /NP 借給了 /VP 佩德羅 /NP Ela emprestou ao Pedro duas canetas (She lent two pens to Peter) It s often that many lngustc expressons wll not appear n the translaton of the other language. For nstance, the preposton PP does not appear n any of the target rules. Moreover, the Chnese preposton 把 and the verb 借給了 should be pared wth the Portuguese verb emprestar (to lend). These observatons show that CSG can be used to express not only structural devatons between two dfferent languages, but also dscontnuous consttuents relatonshps n the Chnese component. 4.4. CSG Parser CSG formalsm s parsed by a modfed verson of generalzed LR algorthm [11] that takes the features constrants and the nference of the target structure nto consderaton. The man reason for choosng ths algorthm s due to the consderable effcency over the Earley s parsng algorthm [16] whch requres a set of computatons of LR tems at each stage of parsng [11]. Furthermore, the parsng table used s extended by addng features constrants and the target rules nto the actons table.

5. Generaton of the translaton Once the parse tree s constructed, the translaton of the nput sentence s generated by referencng the set of generatve target sentental patterns that were selected prevously. In each node of the parse tree, there s an assocated target sentental pattern, whch s used to generate the correspondng translaton. Moreover, n order to ensure that the system generates perfectly the translaton n Portuguese grammatcally, we employ unfcaton of Functonal descrptors (FD) as a valdaton operaton for each node. Snce AVMs for each node was constructed for each consttuent node n the parsng stage, these wll be reused durng the generaton phase. Snce most of the Portuguese words defned n FDs are n ther orgnal word-form, they need to be changed based on a set of grammatcal agreement rules. Thus, extra FDs wll be added accordngly to the AVM, dependng on ts part-of-speech, for checkng the dependency between Portuguese words n order to generate the target translaton correctly. These extra attrbutes nclude number, gender, tense, and categores of person. As an example, consder the parse tree of the sentence 她把兩支鋼筆借給了佩德羅 shown n Fgure 4. NP1 pro 她 PP p 把 num 兩 S NP2 {num NP4} q 支 NP4 {NP1 VP a NP3 NP2} NP5 Fgure 4. Example of a parse tree Suppose that the translaton of the noun phrase 兩支鋼筆 /NP2 (two pens), wth the target pattern num NP4, s gong to be generated. The meanngs obtaned from the blngual dctonary of the words 兩 (two) and 鋼筆 n 鋼筆 {q NP5} VP v 借給了 NP3 npr 佩德羅 (pen) are dos and caneta respectvely. Moreover, FDs of 兩 and 支鋼筆, and ther related nformaton are shown below. c-lexcal = 兩 category = q p-lexcal = dos FD1 = Fgure 5. AVMs of the words 兩 and 支鋼筆 Unfcaton of FD1 and FD2 wll fal because the gender and the number are dfferent. In such a case, necessary conversons are performed so that FD1 and FD2 wll be compatble wth each other. Therefore, the generated result for 兩支鋼筆 s duas canetas (two pens). Smlarly, snce the verb phrase 借給了 /VP (lent) must be n agreement wth NP1 and t must have a correct tense, the Portuguese word emprestar (to lend) should be converted to emprestou (thrd person n past tense). Besdes unfcaton, artcles may need to be restored for each noun phrase s necessary. For example, the noun phrase 佩德羅 /NP3 (Peter) needs to add an artcle o before the Portuguese word Pedro. After all the unfcatons and artcle restoratons, the sentence becomes Ela emprestou a o Pedro duas canetas. However, the generated sentence s stll not totally correct. It s manly because some words can be contracted n the Portuguese grammar. In ths case, the preposton a and the artcle o should be contracted as one word ao. Thus, an extra module that checks f there s a need for contractons s called at last, and the output of the generaton module s Ela emprestou ao Pedro duas canetas (She lent two pens to Peter). 6. Concluson gender = male number = plural c-lexcal = 支鋼筆 category = NP4 p-lexcal = caneta FD2 = gender = female number = sngular In ths paper, we proposed Chnese-Portuguese query translaton for CLIR based on a MT system that parses Constrant Synchronous Grammar. In ths archtecture, based on the gven Chnese sentence, a set of rough segmented results s generated and after taggng all of these canddate sentences, the one wth the hghest score wll be selected. The sentence s then parsed for nferrng the syntactc structure based on Constrant-based Synchronous Grammars. Unlke transfer-based MT archtectures where the translaton process s carred out n sequence by dfferent analytcal phases, by parsng CSG rules, the correspondng target sentental pattern can be nferred

mmedately, so that our approach can reduce nformaton loss durng the transfer process. After constructng the parse tree, t s used for generatng the translaton wth the assstance of the unfcaton between functonal descrptors defned to guarantee the correctness of the grammar and the qualty of the translaton. The proposed MT model can remove dfferent types of ambguty at dfferent stages for enhancng the qualty of the translaton: the creaton of word boundares n the segmentaton module removes ambguty between Chnese words; Part-of-speech ambguty s removed by probablstc tagger; structural ambgutes n parse trees can be removed by parsng CSG; and lexcal ambgutes, where words may have more than one meanng, usually referred as the problem of word sense dsambguaton, can be solved through CSG parsng through the analyss of surrounded neghbors of the ambguous word n queston. Acknowledgements The research work reported n ths paper was supported by Fundo para o Desenvolvmento das Cêncas e da Tecnologa (Scence and Technology Development Fund) under grant 041/2005/A and t was also supported by Research Commttee of Unversty of Macau under grant CATIVO:2372. References [1] Ballesteros, L., Croft, W. B., "Dctonary-based Methods for Cross-Lngual Informaton Retreval", Proceedngs of the 7th Internatonal DEXA Conference on Database and Expert Systems Applcatons, pp. 781-801. [2] Bennett, W. and Slocum, J., The LRC Machne Translaton System, Computatonal Lngustcs, Vol. 11, No. 2-3, pp. 111-121, 1985. [3] Peter F. Brown, John Cocke, Stephen Della Petra, Vncent J. Della Petra, Frederck Jelnek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossn, A Statstcal Approach to Machne Translaton, Computatonal Lngustcs, Vol. 16, No. 2, pp. 79-85, 1990. [4] Peter F. Brown, Stephen A. Della Petra, Vncent J. Della Petra, and Robert Mercer, The mathematcs of statstcal machne translaton: parameter estmaton, Computatonal Lngustcs, Vol. 19, No. 2, pp. 263-311, 1993. [5] Ralf D. Brown, Example-Based machne translaton n the pangloss system, Proceedngs of the 16th Internatonal Conference on Computatonal Lngustcs (COLING-96), Copenhagen, Denmark, pp. 169-174, 1996. [6] Melamed I. D., Multtext Grammars and Synchronous Parsers, Proceedngs of NAACL/HLT 2003, Edmonton, pp. 79-86, 2003. [7] S. M. Sheber and Y. Schabes. Synchronous Tree-Adjonng Grammars, Proceedngs of the 13th Internatonal Conference on Computatonal Lngustcs (COLING-90), Helsnk, Fnland, pp. 253-258, 1990. [8] A. Abellé, Y. Schabes, A. Josh, Usng lexcalzed TAGs for machne translaton, Proceedngs 13rd Internatonal Conference on Computatonal Lngustcs (COLING-90), Helsnk, Fnland, Vol. 3, pp. 1-7, 1990. [9] Sek, H., Matsumura, T., Fuj, M., Kasam, T., On multple context-free grammars, Theoretcal Computer Scence, Vol. 88, No. 2, pg. 191-229, 1991. [10] Wong F., Hu D. C., Mao Y. H., Dong M. C., and L Y. P., Machne Translaton Based on Constrant-Based Synchronous Grammar, Proceedngs of the Second Internatonal Jont Conference on Natural Language (IJCNLP-05), Vol. 3651, Jeju Island, Republc of Korea, pp. 612-623, 2005. [11] Tomta, M., An effcent augmented-context-free parsng algorthm, Computatonal Lngustcs, Vol. 13, No. 1-2, pp. 31-46, 1987. [12] Zhang HP, Lu Q., Model of Chnese words rough segmentaton based on N-shortest-paths method, Journal of Chnese Informaton Processng, Vol. 16, No. 5, pp. 1-7, 2002. [13] Leong K. S., Wong F., Tang C. W., and Dong M. C., CSAT: A Chnese Segmentaton and Taggng Module Based on the Interpolated Probablstc Model, Proceedngs n Computatonal Methods n Engneerng and Scence (EPMESC-X), Sanya, Hanan, Chna, pp. 1092-1098, 2006. [14] Rabner L., A tutoral on hdden Markov models and selected applcatons n speech recognton, Proceedngs of the IEEE, Vol. 77, No. 2, pp. 257 286, 1989. [15] K. Ronald, The Formal Archtecture of Lexcal-Functonal Grammar, Journal of Informaton Scence and Engneerng, Vol. 5, pp. 305-322, 1989. [16] Early J., An Effcent Context-Free Parsng Algorthm, Communcatons of the Assocaton for Computng Machnery, Vol. 13, No. 2, pp. 94-102, 1970.