Natural Laguages Aalysis i Machie Traslatio (MT) based o the STCG (STRING-TREE CORRESPONDENCE GRAMMAR) Tag Eya Kog, Zahari Yusoff Uit Terjemaha Melalui Komputer Pusat Pegajia Sais Komputer Uiversiti Sais Malaysia 11800 Mide, Pulau Piag, Malaysia. [e-mail: eyakog@cs.usm.my ad zari@cs.usm.my] 0. Abstract The Strig-Tree Correspodece Grammar (STCG) [1] is a grammar formalism for defiig: a set of strigs (a laguage), a set of trees (valid represetatio/iterpretatio structures), the mappig betwee the two (to be iterpreted for aalysis & geeratio). The formalism is argued to be a totally declarative grammar formalism that ca associate, to strigs i a laguage, arbitrary tree structures as desired by the grammar writer to be the liguistic represetatio structures of the strigs. More importatly is the facility to specify the correspodece betwee the strig ad the associated tree i a very atural maer. These features are very much desired i grammar writig, i particular for the treatmet of certai liguistic pheomea which are 'o-stadard', amely featurisatio, lexicalisatio ad crossed depedecies [2,3]. Furthermore, a grammar writte i this way aturally iherits the desired property of bi-directioality (i fact o-directioality [4]) such that the same grammar ca be iterpreted for both aalysis ad geeratio. I this paper, we ivestigate the properties of the STCG for iterpretatio towards aalysis (as is uderstood withi the cotext of Machie Traslatio (MT)). Other tha usig STCG grammars as specificatios for the automatic geeratio of aalysis programs i the Specialised Laguages for Liguistic Programmig (SLLPs) of MT systems (a study reported i [5,6]), the work also cetres aroud the specificatio of a geeral aalyser/parser for the STCG. The proposed STCG aalyser is capable of mimickig some very useful features i various cotextfree parsig techiques. Oe such feature is the use of charts i tabular parsig algorithms, as exemplified i Earley's Algorithm [7], which is very helpful i avoidig redudacies that may otherwise result i a combiatorial explosio. Aother is the compact way of represetig possible parse trees for ambiguous seteces, such as the oe see i [8]. Though ot reported i this paper, we ote that the proposed aalyser also provide a atural way for hadlig the kid of awkward pheomea metioed above (amely lexicalisatio, featurisatio, ad worst of all, crossed depedecies) while at the same time retaiig much of the efficiecy of stadard cotext-free parsig algorithms (a study reported i [2,3]). 1. The STCG Formalism The Strig-Tree Correspodece Grammar is a declarative grammar formalism that ca be used to describe the correspodece betwee strigs of terms ad trees. I particular, liguistic rules are writte with utteraces as the strig of terms (heceforth STRING) ad the correspodig represetative liguistic structures as the tree (heceforth TREE). Figure 1 gives a idicatio of a full STCG rule. The structure of the TREE is totally specified by the liguist ad is ot costraied by ay applicatio of rules (as i the case for the parse tree i the classical cotext free grammar). I a rule, the mai correspodece is first declared: i the example, the STRING #1.v.#2.part (with #1 ad #2 beig strig variables, ie. variables which are istatiable to strigs of terms) is set to correspod to the TREE with root ode S (where ad are forest variables, ie. variables that ca be istatiated to lists of subtrees). The mai-corr(espodece) is followed by a declaratio of subcorrespodeces (o the right had 261
side) betwee substrigs of the STRING ad subtrees of the TREE, each of which possibly havig a list of refereces (rule ames). For example, the sub-corr(espodece) betwee the substrig #1 ad the subtree rooted at the ode 1 refers to the rules R..., the latter beig other rules i the grammar. This referece is a mechaism by meas of which the strig ad forest variables metioed earlier are fully istatiated via a operatio called iificatio [9,10] resultig i a correspodece betwee explicit strigs of terms ad ad trees, both without variables. I actual fact, the mai-corr as well as the sub-co specified i the rule are formally recorded i terms of a Structured Strig Tree Correspodece (SSTC) trasparet to the liguist [11] as illustrated i figure 2, where a give correspodece may be oprojective (eg. with discotiuous costituets) as is the case for the odev(part) i the example. Note also that the particle is chose (by the liguist) to be represeted as a collectio of features i the ode v - a case of featurisatio. Mai-Corr. 0"'"..ThP 1,/./.11) v(part) #1.v.#2.part with : R1 I very simple terms, a strig to tree correspodece i the STCG ca be viewed as aalogous to the mathematical defiitio of a relatio betwee iteger umbers as i the example give o the right. Here, a relatio (i this case a fuctio) f is defied i terms of fier subrelatios accordig to the subdomais. Sub-Corr. #1 with : R v(part) v.part = pick, etc. paa = up, etc. Figure 1. (0/a_d) Nff..11*.VP (- /a_b) (0/b e), kpart) (b_c+d_e/ (- /c_d) b_c+d_e) $13 #1. v. #2 part a bb cc dd e Figure 2. 1 #2 with : R -3 x<3 f(x)= x +5 3<x<5 x 55_X A set of STCG rules form a grammar, some of which are axiom rules (ie. start rules or rules cotaiig axiom trees, as i the axiom or the start symbol S i the classical cotext free grammar). With the sematics of the rules beig as idicated above, a grammar thus defies a laguage of strigs, a laguage of represetatio trees, ad the correspodece betwee elemets of the two laguages/sets. It is this set of strig-tree correspodeces that ca be iterpreted for both aalysis ad geeratio. 2. Natural Laguages Aalysis i MT Based o the STCG Iitially, the STCG was desiged to serve as a specificatio laguage for writig grammars i MT such that the specificatios writte i the STCG grammar formalism ca the be coded (maually) ito the liguistic programs for aalysis ad geeratio writte i the SLLPs of itegrated MT systems. Some substatial work have also bee carried out to automate this process, amely towards the automatic geeratio of aalysis programs i the MT systems ARIANE [12] ad JEMAH [13] from grammars writte i the STCG formalism (see for example [5,6]). However, due to certai limitatios i the existig SLLPs for the realisatio of a proper implemetatio of a STCG aalyser (as discussed i [2]), we have decided istead to look ito the desig of a aalyser which ca directly iterpret the STCG grammar. 2.1. The Fudametal Desig of the STCG Aalyser As we have see above, a STCG grammar actually defies a set of SSTCs i a way quite similar to the defiitio of a mathematical fuctio. I evaluatig a mathematical fuctio, if the fuctio is defied i terms of other sub-fuctios the it ca oly be completely evaluated after all its sub-fuctios have bee evaluated ad retur with the appropriate values. We ca view the STCG aalysis process i the same maer where, by takig the iput strig/setece as their STRING, the set of explicit SSTCs defied by the axiom rules of a grammar are costructed based o the resultat sub-sstcs defied by the referece rules of these axiom rules. Sice the 262
referece rules of the axiom rules may i tur refer to other rules, they may also retur the completed SSTCs oly after their respective referece rules have bee completed. This referece process will termiate whe all remaiig sub-sstcs evaluated are defied by subcorrespodeces which do ot refer to ay other rule, amely the 'lexical-sstcs', which must match with the iput words (the o-lexical SSTCs are called 'phrasal- SSTCs'). We illustrate this i the followig aalysis of the iput strig "He picks the ball up" with respect to a grammar cosistig of rule R1 give i figure 1 ad rules R1, R3 give i figure 3. The rule R1 is give as a axiom rule. The aalysis process begis with the evaluatio of the geeral SSTC defied by the axiom rule R1, which i tur leads to the evaluatio of two other sub-sstcs defied by the referece rules R1, R3 as illustrated i figure 4. mai-corr with : R1 mai-corr d/\. with : R3 Figure 3. sub-corr 1 = Joh, ball, he,..., etc. sub-corr the,etc. = ball,etc. VP vpar (l/t; #A. v. #B. Pa (1aa_bbcc5 with : RI - Apply rule R I - Apply rule R3 - Apply rule RI VP (0/1_5) (0/0_1) v(pai (1 2+4_5/ (0/2 4) 1=2+4_5)...".1111"" (2_3/2_3 ) ko_4/ ((1_1 /0_ I) j-le. picks. the. ball. LID 0_1 1 2 2 3 3_4 4_5 with : R I r #11 b_c with : R... --00--- b d dc with : R3 Phrasal-SSTC (0/0_1) I (0_1/0_1) kit 01 with R I (o12_4) ded...""1"17 t (2_3/2_th3e).(3b_a411/3_4) 2_3 3_4 with : R... (0/2 4) v(part) ti Ski b_d Lexical-SSTC v (part) icks. u 1_2 4_5 1/10. hail _3 3_4 He 01 a b picks 1 2 b_d d_c c_5 the ball up 2_3 3_4 4_5 Figure 4. a /b picks 1 2 t d d_c c_5 he ball up _3 3_4 4_5 I the diagram above (o the left), the aalysis process expads the SSTC defied by the axiom rule ito a strig of sub-sstcs, which is further expaded ito aother strig of sub-sstcs util it caot be expaded ay further, which is whe the strig of sub-sstcs cosists oly of lexical-sstcs. The strig of lexical-sstcs is the matched with the words i the iput strig. Note that the matchig eed ot be i a projective maer, as ca be see i this particular example, where the lexical-sstcs are matched to the words i the iput strig i a crossed serial maer - a case of crossed depedecies. I order to keep track of such o- 263
projective correspodeces, we itroduce the use of idex variables to record the iterval correspodig to each symbol appearig i the STRING (as illustrated o the right). I [2], we proposed a desig of the STCG aalysis algorithm which is capable of mimickig some very useful features i various cotext-free parsig techiques. Oe such feature is the use of charts i tabular parsig algorithms, as exemplified i Earley's Algorithm [7], which is very helpful i avoidig redudacies that may otherwise result i a combiatorial explosio. Aother is the represetatio of shared forest i term of a STCG grammar rules which is i fact followig the approach adopted i [8] as illustrated i the ext sectio. 2.2 Multiple Results of aalysis for ambiguous iput setece The example setece give above is uambiguous, ad thus correspods to oly a sigle represetatio tree. However, atural laguage grammars are kow to be i the class of highly ambiguous grammars, ad as such, there may be umerous represetatio trees geerated for a sigle setece i the laguage described. Istead of storig each represetatio tree separately i the set of SSTCs defiig the correspodeces betwee the give setece ad all its possible represetatio trees, we should try to represet all these i a space-efficiet maer. I the figure give below, we preset a compact way of represetig a set of SSTCs correspods to a ambiguous setece by meas of a AND-OR graph of rules - similar to the techique used by [8]. For example, the two SSTCs: VP (0/ 13) (0/0_1 ) I V FP (013_6) ( 1_2/1_2) (0/2_3) p (0/4_6) ( 0_ 1/()_ I ) I (3_4/3_4) /4 (2_3/2 3) de ' (4_5/4_5) (5_6/S 6) ruh'14.?1 (0_1) (0/0_6) _6) (0_ ) ( 1_2/1_2) (OPT 6) PP (2_3/2_3) (0/3_6) (3_4/34) de t (4_5/4_57(56:6/5-6) e,.rs.4te rat() with : RTC Figure 5:Two liguistic represetatios of the setece Joh saw Mary i the boat. ca be factorised ito a AND-OR graph of rules R2, R3, R5, RPP (give below) ad rules R1, R3 (give i figure 3) i the followig maer: RIP I (Joh) (saw) (Mary) P ) R3 (i " De t R I (the) (boat) Figure 6 : A AND-OR Graph of STCG grammar rules. Mai-Corr. Sup-corr. Ne1)& V NIP #A. v. #B with : R2 with : R 1,RIR5 with : R I.R3,R5 Mai-Corr. Sub-Corr. S PP 1 EA #A.#B with : R2,R3 with : RPP with : R3 pa Mai-Corr. Sub-Corr. p $ lip *IA itg la with : with : R5 R I,R3 with : RPP Mai-Corr. p with : RPP Sub-Corr. with : R I,R3,R5 264
3. Cocludig Remarks Recetly, efficiet cotext-free parsig methods such as the LR parser ad Earley's Algorithm have bee referred to extesively i implemetig parsers for most of the formalisms used i the field of NLP. I a effort to retai the efficiecy of stadard cotext-free parsig algorithms, most recet declarative formalisms are typically restricted by the costrait of strig cocateatio i cotext-free grammars which allows a setece to be systematically decomposed so that the parsig process ca be idexed by the subparts of that decompositio (the substrigs). However, it has also bee widely recogised that the cocateatio restrictio of CFG ca be problematic i hadlig pheomea such as lexicalisatio, featurisatio, ad especially crossed depedecies. As a alterative, we propose the STCG formalism which allows for a more 'atural' way of specifiyig the strigs of the laguage beig described, their correspodig liguistically motivated represetatio trees, ad the correspodece betwee the two, where the correspodece eed ot be projective ad hece appropriate for the said pheomea. Eve though the stadard CF parsig methods caot be adopted directly i the aalysis of a iput setece with respect of a STCG grammar, due to the STRING patters of the STCG which eed ot submit to the cocateatio restrictio of CFG, i this paper we preset the geeral layout (due to the space costrait, however iterested readers may get more ails i [2]) of a aalyser for the STCG which is capable of mimickig some very useful features i various cotext-free parsig techiques. Oe such feature is the use of charts i tabular parsig algorithms, as exemplified i Earley's Algorithm [7], which is very helpful i avoidig redudacies that may otherwise result i a combiatorial explosio. Aother is the compact way of represetig possible parse trees for ambiguous seteces, such as the oe see i [8]. Furthermore, we have also provided a atural way for hadlig the kid of awkward pheomea such as lexicalisatio, featurisatio, ad worst of all, crossed depedecies, while at the same time retaiig much of the efficiecy of stadard cotext-free parsig algoritms [2,3]. REFERENCES [ 1 ] Zahari Y., Strig-Tree Correspodece Grammar: a declarative grammar formalism for defiig the correspodece betwee strigs of terms ad tree structures, proceedigs of the 3rd Coferece of the Europea Chapter of the ACL, Copehage, April 1987. [2] Tag Eya Kog, Natural laguages Aalysis i machie traslatio (MT) based o the STCG, PhD thesis, Uiversiti Sais Malaysia, Peag, March 1994. [3] Tag Eya Kog, Zahari Y., Hadlig Crossed Depedecies with the STCG, proceedigs of Natural Laguage Processig Pacific Rim Symposium (NLPRS'95), Sofitel Ambassador Hotel, Seoul, Korea, Dec. 4-6, 1995. [4] Yves Lepage, Parsig ad Geeratig Cotext-Sesitive Laguages with Correspodece Iificatio Grammars, proceedigs of the Natural Laguage Processig Pacific Rim Symposium (NLPRS'91), Sigapore, 25-26 Nov 1991. [5] Zahari Yusoff, Tag Eya Kog, Geeratio of aalysis programs i ROBRA (ARIANE) From Strig-Tree Correspodece Grammars (or a Strategy for Aalysis i machie traaslatio), Proceedigs of the 3rd Machie Traslatio Summit, Washigto, D.C., July,1991. [6] Zahari Y., Tag Eya Kog, Strig-Tree Correspodece Grammars as a base for the automatic geeratio of aalysis programs i machie traaslatio, proceedigs of the Iteratioal Coferece o Curret Issues i Computatioal Liguistics, Peag, Jue 1991. [7] J. Earley, A efficiet catext-free parsig algorithm, Commuicatios of the ACM, Vol. 13, Num. 2, Feb 1970, pp. 94-102. [8] Lag, B., Towards a Uiform Formal Framework for Parsig, I : Curret Issues i Parsig Techology, M. Tomita (ed.), Kluwer Academic Publishers, 1991, pp. 153-171. [9] Zahari Y., Strategies ad heuristics i the aalysis of atural laguages i machie traslatio, PhD thesis, Uiversiti Sais Malaysia, Peag, March 1986. [10] Y.Lepage, U systeme de grammaires correspodacielles d'iificatio, these de Docteur, IMAG, Uiversite Joseph Fourier, Greoble, Jue 1989. [11] Zahari Yusoff, Christia Boitet, Represetatio trees ad strig-tree correspodeces, proceedigs of the 12th Iteratioal Coferece o Computatioal Liguistics, COLING-88, Budapest, August 1988, pp.59-64. [12] Ch.Boitet, P.Guillaume, M.Quezel-Ambruaz, Le poit sur ARIANE-78, debut 1982 (DSE-I ), vol.], part.] : le logiciel, GETA, avril 1982. [13] Tog Loog Cheog, The JEMAH System : Referece Maual, UTMK documet, USM, Peag, 1988. 265
266