The Acquisition and Use of Context-Dependent Grammars for English

The Acquisitio ad Use of Cotext-Depedet Grammars for Eglish Robert E Simmos* Uiversity of Texas Yeog-Ho Yu t Uiversity of Texas This paper itroduces a paradigm of cotext-depedet grammar (CDG) ad a acquisitio system that, through iteractive teachig sessios, accumulates the CDG rules. The resultig cotext-sesitive rules are used by a stack-based, shift~reduce parser to compute uambiguous sytactic structures of seteces. The acquisitio system ad parser have bee applied to the phrase structure ad case aalyses of 345 seteces, maily from ewswire stories, with 99% accuracy. Extrapolatio from our curret grammar predicts that about 25 thousad CDG rule examples will be sufficiet to trai the system i phrase structure aalysis of most ews stories. Overall, this research cocludes that CDG is a computatioally ad coceptually tractable approach for the costructio of setece grammar for large subsets of atural laguage text. 1. Itroductio A edurig goal for atural laguage processig (NLP) researchers has bee to costruct computer programs that ca read arrative, descriptive texts such as ewspaper stories ad traslate them ito kowledge structures that ca aswer questios, classify the cotet, ad provide summaries or other useful abstractios of the text. A essetial aspect of ay such NLP system is parsig--to traslate the idefiitely log, recursively embedded strigs of words ito defiite ordered structures of costituet elemets. Despite decades of research, parsig remais a difficult computatio that ofte results i icomplete, ambiguous structures; ad computatioal grammars for atural laguages remai otably icomplete. I this paper we suggest that a solutio to these problems may be foud i the use of cotext-sesitive rules applied by a determiistic shift/reduce parser. A system is described for rapid acquisitio of a cotext-sesitive grammar based o ordiary ews text. The resultig grammar is accessed by determiistic, bottomup parsers to compute phrase structure or case aalyses of texts that the grammars cove The acquisitio system allows a liguist to teach a CDG grammar by showig examples of parsig successive costituets of seteces. At this writig, 16,275 example costituets have bee show to the system ad used to parse 345 seteces ragig from 10 to 60 words i legth achievig 99% accuracy. These examples compress to a grammar of 3,843 rules that are equally effective i parsig. Extrapolatio from our data suggests that acquirig a almost complete phrase structure grammar for AP Wire text will require about 25,000 example rules. The procedure is further demostrated to apply directly to computig superficial case aalyses from Eglish seteces. Departmet of Computer Scieces, AI Lab, Uiversity of Texas, Austi TX 78712. E-mail @cs.texas.edu t Boeig Helicopter Computer Svces, Philadelphia, PA (~) 1992 Associatio for Computatioal Liguistics

Computatioal Liguistics Volume 18, Number 4 Oe of the first lessos i atural or formal laguage aalysis is the Chomsky (1957) hierarchy of formal grammars, which classifies grammar forms from urestricted rewrite rules, through cotext-sesitive, cotext-free, ad the most restricted, regular grammars. It is usually coceded that pure, cotext-free grammars are ot powerful eough to accout for the sytactic aalysis of atural laguages (NL) such as Eglish, Japaese, or Dutch, ad most NL research i computatioal liguistics has used either augmeted cotext-flee or ad hoc grammars. The covetioal wisdom is that cotext-sesitive grammars probably would be too large ad coceptually ad computatioally utractable. There is also a uspoke suppositio that the use of a cotext-sesitive grammar implies usig the kid of complex parser required for parsig a fully cotext~sesitive laguage. However, NL research based o simulated eural etworks took a cotext-based approach. Oe of the first hits came from the strikig fidig from Sejowski ad Roseberg's NETtalk (1988), that seve-character cotexts were largely sufficiet to map each character of a prited word ito its correspodig phoeme---where each character actually maps i various cotexts ito several differet phoemes. For accomplishig liguistic case aalyses McClellad ad Kawamoto (1986) ad Miikkulaie ad Dyer (1989) used the etire cotext of phrases ad seteces to map strig cotexts ito case structures. Robert Alle (1987) mapped ie-word seteces of Eglish ito Spaish traslatios, ad Yu ad Simmos (1990) accomplished comparable cotext-sesitive traslatios betwee Eglish ad Germa simple seteces. It was apparet that the cotexts i which a word occurred provided iformatio to a eural etwork that was sufficiet to select correct word sese ad sytactic structure for otherwise ambiguous usages of laguage. I order to solve a problem of acceptig idefiitely log, complex seteces i a fixed-size eural etwork, Simmos ad Yu (1990) showed a method for traiig a etwork to act as a cotext-sesitive grammar. A sequetial program accessed that grammar with a determiistic, sigle-path parser ad accurately parsed descriptive texts. Cotiuig that research, 2,000 rules were accumulated ad a etwork was traied usig a back-propagatio method. The traiig of this etwork required te days of cotiuous computatio o a Symbolics Lisp Machie. We observed that the traiig cost icreased by more tha the square of the umber of traiig examples ad calculated that 10,000-20,000 rules might well tax a supercomputer. So we decided that storig the grammar i a hash table would form a far less expesive optio, provided we could defie a selectio algorithm comparable to that provided by the traied eural etwork. I this paper we describe such a selectio formula to select rules for cotextsesitive parsig, a system for acquirig cotext-sesitive rules, ad experimets i aalysis ad applicatio of the grammar to ordiary ewspaper text. We show that the applicatio of cotext-sesitive rules by a determiistic shift/reduce parser is a coceptually ad computatioally tractable approach to NLP that may allow us to accumulate practical grammars for large subsets of Eglish texts. 2. Cotext-Depedet Parsig I NL research most iterest has cetered o cotext-free grammars (CFG), augmeted with feature tests ad trasformatios, used to describe the phrase structure of seteces. There is a broad literature o Geeralized Phrase Structure Grammar (Gazdar et al. 1985), Uificatio Grammars of various types (Shieber 1986), ad Augmeted 392

Robert E Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish Trasitio Networks (J. Alle 1987). Gazdar (1988) calls attetio to a subcategory of cotext-sesitive grammars called idexed laguages ad illustrates some applicability to atural laguages, ad Joshi illustrates a applicatio of "mild cotext-sesitivity" (Joshi 1987), but i geeral, NL computatio with cotext-sesitive grammars is a largely uexplored area. While a few advaced NLP laboratories have developed grammars ad parsig capabilities for sigificatly large subsets of atural laguage, 1 it caot be deied that massive effort was required ad that the results are plagued by ambiguous iterpretatios. These grammars are typically a cotext-free form, augmeted by complex feature tests, trasformatios, ad occasioally, arbitrary programs. The combiatio of eve a efficiet parser with such itricate grammars may greatly icrease computatioal complexity of the parsig system (Tomita 1985). It is extremely difficult to write ad maitai such grammars, ad they must frequetly be revised ad retested to esure iteral cosistecy as ew rules are added. We argue here that a acquisitio system for accumulatig cotext-sesitive rules ad their applicatio by a determiistic shift/reduce parser will greatly simplify the process of costructig ad maitaiig atural laguage parsig systems. Although we use cotext-sesitive rules of the form uxv ~ uyv they are iterpreted by a shift/reduce parser with the result that they ca be applied successfully to the LR(k) subset of cotext-free laguages. Uless the parser is augmeted to iclude shifts i both directios, the system caot parse cotext-sesitive laguages. It is a ope questio as to whether Eglish is or is ot cotext-sesitive, but it defiitely icludes discotiuous costituets that may be separated by idefiitely may symbols. For this reaso, future developmets of the system may require operatios beyod shift ad reduce i the parser. To avoid the easy misiterpretatio that our preset system applies to cotext-sesitive laguages, we call it Cotext- Depedet Grammar (CDG). We begi with the simple otio of a shift/reduce parser. Give a stack ad a iput strig of symbols, the shift/reduce parser may oly shift a symbol to the stack (Figure la) or reduce symbols o the stack by rewritig them as a sigle symbol (Figure lb). We further costrai the parser to reduce o more tha two symbols o the stack to a sigle symbol. The parsig termiates whe the stack cotais oly a sigle root elemet ad the iput strig is empty. Usually this class of parser applies a CFG to a setece, but it is equally applicable to CDG. 2.1 CDG Rule Forms The theoretical viewpoit is that the parse of a setece is a sequece of states, each composed of a coditio of the stack ad the iput strig. The sequece eds successfully whe the stack cotais oly the root elemet (e.g. SNT), ad the iput strig is 1 Notable examples iclude the large augmeted CFGs at IBM Yorktow Hts, the Uiv. of Pesylvaia, ad the Liguistic Research Ctr. at the Uiv. of Texas. 393

Computatioal Liguistics Volume 18, Number 4 INPUT SENTENCE t._i Ci+l t i+2... INPUT SENTENCE t.i+l t 1+2 f_i+$... t. t Nrk t_l NT._~, t l STACK STACK bottom bottom INPUT SENTENCE t i,t m,.~,t I are termials. N'T_'~ is a ~o-termial_ (a) Shift Operatio INPUT SENTENCE t i t i+1 t_i+2... t_i t_i+l t i+2... t.m A~'_t a~rd U STACK STACK bottom bottom t i,t m,...,t I are termials. N'T_~, NT~ are o-termials. (b) Reduce Operatio Figure 1 Shift/reduce parser. empty. Each state ca be see as the left half of a cotext-sesitive rule whose right half is the succeedig state. stacksiputs ~ stacks+ l iputs+ l However, seteces may be of ay legth ad are ofte more tha forty words, so the resultig strigs ad stacks would form very cumbersome rules of variable legths. To avoid this difficulty, the stack ad iput parts of a rule are limited to five symbols each. I the followig example the stack ad iput parts are separated by the symbol "*/' as the idea is applied to the setece "The old ma from Spai ate fish." The symbol _ stads for blak, art for article, adj for adjective, p for prepositio, for ou, ad v for verb. The sytactic classes are assiged by dictioary lookup i a cotext-sesitive dictioary. 2 394

Robert F. Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish The old ma from Spai ate fish art adj p v * art adj p art * adj p v _ art adj * p v art adj * p v art p * p v _.... p * p v p p* v p p * v p pp v * _.... p * v p v * p v * _ p vp * st * The aalysis termiates with a empty iput strig ad the sigle symbol "st" o the stack, successfully completig the parse. Note that the first four operatios ca be described as shifts followed by the two reductios, adj ~ p, ad art p ~ p. Subsequetly the p ad were shifted oto the stack ad the reduced to a pp; the the p ad pp o the stack were reduced to a p, followed by the shiftig of v ad, their reductio to vp, ad a fial reductio of p vp ---* st. Illustratios similar to this are ofte used to itroduce the cocept of parsig i AI texts o atural laguage (e.g.j. Alle 1987). We could perfectly well record the grammar i pairs of successive states as follows: p p* v --* p p * v p p * v 7 p pp* v but some ecoomy ca be achieved by recordig the operatio ad possible label as the right half of a rule. So for the example immediately above, we record: _ p p * v --+ (S) p p * v _ - - * ( R p p ) where S shifts ad (R pp) replaces the top two elemets of the stack with pp to form the ext state of the parse. Thus a widowed cotext of te symbols is created as the left half of a rule ad a operatio as the right half. Note that if the stack were limited to the top two elemets, ad the iput to a sigle elemet, the rule system would reduce to a biary rule CFG. The example i Figure 2 shows how a setece "Treatmet is a complete rest ad a special diet" is parsed by a cotext sesitive shift/reduce parser. Termial symbols are lowercase, while otermials are uppercase. The shaded areas represet the parts 2 Described i Sectio 7.3. 395

...... Computatioal Liguistics Volume 18, Number 4 N!i.....-.-...-...-. iiiiiiiiiii!iliiii i~ i:i:i:i:~ iiiiiiiiiiiiiii!ii o!{{{ii{iiiiiiiiin{{!;!{, i Treatmet is a complete rest ad a special diet. ( v det adj cjdet adj ) I Iput bottom-~------~ ~ top e~ ~ last v det v det adj v det adj v det NP v NP v NP cj v NP cj det NP cj det adj cj det adj NP cj det NP v NP cj NP v NP CNP v NP v VP S v det adj v det adj cj det adj cj det adj cj det adj cj det adj cj det adj cj det adj cj det adj det adj adj Operatio shift shift shift shift shift reduce to NP reduce to NP shift shift shift shift reduce to NP reduce to NP reduce to CNP reduce to NP reduce to VP reduce to S doe Widowed Cotext Figure 2 A example of widowed cotext. of the cotext ivisible to the system. The ext operatio is solely decided by the widowed cotext. It ca be observed that the last state i the aalysis is the sigle symbol SNT--the desigated root symbol, o the stack alog with a empty iput strig, successfully completig the parse. Ad this is the CDG form of rule used i the phrase structure aalysis. 2.2 Algorithm for the Shift/Reduce Parser The parser accepts a strig of sytactic word classes as its iput ad forms a tesymbol vector, five symbols each from the stack ad the iput strig. It looks up this vector as the left half of a productio i the grammar ad iterprets the right half of the productio as a istructio to modify the stack ad iput sequeces to costruct the ext state of the parse. To accomplish these tasks, it maitais two stacks, oe for the iput strig ad oe for the sytactic costituets. These stacks may be arbitrarily large. A algorithm for the parser is described i Figure 3. The most importat part of this algorithm is to fid a applicable CDG rule from the grammar. Fidig such a rule is based o the curret widowed cotext. If there is a rule whose left side exactly matches the curret widowed cotext, that rule will be applied. However, realistically, it is ofte the case that there is o exact match with ay rule. Therefore, it is ecessary to fid a rule that best matches the curret cotext. 396

Robert E Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish CD-SR-Parser(Iput,Cdg) Iput is a strig of sytactic classes for the give setece. Cdg is the give CDG grammar rules. Stack := empty do util(iput = empty ad Stack = (SNT)) Widowed-cotext := Apped(Top-five(stack),First-ilve(iput)) Operatio := Cosult_CDG(Widow-cotext,Cdg) if First(Operatio) = SHIFT the Stack := Push(First(Iput),Stack) Iput := Rest(Iput) else Stack := Push(Secod(Operatio),Pop(Pop(Sta~k))) ed do The fuctios, Top_five ad First-five, retur the lists of top (or first) five elemets of the Stack ad the Iput respectively. If there are ot eough elemets, these procedures pad with blaks. The fuctio Apped cocateates two lists ito oe. Cosult_CDG cosults the give CDG rules to fid the ext operatio to take. The details of this fuctio are the subject of the ext sectio. Push ad Pop add or delete oe elemet to/from a stack while First ad Secod retur the first or secod elemets of a list, respectively. Rest returs the give list mius the first elemet. Figure 3 Cotext-sesitive shift reduce parser. 2.3 Cosultig the CDG Rules There are two related issues i cosultig the CDG rules. Oe is the computatioal represetatio of CDG rules, ad the other is the method for selectig a applicable rule. I the traditioal CFG paradigms, a CFG rule is applicable if the left-had side of the rule exactly matches the top elemets of the stack. However, i our CDG paradigm, a perfect match betwee the left side of a CDG rule ad the curret state caot be assured, ad i most cases, a partial match must suffice for the rule to be applied. Sice may rules may partially match the curret cotext, the best matchig rule should be selected. Oe way to do this is to use a eural etwork. Through the back-propagatio algorithm (Rumelhart, Hito, ad Williams 1986), a feed-forward etwork ca be traied to memorize the CDG rules. After successful traiig, the etwork ca be used to retrieve the best matchig rule. However, this approach based o ~ eural etwork usually takes cosiderable traiig time. For istace, i our previous experimet (Simmos ad Yu 1990), traiig a etwork for about 2,000 CDG rules took several days of computatio. Therefore, this approach has a itrisic problem for scalig up, at least o the preset geeratio of eural et simulatio software. Aother method is based o a hash table i which every CDG rule is stored accordig to its top two elemets of the stack--the fourth ad fifth elemets of the left half of the rule. Give the curret widowed cotext, the top two elemets of the stack are used to retrieve all the relevat rules from the hash table. 397

Computatioal Liguistics Volume 18, Number 4 We use o more tha 64 word ad phrase class symbols, so there ca be o more tha 4,096 possible pairs. The effect is to divide the large umber of rules ito o more tha 4,096 subgroups, each of which will have a maageable subset. I fact, with 16,275 rules we discovered that we have oly 823 pairs ad the average umber of rules per subgroup is 19.8; however, for frequetly occurrig pairs the umber of rules i the subgroups ca be much larger. The problem is to determie what scorig formula should be used to fid the rule that best matches a parsig cotext. Sejowski ad Roseberg (1988) aalyzed the weight matrix that resulted from traiig NETtalk ad discovered a triagular fuctio with the apex cetered at the character i the widow ad the weights fallig off i proportio to distace from that character. We decided that the best matchig rule i our system would follow a similar patter with maximum weights for the top two elemets o the stack with weights decreasig i both directios with distace from those positios. The scorig fuctio we use is developed as follows: Let T4 be the set of vectors {RI~R2,...,R} where Ri is the vector [rl, r2,..., rl0] Let C be the vector [Cl, Ca,..., c10] Let #(ci, ri) be a matchig fuctio whose value is 1 if ci = ri, ad 0 otherwise. TZ is the etire set of rules, Ri is (the left half of) a particular rule, ad C is the parse cotext. The/-4' is the subset of T4 where if Ri E T~ I the #(ri4,c4) #(ris~cs) = 1. Access of the hash table with the top two elemets of the stack, c4, c5 produces the set T4'. We ca ow defie the scorig fuctio for each Ri C T~ I. 3 i0 Score = ~_, t~(ci, ri). i+ ~_, #(ci, ri)(11 -i) i=1 i=6 The first summatio scores the matches betwee the stack elemets of the rule ad the curret cotext, ad the secod summatio scores the matches betwee the elemets i the iput strig. If two items of the rule ad cotext match, the total score is icreased by the weight assiged to that positio. The maximum score for a perfect match is 21 accordig to the above formula. From several experimets, varyig the legth of vector ad the weights, particularly those assiged to blaks, it has bee determied that this formula gave the best performace amog those tested. More importatly, it has worked well i the curret phrase structure ad case aalysis experimets. It was a uexpected surprise to us 3 that usig cotext-sesitive productios, a elemetary, determiistic, parsig algorithm proved adequate to provide 99% correct, uambiguous aaalyses for the etire text studied. 3. Grammar Acquisitio for CDG Costructig a augmeted phrase structure grammar of whatever type uificatio, GPSG, or ATN--is a paiful process usually ivolvig a well-traied liguistic team of several people. These types of grammar require that a CFG recogitio rule such 3 But perhaps ot to Marcus (1980) ad Berwick (1985), who promote the study of determiistic parsig. 398

Robert F. Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish as p vp ~ st be supported by such additioal iformatio as the fact that the p ad vp agree i umber, that the p is characterized by particular features such as cout, aimate, etc., ad that the vp ca or caot accept certai types of complemets. The additioal features make the rules exceedigly complex ad difficult to prepare ad debug. College studets ca be taught easily to make a phrase structure tree to represet a setece, but it requires cosiderable liguistic traiig to deal successfully with a feature grammar. We have see i the precedig sectio that a CFG is derived from recordig the successive states of the parses of seteces. Thus it was atural for us to develop a iteractive acquisitio system that would assist a liguist (or a studet) i costructig such parses to produce easily large sets of example CFG rules. 4 The system cotiued to evolve as a cosequece of our use util we had icluded capabilities to: read i text ad data files compile dictioary ad grammar tables from completed text files select a setece to cotiue processig or revise look up words i a dictioary to suggest the sytactic class for the word i cotext whe assigig sytactic classes to the words i a setece compare each state of the parse with rules i the curret grammar to predict the shift/reduce operatio. A carriage retur sigals that the user accepts the prompt, or the typig i of the desired operatio overrides it. compute ad display the parse tree from the local grammar after completio of each setece, or from the global total grammar at ay time provide backig up ad editig capability to correct errors prit help messages ad guide the user compile dictioary ad grammar etries at the completio of each setece, isurig o duplicate etries save completed or partially completed grammar files. The resultig tool, GRAMAQ, eables a liguist to costruct a cotext-sesitive grammar for a text corpus at the rate of several seteces per hour. Thousads of rules are accumulated with oly weeks of effort i cotrast to the years required for a comparable system of augmeted CFG rules. About te weeks of effort were required to produce the 16,275 rules o which this study is based. Sice GRAMAQ's prompts become more accurate as the dictioary ad grammar grow i size, there is a positive acceleratio i the speed of grammar accumulatio ad the liguist's task gradually coverges to oe of alert supervisio of the system's prompts. A slightly differet versio of GRAMAQ is Caseaq, which uses operatios that create case costituets to accumulate a cotext-sesitive grammar that trasforms 4 Startig with a Emacs editor, it was fairly easy to read i a file of seteces ad to assig each word its sytactic class accordig to its cotext. The the asterisk was iserted at the begiig of the sytactic strig, the strig was copied to the ext lie, the asterisk moved if a shift operatio was idicated, or the top two symbols o the stack were rewritte if a reduce was required--just as we costructed the example i the precedig sectio. Naturally eough, we soo made Emacs macros to help us, ad the escalated to a Lisp program that would prit the stack-*-strig ad iterpret our shift/reduce commads to produce a ew state of the parse. 399

Computatioal Liguistics Volume 18, Number 4 Text States Seteces Wds/St M-Wds/St Hepatitis 236 12 4-19 10.3 Measles 316 10 4-25 16.3 News Story 470 10 9-51 23.5 APWire-Robots 1005 21 11-53 26.0 APWire-Rocket 1437 25 8-47 29.2 APWire-Shuttle 598 14 12-32 21.9 Total 4062 92 4-53 22.8 Table 1 Characteristics of a sample of the text corpus. seteces directly to case structures with o itermediate stage of phrase structure trees. It has the same fuctioality as GRAMAQ but allows the liguist user to specify a case argumet ad value as the trasformatio of sytactic elemets o the stack, ad to reame the head of such a costituet by a sytactic label. Figure 9 i Sectio 7.3 illustrates the acquisitio of case grammar. 4. Experimets with CDG There are a umber of critical questios that eed be aswered if the claim that CDG grammars are useful is to be supported. Ca they be used to obtai accurate parses for real texts? Do they reduce ambiguity i the parsig process? How well do the rules geeralize to ew texts? How large must a CFG be to ecompass the sytactic structures for most ewspaper text? 4.1 Parsig ad Ambiguity with CDG Over the course of this study we accumulated 345 seteces maily from ewswire texts. The first two articles were brief disease descriptios from a youth ecyclopedia; the remaiig fiftee were ewspaper articles from February 1989 usig the terms "star wars," "SDI," or "Strategic Defese Iitiative." Table 1 characterizes typical articles by the umber of CDG rules or states, umber of seteces, the rage of setece legths, ad the average umber of words per setece. We developed our approach to acquirig ad parsig cotext-sesitive grammars o the first two simple texts, ad the used GRAMAQ to redo those texts ad to costruct productios for the ews stories. The total text umbered 345 seteces, which accumulated 16,275 cotext-sesitive rules--a average of 47 per setece. The parser embodyig the algorithm illustrated earlier i Figure I was augmeted to compare the costituets it costructed with those prescribed durig grammar acquisitio by the liguist. I parsig the 345 seteces, 335 parses exactly matched the liguist's origial judgemet. I ie cases i which differeces occurred, the parses were judged correct, but slightly differet sequeces of parse states occurred. The teth case clearly made a attachmet error--of a itroductory adverbial phrase i the setece "Hours later, Baghdad aouced... " This was mistakely attached to "Baghdad." This evaluatio shows that the grammar was i precise agreemet with 400

Robert F. Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish Aother missio soo scheduled that also would have priority over the shuttle is the first firig of a tridet two itercotietal rage missile from a submerged submarie. NP ~ N P ~ art eazotb~ ~ mis~o NP adv soo V P ~ palm scheduled. ~ - xeapro ~ that v ~ have SNT VP ~ l~iority p ~ over art ~ the shuttle -- vbe VP ~ ~ ~ - ~ ~ ~ the prprt firig PP~-% a r t ~ a NP NP tridet two I o_ NP ' ' x / adj itercotietal < rage missile PP ~ p - from a ~ - " NP N-~ ~paprt submerged r~t" N submarie Figure 4 Setece parse. the liguist 97% of the time ad completed correct parses i 99.7% of the 345 seteces from which it was derived. Sice our primary iterest was i evaluatig the effectiveess of the CDG, all these evaluatios were based o usig correct sytactic classes for the words i the seteces. The cotext-sesitive dictioary lookup procedure described i Sectio 7.3 is 99.5% accurate, but it assigs 40 word classes icorrectly. As a cosequece, usig this procedure would result i a reductio of about 10% accuracy i parsig. A output of a setece from the parser is displayed as a tree i Figure 4. Sice the whole mechaism is coded i Lisp, the actual output of the system is a ested list that is the prited as a tree. Notice i this figure that the PP at the bottom modifies the NP composed of "the first firig of a tridet two itercotietal rage missile" ot just the word "firig." Sice the parsig is bottom-up, left-to-right, the costituets are formed i the atural order of words ecoutered i the setece ad the termials of the tree ca be read top-to-bottom to give their orderig i the setece. Although 345 seteces totalig 8594 words is a small selectio from the ifiite set of possible Eglish seteces, it is large eough to assure us that the CDG is a reasoable form of grammar. Sice the determiistic parsig algorithm selects a sigle iterpretatio, which we have see almost perfectly agrees with the liguist's parsigs, it is apparet that, at least for this size text sample, there is little difficulty with ambiguous iterpretatios. 401

Computatioal Liguistics Volume 18, Number 4 5. Geeralizatio of CDG The purpose of accumulatig sample rules from texts is to achieve a grammar geeral eough to aalyze ew texts it has ever see. To be useful, the grammar must geeralize. There are at least three aspects of geeralizatio to be cosidered. How well does the grammar geeralize at the setece level? That is, how well does the grammar parse ew seteces that it has ot previously experieced? How well does the grammar geeralize at the operatio level? That is, how well does the grammar predict the correct Shift/Reduce operatio durig acquisitio of ew seteces? How much does the rule retetio strategy affect geeralizatio? For istace, whe the grammar predicts the same output as a ew rule does, ad the ew rule is ot saved, how well does the resultig grammar parse? 5.1 Geeralizatio at the Setece Level The complete parse of a setece is a sequece of states recogized by the grammar (whether it be CDG or ay other). If all the costituets of the ew setece ca be recogized, the ew setece ca be parsed correctly. It will be see i a later paragraph that with 16,275 rules, the grammar predicts the output of ew rules correctly about 85% of the time. For the average setece with 47 states, oly 85% or about 40 states ca be expected to be predicted correctly; cosequetly the determiistic parse will frequetly fail. I fact, 5 of 14 ew seteces parsed correctly i a brief experimet that used a grammar based o 320 seteces to attempt to parse the ew, 20-setece text. Cosiderig that oly a sigle path was followed by the determiistic parser, we predicted that a multiple-path parser would perform somewhat better for this aspect of geeralizatio. I fact, our iitial experimets with a beam search parser resulted i successful parses of 15 of the 20 ew seteces usig the same grammar based o the 320 seteces. 5.2 Geeralizatio at the Operatio Level This level of geeralizatio is of cetral sigificace to the grammar acquisitio system. Whe GRAMAQ looks up a state i the grammar it fids the best matchig state with the same top two elemets o the stack, ad offers the right half of this rule as its suggestio to the liguist. How ofte is this predictio correct? To aswer this questio we compiled the grammar of 16,275 rules i cumulative icremets of 1,017 rules usig a procedure, uio-grammar, that would oly add a rule to the grammar if the grammar did ot already predict its operatio. We call the result a "miimal-grammar," ad it cotais 3,843 rules. The black lie of Figure 5 shows that with the first 1,000 rules 40% were ew; with a accumulatio of 5,000, 18% were ew rules. By the time 16,000 rules have bee accumulated, the curve has flatteed to a average of 16% ew rules added. This meas that the acquisitio system will make correct prompts about 84% of the time ad the liguist will oly eed to correct the system's suggestios about 3 or 4 times i 20 cotext presetatios. 402

Robert E Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish 50 40 r~ -] 30 20... ~"... t... t... t... t... t...!...!... +.'"'"t... t...!... t...!...!...! i..~. I I, I I I I I l : I I l : I I -', I ; l ; I ; ;! ' I " ; ", ' I I I I i I I I I I, l,,, I ' ' '... i... ;... ~... i... i... ~... 1... i... 4... 4... ~-... i... ;... i... i... i...... i... i... i... J... L... i... i... i... J...... i 10... *... G.,,--.~... i... J...!...!... i... i... 4... 4... 4...!...!...!... i... i i i i I i I -: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Accumulated Rules by Thousads Figure 5 Geeralizatio of CDG rules. 5.3 Rule Retetio ad Geeralizatio If two parsig grammars accout equally well for the same seteces, the oe with fewer rules is less redudat, more abstract, ad the oe to be preferred. We used the uio-grammar procedure to produce ad study the miimal grammar for the 16,275 rules (rule-examples) derived from the sample text. Uio-grammar records a ew rule for a rule-example: s 1. if best matchig rule has a operatio that does't match 2. if best matchig rule ties with aother rule whose operatio does ot match 3. if 2 is true, ad score = 21 we have a full cotradictio ad list the rule as a error. Six cotradictios occurred i the grammar; five were icosistet treatmets of "SNT" followed by oe or more puctuatio marks, while the sixth offered both a shift ad a "pp" for a prepositio-ou followed by a prepositio. The latter case is a attachmet ambiguity ot resolvable by sytax. I the first pass as show i Table 2, the text resulted i 3,194 rules compared with 16,275 possible rules. That is, 13,081 possible CDG rules were ot retaied because already existig rules would match ad predict the operatio. However, usig those rules to parse the same text gave very poor results: zero correct parses at the setece level. Therefore, the process of compilig a miimal grammar was repeated startig with those 3,194 rules. This time oly 619 ew rules were added. The purpose of this 5 These defiite coditios are due to a aalysis by Mark Rig. 403

Computatioal Liguistics Volume 18, Number 4 Table 2 Four passes with miimal grammar. Pass UtWied Retaied Total Rules 1 13081 3194 16275 2 15656 619 16275 3 16245 18 16275 4 16275 0 16275 repetitio is to get rid of the effect that the rules added later chage the predictios made earlier. Fially, i a fourth repetitio of the process o rules were ew. The resultig grammar of 3,843 rules succeeds i parsig the text with oly occasioal mior errors i attachig costituets. It is to be emphasized that the uretaied rules are similar but ot idetical to those i the miimal grammar. We ca observe that this techique of miimal retetio by "uioig" ew rules to the grammar results i a compressio of the order 16,275/3,843 or 4.2 to 1, without icrease i error. If this ratio holds for larger grammars, the if the liguist accumulates 40,000 traiig-example rules to accout for the sytax of a give subset of laguage, that grammar ca be compressed automatically to about 10,000 rules that will accomplish the same task. 6. Predictig the Size of CDGs Whe ay kid of acquisitio system is used to accumulate kowledge, oe very iterestig questio is, whe will the kowledge be complete eough for the iteded applicatio? I our case, how may CDG rules will be sufficiet to cover almost all ewswire stories? To aswer this questio, a extrapolatio ca be used to fid a poit whe the solid lie of Figure 5 itersects with the y-axis. However, the CDG curve is descedig too slowly to make a reliable extrapolatio. Therefore, aother questio was ivestigated istead: whe will the CDG rules iclude a complete set of CFG rules? Note that a CDG rule is equivalet to a CFG rule if the cotext is limited to the top two elemets of the stack. What the other elemets i the cotext accomplish is to make oe rule preferable to aother that has the same top two elemets of the stack, but a differet cotext. We allow 64 symbols i our phrase structure aalysis. That meas, there are 642 possible combiatios for the top two elemets of the stack. For each combiatio, there are 65 possible operatios: 6 a shift or a reductio to aother symbol. Amog 16,275 CDG rules, we studied how may differet CFG rules ca be derived by elimiatig the cotext. We foud 844 differet CFG rules that used 600 differet left-side pairs of symbols. This shows that a give cotext free pair of symbols averages 1.4 differet operatios. 7 The, as we did with CDG rules, we measured how may ew CFG rules were added i a accumulative fashio. The shaded lie of Figure 5 shows the result. 6 Actually, there are fewer tha 65 possible operatios sice the stack elemets ca be reduced oly to otermial symbols. 7 We actually use oly 48 differet symbols, so oly 482 or 2,304 combiatios could have occurred. The fractio 600/2,304 yields.26, the proportio of the combiatoric space that is actually used, so far. 404

Robert E Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish 100... ~.... :!!!! I or, _m. Z 10... l...... 1 100 1,000 10,000 25,000 100,000 Nbr of Accumulated Rules Extrapolatio thegray lie, predicts that 99% of the cotext free pairs will be achieved with the accumulatio of 25,000 cotext sesitive rues. Figure 6 Log-log plot of ew CFG rules. Notice that the lie has desceded to about 1.5% errors at 16,000 rules. To make a extrapolatio easier, a log-log graph shows the same data i Figure 6. From this graph, it ca be predicted that, after about 25,000 CDG rules are accumulated, the grammar will ecompass a CFG compoet that is 99% complete. Beyod this poit, additioal CDG rules will add almost o ew CFG rules, but oly fie-tue the grammar so that it ca resolve ambiguities more effectively. Also, it is our belief that, after the CDG reaches that poit, a multi-path, beamsearch parser will be able to parse most ewswire stories very reliably. This belief is based o our iitial experimet that used a beam search parser to test geeralizatio of the grammar to fid parses for fiftee out of twety ew seteces. 7. Acquirig Case Grammar Explicatig the phrase structure costituets of seteces is a essetial aspect i computer recogitio of meaig. Case aalysis orgaizes the costituets ito a hierarchical structure of labeled propositios. The propositios ca be used directly to aswer questios ad are the basis of schemas, scripts, ad frames that are used to add meaig to otherwise iexplicit texts. As a result of the experimets with acquirig CDG ad explorig its properties for parsig phrase structures, we became fairly cofidet that we could geeralize the system to acquisitio ad parsig based o a grammar that would compute sytactic case structures directly from sytactic strigs. Direct traslatio from strig to structure is supported by eural etwork experimets such as those by McClellad ad Kawamoto (1986), Miikkulaie ad Dyer (1989), Yu ad Simmos (1990), ad Leow ad Simmos (1990). We reasoed that if we could acquire case grammar with somethig approachig the simplicity of acquirig phrase structure rules, the result could be of great value for NL applicatios. 405

Computatioal Liguistics Volume 18, Number 4 7.1 Case Structure Cook (1989) reviewed twety years of liguistic research o case aalysis of atural laguage seteces. He sythesized the various theories ito a system that depeds o the subclassificatio of verbs ito twelve categories, ad it is apparet from his review that with a fie subcategorizatio of verbs ad omials, case aalysis ca be accomplished as a purely sytactic operatio--subject to the limitatios of attachmet ambiguities that are ot resolvable by sytax. This coclusio is somewhat at variace with those AI approaches that require a sytactic aalysis to be followed by a sematic operatio that filters ad trasforms sytactic costituets to compute case-labeled propositios (e.g. Rim 1990), but it is cosistet with the eural etwork experiece of directly mappig from setece to case structure, ad with the AI research that seeks to itegrate sytactic ad sematic processig while traslatig seteces to propositioal structures. Liguistic theories of case structure have bee cocered oly with sigle propositios headed by verb predicatios; they have bee largely silet with regard to the structure of ou phrases ad the relatios amog embedded ad sequetial propositios. Additioal covetios for maagig these complicatios have bee developed i Simmos (1984) ad Alterma (1985) ad are used here. The cetral otio of a case aalysis is to traslate setece strigs ito a ested structure of case relatios (or predicates) where each relatio has a head term ad a idefiite umber of labeled argumets. A argumet may itself be a case relatio. Thus a setece, as i the examples below, forms a tree of case relatios. The old ma from Spai ate fish. (eat Agt (ma Mod old From spai) Obj fish) (is Objl Obj2 Aother missio scheduledsooisthefirstfirigofatridet missile from a submerged submarie. (missio Mod aother Obj* (scheduled Vmod soo)) (firig Mod first Det the Of (missile Nmod tridet Det a) From (submarie Mod submerged Det a))) Note that missio is i Obj* relatio to scheduled. This meas the object of scheduled is missio, ad the expressio ca be read as "aother missio such that missio is scheduled soo." A asterisk as a suffix to a label always sigals the reverse directio for the label. There is a small set of case relatios for verb argumets, such as verbmodifier, aget, object, beeficiary, experiecer, locatio, state, time, directio, etc. For ous there are determier, modifier, quatifier, amout, oumodifier, prepositio, ad reverse verb relatios, agt*, obj*, be*, etc. Prepositios ad cojuctios are usually used directly as argumet labels while setece cojuctios such as because, while, before, after, etc. are represeted as heads of propositios that relate two other propositios with the labels precedig, post, atecedet, ad cosequet. For example, "Because she ate fish ad chips earlier, Mary was ot hugry." (because Ate (ate Agt she Obj (fish Ad chips) Vmod earlier) Cose (was Vmod ot Objl mary State hugry)) Verbs are subcategorized as vao, vabo, vo, va, vhav, vbe where a is aget, o is object, b is beeficiary ad vhav is a form of have ad vbe a form of be. So far, oly the 406

Robert E Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish subcategory of time has bee ecessary i subcategorizig ous to accomplish this form of case aalysis, but i geeral, a lexical sematics is required to resolve sytactic attachmet ambiguities. The complete set of case relatios is presumed to be small, but o oe has yet claimed a complete eumeratio of them. Other case systems such as those taught by Schak (1980) ad Jackedoff (1983) classify predicate ames ito such primitives as Do, Evet, Thig, Mtras, Ptras, Go, Actio, etc., to approximate some form of "laguage of thought" but the preset approach is less ambitious, proposig merely to represet i a fairly formal fashio the orgaizatio of the words i a setece. Subsequet operatios o this admittedly superficial class of case structures, whe augmeted with a system of shallow lexical sematics, have bee show to accomplish questio aswerig, focus trackig of topics throughout a text, automatic outliig, ad summarizatio of texts (Seo 1990; Rim 1990). Oe strog costrait o this type of aalysis is that the resultig case structure must maitai all iformatio preset i the text so that the text may be exactly recostituted from the aalysis. 7.2 Sytactic Aalysis of Case Structure We've see earlier that a shift/reduce-reame operatio is sufficiet to parse most seteces ito phrase structures. Case structure, however, requires trasformatios i additio to these operatios. To form a case structure it is frequetly ecessary to chage the order of costituets ad to isert case labels. Followig Jackedoff's priciple of grammatical costrait, which argues essetially that sematic iterpretatio is frequetly reflected i the sytactic form, case trasformatios are accomplished as each sytactic costituet is discovered. Thus whe a verb, say throw ad a NP, say cocouts are o top of the stack, oe must ot oly create a VP, but also decide the case, Obj, ad form the costituet, (throw Obj cocouts). This ca be accomplished i customary approaches to parsig by usig augmeted cotext free recogitio rules of the form: VP~VPNP/ lobj2 where the umbers followig the slash refer to the text domiated by the sytactic class i the refereced positio, (ordered left-to-right) i the right half of the rule. The resultig costituets ca be accumulated to form the case aalysis of a setece (Simmos 1984). We develop augmeted cotext-sesitive rules followig the same priciple. Let us look agai at the example "The old ma from Spai ate fish," this time to develop case relatios. * art adj from vao ; shift art * adj from vao ; shift art adj * from vao ; shift art adj * from vao ; i mod art * from vao ; 1 det * from vao ; shift from * vao ; shift from * vao ; 3 2 1 * vao ; shift vao * ; 2 agt vao * ; shift vao * ; 1 obj 2 (ma Mod old) 2 (ma Mod old Det the) (ma Mod old Det the From spai) 1 (ate Agt (ma Mod old... ) 2 (ate Agt (ma...) Obj fish) 407

Computatioal Liguistics Volume 18, Number 4 Stack V.. vpasv adj l vbe vabo prep prep by st because ad st after 2 vao vo v vpasv because st after st Case-Trasform mod adj 2 mod l 1 agt 2 1 obj 2 1 vbe 2 vpasv 2 be 1 vao 1 obj 2 3 2 1 3 2 1 1 prep 2 1 cose 2 2 ate 1 1 2 3 1 pre 2 2 post 1 Table 3 Some typical case trasformatios for sytactic costituets I this example the case trasformatio immediately follows the semicolo, ad the result of the trasformatio is show i paretheses further to the right. The result i the fial costituet is: (ate Agt (ma Mod old Det the From spai) Obj fish). Note that we did ot reame the sytactic costituets as NP or VP i this example, because we were ot iterested i showig the phrase structure tree. Reamig i case aalysis eed oly be doe whe it is ecessary to pass o iformatio accumulated from a earlier costituet. For example, i "fish were eate by birds," the CS parse is as follows: * vbe ppart by ; shift * vbe ppart by ; shift vbe * ppart by ; shift vbe ppart * by ; I vbe 2, vpasv (eate Vbe were) vpasv * by ; I obj 2 (eate Vbe were Obj fish) vpasv * by ; shift vpasv by * ; shift vpasv by * ; i prep 2 (birds Prep by) vpasv * ; 2 agt 1 (eate Vbe were Obj fish Agt (birds Prep by)) Here, it was ecessary to reame the combiatio of a past participle ad its auxiliary as a passive verb, vpasv, so that the sytactic subject ad object could be recogized as Obj ad Aget, respectively. We also chose to use the argumet ame Prep to form (birds Prep by) so that we could the call that costituet Aget. We ca see that the reduce operatio has become a reduce-trasform-reame operatio where umbers refer to elemets of the stack, the secod term provides a case argumet label, the orderig provides a trasformatio, ad a optioal fourth elemet may reame the costituet. A sample of typical case trasformatios is show associated with the top elemets of the stack i Table 3. I this table, the first elemet of the stack is i the third positio i the left side of the table, ad the umber I refers to that positio, 2 to the secod, ad 3 to the first. As a aid to the reader the first two 408

Robert E Simmos ad Yeog-Ho Yu Cotext-Depedet Grammars for Eglish CS-CASE-Parser(iput,cdg) Iput is a strig of sytactic classes for the give setece. Cdg is the give CDG grammar rules. stack := empty outputstack := empty do util(iput = empty ad 2d(stack) = blak) widow-cotext := apped(top-.five(stack),first_five(iput)) operatio := cosult_cdg(widow-cotext,cdg) if first(operatio) = SHIFT the stack := push(first(iput),stack) iput := rest(iput) else stack := push(select(operatio),pop(pop(stack))) outputstack := make_costituet(operatio,outputstack) ed do Figure 7 Algorithm for case parse. etries i the table refer literally by symbol rather tha by referece to the stack. The symbols vao ad vabo are subclasses of verbs that take, respectively, aget ad object; ad aget, beeficiary, ad object. The symbol v.. refers to ay verb. Forms of the verb be are referred to as vbe, ad passivizatio is marked by relabelig a verb by addig the suffix -pasv. Parsig case structures From the discussio above we may observe that the flow of cotrol i accomplishig a case parse is idetical to that of a phrase structure parse. The differece lies i the fact that whe a costituet is recogized (see Figure 7): i phrase structure, a ew ame is substituted for its stack elemets, ad a costituet is formed by listig the ame ad its elemets i case aalysis, a case trasformatio is applied to desigated elemets o the stack to costruct a costituet, ad the head (i.e. the first elemet of the trasformatio) is substituted for its elemets--uless a ew ame is provided for that substitutio. Cosequetly the algorithm used i phrase structure aalysis is easily adapted to case aalysis. The differece lies i iterpretig ad applyig the operatio to make a ew costituet ad a ew stack. I the algorithm show above, we revise the stack by attachig either the head of the ew costituet, or its ew ame, to the stack resultig from the removal of all elemets i the ew costituet. The fuctio select chooses either a ew ame if preset, or the first elemet, the head of the operatio. Makecostituet applies the trasformatio rule to form a ew costituet from the output stack ad pushes the costituet oto the output stack, which is first reduced by removig the elemets used i the costituet. Agai, the algorithm is a determiistic, first (best) path parser 409

Computatioal Liguistics Volume 18, Number 4 with behavior essetially the same as the phrase structure parser. But this versio accomplishes trasformatios to costruct a case structure aalysis. 7.3 Acquisitio System for Case Grammar The acquisitio system, like the parser, required oly mior revisios to accept case grammar. It must apply a shift or ay trasformatio to costruct the ew stack-strig for the liguist user, ad it must record the shift or trasformatio as the right half of a cotext-sesitive rule--still composed of a te-symbol left half ad a operatio as the right half. Cosequetly, the system will be illustrated i Figure 9 rather tha described i detail. Earlier we metioed the cotext-sesitive dictioary. This is compiled by associatig with each word the liguist's i-cotext assigmets of each sytactic word class i which it is experieced. Whe the dictioary is built, the occurrece frequecies of each word class are accumulated for each word. A primitive grammar of four-tuples termiatig with each word class is also formed ad hashed i a table of sytactic paths. The procedure to determie a word class i cotext,,, first obtais the cadidates from the dictioary.,, For each cadidate wc, it forms a four-tuple, vec, by addig it to the cdr of each immediately precedig vec, stored i IPC. Each such vec is tested agaist the table of sytactic paths; if it has bee see previously, it is added to the list of IPCs, otherwise it is elimiated. If the uio of first elemets of the IPC list is a sigle word class, that is the choice. If ot, the word's most frequet word class amog the uio of survivig classes for the word is chose. The effect of this procedure is to examie a cotext of plus ad mius three words to determie the word class i questio. Although a larger cotext based o fivetuple paths is slightly more effective, there is a tradeoff betwee accuracy ad storage requiremets. The word class selectio procedure was tested o the 8,310 words of the 345- setece sample of text. A score of 99.52% correct was achieved, with 8,270 words correctly assiged. As a compariso, the most frequet category for a word resulted i 8,137 correct assigmets for a score of 97.52%. Although there are oly 3,298 word types with a average of 3.7 tokes per type, the occurrece of sigle word class usages for words i this sample is very high, thus accoutig for the effectiveess of the simpler heuristic of assigmet of the most frequet category. However, sice the effect of misassigmet of word class ca ofte rui the parse, the use of the more complex procedure is amply justified. Aalysis of the 40 errors i word class assigmet showed 7 cofusios of ous ad verbs that will certaily cause errors i parsig; other cofusios of adjective/ou, ad adverb/prepositio are less devastatig, but still serious eough to require further improvemets i the procedure. The word class selectio procedure is adequate to form the prompts i the lexical acquisitio phase, but the statistics o parsig effectiveess give earlier deped o perfect word class assigmets. Show i Figure 8 is the system's presetatio of a setece ad its requests for each word's sytactic class. The protocol i Figure 9 shows the acquisitio of shift 410