Natural Languages Analysis in Machine Translation (MT) based on the STCG (STRING-TREE CORRESPONDENCE GRAMMAR)

Similar documents
Natural language processing implementation on Romanian ChatBot

'Norwegian University of Science and Technology, Department of Computer and Information Science

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

arxiv: v1 [cs.dl] 22 Dec 2016

Management Science Letters

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

Consortium: North Carolina Community Colleges

part2 Participatory Processes

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

Application for Admission

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Parsing of part-of-speech tagged Assamese Texts

VISION, MISSION, VALUES, AND GOALS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

The Interface between Phrasal and Functional Constraints

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Context Free Grammars. Many slides from Michael Collins

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Natural Language Processing. George Konidaris

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

also inside Continuing Education Alumni Authors College Events

An Interactive Intelligent Language Tutor Over The Internet

2014 Gold Award Winner SpecialParent

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education

Developing a TT-MCTAG for German with an RCG-based Parser

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

A General Class of Noncontext Free Grammars Generating Context Free Languages

Some Principles of Automated Natural Language Information Extraction

Multimedia Courseware of Road Safety Education for Secondary School Students

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Analysis of Probabilistic Parsing in NLP

Disambiguation of Thai Personal Name from Online News Articles

CS 598 Natural Language Processing

An Efficient Implementation of a New POP Model

"f TOPIC =T COMP COMP... OBJ

Distant Supervised Relation Extraction with Wikipedia and Freebase

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

LTAG-spinal and the Treebank

Seminar - Organic Computing

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Top US Tech Talent for the Top China Tech Company

Grammars & Parsing, Part 1:

The CYK -Approach to Serial and Parallel Parsing

Multimedia Application Effective Support of Education

A relational approach to translation

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Prediction of Maximal Projection for Semantic Role Labeling

Hyperedge Replacement and Nonprojective Dependency Structures

A Version Space Approach to Learning Context-free Grammars

AQUA: An Ontology-Driven Question Answering System

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Specifying Logic Programs in Controlled Natural Language

Hans-Ulrich Block, Hans Haugeneder Siemens AG, MOnchen ZT ZTI INF W. Germany. (2) [S' [NP who][s does he try to find [NP e]]s IS' $=~

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

The Singapore Copyright Act applies to the use of this document.

Graduate Program in Education

The Discourse Anaphoric Properties of Connectives

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Abstractions and the Brain

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Human-like Natural Language Generation Using Monte Carlo Tree Search

Adapting Stochastic Output for Rule-Based Semantics

Introduction, Organization Overview of NLP, Main Issues

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Refining the Design of a Contracting Finite-State Dependency Parser

A Graph Based Authorship Identification Approach

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

INTERMEDIATE ALGEBRA Course Syllabus

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

SELF-STUDY QUESTIONNAIRE FOR REVIEW of the COMPUTER SCIENCE PROGRAM

A. True B. False INVENTORY OF PROCESSES IN COLLEGE COMPOSITION

Writing Research Articles

Accurate Unlexicalized Parsing for Modern Hebrew

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Compositional Semantics

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

University of Groningen. Systemen, planning, netwerken Bosman, Aart

South Carolina English Language Arts

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Pumpkin Grows. Written by Linda D. Bullock and illustrated by Debby Fisher

Transcription:

Natural Laguages Aalysis i Machie Traslatio (MT) based o the STCG (STRING-TREE CORRESPONDENCE GRAMMAR) Tag Eya Kog, Zahari Yusoff Uit Terjemaha Melalui Komputer Pusat Pegajia Sais Komputer Uiversiti Sais Malaysia 11800 Mide, Pulau Piag, Malaysia. [e-mail: eyakog@cs.usm.my ad zari@cs.usm.my] 0. Abstract The Strig-Tree Correspodece Grammar (STCG) [1] is a grammar formalism for defiig: a set of strigs (a laguage), a set of trees (valid represetatio/iterpretatio structures), the mappig betwee the two (to be iterpreted for aalysis & geeratio). The formalism is argued to be a totally declarative grammar formalism that ca associate, to strigs i a laguage, arbitrary tree structures as desired by the grammar writer to be the liguistic represetatio structures of the strigs. More importatly is the facility to specify the correspodece betwee the strig ad the associated tree i a very atural maer. These features are very much desired i grammar writig, i particular for the treatmet of certai liguistic pheomea which are 'o-stadard', amely featurisatio, lexicalisatio ad crossed depedecies [2,3]. Furthermore, a grammar writte i this way aturally iherits the desired property of bi-directioality (i fact o-directioality [4]) such that the same grammar ca be iterpreted for both aalysis ad geeratio. I this paper, we ivestigate the properties of the STCG for iterpretatio towards aalysis (as is uderstood withi the cotext of Machie Traslatio (MT)). Other tha usig STCG grammars as specificatios for the automatic geeratio of aalysis programs i the Specialised Laguages for Liguistic Programmig (SLLPs) of MT systems (a study reported i [5,6]), the work also cetres aroud the specificatio of a geeral aalyser/parser for the STCG. The proposed STCG aalyser is capable of mimickig some very useful features i various cotextfree parsig techiques. Oe such feature is the use of charts i tabular parsig algorithms, as exemplified i Earley's Algorithm [7], which is very helpful i avoidig redudacies that may otherwise result i a combiatorial explosio. Aother is the compact way of represetig possible parse trees for ambiguous seteces, such as the oe see i [8]. Though ot reported i this paper, we ote that the proposed aalyser also provide a atural way for hadlig the kid of awkward pheomea metioed above (amely lexicalisatio, featurisatio, ad worst of all, crossed depedecies) while at the same time retaiig much of the efficiecy of stadard cotext-free parsig algorithms (a study reported i [2,3]). 1. The STCG Formalism The Strig-Tree Correspodece Grammar is a declarative grammar formalism that ca be used to describe the correspodece betwee strigs of terms ad trees. I particular, liguistic rules are writte with utteraces as the strig of terms (heceforth STRING) ad the correspodig represetative liguistic structures as the tree (heceforth TREE). Figure 1 gives a idicatio of a full STCG rule. The structure of the TREE is totally specified by the liguist ad is ot costraied by ay applicatio of rules (as i the case for the parse tree i the classical cotext free grammar). I a rule, the mai correspodece is first declared: i the example, the STRING #1.v.#2.part (with #1 ad #2 beig strig variables, ie. variables which are istatiable to strigs of terms) is set to correspod to the TREE with root ode S (where ad are forest variables, ie. variables that ca be istatiated to lists of subtrees). The mai-corr(espodece) is followed by a declaratio of subcorrespodeces (o the right had 261

side) betwee substrigs of the STRING ad subtrees of the TREE, each of which possibly havig a list of refereces (rule ames). For example, the sub-corr(espodece) betwee the substrig #1 ad the subtree rooted at the ode 1 refers to the rules R..., the latter beig other rules i the grammar. This referece is a mechaism by meas of which the strig ad forest variables metioed earlier are fully istatiated via a operatio called iificatio [9,10] resultig i a correspodece betwee explicit strigs of terms ad ad trees, both without variables. I actual fact, the mai-corr as well as the sub-co specified i the rule are formally recorded i terms of a Structured Strig Tree Correspodece (SSTC) trasparet to the liguist [11] as illustrated i figure 2, where a give correspodece may be oprojective (eg. with discotiuous costituets) as is the case for the odev(part) i the example. Note also that the particle is chose (by the liguist) to be represeted as a collectio of features i the ode v - a case of featurisatio. Mai-Corr. 0"'"..ThP 1,/./.11) v(part) #1.v.#2.part with : R1 I very simple terms, a strig to tree correspodece i the STCG ca be viewed as aalogous to the mathematical defiitio of a relatio betwee iteger umbers as i the example give o the right. Here, a relatio (i this case a fuctio) f is defied i terms of fier subrelatios accordig to the subdomais. Sub-Corr. #1 with : R v(part) v.part = pick, etc. paa = up, etc. Figure 1. (0/a_d) Nff..11*.VP (- /a_b) (0/b e), kpart) (b_c+d_e/ (- /c_d) b_c+d_e) $13 #1. v. #2 part a bb cc dd e Figure 2. 1 #2 with : R -3 x<3 f(x)= x +5 3<x<5 x 55_X A set of STCG rules form a grammar, some of which are axiom rules (ie. start rules or rules cotaiig axiom trees, as i the axiom or the start symbol S i the classical cotext free grammar). With the sematics of the rules beig as idicated above, a grammar thus defies a laguage of strigs, a laguage of represetatio trees, ad the correspodece betwee elemets of the two laguages/sets. It is this set of strig-tree correspodeces that ca be iterpreted for both aalysis ad geeratio. 2. Natural Laguages Aalysis i MT Based o the STCG Iitially, the STCG was desiged to serve as a specificatio laguage for writig grammars i MT such that the specificatios writte i the STCG grammar formalism ca the be coded (maually) ito the liguistic programs for aalysis ad geeratio writte i the SLLPs of itegrated MT systems. Some substatial work have also bee carried out to automate this process, amely towards the automatic geeratio of aalysis programs i the MT systems ARIANE [12] ad JEMAH [13] from grammars writte i the STCG formalism (see for example [5,6]). However, due to certai limitatios i the existig SLLPs for the realisatio of a proper implemetatio of a STCG aalyser (as discussed i [2]), we have decided istead to look ito the desig of a aalyser which ca directly iterpret the STCG grammar. 2.1. The Fudametal Desig of the STCG Aalyser As we have see above, a STCG grammar actually defies a set of SSTCs i a way quite similar to the defiitio of a mathematical fuctio. I evaluatig a mathematical fuctio, if the fuctio is defied i terms of other sub-fuctios the it ca oly be completely evaluated after all its sub-fuctios have bee evaluated ad retur with the appropriate values. We ca view the STCG aalysis process i the same maer where, by takig the iput strig/setece as their STRING, the set of explicit SSTCs defied by the axiom rules of a grammar are costructed based o the resultat sub-sstcs defied by the referece rules of these axiom rules. Sice the 262

referece rules of the axiom rules may i tur refer to other rules, they may also retur the completed SSTCs oly after their respective referece rules have bee completed. This referece process will termiate whe all remaiig sub-sstcs evaluated are defied by subcorrespodeces which do ot refer to ay other rule, amely the 'lexical-sstcs', which must match with the iput words (the o-lexical SSTCs are called 'phrasal- SSTCs'). We illustrate this i the followig aalysis of the iput strig "He picks the ball up" with respect to a grammar cosistig of rule R1 give i figure 1 ad rules R1, R3 give i figure 3. The rule R1 is give as a axiom rule. The aalysis process begis with the evaluatio of the geeral SSTC defied by the axiom rule R1, which i tur leads to the evaluatio of two other sub-sstcs defied by the referece rules R1, R3 as illustrated i figure 4. mai-corr with : R1 mai-corr d/\. with : R3 Figure 3. sub-corr 1 = Joh, ball, he,..., etc. sub-corr the,etc. = ball,etc. VP vpar (l/t; #A. v. #B. Pa (1aa_bbcc5 with : RI - Apply rule R I - Apply rule R3 - Apply rule RI VP (0/1_5) (0/0_1) v(pai (1 2+4_5/ (0/2 4) 1=2+4_5)...".1111"" (2_3/2_3 ) ko_4/ ((1_1 /0_ I) j-le. picks. the. ball. LID 0_1 1 2 2 3 3_4 4_5 with : R I r #11 b_c with : R... --00--- b d dc with : R3 Phrasal-SSTC (0/0_1) I (0_1/0_1) kit 01 with R I (o12_4) ded...""1"17 t (2_3/2_th3e).(3b_a411/3_4) 2_3 3_4 with : R... (0/2 4) v(part) ti Ski b_d Lexical-SSTC v (part) icks. u 1_2 4_5 1/10. hail _3 3_4 He 01 a b picks 1 2 b_d d_c c_5 the ball up 2_3 3_4 4_5 Figure 4. a /b picks 1 2 t d d_c c_5 he ball up _3 3_4 4_5 I the diagram above (o the left), the aalysis process expads the SSTC defied by the axiom rule ito a strig of sub-sstcs, which is further expaded ito aother strig of sub-sstcs util it caot be expaded ay further, which is whe the strig of sub-sstcs cosists oly of lexical-sstcs. The strig of lexical-sstcs is the matched with the words i the iput strig. Note that the matchig eed ot be i a projective maer, as ca be see i this particular example, where the lexical-sstcs are matched to the words i the iput strig i a crossed serial maer - a case of crossed depedecies. I order to keep track of such o- 263

projective correspodeces, we itroduce the use of idex variables to record the iterval correspodig to each symbol appearig i the STRING (as illustrated o the right). I [2], we proposed a desig of the STCG aalysis algorithm which is capable of mimickig some very useful features i various cotext-free parsig techiques. Oe such feature is the use of charts i tabular parsig algorithms, as exemplified i Earley's Algorithm [7], which is very helpful i avoidig redudacies that may otherwise result i a combiatorial explosio. Aother is the represetatio of shared forest i term of a STCG grammar rules which is i fact followig the approach adopted i [8] as illustrated i the ext sectio. 2.2 Multiple Results of aalysis for ambiguous iput setece The example setece give above is uambiguous, ad thus correspods to oly a sigle represetatio tree. However, atural laguage grammars are kow to be i the class of highly ambiguous grammars, ad as such, there may be umerous represetatio trees geerated for a sigle setece i the laguage described. Istead of storig each represetatio tree separately i the set of SSTCs defiig the correspodeces betwee the give setece ad all its possible represetatio trees, we should try to represet all these i a space-efficiet maer. I the figure give below, we preset a compact way of represetig a set of SSTCs correspods to a ambiguous setece by meas of a AND-OR graph of rules - similar to the techique used by [8]. For example, the two SSTCs: VP (0/ 13) (0/0_1 ) I V FP (013_6) ( 1_2/1_2) (0/2_3) p (0/4_6) ( 0_ 1/()_ I ) I (3_4/3_4) /4 (2_3/2 3) de ' (4_5/4_5) (5_6/S 6) ruh'14.?1 (0_1) (0/0_6) _6) (0_ ) ( 1_2/1_2) (OPT 6) PP (2_3/2_3) (0/3_6) (3_4/34) de t (4_5/4_57(56:6/5-6) e,.rs.4te rat() with : RTC Figure 5:Two liguistic represetatios of the setece Joh saw Mary i the boat. ca be factorised ito a AND-OR graph of rules R2, R3, R5, RPP (give below) ad rules R1, R3 (give i figure 3) i the followig maer: RIP I (Joh) (saw) (Mary) P ) R3 (i " De t R I (the) (boat) Figure 6 : A AND-OR Graph of STCG grammar rules. Mai-Corr. Sup-corr. Ne1)& V NIP #A. v. #B with : R2 with : R 1,RIR5 with : R I.R3,R5 Mai-Corr. Sub-Corr. S PP 1 EA #A.#B with : R2,R3 with : RPP with : R3 pa Mai-Corr. Sub-Corr. p $ lip *IA itg la with : with : R5 R I,R3 with : RPP Mai-Corr. p with : RPP Sub-Corr. with : R I,R3,R5 264

3. Cocludig Remarks Recetly, efficiet cotext-free parsig methods such as the LR parser ad Earley's Algorithm have bee referred to extesively i implemetig parsers for most of the formalisms used i the field of NLP. I a effort to retai the efficiecy of stadard cotext-free parsig algorithms, most recet declarative formalisms are typically restricted by the costrait of strig cocateatio i cotext-free grammars which allows a setece to be systematically decomposed so that the parsig process ca be idexed by the subparts of that decompositio (the substrigs). However, it has also bee widely recogised that the cocateatio restrictio of CFG ca be problematic i hadlig pheomea such as lexicalisatio, featurisatio, ad especially crossed depedecies. As a alterative, we propose the STCG formalism which allows for a more 'atural' way of specifiyig the strigs of the laguage beig described, their correspodig liguistically motivated represetatio trees, ad the correspodece betwee the two, where the correspodece eed ot be projective ad hece appropriate for the said pheomea. Eve though the stadard CF parsig methods caot be adopted directly i the aalysis of a iput setece with respect of a STCG grammar, due to the STRING patters of the STCG which eed ot submit to the cocateatio restrictio of CFG, i this paper we preset the geeral layout (due to the space costrait, however iterested readers may get more ails i [2]) of a aalyser for the STCG which is capable of mimickig some very useful features i various cotext-free parsig techiques. Oe such feature is the use of charts i tabular parsig algorithms, as exemplified i Earley's Algorithm [7], which is very helpful i avoidig redudacies that may otherwise result i a combiatorial explosio. Aother is the compact way of represetig possible parse trees for ambiguous seteces, such as the oe see i [8]. Furthermore, we have also provided a atural way for hadlig the kid of awkward pheomea such as lexicalisatio, featurisatio, ad worst of all, crossed depedecies, while at the same time retaiig much of the efficiecy of stadard cotext-free parsig algoritms [2,3]. REFERENCES [ 1 ] Zahari Y., Strig-Tree Correspodece Grammar: a declarative grammar formalism for defiig the correspodece betwee strigs of terms ad tree structures, proceedigs of the 3rd Coferece of the Europea Chapter of the ACL, Copehage, April 1987. [2] Tag Eya Kog, Natural laguages Aalysis i machie traslatio (MT) based o the STCG, PhD thesis, Uiversiti Sais Malaysia, Peag, March 1994. [3] Tag Eya Kog, Zahari Y., Hadlig Crossed Depedecies with the STCG, proceedigs of Natural Laguage Processig Pacific Rim Symposium (NLPRS'95), Sofitel Ambassador Hotel, Seoul, Korea, Dec. 4-6, 1995. [4] Yves Lepage, Parsig ad Geeratig Cotext-Sesitive Laguages with Correspodece Iificatio Grammars, proceedigs of the Natural Laguage Processig Pacific Rim Symposium (NLPRS'91), Sigapore, 25-26 Nov 1991. [5] Zahari Yusoff, Tag Eya Kog, Geeratio of aalysis programs i ROBRA (ARIANE) From Strig-Tree Correspodece Grammars (or a Strategy for Aalysis i machie traaslatio), Proceedigs of the 3rd Machie Traslatio Summit, Washigto, D.C., July,1991. [6] Zahari Y., Tag Eya Kog, Strig-Tree Correspodece Grammars as a base for the automatic geeratio of aalysis programs i machie traaslatio, proceedigs of the Iteratioal Coferece o Curret Issues i Computatioal Liguistics, Peag, Jue 1991. [7] J. Earley, A efficiet catext-free parsig algorithm, Commuicatios of the ACM, Vol. 13, Num. 2, Feb 1970, pp. 94-102. [8] Lag, B., Towards a Uiform Formal Framework for Parsig, I : Curret Issues i Parsig Techology, M. Tomita (ed.), Kluwer Academic Publishers, 1991, pp. 153-171. [9] Zahari Y., Strategies ad heuristics i the aalysis of atural laguages i machie traslatio, PhD thesis, Uiversiti Sais Malaysia, Peag, March 1986. [10] Y.Lepage, U systeme de grammaires correspodacielles d'iificatio, these de Docteur, IMAG, Uiversite Joseph Fourier, Greoble, Jue 1989. [11] Zahari Yusoff, Christia Boitet, Represetatio trees ad strig-tree correspodeces, proceedigs of the 12th Iteratioal Coferece o Computatioal Liguistics, COLING-88, Budapest, August 1988, pp.59-64. [12] Ch.Boitet, P.Guillaume, M.Quezel-Ambruaz, Le poit sur ARIANE-78, debut 1982 (DSE-I ), vol.], part.] : le logiciel, GETA, avril 1982. [13] Tog Loog Cheog, The JEMAH System : Referece Maual, UTMK documet, USM, Peag, 1988. 265

266