Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION Natural laguage processig implemetatio o Romaia ChatBot RALF FABIAN, MARCU ALEXANDRU-NICOLAE Departmet for Iformatics Lucia Blaga Uiversity of Sibiu Io Raţiu Street, o. 5-7, Sibiu ROMANIA ralf.fabia@ulbsibiu.ro, marcu.alex@yahoo.com Abstract: - If i writte ore spoke form, laguage is a essece of huma behavior. It is used for kowledge represetatio ad trasfer from oe geeratio to aother. Without laguage we would't be able to ay kid of commuicatio. Today, we are all cofroted by a uprecedeted volume of iformatio ad most of it i text form. Usig computer system to maage ad access large volumes of iformatio becomes a ecessary evil for today, ad eve more for upcomig geeratios. System capable of uderstadig huma laguage would sigificatly improve huma-computer iteractio. This paper presets a applicatio framework for atural laguage processig i form of a ChatBot for the Romaia laguage. Key-Words: - atural laguage processig, formal laguages, parsig, atural laguage iterface, computatioal model 1 Itroductio Sice for may people, a large ad growig fractio of work ad leisure time is spet avigatig ad accessig the uiverse of iformatio, classical computer laguages ad iformatio queryig methods are ot a attractive ad realistic optio. Thus, the study of laguage has become a primary area of iterest for sciece. Besides the vast amog of iformatio that a system capable of uderstadig huma laguage would have to access, it would first of all improve humacomputer iteractio. Some years ago (i 2003), it was estimated that the aual productio of books reached about 8 Terabytes. It would take a huma beig at least five years to read the ew scietific material that is produced every 24 hours. Although these estimates where based o prited materials ad dose ot iclude the icreasigly amout of iformatio produced electroically o the Web. Noe of the curret expert system ca match the flexibility ad accurate of a huma coversatio. Ad cosiderig the level of ambiguity i some laguages is amazig how huma psychology has adapted. Natural laguage processig (NLP) is defied [4] as a field of computer sciece ad liguistics cocered with the iteractios betwee computers ad huma (atural) laguages. Computer system, uderstadig atural laguage, deal with machie readig comprehesio, ad represet a subtopic of NLP. The applicatio of atural laguage uderstadig described i these paper addresses text based processig. Researches have show that this kid of processig applies successfully to: queryig documet from databases with desired topics, iformatio extractio from documets ore messages, text traslatios from oe laguage to aother, questio-aswerig system [1][3]. The system eeds to participate actively i order to maitai a atural dialogue. Furthermore it requires verifyig that thigs are uderstood ad if ot, a ability to geerate clarificatio subdialogues. Parsig iput is more complex tha the reverse process of output costructio i atural laguage geeratio because of the potetial occurrece of ukow ad uexpected features i the iput ad the eed to determie the appropriate sytactic ad sematic schemes to apply to it. The first popular program that uses atural laguage commuicatio was ELIZA, developed by MIT begiig from 1960. Other commo Chaterbox ore Chatbot applicatios are Dr. Romulo based o the ALICE artificial itelligece chat platform ad MathBot for aswerig simple umber problems. A milestoe i this applicatio filed is the award-wiig free atural laguage artificial itelligece chat robot A.L.I.C.E. (Artificial Liguistic Iteret Computer Etity) usig AIML (Artificial Itelligece Markup Laguage) [8]. A huge of chatterbots, chat bot, coversatioal agets ad virtual agets from all over the World the Chatbots Directory [7]. ISSN: 1790-2769 440 ISBN: 978-960-474-113-7
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION There are two mai motivatio for developig computatioal models: the scietific motivatio ad the practical/techological motivatio [1][2]. The former oe deals with obtaiig a better uderstadig about how laguage works ad tries to traspose complex theories as computer programs ad the test them by observig how well they perform. The later, deals whit the assumptio that atural laguage processig capabilities may chage the way computers are used today. Computers aware of atural laguage uderstadig could access ad maage stored iformatio i text form ad i additio provide a user iterface accessible to everyoe. I this paper we preset the implemetatio of a ChatBot based o commo cocepts from formal laguage theory ad atural laguage uderstadig. The program i implemeted i Pytho programmig laguage ad may be obtaied ad accessed through the iteret. At the ed we outlie a set of extesios to the formal model used. These are based o our previous research o formal models for modelig ad simulatig dyamic systems. Eve if the model used dose ot faithfully match the way humas process laguages, it is importat oly to produce the desired results. 2 Notios ad termiology The first step i makig a computer capable of processig atural laguage is to defie a set of rules that yield a exact commuicatio eeded for the computer, as cotrary to a more ambiguous oe accepted amog humas. It is possible for a setece to have ay umber of meaigs eve for a particular cotext. This raises a very particular problem for algorithms meat to uderstad huma laguage, because computers programs are traditioally used i a very precise ad exact way. From formal laguage theory we kow that a Chomsky geerative grammar (shortly grammar) [2], [6], [7], is a quadruple G= ( VN, VT, S, P), where V N ad V T are alphabets of otermial respectively termial symbols; S VN is the startig symbol or axiom ad P is a fiite set of pairs of words from ( VN VT), P= {( ui, vi) 1 i m}, so that ay word u i cotais at least oe otermial symbol. The pairs ( ui, v i) are called derivatio rules or productio rules or simple productios ad will be deoted by ui vi. If the left had side of a productio rule cosists of oly a sigle otermial symbol, the we have a cotext-free grammar [2][1]. The set of all seteces build i respect to the cosidered rules, is called laguage geerated by the grammar, ad formally defied as: LG ( ) = { p VT S p}. From the viewpoit of atural laguage processig, cotext free grammars are iterestig for two reasos: - the model is powerful eough to represet structures of atural laguages; - the model is simple eough to build efficiet parsers to aalyze seteces. Havig a model of a give laguage, the ext step is to create a algorithm which tests a give setece to see if it's well formed. Such type of algorithm is called parser. Techically, a parser or, more formally, sytactic aalysis, is the process of aalyzig a sequece of tokes to determie their grammatical structure with respect to a give formal grammar. Every programmig laguage has a parser ad at least a iterpreter to brig the code ito a program. Where as for programmig laguages this task is cosiderably easy, the huma laguages are seemigly edless i complexity. A setece i atural laguage may have several iterpretatios ad choosig betwee them is related to the beliefs ad kowledge of the perso who commuicates. The followig productio rules are a simplified except from the grammar of Romaia laguage used i our applicatio. S PS PS PREDICAT SUBIECT PREDICAT PRONUME_INTEROGATIV VERB_COPULATIV SUBIECT ARTICOL_NEHOTARAT SUBSTANTIV PRONUME_INTEROGATIV 'ce' VERB_COPULATIV 'este' ARTICOL_NEHOTARAT 'o' SUBTANTIV 'masia' Accordig to these rules we ca build seteces i a tree structure called derivatio tree (e.g. fig. 1.). Figure 1. Derivatio tree of a setece. Cosiderig that a setece ca come i ay size, they are two strategies for aalyzig a setece from the grammar poit of view: (a) top-dow - starts ISSN: 1790-2769 441 ISBN: 978-960-474-113-7
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION with the S symbol ad attempts to populate the iteral structure of the derivatio tree to obtai a sequece of termial symbols that matches the classes of the words i the iput setece. (b) bottom-up - attempts to populate the iteral structure of the derivatio tree startig from the setece ad searchig for matchig productio rules that ca stepwise lead fially to the top symbol S. Simulatig a system maily supposes a model for experimet creatio that matches best to the evolvig of the real system ad also supposes a set of processig procedures for this experimets that may idicate the optimal decisios for a further cotrol of the system. The system simulatio starts with iitializig the system with data describig the iitial state. The dyamics of the system cosist i choosig the ext activity, i.e. the ext procedure to be executed. 3 Applicatio descriptio The applicatio represets a implemetatio of a Chatbot for Romaia laguage ad is basically guided by the work a results from [1][2][5][6]. Although, it is a attempt to build a framework for easig up computer-huma commuicatio trough atural laguage usage. The user iterface is simple ad ca easily be itegrated i ay web page. Figure 2 depicts the elemet of the iterface, cotaiig a text field for the user iput, a optioal butto to sed a setece to the system ad a list of coversatio history. Figure 2. User iterface. Iput Discussio history The developmet is etirely i Phyte [17] as programmig laguage ad Djago [18] for data models ad web itegratio. Fuctioally, the applicatio supports two operatio modes: (a) traiig mode this mode is used to trai the system by addig ad classifyig ew kowledge; (b) coversatio mode this is the ormal operatio mode i which the system respods to the seteces passed by user. Traiig data as well as testig data are collected from the Romaia laguage, due the uderlyig grammar. I this first versio the two operatio modes are idepedet. We pa to exted this type of operatio Exteral (Web iterface) Iteral Iput Words Parsig Sytactic structure Cotext iterpretatio Discourse cotext Grammar, Vocabulary Applicatio reasoig Output Words Realizatio Sytactic structure Cotext geeratio Figure 3. Respose geeratio process with iteral stages. ISSN: 1790-2769 442 ISBN: 978-960-474-113-7
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION such that user iput seteces may cotribute to the kowledge of the system. A proposal for this is described i a later paragraph. 3.1 Iteral structure ad fuctioality With respect to the cocepts of grammars ad parsers, we ca ow describe the compositio ad iteral behavior of our system. For a give iput, i form of a setece, we have to cosider the followig steps, i order to get a uderstadig from it. First of all the setece has to be checked to see if it's well formed. Havig passed this, it is the divided ito three parts: the elemet that does the actio, the actio ad the rest of the setece cotaiig extra parameters used to uderstad how ad i which circumstaces the actio has bee performed. Next a class is costructed which geerates the actio o the word performig it. A respose accordig to the system's behavior is the provided. Commuicatio is the process of trasferrig iformatio from oe source to aother. I learig mode, all of the iformatio commuicated to the system, will eed i the begiig a few pre-coded actios, ecessary for creatig the iitial liks betwee parts of kowledge. All words are, i essece, labels for somethig i real life ad thus, they are the most basic elemets of huma commuicatio. I commo for all elemets are the represetatio of there specifics/particularities ad what are the actios that they ca do. A Word class cotais two distict elemets: the parts ad the actio. With this represetatio it is ow possible to create the setece. Importat here is to uderstad the way i witch a words meaig is affected i a certai setece, ad further more, the way i witch the meaig of the setece is affected by the phrase, ad i the ed, the geeral meaig affected by the cotext ad the geeral Figure 4. Iteral data represetatio ad relatios. ISSN: 1790-2769 443 ISBN: 978-960-474-113-7
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION topic of the discussio. The other basic elemet of this system is the actio which i this case is actual source code which will be ivoked with the appropriate words cotaied i the setece. The way i which this particular fuctio is costructed will represet the way i which the setece will be uderstood. For better performace a distictio is made betwee existig verbs. A verb that will expect to get a adverb will perform oly a certai type of actios as opposite to a verb that expect a adjective. A eve more deep separatio ca be made usig a classificatio system. For example a verb ca perform certai operatios to certai types of words. 3.2 Itegratio ad extesio For a future release of this applicatio we cosider two alteratives as extesios for the formal model of the grammar treatig system. Oe based o fuzzy reasoig ad the other o a stochastic approach. We preset i the followig some basic cocepts for the stochastic versio. A fuzzy modelig system is treated i a earlier paper [16]. Defiitio 3.1. A stochastic geerative grammar (or shortly stochastic grammar) is a pair ( G, f p ) where: - G = ( VN, VT, S, P) is a Chomsky geerative grammar ad - fp : P [0,1] is a probability fuctio havig the property Σ( f p( α αi)) = 1, where α αi are all α -productios from P. Extedig the probability fuctios from productios to derivatios we obtai the followig results: f p( S q r) = fp( S q) fp( q r), where f p( q r) = fp( p), p beig the set of productios applied. Defiitio 3.2. The laguages geerated by the stochastic grammar ( G, f p ) is p T p L( G, f ) = { p p V ad S p, f ( S p) > 0}. Defiitio 3.3. A stochastic geerative grammar of type i = 0,1, 2,3 is a pair ( G, f p ) where G= ( VN, VT, S, P) is a Chomsky geerative grammar of type i ad fp : P [0,1] is a probability fuctio with the property Σ( f p( α αi)) = 1, where α αi are all α -productios from P. Defiitio 3.4. A stimulatio fuctios is a mappig f :[0,1] {1,2,..., } [0,1], f ( x, x,..., x, i) = ( f ( x, x,..., x, i), 1 2 1 1 2 f2( x1, x2,..., x, i),..., f( x1, x2,..., x, i )) verifyig the properties: - fk( x1, x2,..., x, i) = xk k= 1 k= 1 - fi( x1, x2,..., x, i) xi - fl( x1, x2,..., x, i) xl (1) > (2) <, ( ) l i. (3) We otice that if α αi, i= 1,..., are all the α - productios from P the we ca associate to them a fiite probability field α α1 α α2... α α A ( α) = p1 p2... p (4) O the set of these probability fields A we defie ow a operator called stimulatio operator. A stimulatio operator is a mappig E: A {1,2,..., } A defied as α α1 α α2... α α E( A ( α), i) =, (5) q1 q2... q where ql = fl( x1, x2,..., x, i), l {1,2,..., }.. 4 Coclusio ad further work At this early stage of developmet, our approach looks promisig especially by the fact that it s workig. The modular developmet of the applicatio framework eables us to experimet with further extesio, as stated i paragraph 3.2. Oe major cocer for every system that has to iteract with a huma commuicator is to deal with error validatio. Furthermore, facig costatly such issues, it is ecessarily to exted the dyamic system behavior with a learig strategy. A iput validatio would the use the parser for a prelimiary check followed by the logical test. The later ca oly be doe by uderstadig a setece meaig. A drawback that eeds to be hadled is the fact that the system is curretly susceptible to assimilatig bad iformatio i the istructio mode. Aother issue we would like to ivestigate is the itegratio ito a ecommerce web portal, where a potetially ew cliet ca ask questios about the products available, ad this he way bypass tedious avigatig through cofusig meus. Refereces: [1] James Alle, Natural Laguage Uderstadig, 2d Editio, Addiso-Wesley, 1995, ISBN-10: 0805303340, ISBN-13: 9780805303346. ISSN: 1790-2769 444 ISBN: 978-960-474-113-7
Proceedigs of the 9th WSEAS Iteratioal Coferece o SIMULATION, MODELLING AND OPTIMIZATION [2] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullma, Compilers: Priciples, Techiques, ad Tools, Addiso Wesley, 2001. [3] Lluís Marquez, Machie Learig ad Natural Laguage Processig, Techical Report LSI- 00-45-R, Departamet de Lleguatges i Sistemes Iform`atics (LSI), Uiversitat Politecica de Cataluya (UPC), Barceloa, Spai. 2000. [4] Wikipedia, The free ecyclopedia, http://e.wikipedia.org/wiki/natural_laguage_ processig [5] Bird, Steve; Ewa Klei; Edward Loper, Natural Laguage Processig with Pytho. O'Reilly Media, 2009. [6] Bird, Steve; Ewa Klei; Edward Loper; Jaso Baldridge, Multidiscipliary istructio with the Natural Laguage Toolkit. Proceedigs of the Third Workshop o Issues i Teachig Computatioal Liguistics, ACL, 2008. [7] Chatbots directory, http://www.chatbots.org [8] A.L.I.C.E. Artificial Itelligece Foudatio http://www.alicebot.org [9] Multi Level Recursive Specificatios for Cotext Free Grammars - Vasile Crăciuea, Cristia Elea Aro, Ralf Fabia, Ioa-Daiel Huyadi, Proceedigs of the 11th WSEAS Iteratioal Coferece o COMPUTERS, Agios Nikolaos, Crete Islad, Greece, 2007, pag. 275-280, ISSN 1790-5117, ISBN 978-960-8457-95-9. [10] Traslatio for itermediate code, Ioa-Daiel Huyadi, Emil M. Popa, Ralf Fabia, Ioela Moca Proceedigs of the 8th WSEAS Iteratioal Coferece, Mathematical Methods ad Computatioal Techiques i Electrical Egieerig (MMACTEE), Bucharest, Romaia 2006, pag. 93-98, ISSN 1790-5117, ISBN 960-8457-54-8. [11] Fix poit iteral hierarchy specificatio for cotext free grammars - Vasile Crăciuea, Ralf Fabia, Ioa-Daiel Huyadi, Emil M. Popa, Proceedigs of the 11th WSEAS Iteratioal Coferece o COMPUTERS, Agios Nikolaos, Crete Islad, Greece, 2007, pag. 247-251, ISSN 1790-5117, ISBN 978-960-8457-95-9. [12] Zadeh, L. A. (1992). Kowledge represetatio i fuzzy logic. I A itroductio to fuzzy logic applicatios i itelliget systems. Kluwer Academic. [13] William Siler, James J. Buckley - Fuzzy Expert Systems ad Fuzzy Reasoig, Published by Joh Wiley & Sos, Ic., Caada, 2005. [14] Jose Galido, Agelica Urrutia, Mario Piattii - Fuzzy Databases: Modelig, Desig ad Implemetatio, Idea Group Publishig, USA, 2006. [15] Wag, Z., X. Shao, G. Zhag, H. Zhu. Itegratio of Variable Precisio Rough Set ad Fuzzy Clusterig: A Applicatio to Kowledge Acquisitio for Maufacturig Process Plaig. Rough Sets, Fuzzy Sets, Data Miig, ad Graular Computig. Lecture Notes i Computer Sciece, 585-593, Spriger Berli / Heidelberg, 2005. [16] Fabia, R., V. Crăciuea, E M. Popa. Itelliget system modellig with total fuzzy grammars. Proc. of the 8th WSEAS Iteratioal Coferece, Mathematical Methods ad Computatioal Techiques i Electrical Egieerig (MMACTEE), 82-87, Bucharest, 2006. [17] Pytho Software Foudatio, http://www.pytho.org [18] The Djago framework, http://www.djagoproject.com ISSN: 1790-2769 445 ISBN: 978-960-474-113-7