Towards Building a WordNet for Vietnamese

Towards Building a WordNet for Vietnamese Ho Ngoc Duc Information Technology Institute, Vietnam National University 144 Xuan Thuy, Ha Noi ducna@vnu.edu.vn Nguyen Thi Thao Communication Network Center, Hanoi University of Technology 1. Dai Co Viet Road, Ha Noi Abstract: We report on our ongoing effort towards developing VietWordNet, a WordNet for the Vietnamese language. We present the methodology we used, the lexical resources we employed, and the computing tools we designed to help acquiring and filtering lexical and semantic information from available machine-readable dictionaries and other resources. Key Words: WordNet, Ontology, Language Engineering 1 Introduction WordNet ([4, 8], http://www.cogsci.princeton.edu/~wn/) is a broad-coverage lexical-semantic net for the English language, developed at Princeton University since about 1985. Its design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. The synonym sets (or synsets) are linked via different relations, e.g., antonymy (e.g., rise and fall ), hyponymy (is-a, e.g., bird is a hyponym of animal ), and meronymy (has-a, partwhole-relation, e.g., body and hand ). WordNet has proved very useful for many activities in Natural Language Processing, e.g., parsing and machine translation, or the treatment of syntactic and semantic ambiguity. Moreover, attempts have been made to exploit the Princeton WordNet Database in Information Retrieval, applying in particular the synonymy relation represented via the so-called synsets. Among others, one has tried to enrich queries with semantically related terms and to compare queries and documents via conceptual distance measures. The success of WordNet has inspired several projects that aim at constructing WordNets for other languages than English [5, 7] or to develop multilingual WordNets. Perhaps the most important project in this line is EuroWordNet [11], a project aiming at building a WordNet for several European languages (Dutch, Italian, Spanish, etc.) As more and more Vietnamese texts become available in electronic form and new Web pages in Vietnamese language emerge everyday, there is a pressing need to create largescale online lexical-semantic nets for the Vietnamese language for use in NLP, Information Retrieval, and other areas. Our goal is to develop VietWordNet, a WordNet for the Vietnamese language along the line of Princeton WordNet. As the manual construction of VietWordNet from scratch would be very costly and time-consuming, we have focused on the techniques to semi-automatically construct the synsets and the relations between them from existing structured lexical resources. In this paper we present the methodology we used, the lexical resources we employed, and the computing tools we designed to help acquiring and filtering lexical knowledge and semantic information from available dictionaries and other resources. The rest of the paper is organized as follows. In the next section we describe our approach to build the core of VietWordNet its nominal part consisting of an inheritance system of Vietnamese nouns. Then we give an overview of the available resources, the tools we can employ and the programs we have to develop to process these resources. We discuss the necessary steps to construct the nominal part of VietWordNet and some methods to carry out these steps automatically. Finally we discuss our results and some ideas for future work. 2 Methodology WordNet treats four major syntactic categories (Noun, Verb, Adjective, Adverb) separately. We choose the nouns as our starting point to build VietWordNet. The choice is justified by the crucial role that the nouns play in the mental lexicon. Nouns are the most common words, they are hierarchically organized as an inheritance system, and many relationships between synsets only apply to nouns [9]. In the first phase, we are creating a hierarchy of Vietnamese nouns. The necessary steps are as follows. We start with a list of Vietnamese nouns. The meanings of these words are then organized into synsets, i.e., we need to identify the meanings of every single word and then group the word meanings into sets of synonyms. The next step is to create relationships between the synsets. In the first phase we only consider the hypernymy/hyponymy relation. Hence, the result is a taxonomy of concepts. Finally, we attach to each synset a gloss explaining the meanings of the words in the set.

To identify the different meanings of any Vietnamese noun we primarily use a Vietnamese-English dictionary, sometimes consulting monolingual dictionaries. The steps of creating the synsets and the links among them rely heavily on the use of English WordNet. To create the nominal hierarchy for VietWordNet, we try to attach Vietnamese nouns to English WordNet synsets, utilizing the bilingual dictionaries Vietnamese-English and English-Vietnamese. Finally, the glosses defining the synsets can be constructed using the monolingual Vietnamese dictionaries. Our approach is based on the following hypothesis. Although Vietnamese and English are very different languages, the inheritance systems of nominal concepts in the two languages are similar, at least in certain domains. More precisely, we expect the hierarchies of nouns denoting concrete objects are similar to those in WordNet. However, when abstract concepts are concerned, we expect gaps in the Vietnamese language, because it lacks many words expressing abstract concepts. This lack also results in other differences between the two languages. For example, the hypernym trees in VietWordNet can be expected to be shallower than corresponding ones in English WordNet, and in average the synsets of VietWordNet also contain fewer words. To fill the gaps caused by the lack of Vietnamese words for certain concepts, we should use collocations (multi-word translations, in the case of nouns: nominal phrases) when it seems necessary. Our approach is related to works aiming at constructing WordNets for other languages semi-automatically. Based on the conceptual similarity between English and other European languages, the skeletons of Spanish and Catalan WordNets were constructed in the same way [1, 3]. A similar effort to build a Korean WordNet is reported in [7]. The main differences between those works and ours lie in the ways the various heuristics are applied to map Vietnamese word senses to WordNet synsets. The characteristics of the Vietnamese language and the lack of reliable lexical resources in electronic form cause problems that must be solved innovatively. 3 Resources We use two different kinds of resources in our process of building the core of VietWordNet for nouns: lexical resources (existing WordNets and machine-readable dictionaries - MRD), and programs for processing these lexical resources. 3.1 Lexical resources A variety of broad-coverage linguistic resources (dictionaries, thesauri, text corpora, etc.) are available publicly for many European languages (German, Spanish, French...) They have proved to be very useful for the construction of WordNets for those languages. In Vietnamese, however, not so many large-scale linguistic resources are available, and very few resources are publicly available in computer-readable form. The MRDs that are available for Vietnamese were created with the human user in mind, i.e., they focus on presentation rather than structure, so they are not easily processed by computers. A human user can easily recognize the structure of an entry in a dictionary (e.g., headword, definitions, examples...) based on typographical properties (text size, color, style: bold, italic, etc.), but for a computer, a parser must be build to make the structure explicit. This task is often very difficult because the dictionaries are in different formats, and even within one dictionary, the formats of the different entries are not uniform. For instance, many entries contain the part-ofspeech (POS) information, but others do not, so it is not always possible to find out the POS of a word in a dictionary. As mentioned previously, the most important lexical resources we use are the English WordNet and the bilingual dictionaries English-Vietnamese and Vietnamese-English. Moreover, we also use some monolingual (Vietnamese) dictionaries. Version 1.7.1 of Princeton WordNet contains 134716 mappings between 109195 nouns and 75804 synsets. Since WordNet was meant to be used on a computer from the beginning, it is truly computer-processable. The bilingual English/Vietnamese dictionaries that we use for building VietWordNet were developed in the Free Online Vietnamese Dictionary Project, a project initiated in 1997 by one of the authors and some other Vietnamese researchers. Its principal goals are to aid Vietnamese Internet users to access foreign-language resources and to support teaching and leaning the Vietnamese language over the Internet. In that project, several monolingual and bilingual dictionaries (English-Vietnamese, French-Vietnamese, etc.) are compiled, digitalized and integrated into a single extensible system. The English-Vietnamese dictionary contains about 58.000 entries, of which circa 39.000 are nouns. The Vietnamese-English dictionary contains around 11.000 entries, including about 6.000 nouns. Besides those resources we have at our disposal a larger Vietnamese-English dictionary, and 4 monolingual Vietnamese dictionaries. Those dictionaries are not very well structured, they are not so easily parsed, thus their use is currently quite limited, and we decided to use them at a later stage. The Vietnamese-English bilingual dictionary contains about 22.000 entries. Among these entries, about 10.000 are positively identified as nouns. The monolingual dictionaries contain about 35.000-40.000 entries. Also intended for use in a later phase to enrich the synsets of VietWordNet are the bilingual French/Vietnamese dictionaries from the Free Online Vietnamese Dictionary Project. These dictionaries are relatively well structured. They contain around 40.000 entries for each direction.

3.2 Computing tools For accessing WordNet we could have used the program code available in the WordNet package, either in C or in Prolog. However, we decided to develop all tools to construct and access VietWordNet using the Java language. The principal reasons to choose Java are the following. First, Java is platform-independent, so the programs we create are immediately available on all platforms without the need for porting. Second, with Java it is easy to turn the programs into a Web-based application that can be run on all operating systems, so that VietWordNet can be made available quickly on the Web for evaluation and validation by a large group of users. Third, Java offers built-in support for Unicode, which is essential for processing Vietnamese. Moreover, the task of constructing a WordNet browser can be done easily in Java using the Swing library. Thus, we decided to adopt JWNL (Java WordNet Library, [6]), a free third-party Java API for accessing WordNet, so a smooth integration of programs is guaranteed. The heterogeneity of Vietnamese resources makes it necessary to develop a set of computing tools to process them. We have implemented programs to parse entries of the bilingual English/Vietnamese and French/Vietnamese dictionaries from the Free Online Vietnamese Dictionary Project. They transform each entry in a dictionary to a structure consisting of the headword, the pronunciation, the POS, the translations, usage codes, examples etc. Preliminary versions of tools for mapping nouns from the dictionaries to English WordNet synsets have also been implemented. Parsers for the larger Vietnamese-English dictionary and for the monolingual Vietnamese dictionaries are currently under development. Moreover, a tool has been created to identify words in a Vietnamese text. We shall explain the use of some tools in the next section. 4 Methods for constructing VietWordNet Our approach is semi-automatic, i.e., some tasks are performed automatically, and other tasks must be done manually. The automatization relies on several heuristics, developed in [1, 2, 3, 5, 10] for construction WordNets for other languages, utilizing the available bilingual and monolingual dictionaries as well as large corpora. The first step of our procedure is to choose a list of Vietnamese nouns that will constitute the skeleton of VietWordNet and to identify all meanings of those nouns. This step is fully automated. We go through all entries of the Vietnamese-English dictionary and create for each entry the set of meanings of the headword if it is a noun. Because not all entries in the Vietnamese-English dictionary contain the part-of-speech, we rely on some heuristics to check if a word is indeed a noun. If the POS is not known for an entry, we check if its headword is contained in the list of Vietnamese nouns. For that task we need a list that comprises almost all Vietnamese nouns. We have extracted such a list from the available monolingual dictionaries. Another method to determine the POS of a Vietnamese word is to see if its English translations are nouns by comparing them with a list of English nouns that covers almost all English nouns. (We use the list of nouns that are contained in WordNet to meet this requirement.) If all single-word translations of a sense of a Vietnamese word are nouns then this word is a noun. The next step to build VietWordNet is to attach Vietnamese nouns to synsets of the English WordNet. Once the Vietnamese nouns are connected to WordNet synsets, we have achieved two goals. First, we have grouped meanings of Vietnamese words into synsets. Second, the most important semantic relations are transferred to VietWordNet, including the hypernymy/hyponymy relation. This approach can be illustrated with an example. Figure 1a depicts a tiny fraction of the inheritance system of English nouns. This system can help us to establish the hypernymy relation between cow and animal, or to find out the socalled conceptual distance between cow and buffalo. (Princeton WordNet contains much more information, i.e., it also provides the synonyms of the nouns under consideration, but we shall ignore those information for now.) If we can attach to the synsets of figure 1a the corresponding Vietnamese words, e.g., the words động vật, bò and trâu to the nodes representing the synsets of animal, cow and buffalo, as depicted in figure 1b, then we have also established the corresponding relationships between the concepts động vật and bò or between bò and trâu. 4a1 Buffalo Land animals Animals Herbivorous Carnivorous Omnivorous Tr u Cow 1a 2a 2b 3a 3b 3c 4a2 éng vët trªn Êt liòn Figure 1a éng vët éng vët n cá éng vët n thþt éng vët n t¹p Bß 1b 2a1 2b1 3a1 3b1 3c1 4a11 4a21 Figure 1b Sea animals éng vët trªn bión We are experimenting with two different approaches to attach Vietnamese nouns to WordNet synsets. In the first one we use the Vietnamese-English dictionary to translate all

Vietnamese nouns. Each Vietnamese word will have one or more senses, and the English translations of each sense will belong to one or more synsets. Our task is to select those synsets to which we can attach the senses of the Vietnamese noun. The second approach is to start with the synsets in the inheritance system of the English WordNet. For each synset we use the English-Vietnamese dictionary to find Vietnamese translations of the English words in that synset. Then selected Vietnamese translations will be attached to that synset. In both approaches, if the translation is one-to-one (i.e., the bilingual dictionary in question gives only one translation for a certain sense) then we can assume with high confidence that the correspondence between word and synset has been established. This correspondence can be double-checked by translating back the words using the other bilingual dictionary. If one word has many translations then we need to apply certain heuristics to exclude false hits and to assign the translations to the correct synsets. Let us consider a simple example to illustrate the first approach. Consider the Vietnamese word đông. This word can be a noun (east, orient; winter), a verb (to coagulate; to congeal; to freeze), or an adjective (crowed; numerous). In the function as a noun it has two senses. According to our Vietnamese-English dictionary, the first sense has two English translations: East and orient, and the second one has one translation: winter. (The different senses and translations can be extracted from the dictionary using the parsers we have developed.) Let us look at the first case. Take the two English nouns East and orient as input. According to WordNet, East has 3 and orient has 2 senses. Thus, the first sense of the word đông may be attached to up to 5 synsets. We rank these candidates according to our confidence score based on several heuristics. We shall discuss those heuristics later. The second case is easier. WordNet tells us that the English noun winter has only one sense and belongs to the synset winter, wintertime. Thus, we can attach the word đông to that synset. Previous works have identified various heuristics to aid the automatic construction of WordNets for other languages. In [1, 2], the authors describe a method to extract semantic relationships from a monolingual dictionary and to use these information to construct a hierarchy of concepts. Unfortunately, the structures of the available monolingual dictionaries for Vietnamese are not very well suited for that method. To apply that method we need a tool to make the structure of a definition in the monolingual dictionary explicit, i.e., we need to analyze the word definition in order to identify the genus and the characteristics of the word to be defined. As a preliminary step towards using that heuristics we have developed a tool to identify Vietnamese words in a sentence. Some other methods can be applied more easily. A simple one is to see if all (or most) English translations of a Vietnamese noun constitute a WordNet synset. If so, the word can be attached to that synset. For example, a WordNet synset consists of two words: East and orient. The first sense of the Vietnamese noun đông has also two translations, East and orient. Thus, we may assign this word to the synset. Another heuristics are is based on the assumption that the senses in the dictionaries are ordered according to their frequency, so that the first sense in the monolingual dictionary corresponds to the synset constructed using the first sense in the Vietnamese-English dictionary, so we attach the first definition to that synset. Another method is to use the available usage codes. For example, some entries contain information about the semantic domain (Science, History, Economics...) or the usage style (technical, slang, vulgar...) of the words in question. Those usage hints can also help to test the compatibility between a Vietnamese word and a WordNet synset. A last step to complete the core nominal VietWordNet is to add glosses to the synsets. Currently we simply add a definition from a monolingual dictionary to the synset. This method is most reliable if the synset contains a single word and that word has only one sense. If the words in the synset have several senses we have to rely on various heuristics. As none of the discussed heuristical methods is fully reliable we have to combine the results of several methods to achieve a high score. 5 Conclusions and future work In this paper we have explored the semi-automatic construction of a WordNet for the Vietnamese language using pre-existing lexical resources such as Princeton WordNet, Vietnamese/English bilingual dictionaries, and monolingual Vietnamese dictionaries. We have analyzed the available resources, evaluated the applicability of several heuristics to them, and designed a set of tools to make these resources suitable for building a core VietWordNet. In the future we need to improve the tools to better utilize the available resources so that we can test and compare the various heuristics more adequately. In particular, we need to implement a parser for the larger Vietnamese-English dictionary so that we can construct a larger VietWordNet. We also need tools to process the monolingual dictionaries. Another issue is to make the programs more robust, so that they can cope with typographical errors in the resources. The construction of VietWordNet cannot be fully automated. A step of testing and validation by human experts is necessary. To aid human users in this work, we are going to design and implement a set of visual tools that let users see and modify portions of VietWordNet, so that they can add words to or delete words from synsets, create or delete synsets, or modify relations between synsets. We intend to integrate such tools into a Web-based application, so that online collaboration between various groups can be achieved more easily.

Acknowledgements We thank Mr. Ho Hai Thuy for providing us with many essential lexical resources. We have also benefited very much from discussions with him about Vietnamese lexicography. We thank all the friends in the Free Online Vietnamese Dictionary Project who have contributed to the construction of electronic English-Vietnamese and Vietnamese-English dictionaries. References [10] [11] Rigau, G., H. Rodriguez and E. Agirre, 1998. "Building Accurate Semantic Taxonomies from Monolingual MRDs". In: Proceedings of COLING- ACL '98. Montréal, 1998 Vossen, P., "EuroWordNet: building a multilingual database with wordnets for European languages". In: The ELRA Newsletter, Vol. 3(1), 1998. http://www.hum.uva.nl/~ewn [1] [2] [3] [4] [5] [6] [7] [8] [9] Atserias, J. et. al., Combining multiple methods for the automatic construction of multilingual WordNets.'' In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, 1997 Copestake, A., "An approach to building the hierarchical element of a lexical knowledge base from a machine readable dictionary". In: Proceedings of the First International Workshop on Inheritance in Natural Language Processing, 1990. Farreres, X., G. Rigau and H. Rodriguez, ``Using WordNet for building WordNets.'' In: Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, 1998 Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, 1998 Hamp, B. and H. Feldweg, GermaNet a Lexical- Semantic Net for German. In: Proceedings of ACL Workshop Automatic Information Extraction and Building of Lexical Semantic Resources, Madrid, 1997 JWNL: http://sourceforge.net/projects/jwordnet Lee, C., G. Lee and J. Seo, ``Automatic WordNet mapping using Word Sense Disambiguation.'' In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 2000), Hong Kong, 2000 Miller, G. et. al., Five papers on WordNet. In: International Journal of Lexicography 3 (4), 1990 Miller, G., Nouns in WordNet: a Lexical Inheritance System. In: International Journal of Lexicography 3 (4), 1990