The TEXT-TO-ONTO Ontology Learning Environment Alexander Maedche and Steffen Staab Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany fmaedche,staabg@aifb.uni-karlsruhe.de http://www.aifb.uni-karlsruhe.de/wbs Abstract Ontologies have become an important means for structuring information and information systems and, hence, important in knowledge as well as in software engineering. However, there remains the problem of engineering large and adequate ontologies within short time frames in order to keep costs low. For this purpose, we present the TEXT-TO-ONTO Ontology Learning Environment, which is based on a general architecture for discovering conceptual structures and engineering ontologies from text. Our Ontology Learning Environment supports as well the acquisition of conceptual structures as mapping linguistic resources to the acquired structures. 1 Introduction Ontologies 1 have shown their usefulness in application areas such as intelligent information integration, information brokering and natural-language processing, to name but a few. However, their wide-spread usage is still hindered by ontology engineering being rather time-consuming and, hence, expensive. Our system TEXT-TO-ONTO tries to overcome this knowledge acquisition bottleneck through learning and discovering conceptual structures from texts. Natural language texts exhibit morphological, syntactic, semantic, pragmatic and conceptual constraints that interact in order to convey a particular meaning to the reader. Thus, the text transports information to the reader and the reader embeds this information into his background knowledge. Through the understanding of the text data is associated with conceptual structures and new conceptual structures are learned from the interacting constraints given through language. TEXT- TO-ONTO exploits the interacting constraints on the various language levels (from morphology to pragmatics and background knowledge) in order to discover new concepts and stipulate relationships between concepts. The system follows an balanced cooperation approach described in [4], i.e. each modeling task can be done by the user or by a learning tool of the system. This balanced interaction of system and user contributes to the preparation of background knowledge, enhancing the domain knowledge (ontology) and to inspecting the learned knowledge. 1 We restrict our attention in this paper to domain ontologies that describe a particular small model of of the world as relevant to applications, in contrast to top-level ontologies and representational ontologies that aim at the description of generally applicable conceptual structures and meta-structures, respectively, and that are mostly based on philosophical and logical point of views rather than focused on applications.
2 TEXT-TO-ONTO Ontology Learning Environment The process of semi-automatic ontology learning from text is embedded in an architecture that comprises several core features described as a kind of pipeline in the following. (cf. the overall schema in Figure 1). Nevertheless, the reader may bear in mind that the overall development of ontologies remains a cyclic process (cf. [1]). In fact, we provide a broad set of interactions such that the engineer may start with primitive methods first. These methods require very little or even no background knowledge, but they may also be restricted to return only simple hints, like term frequencies. While the knowledge model matures during the semi-automatic learning process, the engineer may turn towards more advanced and more knowledge-intensive algorithms, such as our mechanism for discovering generalized non-taxonomic relations described in [2]. natural language texts feed Text & Processing Management (XML tagged) text &selected algorithms Learning & Discovering Algorithms proposes selected text & preprocessing method XMLtagged text against manual model Evaluation Text Processing Server Ontology references models OntoEdit Ontology Modeling Environment Stemming POS tagging chunk parsing Information Extraction... domain lexicon models Lexical DB Figure1. Architecture of the Ontology Learning Environment A comprehensive architecture lays the foundation for acquiring domain ontologies and linguistic resources ([3]). The main components of the architecture are the (i) Text & Processing Management, the (ii) Text Processing Server, (iii) a Lexical Database and Domain Lexicon, a (iv) Learning Module and the (v) Ontology Engineering Environment OntoEdit: Text & Processing Management Component. The ontology engineer the Text & Processing Management Component to select domain texts exploited in the further discovery process. She chooses among a set of text (pre-)processing methods available on the Text Processing Server and among a set of algorithms available at the Learning &
Discovering component. The former module returns text that is annotated by XML and this XML-tagged text is fed to the Learning & Discovering component. Text Processing Server. The Text Processing Server may comprise a broad set of different methods. In our case, it contains a shallow text processor based on the core system SMES (Saarbrücken Message Extraction System) [5]. SMES is a system that performs syntactic analysis on natural language documents. In general, the Text Processing Server is organized in modules, such as a tokenizer, morphological and lexical processing, and chunk parsing that use lexical resources to produce mixed syntactic/semantic information. The results of text processing are stored in annotations using XML-tagged text. Figure2. The TEXT-TO-ONTO Ontology Learning Environment Lexical DB & Domain Lexicon. Syntactic processing relies on lexical knowledge. In our system, SMES accesses a lexical database with more than 120.000 stem entries and more than 12,000 subcategorization frames that are used for lexical analysis and chunk parsing. The domain-specific part of the lexicon (abbreviated domain lexicon ; cf. left lower part of Figure 2) associates word stems with concepts available in the concept taxonomy. Hence, it links syntactic information with semantic knowledge that may be further refined in the ontology.
Learning & Discovering component. The Learning & Discovering component various discovering methods on the annotated texts, e.g. term extraction methods for concept acquisition. Our scenario for discovering non-taxonomic relations the learning algorithm for discovering generalized association rules described in [2]. Conceptual structures that exist at learning time (e.g. a concept taxonomy) may be incorporated into the learning algorithms as background knowledge. The evaluation of the applied algorithms such as described in [2] is performed in a submodule based on the results of the learning algorithm. Ontology Engineering Environment. The Ontology Engineering Environment ONTOEDIT, which is a submodule of the Ontology Learning Environment TEXT-TO-ONTO supports the ontology engineer in semi-automatically adding newly discovered conceptual structures to the ontology. A comprehensive description of the ontology engineering system ONTOEDIT and the underlying methodology is given in [8,9]. The screenshot depicted in Figure 2 shows on the left side the object-model backbone of an ontology. In addition to core capabilities for structuring the ontology, the engineering environment provides some additional features for the purpose of documentation, maintenance, and ontology exchange. OntoEdit internally stores modeled ontologies using an XML serialization. 3 Discovering Non-Taxonomic Conceptual Relations from Text using TEXT-TO-ONTO In [2] we describe our approach for discovering non-taxonomic conceptual relations from text faciliting ontology engineering. Building on the user-modeled taxonomic part of the ontology, our approach analyzes domain-specific texts. It shallow text processing methods to identify linguistically related pairs of words, which are mapped to concepts using the domain lexicon. An algorithm for discovering generalized association rules [6] analyzes statistical information about the linguistic output. Thereby, it the background knowledge from the taxonomy in order to propose relations at the appropriate level of abstraction. For instance, the linguistic processing may find that the word costs frequently co-occurs with each of the words hotel, guest house, and youth hostel in sentences such as (1). (1) Costs at the youth hostel amount to $ 20 per night. From this statistical linguistic data our approach derives correlations at the conceptual level, viz. between the concept Costs and the concepts, Hotel, Guest House, and Youth Hostel. The learning algorithm determines support and confidence measures for the relationships between these three pairs, as well as for relationships at higher levels of abstraction, such as between Accommodation and Costs. In a final step, the algorithm determines the level of abstraction most suited to describe the conceptual relationships by pruning appearingly less adequate ones. Here, the relation between Accommodation and Costs may be proposed for inclusion in the ontology. Results of the learning algorithm are visualized as a graph such as depicted on the right side of Figure 2.
4 Conclusion We have presented an approach and an implemented system towards learning ontologies from text. Core idea of this approach is to support the knowledge engineer using an balanced cooperative modeling paradigm. We have to emphasize that we do not consider fully automatic ontology acquisition from text as realistic, so we support the knowledge engineer as much as possible with graphical user interfaces and visualization of discovered conceptual structures. The system has been evaluated and applied for building domain ontologies in the tourism domain [7] and the insurance domain. References 1. A. Maedche, H.-P. Schnurr, S. Staab, and R. Studer. Representation language-neutral modeling of ontologies. In U. Frank, editor, Proceedings of the German Workshop Modellierung- 2000. Koblenz, Germany, April, 5-7, 2000. Fölbach-Verlag, 2000. 2. A. Maedche and S.Staab. Discovering conceptual relations from text. In W. Horn (ed.): ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence. IOS Press, Amsterdam, 2000. 3. A. Maedche and S. Staab. Semi-automatic engineering of ontologies from text. In Proceedings of the 12th Internal Conference on Software and Knowledge Engineering. Chicago, USA, July, 5-7, 2000. KSI, 2000. 4. K. Morik. Balanced cooperative modeling. Machine Learning, 11:217 235, 1993. 5. G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information extraction core system for real world german text processing. In ANLP 97 Proceedings of the Conference on Applied Natural Language Processing, pages 208 215, Washington, USA, 1997. 6. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of VLDB 95, pages 407 419, 1995. 7. S. Staab, C. Braun, I. Bruder, A. Düsterhöft, A. Heuer, M. Klettke, G. Neumann, B. Prager, J. Pretzel, H.-P. Schnurr, R. Studer, H. Uszkoreit, and B. Wrenger. GETESS searching the web exploiting german texts. In CIA 99 Proceedings of the 3rd Workshop on Cooperative Information Agents, LNAI 1652, pages 113 124, Berlin, 1999. Springer. 8. S. Staab and A. Maedche. Axioms are Objects, too - Ontology Engineering beyond the modeling of Concepts and Relations. Technical Report 400, Institute AIFB, Karlsruhe University, 2000. 9. S. Staab and A. Maedche. Ontology engineering beyond the modeling of concepts and relations. In A. Gomez-Perez (ed.): Proceedings of the ECAI 2000 Workshop on Application of Ontologies and Problem-Solving Methods. IOS Press, Amsterdam, 2000.