The Role of Domain Ontology in Text Mining Applications: The ADDMiner Project

Similar documents
A Case Study: News Classification Based on Term Frequency

Linking Task: Identifying authors and book titles in verbose queries

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

AQUA: An Ontology-Driven Question Answering System

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Universiteit Leiden ICT in Business

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Applications of memory-based natural language processing

Parsing of part-of-speech tagged Assamese Texts

Word Segmentation of Off-line Handwritten Documents

On-Line Data Analytics

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

BYLINE [Heng Ji, Computer Science Department, New York University,

The stages of event extraction

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Distant Supervised Relation Extraction with Wikipedia and Freebase

Ensemble Technique Utilization for Indonesian Dependency Parser

Modeling user preferences and norms in context-aware systems

Mining Association Rules in Student s Assessment Data

Developing a TT-MCTAG for German with an RCG-based Parser

Using dialogue context to improve parsing performance in dialogue systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Graph Based Authorship Identification Approach

Rule Learning With Negation: Issues Regarding Effectiveness

New Features & Functionality in Q Release Version 3.2 June 2016

Lecture 1: Basic Concepts of Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Increasing the Learning Potential from Events: Case studies

Software Maintenance

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

TextGraphs: Graph-based algorithms for Natural Language Processing

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

The Smart/Empire TIPSTER IR System

Online Updating of Word Representations for Part-of-Speech Tagging

1. Introduction. 2. The OMBI database editor

Learning Methods in Multilingual Speech Recognition

Python Machine Learning

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Probabilistic Latent Semantic Analysis

An Interactive Intelligent Language Tutor Over The Internet

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Assignment 1: Predicting Amazon Review Ratings

CS 598 Natural Language Processing

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Phonological and Phonetic Representations: The Case of Neutralization

Compositional Semantics

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

EDITORIAL: ICT SUPPORT FOR KNOWLEDGE MANAGEMENT IN CONSTRUCTION

Rule Learning with Negation: Issues Regarding Effectiveness

A Comparison of Two Text Representations for Sentiment Analysis

Cross-Media Knowledge Extraction in the Car Manufacturing Industry

Automating the E-learning Personalization

Ontologies vs. classification systems

Mining Student Evolution Using Associative Classification and Clustering

Knowledge-Based - Systems

The MEANING Multilingual Central Repository

Writing Research Articles

Introduction to Text Mining

Prediction of Maximal Projection for Semantic Role Labeling

Accelerated Learning Course Outline

Pragmatic Use Case Writing

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

A Domain Ontology Development Environment Using a MRD and Text Corpus

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

PSY 1010, General Psychology Course Syllabus. Course Description. Course etextbook. Course Learning Outcomes. Credits.

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Automatic document classification of biological literature

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Simulated Architecture and Programming Model for Social Proxy in Second Life

South Carolina English Language Arts

An OO Framework for building Intelligence and Learning properties in Software Agents

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Storytelling Made Simple

Exposé for a Master s Thesis

Ontological spine, localization and multilingual access

Student User s Guide to the Project Integration Management Simulation. Based on the PMBOK Guide - 5 th edition

Beyond the Pipeline: Discrete Optimization in NLP

Learning Microsoft Office Excel

Research computing Results

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Level: 5 TH PRIMARY SCHOOL

Abstractions and the Brain

Derivational and Inflectional Morphemes in Pak-Pak Language

Preference Learning in Recommender Systems

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Empirical Software Evolvability Code Smells and Human Evaluations

Transcription:

The Role of Domain Ontology in Text Mining Applications: The ADDMiner Project Ana Cristina B. Garcia, InhaÄma Neves Ferraz and Fernando Pinto Universidade Federal Fluminense bicharra@ic.uff.br; ferrazl@addlabs.uff.br; fernando@addlabs.uff.br Abstract Extracting insights from large text collections is an aspiration of any organization aiming to take advantage of their experience generally documented in textual documents. Textual documents, either digital or not, have been the most common form to register any organization transaction. Free text style is a very easy way to input data since it does not require users any special training. On the other hand, the text material easily collected becomes the major challenge for building automatic deciphering tools. In this paper we present ADDMiner, a text-mining model for extracting causality relationships from a large text collection of accident reports. Our model is based on using domain ontology as well as a corpus-based computational linguistics to guide the mining process. Examples from offshore oil platform accident reports illustrate the potential benefits of our approach. 1. Introduction Offshore petroleum platform operation is a high-risk activity with an extremely high economic return. Accidents are frequently accidents due to the intrinsic danger of dealing with great amounts of combustive material in high pressure. In order to minimize the risks, accident reports, containing a description of the accident and the measures taken to solve it, help any petroleum organization to learn from history. Generally, the industry records the accident history in online textual documents. This rich material becomes almost worthless if not properly compiled. As the report collection grows, making sense of the information inside becomes an impossible task for human brains. It calls for an automatic answer. In this context, text mining represents a promising approach to deal with it. Although text-mining techniques have not yet provided conclusive results for general-purpose mining, a domain-specific application may have different results. The problem consists in extracting causal-effect relation in accident report documents. This paper discusses the use of domain ontology to allow eliciting cause-effect relations in a large collection of accident report textual documents in oil offshore platforms. 2. Accident report domain In Brazilian petroleum industry, offshore drilling and production processes are the predominant activity since most reservoirs lay in offshore areas. There are thirty-nine oil fields mostly located in Rio de Janeiro state. These oil fields are explored by sixty-four oil platforms, operated by forty thousand workers. The considerable high numbers of professionals involved allied to the nature of oil platform operation configure a high-risk operation economical, environmental and human-related. One of the requirements to let a platform operate is the existence of a method to register accidents (or even incidents), including information describing people involved, consequences to the unity as well as the way it was solved and future actions to prevent recurrence. Generally, textual documents are created contemplating this requirement. Although these reports are available electronically, very little can be done to consolidate all information included in them. The information in the anomaly treatment report is not structured. There is no database with clear attributes that would allow extract accident historic and analysis stored in. For making statistic analysis, figuring out the real cause of the accidents and correlation between platform measurements and accidents, the company need to hire experts that would careful read and make sense of each report and try to consolidate the information they found in those reports. If the number of reports was small, a human expert can take care of this job. However, since the amount of reports is huge and growing, automatic approach to this job seems to be the feasible approach. In this context, using ontology and text mining through the ADDMiner model presents as a promising approach to deal with our problem.

3. ADDMiner Model ADDMiner, as illustrated in Figure 1, is divided in four main blocks: Natural Language Processing block: it represents the linguistic treatment to summarize each textual report into a set attribute-value pair. The text in each report is considered as a set of sentences. On the other hand, each sentence becomes a set of ordered words that will be identified using a lexicon indexed by stems and, syntactic and semantically classified. Finally, a parser syntactically analyses the sentences and builds a parsing tree for each sentence that will guide the semantic processing. As described, this is almost a classic natural language processing with some nuances. Ç Stemmer [report] Textual Documents Textual Documents [stems] Domain Lexicon Lexicon [All Reports] [syntactic tags] Statistic-based Document Classifier Parser [semantic tags] [parsing tree] [report] [document type] Structured Report Record Meaning Extractor [domain description] Domain Ontology Report Data Base Data Miner Association Rules + Report Indexing Figure 1: The ADDMiner Model. The lexicon analysis uses a stemmer to preprocess the words and reduce the lexicon size. Furthermore, although it is desirable to have a previous syntactic processing to facilitate text understanding, the semantic extraction block can recover from parser failures. Statistic Classification block[1]: each report document was classified according to the type of accident its content reports. In order to build the classifier, we selected a set of reports and manually classified them in 15 different accident types according to our understanding from reading the report. The reports were statistically treated to remove worthless words and later to identify the words that represented each report. Meaning extractor block: instead of taking a general approach, we consider that meaning is context dependent. We developed an ontology for the domain of offshore oil platform accident report, as illustrated in Figure 2. The ontology works as a guide to search for content in a given accident report. Ç Data Mining block: it represents the data mining process using association rules technique[2] The use of a domain ontology is the key of our approach. As shown in Figure 2, an accident report contains: Ç.

Figure 2: A sample of ADDMinerÉs accident report ontology. Information on the task during which the accident happened including the environment conditions, list of equipment involved, the main task as well as the associated tasks; Çinformation on the actions taken to mitigate the problem including immediate actions as well as definitive corrective action and related preventive actions to prevent recurrence; Ç information on the problem itself including a general problem description of when, where and how the accident happened[3], as well as information on the sources of the causes; Ç Information on the impacts both financial and human-related; and finally Information on the consequences brought[4] by the accident both to the artifact (the oil platform) and humanrelated (people that works in the oil platform) 4. An Example of using ADDMiner in the Petroleum Accident Report Domain As an example, we present a typical case of text mining in our accident report data set domain. Both the algorithm and the software are still under development. The data input is a collection of flat text, one for every Anomaly Report Each report contains a set of sentences and each sentence a set of words. Our lexicon is concise, for this reason we used a stemming processor to reduce each word into a stem (token) that can be found in the lexicon. For example the word accidentado becomes accident + ado (token + suffix). Token accident is used as an index to recuperate syntactic and semantic information stored in the lexicon. Syntactic information is used to help the tokenizer/stemmer process. All, tokens and semantic information are processed under the statistic analyzer which recognize the anomaly type of each report. This statistic analyzer was previously trained with a sample of the domain. Such process is illustrated in Figure 3 and it was classified/typified as Accident with Injury Report Type. Text Statistic Stemmer Anomaly type Figure 3: Anomaly type process recognition. As each anomaly (accident) type are recognized for every report, we switch to the corresponding ontology to execute the adequate ontologic information extraction, for our sample it was automatically choosed the Accident with Injury Ontology. We have developed a domain ontology describing all 15 types of anomalies/accidents that may occur operational fault, accident with injury and machine break. Once statistic characterization of the Anomaly Type is done, the text is passed to the Information block which uses techniques of information extraction over the selected ontology, this block is the KEY of our process and it is shown on Figure 5.

Anomaly type Text Figure 5: Information extraction. The information extracted on this phase guides the database attributes filling. Database modeling was based on ontologies considered. This information extraction uses some sort of grammatical and semantic treatment for finding the corresponding concepts described in the ontology. Next, as illustrated in Figure 6, database is processed to find Association Rules guiding the rules only with the fact that our objective is to find Source of Anomalies. This is done by filtering rules with possible reasons of faults, injuries accordingly with ontology. As an example of a rule, extracted in this phase, we have stairs + steel + making hole injury with a support x and confidence y. Structure records Information Data Mining (Association Rules) Structure record Association Rules Figure 6: Association Rules Generation. The final phase is to show rules to user as structured and graphical information. Again, this is based on the selected ontology. In figure 7 it its shown rule number 31, and some of its properties, for specific document. In central bottom part, it is shown all documents that meets rule 31 (7 documents). Second one was selected and visualized on bottom right part. At right top is shown the rule with all the corresponding attributes adequately presented under the current ontology. The objective of showing bold colored tags is to indicate that they were filled automatically by the program and that it is going to color, the corresponding text in document, the same as tags; right now, it is shown (text in document), when double clicking each property. In this case, it was selected description of activity property. This way it was found the following properties (ontology components) already on block processing in Figure 5. Patient (paciente) = Nilceu Mario Moro Company (Empresa) = ATM Injury (lesño) = corte contuso / escoriaöño Body part (parte do corpo) = punho da mño esquerda Activity (atividade) = furava una antepara de aöo para fixaöño de um suporte de ferramentas no local Injury reason (razño da lesño) = Ao vazar a parede, a broca quebrou, o empregado desequilibrou-se Activity type (classe de atividade) = Parte de atividade Activity schedule (horürio de atividade) = Durante o trabalho Immediate action (aöño imediata) = Enfermaria Also it was also found as the probably causes for this accident as lack of victim attention and environment imperfection. 5. Discussion Most researches on text mining focus on developing broad general-purpose technologies to improve web text document retrieval. Since our objective is to answer a well defined question: What is causing accidents?, we could take a domain-dependent approach when developing a tool to process the domain data source. This paper presented an approach to reveal cause-effect information buried in textual accident report document files. The text mining question can be understood as three sub-questions[5]: What is written in a accident report? Is there any structured in the storytelling style that can guide a report understanding? What information is expected to be provided when describing an accident? Is it possible to draw cause-effect inferences from the reported accidents? Is each case unique? The first question was addressed by using a natural language processor that combines a stemmer to reduce the size of the domain lexicon, combined with a parser that deal with incomplete information. The second question was addressed by including a domain ontology[6] describing what should be in an accident report (the touch of domain-dependency approach). The ontology guided the semantic processing by providing an expectation and guidance of what should be looked for in the text. The third question was addressed by an association rule data miner with pos-processing to prune the output. Rule visualization and the ability to retrieve accident report sample that complies with the rule are the most effective pos-processing technique considered here. We developed a tool according to ADDMiner model that has been implemented in C++ showing the feasibility of our approach.

stairs steel making hole injury stairs steel making hole injury After selecting Description, it is shown colored in document text After selecting Description, it is shown colored in document text Figura 7: Sample of rule visualization 6. References [1] Usama Fayyad and Ramasamy Uthurusamy. Data mining and knowledge discovery in databases: Introduction to the special issue. Communications of the ACM, 39(11), November, 1999. [2] Marti Hearst. Untangling text data mining. Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics. 1999. [4] Ouzounis, C. A., TEXTQUEST: Document clustering of MEDLINE abstracts for concept discovery in molecular biology. PSB 2001, pp. 384 395, 2001. [5] Glenisson, P., Antal, P., Mathys, J., Moreau, Y. & De Moor, B., Evaluation of the Vector Space Representation in Text-Based Gene Clustering. PSB 2003, pp. 391-402, 2003. [6] H. D. White and K. W. McCain. Bibliometrics. Annual Review of Information Science and Technology, 24:119 186, 1989. [3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Company. 1999.