Closed Domain Question Answering for Cultural Heritage

Similar documents
AQUA: An Ontology-Driven Question Answering System

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Compositional Semantics

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Independent Passage Retrieval for Question Answering

Parsing of part-of-speech tagged Assamese Texts

Applications of memory-based natural language processing

Specification of the Verity Learning Companion and Self-Assessment Tool

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

An Interactive Intelligent Language Tutor Over The Internet

Automating the E-learning Personalization

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

PROCESS USE CASES: USE CASES IDENTIFICATION

A Case Study: News Classification Based on Term Frequency

Knowledge-Based - Systems

The Strong Minimalist Thesis and Bounded Optimality

Linking Task: Identifying authors and book titles in verbose queries

Matching Similarity for Keyword-Based Clustering

Ontologies vs. classification systems

Word Segmentation of Off-line Handwritten Documents

Universiteit Leiden ICT in Business

ScienceDirect. Malayalam question answering system

Rule-based Expert Systems

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Loughton School s curriculum evening. 28 th February 2017

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

INNOWIZ: A GUIDING FRAMEWORK FOR PROJECTS IN INDUSTRIAL DESIGN EDUCATION

Cross Language Information Retrieval

Beyond the Pipeline: Discrete Optimization in NLP

Rule Learning With Negation: Issues Regarding Effectiveness

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Aspectual Classes of Verb Phrases

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Some Principles of Automated Natural Language Information Extraction

Proof Theory for Syntacticians

On document relevance and lexical cohesion between query terms

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Rule Learning with Negation: Issues Regarding Effectiveness

Constructing Parallel Corpus from Movie Subtitles

Common Core State Standards for English Language Arts

Cross-Lingual Text Categorization

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Computerized Adaptive Psychological Testing A Personalisation Perspective

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Ensemble Technique Utilization for Indonesian Dependency Parser

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Using dialogue context to improve parsing performance in dialogue systems

Declaration of competencies

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

The Smart/Empire TIPSTER IR System

Getting the Story Right: Making Computer-Generated Stories More Entertaining

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

Specifying Logic Programs in Controlled Natural Language

The MEANING Multilingual Central Repository

The Keele University Skills Portfolio Personal Tutor Guide

Achievement Level Descriptors for American Literature and Composition

Developing a TT-MCTAG for German with an RCG-based Parser

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

A Comparison of Two Text Representations for Sentiment Analysis

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Memory-based grammatical error correction

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Speech Recognition at ICSI: Broadcast News and beyond

Guidelines for Writing an Internship Report

Multilingual Sentiment and Subjectivity Analysis

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Segmented Discourse Representation Theory. Dynamic Semantics with Discourse Structure

Short Text Understanding Through Lexical-Semantic Analysis

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

On-Line Data Analytics

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Conversational Framework for Web Search and Recommendations

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Software Maintenance

Axiom 2013 Team Description Paper

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Evolution of Symbolisation in Chimpanzees and Neural Nets

Explaining: a central discourse function in instruction. Christiane Dalton-Puffer University of Vienna

Online Updating of Word Representations for Part-of-Speech Tagging

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Problems of the Arabic OCR: New Attitudes

An investigation of imitation learning algorithms for structured prediction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Shared Mental Models

WORK OF LEADERS GROUP REPORT

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Transcription:

Closed Domain Question Answering for Cultural Heritage Bernardo Cuteri DEMACS, University of Calabria, Italy cuteri@mat.unical.it Abstract. In this paper I present my research goals and what I have obtained so far into my first year of PhD. In particular this paper is about a novel architecture for closed domain question answering and a possible application in the cultural heritage context. Unlike open domain question answering, which makes intensive use of Information Retrieval (IR) techniques, closed domain question answering systems might be built on top of a formal model with the possibility to apply formal logics and reasoning. Natural language question answering poses some nontrivial problems to tackle. We investigate such problems and propose some solutions based on AI techniques, picking the Cultural Heritage domain as a target application. Keywords: Closed domain question answering, AI, NLP, ASP, cultural heritage 1 Introduction The information need of a user often resolves in a simple question where it would be useful to have brief answers instead of whole documents to look into. IR techniques have proven to be very successful at locating relevant documents to the user query into large collections, but the effort of looking for a specific desired information into such documents is then left to the user. Question answering attempts to find direct answers to user questions. As the intuition says, answering to any kind of question, with no linguistic and no domain restriction is a very hard task. When no restriction is made on the domain of the questions we are talking about open domain question answering. Instead, when questions are bound to a specific domain we are talking about closed (or restricted) domain question answering (CDQA). In open domain QA, most systems are based on a combination of Information Retrieval and NLP techniques[3]. Such techniques are applied to a large corpora of documents: first attempting to retrieve the best documents to look into for the answer, then selecting the paragraphs which are more likely to bear the desired answer and finally processing the extracted paragraphs by means of NLP. Such approach is also behind many closed domain question answering systems, but in this context we might benefit of existing structured knowledge. Some of the very early question answering systems were designed for closed

domains and they were essentially conceived as natural language interfaces to databases [1][2]. The idea of studying and applying closed domain question answering for the cultural heritage domain comes from the PIUCULTURA project, which is a project of which my university is a research partner. This project aims at implementing a mobile system for cultural heritage fruition. My university is in charge of research and develop techniques for the implementation of a question answering prototype for cultural heritage. For what concerns closed domains, cultural heritage can benefit of structured data sources: in this context, information has already started to be saved and shared with common standards. One of the most successful standard is the CIDOC Conceptual Reference Model. The CIDOCcrm provides a common semantic framework for the mapping of cultural heritage information and can be adopted by museums, libraries and archives. Our idea is to design and implement a system capable of interpreting natural language questions regarding cultural heritage objects and facts, map the input questions into formal queries compliant to the CIDOC-crm model and execute such queries to retrieve the desired information. In closed domains, question structures are more predictable than in open domain and we propose to design a sophisticated module of template matching based on a declarative formalism (Answer Set Programming [5]) for question classification and query extraction. A particular feature we want to introduce is the possibility to have dialogues instead of only atomic questions. This might help when the initial question is ambiguous or the system needs more clarifications to provide an accurate answer. Also, the fact that the system is based on formal queries rather than statistical methods might lead to a more robust answer creation, with the possibility to obtain a step-by-step justification of the answer and an easier validation. In the following sections we present an architectural model of the system and provide some more details about the tasks involved in the question answering process. 2 System Architecture and Working Principles Figure 1 shows the architecture of the system (with some simplifications) highlighting the main modules and how the process looks like. The process is split in five tasks: 1. question processing 2. template matching 3. query expansion and contextualization 4. query execution 5. answer creation In the following subsections we are going to break through the process and analyze it step by step.

Fig. 1. Simplified architecture with single question interaction

2.1 Question Processing This is the main NLP step. Fortunately, tokenization, POS tagging and natural language parsing have decades of research behind and there are plenty of tools around that are able to solve such problems efficiently. With respect to the Cultural Heritage domain, something we can not overlook is the importance of entity recognition as we might have proper nouns of artefacts or persons that must not be mistakenly treated by NLP tools (e.g. splitting the title of a painting into distinct grammatical parts). First the question is tokenized and tagged with part-of-speech (POS) tags. Then a natural language parser is in charge of extracting grammatical relations (a.k.a. typed dependencies) from the text (e.g. who is the subject of what verb, what is the object and so on). 2.2 Template Matching Questions are classified and transformed into formal queries by means of template matching. In this context, templates represent the structure of typical questions. If a certain template is matched we can infer something about the question type. Every question template is accompanied with a formal query in which some slots are empty and are filled with terms extracted from the question that matches the template. For example, imagine we have the template for questions of the type Who verb object: the question Who painted Guernica? matches such pattern and a corresponding query can be created.guernica and painted might then be used as constants in the query, filling the empty slots mentioned before. We want to investigate the possibility to implement template matching with Answer Set Programming [5](ASP). ASP evolved from deductive databases, logic programming and nonmonotonic reasoning. It is a flexible language for knowledge representation and reasoning, and for declarative problem solving, and efficient systems are available[6]. ASP is thus a concrete tool for developing complex applications by just specifying a set of logic rules of the form Head : Body, where Body is a conjunction of possibly negated atoms, and Head is a disjunction of atoms. The stable models, or answer sets, of an ASP program correspond to the solutions of the modelled problem. The programmer does not need to provide an algorithm for solving a problem with ASP; rather, she specifies the properties of the desired solution for its computation by means of a collection of logic rules called logic program. The stable models or answer sets of an ASP program correspond to the solutions of the modelled problem. The logic rule is an expression that looks like Head : Body, where Body is a logic conjunction possibly involving negation, and Head is either an atomic formula or a logic disjunction. The language of ASP, besides disjunction in rule heads and nonmonotonic negation in rule bodies, features also special atoms for defining aggregates, strong constraints for selecting solutions, and weak constraints for solving optimization problems. The implementation details go beyond the scope of this paper, but we can say that ASP is a good candidate for a fast and declarative

implementation of template matching. This step requires a small preprocessing step in which the input (i.e. words and associated grammatical relations and parts-of-speech) is transformed into ASP facts. A simple example of a possible template for matching questions of the type Who-verb-object is the following: template(1, bt(w1,w2)):- textword(1, who), gr(2,3,dobj), textword(2,w1), textword(3,w2). Where the textword predicate denotes the presence of a certain word in a certain position and the gr predicate denotes a grammatical relation between two words. gr(2,3,dobj) means that word in position 3 is the object of word in position 2. This template is a bit simplified and does not take POS tags into account, but gives an idea of how to implement a template in ASP. 2.3 Query Expansion and Contextualization The template matching result is a formal query. Sometimes, to be effective, the query has to be expanded with context information and/or word semantic information. We can try to understand the importance of those two by providing an example. Let s say that we asked When was the monalisa created?. An admissible following question could be And who did it?. The pronoun it clearly stands for the painting, but in order to understand it, the system has to get context information or at least store question histories. Another problem that we can analyze with the previous example is the following: let s say that our knowledge base contains the information Leonardo da Vinci painted the monalisa. We know that, if someone painted something we can also say that they did it. The question answering system has to deal with such and other similar problems. A possible solution is to expand the query by using synonyms, hyperonyms and other word semantic relations. Fortunately, there are some available encyclopedic dictionaries that are able to provide such relations. Among them there is BabelNet[4] which has also the desirable property of being multilingual and this might help in case we want to extend the work to different languages. 2.4 Query Execution and Answer Creation In our model, the query is executed against a structured knowledge base. Query results (if any) can then be used to build a natural language answer with a mechanism similar to template matching, but in the inverse direction. A possible approach is that each question template is paired to an answer template. The answer template may have empty slots for answer terms and it is used by the answer creation module to build the NL answer once the query has been executed successfully.

3 Current State and Future Works In this paper we presented an architecture and some implementation ideas for a closed domain question answering system and discussed about the tasks involved in the process. The work is currently under development, studies have been conducted to investigate current research trends in question answering and available solutions. At this moment we have developed a small QA prototype capable of answering simple questions. It uses the Stanford parser[7] for tokenization, POStagging and parsing, integrates Babelnet[4] for query expansion, and supports some common question types. For example it is possible to ask who performed a certain action on a certain object, or where a certain object is located. Templates cover different ways to express the same question like where is Guernica located? or in what museum is Guernica located?. We are investigating on how to cope with question nuances, trying to design some more general templates for questions that are not perfectly matched by a simpler template. This is where ASP (with disjunction, weak constraints and aggregates) might play a crucial role as opposed to less expressive languages like Datalog. We started to create a broad catalogue of possible questions in order to create more complex templates and extend the system to adapt to more difficult questions. We are now planning to extend this approach to its limits, trying to manage a broad set of questions on cultural heritage. If it works fine we also plan to add multilingual support checking if the template system is easy to extend to different languages. We also want to investigate how to implement non-trivial dialogues centered around questions instead of only single atomic questions. References 1. Woods, W. A. (1973, June). Progress in natural language understanding: an application to lunar geology. In Proceedings of the June 4-8, 1973, national computer conference and exposition (pp. 441-450). ACM. 2. Green Jr, B. F., Wolf, A. K., Chomsky, C., Laughery, K. (1961, May). Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference (pp. 219-224). ACM. 3. Hirschman, L., Gaizauskas, R. (2001). Natural language question answering: the view from here. natural language engineering, 7(04), 275-300. 4. Navigli, R., Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217-250. 5. Gelfond, M., Lifschitz, V. (1991). Classical negation in logic programs and disjunctive databases. New generation computing, 9(3-4), 365-385. 6. Calimeri, F., Ianni, G., Ricca, F., Alviano, M., Bria, A., Catalano, G.,... Manna, M. (2011, May). The third answer set programming competition: Preliminary report of the system competition track. In International Conference on Logic Programming and Nonmonotonic Reasoning (pp. 388-403). Springer Berlin Heidelberg. 7. Klein, D., Manning, C. D. (2003, July). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics- Volume 1 (pp. 423-430). Association for Computational Linguistics.