PRAXICON and its language-related modules

Similar documents
AQUA: An Ontology-Driven Question Answering System

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

On-Line Data Analytics

Linking Task: Identifying authors and book titles in verbose queries

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

UCEAS: User-centred Evaluations of Adaptive Systems

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Ontologies vs. classification systems

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Applications of memory-based natural language processing

Computerized Adaptive Psychological Testing A Personalisation Perspective

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

The MEANING Multilingual Central Repository

Developing a TT-MCTAG for German with an RCG-based Parser

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

CS Machine Learning

Vocabulary Usage and Intelligibility in Learner Language

SEMAFOR: Frame Argument Resolution with Log-Linear Models

The taming of the data:

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

The stages of event extraction

2.1 The Theory of Semantic Fields

Online Marking of Essay-type Assignments

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

On document relevance and lexical cohesion between query terms

The CESAR Project: Enabling LRT for 70M+ Speakers

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

1. Introduction. 2. The OMBI database editor

What is Thinking (Cognition)?

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Circuit Simulators: A Revolutionary E-Learning Platform

Knowledge-Based - Systems

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Feature-oriented vs. Needs-oriented Product Access for Non-Expert Online Shoppers

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Beyond the Pipeline: Discrete Optimization in NLP

A Case Study: News Classification Based on Term Frequency

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Automating Outcome Based Assessment

Compositional Semantics

MYCIN. The MYCIN Task

EQuIP Review Feedback

A Platform for Symbolically Encoding Human Narratives

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Graph Alignment for Semi-Supervised Semantic Role Labeling

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

An Open Framework for Integrated Qualification Management Portals

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Word Segmentation of Off-line Handwritten Documents

Modeling function word errors in DNN-HMM based LVCSR systems

A Bayesian Learning Approach to Concept-Based Document Classification

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Bluetooth mlearning Applications for the Classroom of the Future

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

Cognitive Apprenticeship Statewide Campus System, Michigan State School of Osteopathic Medicine 2011

Matching Similarity for Keyword-Based Clustering

Speech Recognition at ICSI: Broadcast News and beyond

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY

THE VERB ARGUMENT BROWSER

2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Abstractions and the Brain

A Comparison of Two Text Representations for Sentiment Analysis

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Florida Reading Endorsement Alignment Matrix Competency 1

The Strong Minimalist Thesis and Bounded Optimality

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

TITLE: Shakespeare: The technical words. DATE(S): Project will run for four weeks during June or July

Android App Development for Beginners

A Framework for Customizable Generation of Hypertext Presentations

Visual CP Representation of Knowledge

English Language and Applied Linguistics. Module Descriptions 2017/18

Discriminative Learning of Beam-Search Heuristics for Planning

Modeling full form lexica for Arabic

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Some Principles of Automated Natural Language Information Extraction

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Please find below a summary of why we feel Blackboard remains the best long term solution for the Lowell campus:

ScienceDirect. Malayalam question answering system

Tools for Tracing Evidence in Social Science

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Corrective Feedback and Persistent Learning for Information Extraction

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Introduction to Text Mining

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

1. Professional learning communities Prelude. 4.2 Introduction

Transcription:

PRAXICON and its language-related modules K. Pastra, P. Dimitrakis, E. Balta, and G. Karakatsiotis Institute for Language and Speech Processing, ATHENA Research Centre, Artemidos 6 and Epidavrou, 15125, Athens, Greece {kpastra,p dim,ebalta,gkarak}@ilsp.gr The semantic gap between the low-level features of sensorimotor data and their meaning as expressed through language is one of the fundamental challenges in developing intelligent systems. In this demonstration, we will present the very first release of the PRAXICON [1], a grounded conceptual knowledge base tightly coupled with programs that perform a generative analysis of sensorimotor and language representations. This is the first grounding resource that is coupled with compositional and generative modules for visual, motoric and language representation analysis and the first one that integrates the output of such modules within and across concepts, formulating a rich semantic network of multi-representational concepts. We will focus the demonstration on the PRAXI- CON database, its visualisation and four of its language-related tools: (a) Free- Text2PRAXICON: a first version of the language-based PRAXICON reasoner; (b) WordNet2PRAXICON: a module that converts WordNet [2] into a referencebased resource, for enriching the PRAXICON; (c) COSMOROE2PRAXICON: a module that extracts and infers conceptual information from the annotated COSMOROE corpus of TV travel series [3, 4], and (d) Cognitive2PRAXICON: a module that extracts and infers conceptual information from the POETICON cognitive Corpus (cf. more details in http://www.poeticon.eu). 1 The conceptual knowledge base and its visualisation interfaces The PRAXICON conceptual knowledge base has been realised in the form of a database (MySQL server) developed using the Java Persistence API (JPA). Thus, it is not bound to any operation system or database server. A graphical user interface (GUI) has also been developed, which allows for text-based search of the concepts in the resource (cf. Figure 1) and subsequent exploration of all available concept-information in a visualisation environment (cf. Figure 2). PRAXICON s GUI has been developed in JavaFX, a new language that enables the PRAXICON to run in any operating system, either as applet or stand alone application. The interface serves for exploration of the PRAXICON conceptual knowledge base. One may type a word denoting a concept of interest, in Greek or English and explore its multiple representations and network of relations. For example, Figure 1 shows the list of results of a query with the word hammer and the detailed results for the selection of the hammer#entity concept. Figure 2 shows the visualisation interface for exploring such concept where information on

2 Fig. 1. PRAXICON GUI: text-based search interface Fig. 2. PRAXICON GUI: visualisation environment the language representation(s) (LR) of the concept, its visual representation(s) (VR) and network of relations to other concepts is presented.

3 2 FreeText2PRAXICON We have integrated the PRAXICON conceptual knowledge base with the ILSP text processing pipeline for Greek and English and with a very first version of the PRAXICON s language-based reasoner. The ILSP text processor extracts information about some sentence, by performing tokenization, lemmatization, stemming, part of speech tagging and syntactic parsing. This info is crucial for the development of an interface for the PRAXICON that uses free natural language. The reasoner in its current, simple from, takes as input the processed text and matches the language representation of concepts in PRAXICON. Then it finds the optimal path than links the textually-expressed concepts, as shown in Figure 3. Fig. 3. This screenshot displays the optimal path that associates the concepts that are participating in the sentence cut the pizza. The path starts from the Language Representation cut which is linked to the abstract concept cut. The relations of the concept cut are expanded and the relation with the movement cut with pizza cutter is selected. The path continues to the inherent intersection of the relations between the movement cut with pizza cutter and the entity pizza cutter and the abstract concept cut. The relations of the entity pizza cutter are being expanded and the chain of relations with the movement cut with pizza cutter and the entity pizza is selected. The path ends at the Language Representation of the entity pizza (Pizza). The reasoner favors the paths that invoke inherent relations between concepts. (Faded nodes and edges are (some) of the roads not taken ).

4 3 WordNet2PRAXICON We have developed a module that uses the WordNet 3.0 dictionary and the WordNet 3.0 semantically annotated gloss files as an input resource and produces xml files based on the PRAXICON Concept schema. The module, which is implemented in Java and makes use of the MIT Java WordNet Interface transverses the WordNet entries and performs the following challenging, but essential, tasks: (a) it distinguishes between literal and figurative senses of synsets, and (b) clusters synsets that have the same reference. Thus, we turn WordNet from a sense-based lexical resource into a reference-based one, for the needs of the PRAXICON. For example, WordNet considers knife as a cutting instrument Fig. 4. Literal sense of knife Fig. 5. Figurative sense of knife a different concept from knife as a weapon, and a different concept from knife - any long thin projection that is transient (the tongue of flame). However, in the first two cases, it is the same reference object that is denoted. The different senses reflect different uses. For similar cases, we have developed a mechanism that merges the synsets to the same PRAXICON concept (cf. Figure 4). The third sense is a figurative one, the word is used metaphorically to denote a different concept (cf. Figure 5); our module distinguishes such figurative senses from the literal ones. 4 COSMOROE2PRAXICON This is a rule-based Perl module that takes as input a COSMOROE xml annotation file; its xml output includes information per concept extracted from the corpus. Visual object representations, visual action representations (video segments) and textual representations, as well as relations with other concepts are extracted. The main task of the module is (a) concept-type categorization; the module classifies the annotation elements into movements, entities, features and abstract concepts and (b) clustering of information per unique concept ; each concept has a unique id that consists of the lemma of the word

5 Fig. 6. Automatic extraction of information from the annotated COSMOROE corpus or visual label that expresses its LR, its concept type, and whenever applicable a further clarification of the lemma sense. Figure 6 provides an example of information extracted by the module for the PRAXICON. 5 Cognitive2PRAXICON Another source of information for the PRAXICON is the POETICON Cognitive experiments. We have carried out experiments that employ the think aloud protocol to elicit verbal descriptions of everyday objects and actions using lithic tools as a stimulus for the description. The corresponding video recordings capture verbal reports of 120 participants and amount to approximately 110 hours. The verbal reports have been transcribed and annotated in terms of the semantic type denoted through specific words in the verbal reports. Cognitive2PRAXICON is a rule-based Perl module that runs over the semantically annotated verbal reports, and (a) extracts unique concepts with automatically attributed concept type, (b) infers a wealth of implied concepts from what is literally expressed in the verbal reports, (c) extracts and infers a wealth of conceptual relations that are expressed in the reports in the form of verbal justifications, conditionals, analogies, and clarifications. Figure 7 provides an example of information inferred from the reports.

6 Fig. 7. Extraction and inference of information from the POETICON Cognitive Experiment Verbal Reports Acknowledgements The research reported in the paper is funded by the European Commission in the frame of the POETICON project (Grant:FP7-ICT- 215843). We thank all POETICON partners for their feedback on developing the PRAXICON. Also, special thanks to our colleague Argyro Vatakis who designed and run the POETICON cognitive experiments and to the team of annotators of the COSMOROE and the Cognitive data corpora. References 1. Pastra, K.: Praxicon: the development of a grounding resource. In: Proceedings of the 4th Bellagio International Workshop on Human-Computer Conversation. (2008) 2. Miller, G., Fellbaum, C.: Wordnet then and now. Language Resources and Evaluation 41 (2007) 209 214 3. Pastra, K.: Cosmoroe: A cross-media relations framework for multimedia dialectics. Multimedia Systems 14(5) (2008) 299 323 4. Pastra, K., Balta, E.: A text-based search interface for multimedia dialectics. In: Proceedings of the European Conference in Computational Linguistics. (2009)