Lecture Notes in Artificial Intelligence

Similar documents
Lecture Notes in Artificial Intelligence 4343

Lecture Notes in Artificial Intelligence 7175

AQUA: An Ontology-Driven Question Answering System

NATO ASI Series Advanced Science Institutes Series

An Interactive Intelligent Language Tutor Over The Internet

Lecture Notes in Artificial Intelligence 5972

UNIVERSITÀ DEGLI STUDI DI ROMA TOR VERGATA. Economia. Facoltà di CEIS MASTER ECONOMICS ECONOMETRICS

A Case Study: News Classification Based on Term Frequency

Ontological spine, localization and multilingual access

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Postprint.

International Series in Operations Research & Management Science

Perspectives of Information Systems

Pre-vocational Education in Germany and China

Linking Task: Identifying authors and book titles in verbose queries

Welcome to. ECML/PKDD 2004 Community meeting

The MEANING Multilingual Central Repository

Ontologies vs. classification systems

Communication and Cybernetics 17

Faculty of Architecture ACCADEMIC YEAR 2017/2018. CALL FOR ADMISSION FOR TRAINING COURSE SUMMER SCHOOL Reading the historic framework

The total number of seats is established by law n. 264, August 2 nd 1999.

Cross Language Information Retrieval

Advances in Mathematics Education

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

EDITORIAL: ICT SUPPORT FOR KNOWLEDGE MANAGEMENT IN CONSTRUCTION

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Probabilistic Latent Semantic Analysis

Data Fusion Models in WSNs: Comparison and Analysis

EXTENSIVE READING AND CLIL (GIOVANNA RIVEZZI) Liceo Scientifico e Linguistico E. Bérard Aosta

Applications of memory-based natural language processing

Language Independent Passage Retrieval for Question Answering

Seminar - Organic Computing

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

COMMUNICATION-BASED SYSTEMS

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

Automating the E-learning Personalization

MMOG Subscription Business Models: Table of Contents

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

Assistant Professor, Department of Economics and Finance, University of Rome Tor Vergata

Document WSIS/PC-3/CONTR/187-E 5 November 2003 Original: English and French

Agent-Based Software Engineering

Multilingual Sentiment and Subjectivity Analysis

The Learning Model S2P: a formal and a personal dimension

Operational Knowledge Management: a way to manage competence

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Evolution of Symbolisation in Chimpanzees and Neural Nets

VI Jaen Conference on Approximation

New Venture Financing

Lecture Notes on Mathematical Olympiad Courses

Guide to Teaching Computer Science

MARE Publication Series

Business Students. AACSB Accredited Business Programs

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Speech Recognition at ICSI: Broadcast News and beyond

Developing Grammar in Context

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

Task Tolerance of MT Output in Integrated Text Processes

AUTONOMY. in the Law

Towards Semantic Facility Data Management

Emergency Management Games and Test Case Utility:

Natural Language Processing: Interpretation, Reasoning and Machine Learning

TextGraphs: Graph-based algorithms for Natural Language Processing

The Smart/Empire TIPSTER IR System

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

UNIVERSITY of NORTH GEORGIA

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Constructing Parallel Corpus from Movie Subtitles

California Digital Libraries Discussion Group. Trends in digital libraries and scholarly communication among European Academic Research Libraries

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

Guru: A Computer Tutor that Models Expert Human Tutors

Cooperative evolutive concept learning: an empirical study

Second Language Learning and Teaching. Series editor Mirosław Pawlak, Kalisz, Poland

Effect of Word Complexity on L2 Vocabulary Learning

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

XXXI. welcome kit XXXI cycle

Interview on Quality Education

BYLINE [Heng Ji, Computer Science Department, New York University,

Community-oriented Course Authoring to Support Topic-based Student Modeling

Towards a Collaboration Framework for Selection of ICT Tools

Advanced Grammar in Use

HIGHLIGHTS OF FINDINGS FROM MAJOR INTERNATIONAL STUDY ON PEDAGOGY AND ICT USE IN SCHOOLS

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Education for an Information Age

Reviewed by Florina Erbeli

PRODUCT PLATFORM AND PRODUCT FAMILY DESIGN

CEFR Overall Illustrative English Proficiency Scales

A process by any other name

Problems of the Arabic OCR: New Attitudes

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

Stephanie Ann Siler. PERSONAL INFORMATION Senior Research Scientist; Department of Psychology, Carnegie Mellon University

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Transcription:

Lecture Notes in Artificial Intelligence 1299 Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Maria Teresa Pazienza (Ed.) Information Extraction A Multidisciplinary Approach to an Emerging Information Technology International Summer School, SCIE-97 Frascati, Italy, July 14-18, 1997 Springer

Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J6rg Siekmann, University of Saarland, Saarbriicken, Germany Volume Editor Maria Teresa Pazienza Universit~t degli Studi di Roma,Tor Vergata Dipartimento di Informatica Sistemi e Produzione Via della Ricerca Scientifica, 1-00133 Roma, Italy E-mail: pazienza@ info.utovrm.it Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Information extraction : a multidisciplinary approach to an emerging information technology ; international summer school / SCIE-97, Frascati, Italy, July 14-18, 1997. Maria Teresa Pazienza (ed.). - Berlin, Heidelberg ; New York ; Barcelona ; Budapest ; Hong Kong ; London, Milan ; Paris ; Santa Clara ; Singapore ; Tokyo : Springer, 1997 (Lecture notes in computer science ; Vol. 1299 : Lecture notes in artificial intelligence) ISBN 3-540-63438-X CR Subject Classification (1991): 1.2, H.3 ISBN 3-540-63438-X Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer -Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg 1997 Printed in Germany Typesetting: Camera ready by author SPIN 10546325 06/3142-5 4 3 2 1 0 Printed on acid-free paper

Preface This book contains all the presentations given at the International Summer School on Information Extraction, SCIE '97, Frascati (Rome), Italy, July 14-18, 1997. The topics covered span various scientifc areas. In fact, although Information Extraction is mainly based on language processing ability, contributions derive from several disciplines. This motivated the wide range of topics discussed in this book. As the ability to access different kinds of information via the Internet is increasingly involving end-users with different skills, the demand for suitable tools for Information Extraction(IE), organization, and integration is becoming more and more pressing to filter relevance and sort the large number of retrieved docmnents. IE is very often compared with the more mature methodology of Information Retrieval (IR), which is currently attempting to use natural language processing (NLP) based technologies to improve the recall and precision of retrieved documents in response to users' queries. But information extraction is not information retrieval: Information Retrieval aims to select relevant documents, Information Extraction aims to extract facts from the documents. This activity is not easy, as there are many ways to express the same fact. Typically information is spread across sew~ral sentences in natural language. The ability to process textual documents is essential to [E. Therefore it should be considered a core language technology. In an NLP system the lexicon conveys linguistic information relative to words. The lexicon, then, may be considered a comprehensive knowledge repository where both representation and content support many deductive processes. In many applications lexical information relates surface forms (words or word sequences) more directly to pattern matching-based procedures having some linguistic flavors but dependent on the underlying tasks: text mark-up, entity detection and classification, or template filling. In such a framework the amount of linguistic information involved (proper noun rules or grammatical relations related to subcategorization information) is very high. The dynamic nature of language, as well as its relationship with the underlying knowledge domain, creates difficult problems for NLP systems. In the IR context, terminology identification tends to be an important supporting technology. In the NLP context the problem of terminology identification is more general, as it is essential to find all terms which are characteristic of the document. Hence they are representative of its content and/or domain. The identification of terms is strongly tied to discourse processing, seeking to select those terms which are highly representative and capable of providing broad characterization of the document content. From a "global information" perspective, most of the lexical knowledge employed in a given application is at first unusable for newer tasks in newer domains. The crucial characteristic of any information is what it is about, i.e., the entities it refers to. It is this referential meaning that needs to be made explicit and or-

VI ganized, in order to extract and reuse relevant information. It is easy to see how ontological aspects play a fundamental role here. As in IE, a template is to be 'filled' by information conveyed by a natural language statement. The problem is to make explicit the intended models of the world used to convey information. Thus, the ontological assumptions implicit in the terms adopted for concepts, relations, and attributes are to be made explicit. Moreover, if a system exists and performs an extraction task in a language, it is not obvious how to perform the same task on texts in different languages. It is important to separate the task-specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language-dependent lexical knowledge the system requires, which unavoidably must be extended for each new language. It is also desirable to determine how well an IE system is performing a given task through the definition of objective measures. It is common opinion that even systems with a modest performance (which miss more events and include some errors) may be of interest. They may be useful to extract information for future manual verification. Incomplete extracted information is in fact better than nothing! The international summer school on information extraction SCIE '97 was organized to stress all these different aspects of information extraction. SCIE '97 was sponsored by: AI*IA Italian Association for Artificial Intelligence CNR National Research Council EC European Community ENEA National Institution for Alternative Forms of Energy ESA European Space Agency UTV University of Rome, Tot Vergata FUB Ugo Bordoni Research Foundation Thanks to all the people and institutions who contributed to the organization of this summer school. Special thanks to the staff of the ESRIN (Frascati) site of the European Space Agency for their valuable support in hosting the school. I would like to express my personal gratitude to all colleagues of the NLP Group of the University of Rome, Tor Vergata, whose extraordinary effort over these past months made SCIE '97 possible. Thanks a lot. Rome, July 1997 Maria Teresa Pazienza

SCIE-97 Organizing Committees General Chairperson: Maria Teresa Pazienza, University of Roma Tor Vergata (Italy) Program Committee: Maristella Agosti, University of Padua (Italy) Paolo Atzeni, University of Rome III (Italy) Luigia Carlucci Aiello, University of Rome La Sapienza(Italy) Floriana Esposito, University of Bari (Italy) Organizing Committee (University of Rome Tor Vergata - Italy): Roberto Basili, Massimo Di Nanni, Giovanni Pedani, Michele Vindigni.

Table of Contents Information Extraction as a Core Language Technology Y. Wilks Information Extraction: Techniques and Challenges R. Grishman Concepticons vs. Lexicons: An Architecture for Multilingual Information Extraction R. Gaizauskas, K. Humphreys, S. Azzam, Y. Wilks Lexical Acquisition for Information Extraction R. Basili, M.T. Pazienza Technical Terminology for Domain Specification and Content Characterisation B. Boguraev, C. Kennedy Short Query Linguistic Expansion Techniques: Palliating One-Word Queries by Providing Intermediate Structures to Text G. Grefen~tette Information Retrieval: Still Butting Heads with Natural Language Processing? A. F. Smeaton Semantic Matching: Formal Ontological Distinctions for Information Organization, Extraction, and Integration N. Guarino Machine Learning for Information Extraction F. Neri, L. Saitta Modeling and Querying Semi-structured Data S. Cluet 10 28 44 73 97 115 139 171 192