AN INVESTIGATION OF THE BENCHMARK. Khosrow Kaikhah. Department of Computer Science. Southwest Texas State University. San Marcos, Texas 78666

Similar documents
On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Parsing of part-of-speech tagged Assamese Texts

AQUA: An Ontology-Driven Question Answering System

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

An Introduction to the Minimalist Program

Some Principles of Automated Natural Language Information Extraction

Natural Language Processing. George Konidaris

CS 598 Natural Language Processing

Initial English Language Training for Controllers and Pilots. Mr. John Kennedy École Nationale de L Aviation Civile (ENAC) Toulouse, France.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

An Interactive Intelligent Language Tutor Over The Internet

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Implementing a tool to Support KAOS-Beta Process Model Using EPF

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

On-Line Data Analytics

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Handbook for Graduate Students in TESL and Applied Linguistics Programs

Creating Travel Advice

Procedures for Academic Program Review. Office of Institutional Effectiveness, Academic Planning and Review

Syllabus of the Course Skills for the Tourism Industry

TEKS Correlations Proclamation 2017

Training Catalogue for ACOs Global Learning Services V1.2. amadeus.com

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

JEFFERSON COLLEGE COURSE SYLLABUS BUS 261 BUSINESS COMMUNICATIONS. 3 Credit Hours. Prepared by: Cindy Rossi January 25, 2014

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Applications of memory-based natural language processing

Instrumentation, Control & Automation Staffing. Maintenance Benchmarking Study

SOFTWARE EVALUATION TOOL

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Chapter 9 Banked gap-filling

Bachelor of Arts. Intercultural German Studies. Language in intercultural contexts

PROCESS USE CASES: USE CASES IDENTIFICATION

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

EQuIP Review Feedback

MODULE 4 Data Collection and Hypothesis Development. Trainer Outline

English Language and Applied Linguistics. Module Descriptions 2017/18

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Criterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations

OKLAHOMA 4-H SHOOTING SPORTS POLICY Revised June 2010 Revised June 2007 Original 1994

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Final Teach For America Interim Certification Program

College of Liberal Arts (CLA)

5. UPPER INTERMEDIATE

The College Board Redesigned SAT Grade 12

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Educating Georgia s Future gadoe.org. Richard Woods, Georgia s School Superintendent. Richard Woods, Georgia s School Superintendent. gadoe.

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Grammars & Parsing, Part 1:

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

ACCOUNTING FOR LAWYERS SYLLABUS

School Inspection in Hesse/Germany

A Pilot Study on Pearson s Interactive Science 2011 Program

eportfolio Trials in Three Systems: Training Requirements for Campus System Administrators, Faculty, and Students

Abstractions and the Brain

First Grade Curriculum Highlights: In alignment with the Common Core Standards

SER CHANGES~ACCOMMODATIONS PAGES

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

Common Core State Standards for English Language Arts

CHAPTER XI DIRECT TESTIMONY OF REGINALD M. AUSTRIA ON BEHALF OF SOUTHERN CALIFORNIA GAS COMPANY AND SAN DIEGO GAS & ELECTRIC COMPANY

A student diagnosing and evaluation system for laboratory-based academic exercises

Intensive English Program Southwest College

Software Maintenance

L1 and L2 acquisition. Holger Diessel

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

CEFR Overall Illustrative English Proficiency Scales

LING 329 : MORPHOLOGY

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

HARPER ADAMS UNIVERSITY Programme Specification

Kentucky s Standards for Teaching and Learning. Kentucky s Learning Goals and Academic Expectations

Advanced Grammar in Use

EDUC-E328 Science in the Elementary Schools

By Laurence Capron and Will Mitchell, Boston, MA: Harvard Business Review Press, 2012.

Course Syllabus Advanced-Intermediate Grammar ESOL 0352

Accounting 380K.6 Accounting and Control in Nonprofit Organizations (#02705) Spring 2013 Professors Michael H. Granof and Gretchen Charrier

The SREB Leadership Initiative and its

AN ERROR ANALYSIS ON THE USE OF DERIVATION AT ENGLISH EDUCATION DEPARTMENT OF UNIVERSITAS MUHAMMADIYAH YOGYAKARTA. A Skripsi

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Minimalism is the name of the predominant approach in generative linguistics today. It was first

MYP Language A Course Outline Year 3

Colorado State University Department of Construction Management. Assessment Results and Action Plans

Program in Molecular Medicine

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Developing a TT-MCTAG for German with an RCG-based Parser

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Higher education is becoming a major driver of economic competitiveness

Communication around Interactive Tables

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

An Industrial Technologist s Core Knowledge: Web-based Strategy for Defining Our Discipline

MBA 5652, Research Methods Course Syllabus. Course Description. Course Material(s) Course Learning Outcomes. Credits.

User education in libraries

Automating the E-learning Personalization

Second Language Acquisition in Adults: From Research to Practice

Writing Research Articles

Delaware Performance Appraisal System Building greater skills and knowledge for educators

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

AC : DEVELOPMENT OF AN INTRODUCTION TO INFRAS- TRUCTURE COURSE

BENG Simulation Modeling of Biological Systems. BENG 5613 Syllabus: Page 1 of 9. SPECIAL NOTE No. 1:

Speech Recognition at ICSI: Broadcast News and beyond

Transcription:

AN INVESTIGATION OF THE BENCHMARK EVALUATION TOOL Khosrow Kaikhah Department of Computer Science Southwest Texas State University San Marcos, Texas 78666 Final Report for: Summer Research Program Rome Laboratory Sponsored by: Air Force Office of Scientific Research Bolling Air Force Base, Washington, D.C. August 1992

AN INVESTIGATION OF THE BENCHMARK EVALAUATION TOOL K. Kaikhah Department of Computer Science Southwest Texas State University ABSTRACT Recently, natural language processing has received tremendous support and popularity. As a consequence, the number of natural language processing systems has dramatically increased and the need for a systematic evaluation procedure of such systems seems inevitable. Until recently, there has not been a universal evaluation procedure for evaluating all types of NLP systems. Evaluations of such systems are usually conducted during the implementation phase and, in most cases, do not involve a comprehensive plan or independent evaluators. Developers of NLP systems can benefit from an unbiased evaluation procedure which measures their efforts and the power of their systems. At the same time, the consumers of NLP systems can benefit greatly from an evaluation tool which assists with the selection of the appropriate system for their needs. The Calspan Corporation has proposed and implemented the Benchmark Evaluation Tool for evaluating all natural language processing systems, regardless of type or application. The study was sponsored by the Rome Laboratory and was concluded in May of 1992. The Benchmark Evaluation Tool is designed to be domain independent. Therefore, it concentrates on the linguistic issues rather than on the application domain. This feature is unique, in that, the tool is sensitive to each individual linguistic capability and not to each individual application. It is composed of twelve independent sections which are designed to progressively test different linguistic features of NLP systems. The Benchmark Evaluation Tool also includes definitions and explanations for each section as well as a five-choice scoring strategy to measure the responses. Our objective is to investigate the effectiveness of the Benchmark Evaluation Tool by applying the tool to a natural language processing system. This particular system is composed of two major parts: a domain-independent part which has general knowledge of syntactic rules, and a domainspecific part which provides the necessary semantic and pragmatic knowledge for a specified domain. The application domain accompanying the NLP system for testing purposes is an interface to a relational database of air travel planning irifonnation.

AN INVESTIGATION OF THE BENCHMARK EVALAUATION TOOL K. Kaikhah 1. Introduction Although natural language processing has been on the minds of researchers from the early days of the inception of digital computers, it has never enjoyed such a tremendous popularity and support as it has received over the past two decades. As the number of natural language processing systems has increased, so has the need for a systematic evaluation procedure for testing NLP systems. Both producers and consumers of NLP systems can benefit from a well defined evaluation procedure. It can help the producers with conducting an unbiased evaluation of their systems, and can help the consumers with choosing the appropriate system for their needs. The evaluation procedure should not be defined for a particular system, but rather as a blueprint for testing the linguistics features of NLP systems. Until recently, evaluation procedures have been implemented and administrated by the developers of NLP systems. As a result, evaluations tend to be biased and follow known success patterns. These patterns may not be deliberate, but nevertheless it is the result of being so involved with the development. Therefore, a number of NLP researchers and consumers have expressed their needs and desires for an unbiased and independent evaluation procedure. One should keep in mind that a universal evaluation procedure which can be applied to all NLP systems may be too ambitious. However, a foundation for evaluating systems can be laid out to guide the producers, consumers, as well as the independent evaluators through the evaluation. The Benchmark Investigation/Identification program sponsored by the Rome Laboratory developed an evaluation tool and application procedure for evaluating natural language processing systems. The duration of the project was eighteen months; it was completed in May of 1992. It produced an evaluation procedure consisting of twelve sections. Each section is designed to test a different linguistics capability of NLP systems and provides brief explanations and definitions of the linguistic feature being tested, patterns that define the structure of the test sentence, example sentences, and criteria against which to evaluate the behavior of the NLP system. Each test sentence is then scored according to the level of system's comprehension. It can range from success (8) to Partial success (P) to No output (N). For more details, see [1].

Most applications of NLP systems involve interactive human-computer interfaces which include: a) Data Base Management Systems, b) Command and Control Systems, c) Decision- Aiding Systems, d) Engineering Design Systems, and e) Diagnostic Systems. The natural language processing system which is used for this investigation is equipped with an interface to a relational database. The system can respond to questions about ground transportation, fares, and flights for the cities of Atlanta, Boston, Baltimore, Denver, Dallas, Fort Worth, Pittsburgh, Philadelphia, Oakland, San Francisco, and Washington D.C. The NLP system analyzes the English sentences with three independent modules syntactic, semantic, and pragmatic in order to transform the sentences into application calls. Twenty four different switches control the behavior of the NLP system. By setting the appropriate switches, the system can be prompted to learn unknown grammatical structures and words. This process, however, requires a knowledgeable linguistic trainer, since the NLP system expects meaningful linguistic feedback during training. The parse tree as well as the semantic, and pragmatic analysis of sentences can also be examined, if so desired, by setting the appropriate switches. The goal of this investigation has been to determine the feasibility and usefulness of a universal evaluation procedure, namely the Benchmark Evaluation Tool. The Benchmark Evaluation Tool is designed to be applicable to all types of NLP systems, therefore, it can be considered to be a universal evaluation tool. We have applied the Benchmark Evaluation Tool to an NLP system and the comprehensive results are included in section 4. The following sections briefly describe the Benchmark Evaluation Tool and the NLP system, respectively. 2. The Benchmark Evaluation Tool In May of 1992, a Rome Laboratory sponsored project, The Benchmark Investigation/Identification Program was completed by Calspan Advanced Technology Center and their subcontractor, Language Systems Incorporated. The goal of the project was to develop a standard evaluation tool which is domain-independent and which can be applied to all NLP systems, regardless of their types, and without any need for modifying or porting the NLP system to a test domain. For more details, see [1]. There are several areas in which NLP systems can be evaluated. They include: a) linguistic competence, b) end user issues such as reliability and likeability, c) system development issues such as maintainability and portability, and d) intelligent behavior issues such as learning and cooperative dialogue. The Benchmark Evaluation Tool has focused on linguistic competence of NLP systems including lexical, syntactic, semantic, and discourse capabilities. It consists of twelve

sections, each of which tests a different feature of the NLP systems. They are: I) Basic Sentences, II) Interrogative Sentences, III) Noun Phrases, IV) Adverbials, V) Verbs and Verb Phrases, VI) Quantifiers, VII) Comparatives, VIII) Connectives, IX) Embedded Sentences, X) Reference, XI) Ellipsis, and XII) Semantics of Events. The Evaluation Tool is designed for people with non-linguistics backgrounds, therefore it provides instructions and explanatory materials for each section. These materials are provided to assist the evaluators with the creation or tailoring of test sentences and do not include a set of predefined natural language test sentences. The testing is conducted in a progressive manner from elementary sentence types to more complex sentence types. This strategy allows the evaluator to concentrate on a single linguistic feature in each test sentence. If the NLP system fails on a certain linguistic feature, the evaluator is advised not to include the feature in subsequent test sentences. The scoring is done according to the following criteria [1]: Success (S): The system successfully met the evaluation criteria stated for the particular test item. Correct (C): The system did not successfully meet the evaluation criteria, but produced acceptable/correct output. Partially Correct (P): The system did not successfully meet the evaluation criteria, and only produced partially acceptable/correct output. Failure (F): The system did not successfully meet the evaluation criteria and produced no correct output. No Output (N): The system produced no output. In short, the Benchmark Evaluation Tool is a procedure that a) produces profiles of NLP systems which are descriptive, hierarchically organized, quantitative, and objective, b) is usable across domains and applications, c) is usable across the different types of NLP systems, and d) is unbiased with respect to linguistic theories and does not require an evaluator who is a trained linguist. In fact, the Evaluation Tool is unique in two features [1]: The profiling facility Its usability and applicability across domains and applications

3. The NLP System The NLP system, used in our investigation, consists of two major components: a domainindependent (core) component, and a domain-specific (application) component. The domainindependent routines which include the procedural components for syntactic, semantic, and pragmatic analysis as well as large portions of the grammar and lexicon do not change during porting. However the domain-specific routines which include specialized lexicon, semantic rules, knowledge base, and application-specific routines must be re-implemented to accommodate a new application. Three distinct modules syntactic, semantic, and pragmatic analyze and process the input sentences independently. The syntactic module requires knowledge of lexicon and grammar rules; the semantic module requires the services of semantics rules; and the pragmatic module requires domain knowledge. The NLP system contains twenty four switches which control its behavior. The behavior of each module can be seen by setting the appropriate switches. In certain modes, however, the switches only control what may appear on the screen and not the processing that is going on in the background. Input to the system can be from external files or keyboard. The output can take several forms depending on the configuration of the switches. The syntactic analysis of the sentences can produce two types of output: a detailed surface structure parse tree, and an operator-argument representation called the Intermediate Syntactic Representation (ISR). ISR is the simplified version of the parse tree with a single canonical form for a number of various surface structures and a lot less detail that is not required by the semantic analyzer. The semantic and pragmatic modules use the ISR as input and produce the Integrated Discourse Representation (lor). IDRs are application-neutral representations of the meaning of the sentences in the current discourse containing situations described in the input sentences, the entities referred to, and the way the entities participate in the situations. 4. Applying the Benchmark Evaluation Tool to the NLP system The application domain which accompanied the NLP system for this investigation is an interface to a relational database of air travel planning information. Test sentences were scored according to the syntactic, semantic, and pragmatic processing and comprehension of each sentence. Generally, the sentences which failed to produce an lor (i.e. failed the analysis of the

5 Conclusions Although the idea of testing the sensitivity of individual linguistic capabilities of NLP systems rather than the sensitivity of the systems to individual applications is extremely attractive, it has nevertheless proved to be an ambitious task. Most NLP systems are designed for well-defined domains and applications. Therefore, a general purpose evaluation tool may not be suitable for all types of NLP systems. This was evident from our investigation. The NLP system, in our investigation, has an extremely narrow application domain which responds fairly well to sentences that satisfy its requirements, however the sentences that do not, fail to be analyzed completely. Each type of NLP system possesses certain attributes that are unique. Each type has strengths and weaknesses which are directly associated to the goals and objectives of the system. Therefore, the evaluation procedure should be more sensitive to the type of the NLP system being evaluated. For instance, if the NLP system is a Data Base Management System, the evaluation tool must place more emphasis on the interrogative and basic sentences rather than on quantifiers and ellipsis. Since after all, the system is not designed to respond to ellipsis or quantifiers. Some of the grammar patterns suggested by the Benchmark Evaluation Tool are not used in everyday conversation, however they are perfectly correct. For instance, 'List the flights which are more expensive than the Boston to Atlanta flight is expensive' (VII-1.1) is grammatically correct, but the second part of the sentence 'flight is expensive' is normally omitted and is implied by the first part. This may cause some confusion among some of the evaluators. In addition, scoring may also pose some confusion, since the boundaries between suggested scores are not well defined and are subjective. Hence, two independent evaluators may score a single NLP system completely different. Although not all suggested sentence patterns were applicable, nevertheless they helped with defining the boundaries of the NLP system. In many instances such as VII-3.1, VII-3.2, VII-9.1, and VII-9.2 the clash between the wide scope of the evaluation tool and the narrow application domain of the NLP system was clearly evident. In defense of the NLP system, it must be noted that no NLP system can successfully satisfy all the rigorous requirements of the Benchmark Evaluation Tool. The Benchmark Evaluation Tool proved to be extremely helpful in providing guidance and structure for evaluating the NLP system, therefore it should be used as a guide to select the appropriate testing procedures for individual types of systems, rather than as a general purpose evaluation procedure that can be applied to all NLP systems...

Strengths of the Benchmark Evaluation Tool: Comprehensive Contains detailed explanations Independent of NLP systems and their application domain... Defines the boundaries of NLP systems Weaknesses of the Benchmark Evaluation Tool: Time consuming Scope of the evaluation is too wide Some suggested patterns seem unusual and are not used in everyday conversation... Scoring is not well defined In conclusion, the evaluation process proved to be extremely time consuming. It is conceivable that in the near future the evaluation process could be fully automated. However, in order for an automated evaluator to be successful, the evaluation should be performed in a narrower space with well-defined boundaries. Therefore, there should be several different automated evaluators each specialized for a different type of NLP system. Each automated evaluator would have syntactic, semantic, and pragmatic knowledge of only one type of NLP system and would generate appropriate test sentences. The set of automated evaluators would form a complete collection of tools for evaluating all types of NLP systems. The Benchmark Evaluation Tool will be extremely instrumental in developing the automated evaluators. 6. References [1]. Benchmark Investigation/Identification Program Volume I; Final Report, Calspan Advanced Technology Center, P.O. Box 400, Buffalo, NY 14225, May 1992.