AN INVESTIGATION OF THE BENCHMARK EVALUATION TOOL Khosrow Kaikhah Department of Computer Science Southwest Texas State University San Marcos, Texas 78666 Final Report for: Summer Research Program Rome Laboratory Sponsored by: Air Force Office of Scientific Research Bolling Air Force Base, Washington, D.C. August 1992
AN INVESTIGATION OF THE BENCHMARK EVALAUATION TOOL K. Kaikhah Department of Computer Science Southwest Texas State University ABSTRACT Recently, natural language processing has received tremendous support and popularity. As a consequence, the number of natural language processing systems has dramatically increased and the need for a systematic evaluation procedure of such systems seems inevitable. Until recently, there has not been a universal evaluation procedure for evaluating all types of NLP systems. Evaluations of such systems are usually conducted during the implementation phase and, in most cases, do not involve a comprehensive plan or independent evaluators. Developers of NLP systems can benefit from an unbiased evaluation procedure which measures their efforts and the power of their systems. At the same time, the consumers of NLP systems can benefit greatly from an evaluation tool which assists with the selection of the appropriate system for their needs. The Calspan Corporation has proposed and implemented the Benchmark Evaluation Tool for evaluating all natural language processing systems, regardless of type or application. The study was sponsored by the Rome Laboratory and was concluded in May of 1992. The Benchmark Evaluation Tool is designed to be domain independent. Therefore, it concentrates on the linguistic issues rather than on the application domain. This feature is unique, in that, the tool is sensitive to each individual linguistic capability and not to each individual application. It is composed of twelve independent sections which are designed to progressively test different linguistic features of NLP systems. The Benchmark Evaluation Tool also includes definitions and explanations for each section as well as a five-choice scoring strategy to measure the responses. Our objective is to investigate the effectiveness of the Benchmark Evaluation Tool by applying the tool to a natural language processing system. This particular system is composed of two major parts: a domain-independent part which has general knowledge of syntactic rules, and a domainspecific part which provides the necessary semantic and pragmatic knowledge for a specified domain. The application domain accompanying the NLP system for testing purposes is an interface to a relational database of air travel planning irifonnation.
AN INVESTIGATION OF THE BENCHMARK EVALAUATION TOOL K. Kaikhah 1. Introduction Although natural language processing has been on the minds of researchers from the early days of the inception of digital computers, it has never enjoyed such a tremendous popularity and support as it has received over the past two decades. As the number of natural language processing systems has increased, so has the need for a systematic evaluation procedure for testing NLP systems. Both producers and consumers of NLP systems can benefit from a well defined evaluation procedure. It can help the producers with conducting an unbiased evaluation of their systems, and can help the consumers with choosing the appropriate system for their needs. The evaluation procedure should not be defined for a particular system, but rather as a blueprint for testing the linguistics features of NLP systems. Until recently, evaluation procedures have been implemented and administrated by the developers of NLP systems. As a result, evaluations tend to be biased and follow known success patterns. These patterns may not be deliberate, but nevertheless it is the result of being so involved with the development. Therefore, a number of NLP researchers and consumers have expressed their needs and desires for an unbiased and independent evaluation procedure. One should keep in mind that a universal evaluation procedure which can be applied to all NLP systems may be too ambitious. However, a foundation for evaluating systems can be laid out to guide the producers, consumers, as well as the independent evaluators through the evaluation. The Benchmark Investigation/Identification program sponsored by the Rome Laboratory developed an evaluation tool and application procedure for evaluating natural language processing systems. The duration of the project was eighteen months; it was completed in May of 1992. It produced an evaluation procedure consisting of twelve sections. Each section is designed to test a different linguistics capability of NLP systems and provides brief explanations and definitions of the linguistic feature being tested, patterns that define the structure of the test sentence, example sentences, and criteria against which to evaluate the behavior of the NLP system. Each test sentence is then scored according to the level of system's comprehension. It can range from success (8) to Partial success (P) to No output (N). For more details, see [1].
Most applications of NLP systems involve interactive human-computer interfaces which include: a) Data Base Management Systems, b) Command and Control Systems, c) Decision- Aiding Systems, d) Engineering Design Systems, and e) Diagnostic Systems. The natural language processing system which is used for this investigation is equipped with an interface to a relational database. The system can respond to questions about ground transportation, fares, and flights for the cities of Atlanta, Boston, Baltimore, Denver, Dallas, Fort Worth, Pittsburgh, Philadelphia, Oakland, San Francisco, and Washington D.C. The NLP system analyzes the English sentences with three independent modules syntactic, semantic, and pragmatic in order to transform the sentences into application calls. Twenty four different switches control the behavior of the NLP system. By setting the appropriate switches, the system can be prompted to learn unknown grammatical structures and words. This process, however, requires a knowledgeable linguistic trainer, since the NLP system expects meaningful linguistic feedback during training. The parse tree as well as the semantic, and pragmatic analysis of sentences can also be examined, if so desired, by setting the appropriate switches. The goal of this investigation has been to determine the feasibility and usefulness of a universal evaluation procedure, namely the Benchmark Evaluation Tool. The Benchmark Evaluation Tool is designed to be applicable to all types of NLP systems, therefore, it can be considered to be a universal evaluation tool. We have applied the Benchmark Evaluation Tool to an NLP system and the comprehensive results are included in section 4. The following sections briefly describe the Benchmark Evaluation Tool and the NLP system, respectively. 2. The Benchmark Evaluation Tool In May of 1992, a Rome Laboratory sponsored project, The Benchmark Investigation/Identification Program was completed by Calspan Advanced Technology Center and their subcontractor, Language Systems Incorporated. The goal of the project was to develop a standard evaluation tool which is domain-independent and which can be applied to all NLP systems, regardless of their types, and without any need for modifying or porting the NLP system to a test domain. For more details, see [1]. There are several areas in which NLP systems can be evaluated. They include: a) linguistic competence, b) end user issues such as reliability and likeability, c) system development issues such as maintainability and portability, and d) intelligent behavior issues such as learning and cooperative dialogue. The Benchmark Evaluation Tool has focused on linguistic competence of NLP systems including lexical, syntactic, semantic, and discourse capabilities. It consists of twelve
sections, each of which tests a different feature of the NLP systems. They are: I) Basic Sentences, II) Interrogative Sentences, III) Noun Phrases, IV) Adverbials, V) Verbs and Verb Phrases, VI) Quantifiers, VII) Comparatives, VIII) Connectives, IX) Embedded Sentences, X) Reference, XI) Ellipsis, and XII) Semantics of Events. The Evaluation Tool is designed for people with non-linguistics backgrounds, therefore it provides instructions and explanatory materials for each section. These materials are provided to assist the evaluators with the creation or tailoring of test sentences and do not include a set of predefined natural language test sentences. The testing is conducted in a progressive manner from elementary sentence types to more complex sentence types. This strategy allows the evaluator to concentrate on a single linguistic feature in each test sentence. If the NLP system fails on a certain linguistic feature, the evaluator is advised not to include the feature in subsequent test sentences. The scoring is done according to the following criteria [1]: Success (S): The system successfully met the evaluation criteria stated for the particular test item. Correct (C): The system did not successfully meet the evaluation criteria, but produced acceptable/correct output. Partially Correct (P): The system did not successfully meet the evaluation criteria, and only produced partially acceptable/correct output. Failure (F): The system did not successfully meet the evaluation criteria and produced no correct output. No Output (N): The system produced no output. In short, the Benchmark Evaluation Tool is a procedure that a) produces profiles of NLP systems which are descriptive, hierarchically organized, quantitative, and objective, b) is usable across domains and applications, c) is usable across the different types of NLP systems, and d) is unbiased with respect to linguistic theories and does not require an evaluator who is a trained linguist. In fact, the Evaluation Tool is unique in two features [1]: The profiling facility Its usability and applicability across domains and applications
3. The NLP System The NLP system, used in our investigation, consists of two major components: a domainindependent (core) component, and a domain-specific (application) component. The domainindependent routines which include the procedural components for syntactic, semantic, and pragmatic analysis as well as large portions of the grammar and lexicon do not change during porting. However the domain-specific routines which include specialized lexicon, semantic rules, knowledge base, and application-specific routines must be re-implemented to accommodate a new application. Three distinct modules syntactic, semantic, and pragmatic analyze and process the input sentences independently. The syntactic module requires knowledge of lexicon and grammar rules; the semantic module requires the services of semantics rules; and the pragmatic module requires domain knowledge. The NLP system contains twenty four switches which control its behavior. The behavior of each module can be seen by setting the appropriate switches. In certain modes, however, the switches only control what may appear on the screen and not the processing that is going on in the background. Input to the system can be from external files or keyboard. The output can take several forms depending on the configuration of the switches. The syntactic analysis of the sentences can produce two types of output: a detailed surface structure parse tree, and an operator-argument representation called the Intermediate Syntactic Representation (ISR). ISR is the simplified version of the parse tree with a single canonical form for a number of various surface structures and a lot less detail that is not required by the semantic analyzer. The semantic and pragmatic modules use the ISR as input and produce the Integrated Discourse Representation (lor). IDRs are application-neutral representations of the meaning of the sentences in the current discourse containing situations described in the input sentences, the entities referred to, and the way the entities participate in the situations. 4. Applying the Benchmark Evaluation Tool to the NLP system The application domain which accompanied the NLP system for this investigation is an interface to a relational database of air travel planning information. Test sentences were scored according to the syntactic, semantic, and pragmatic processing and comprehension of each sentence. Generally, the sentences which failed to produce an lor (i.e. failed the analysis of the
5 Conclusions Although the idea of testing the sensitivity of individual linguistic capabilities of NLP systems rather than the sensitivity of the systems to individual applications is extremely attractive, it has nevertheless proved to be an ambitious task. Most NLP systems are designed for well-defined domains and applications. Therefore, a general purpose evaluation tool may not be suitable for all types of NLP systems. This was evident from our investigation. The NLP system, in our investigation, has an extremely narrow application domain which responds fairly well to sentences that satisfy its requirements, however the sentences that do not, fail to be analyzed completely. Each type of NLP system possesses certain attributes that are unique. Each type has strengths and weaknesses which are directly associated to the goals and objectives of the system. Therefore, the evaluation procedure should be more sensitive to the type of the NLP system being evaluated. For instance, if the NLP system is a Data Base Management System, the evaluation tool must place more emphasis on the interrogative and basic sentences rather than on quantifiers and ellipsis. Since after all, the system is not designed to respond to ellipsis or quantifiers. Some of the grammar patterns suggested by the Benchmark Evaluation Tool are not used in everyday conversation, however they are perfectly correct. For instance, 'List the flights which are more expensive than the Boston to Atlanta flight is expensive' (VII-1.1) is grammatically correct, but the second part of the sentence 'flight is expensive' is normally omitted and is implied by the first part. This may cause some confusion among some of the evaluators. In addition, scoring may also pose some confusion, since the boundaries between suggested scores are not well defined and are subjective. Hence, two independent evaluators may score a single NLP system completely different. Although not all suggested sentence patterns were applicable, nevertheless they helped with defining the boundaries of the NLP system. In many instances such as VII-3.1, VII-3.2, VII-9.1, and VII-9.2 the clash between the wide scope of the evaluation tool and the narrow application domain of the NLP system was clearly evident. In defense of the NLP system, it must be noted that no NLP system can successfully satisfy all the rigorous requirements of the Benchmark Evaluation Tool. The Benchmark Evaluation Tool proved to be extremely helpful in providing guidance and structure for evaluating the NLP system, therefore it should be used as a guide to select the appropriate testing procedures for individual types of systems, rather than as a general purpose evaluation procedure that can be applied to all NLP systems...
Strengths of the Benchmark Evaluation Tool: Comprehensive Contains detailed explanations Independent of NLP systems and their application domain... Defines the boundaries of NLP systems Weaknesses of the Benchmark Evaluation Tool: Time consuming Scope of the evaluation is too wide Some suggested patterns seem unusual and are not used in everyday conversation... Scoring is not well defined In conclusion, the evaluation process proved to be extremely time consuming. It is conceivable that in the near future the evaluation process could be fully automated. However, in order for an automated evaluator to be successful, the evaluation should be performed in a narrower space with well-defined boundaries. Therefore, there should be several different automated evaluators each specialized for a different type of NLP system. Each automated evaluator would have syntactic, semantic, and pragmatic knowledge of only one type of NLP system and would generate appropriate test sentences. The set of automated evaluators would form a complete collection of tools for evaluating all types of NLP systems. The Benchmark Evaluation Tool will be extremely instrumental in developing the automated evaluators. 6. References [1]. Benchmark Investigation/Identification Program Volume I; Final Report, Calspan Advanced Technology Center, P.O. Box 400, Buffalo, NY 14225, May 1992.