Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Similar documents
Software Maintenance

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Seminar - Organic Computing

Applications of memory-based natural language processing

1. Introduction. 2. The OMBI database editor

Learning Methods for Fuzzy Systems

BUILD-IT: Intuitive plant layout mediated by natural interaction

Parsing of part-of-speech tagged Assamese Texts

AQUA: An Ontology-Driven Question Answering System

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Speech Recognition at ICSI: Broadcast News and beyond

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

PROCESS USE CASES: USE CASES IDENTIFICATION

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Evolution of Symbolisation in Chimpanzees and Neural Nets

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

An Interactive Intelligent Language Tutor Over The Internet

Problems of the Arabic OCR: New Attitudes

ECE-492 SENIOR ADVANCED DESIGN PROJECT

School Inspection in Hesse/Germany

Developing a TT-MCTAG for German with an RCG-based Parser

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

TU-E2090 Research Assignment in Operations Management and Services

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

"On-board training tools for long term missions" Experiment Overview. 1. Abstract:

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Modeling full form lexica for Arabic

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

LEGO MINDSTORMS Education EV3 Coding Activities

A First-Pass Approach for Evaluating Machine Translation Systems

English Language and Applied Linguistics. Module Descriptions 2017/18

Visual CP Representation of Knowledge

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Automating the E-learning Personalization

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Python Machine Learning

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Laboratorio di Intelligenza Artificiale e Robotica

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

Task Tolerance of MT Output in Integrated Text Processes

Using computational modeling in language acquisition research

Abstractions and the Brain

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Laboratorio di Intelligenza Artificiale e Robotica

Constraining X-Bar: Theta Theory

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Constructing Parallel Corpus from Movie Subtitles

Including the Microsoft Solution Framework as an agile method into the V-Modell XT

UML MODELLING OF DIGITAL FORENSIC PROCESS MODELS (DFPMs)

DICTE PLATFORM: AN INPUT TO COLLABORATION AND KNOWLEDGE SHARING

A Grammar for Battle Management Language

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Lecture 10: Reinforcement Learning

Diploma in Library and Information Science (Part-Time) - SH220

Litterature review of Soft Systems Methodology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Introduction and survey

TEACHING IN THE TECH-LAB USING THE SOFTWARE FACTORY METHOD *

Improving software testing course experience with pair testing pattern. Iyad Alazzam* and Mohammed Akour

Some Principles of Automated Natural Language Information Extraction

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Institutionen för datavetenskap. Hardware test equipment utilization measurement

GACE Computer Science Assessment Test at a Glance

Ontologies vs. classification systems

The Strong Minimalist Thesis and Bounded Optimality

Agent-Based Software Engineering

Using dialogue context to improve parsing performance in dialogue systems

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A Framework for Customizable Generation of Hypertext Presentations

University of Groningen. Systemen, planning, netwerken Bosman, Aart

INPE São José dos Campos

On-the-Fly Customization of Automated Essay Scoring

Lecture 2: Quantifiers and Approximation

Language Independent Passage Retrieval for Question Answering

Identifying Novice Difficulties in Object Oriented Design

Success Factors for Creativity Workshops in RE

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

An ICT environment to assess and support students mathematical problem-solving performance in non-routine puzzle-like word problems

MYCIN. The MYCIN Task

arxiv: v1 [cs.cl] 2 Apr 2017

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

A MULTI-AGENT SYSTEM FOR A DISTANCE SUPPORT IN EDUCATIONAL ROBOTICS

Transcription:

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg, Germany {cri,vhahn}@nats.informatik.uni-hamburg.de Abstract Implementation of machine translation toy systems is a good practical exercise especially for computer science students. Our aim in a series of courses on MT in 2002 was to make students familiar both with typical problems of Machine Translation in particular and natural language processing in general, as well as with software implementation. In order to simulate a software implementation process as realistic as possible, we introduced more than 20 evaluation criteria to be filled by the students when they evaluated their own products. The criteria go far beyond such toy systems, but they should demonstrate the students, what a real software evaluation means, and which are the particularities of Machine Translation Evaluation. 1 Introduction Machine Translation (MT) is an important subfield both of Computational Linguistics and Natural Language Processing. Therefore academic education in MT addresses students in linguistics and computer science. Usually, according to the background of the students, courses are given separately to these two groups but with different methodologies: theoretical aspects and demonstration of tools for the linguists (Somers, 2001) on one hand, implementation of clearly defined algorithms for the computer science students on the other hand. The alternative, to implement a realistic MTsystem in one course is not feasible, due to the lack of time and missing background knowledge by the students. Very often they are facing the field for the first time. A solution in between may be the implementation of one or several toy systems, with rather limited language resources and limited functionality. In (v. Hahn and Vertan, 2002) the reader will find detailed examples of such toy systems, which have been developed mainly by courses for computer science students, but also including students from linguistics. In one these courses the students had to implement (in small groups) parallel small systems based on either pattern matching, or word-to-word translation, or syntactic translation, or semantic translation. These sub-systems processed a corpus of app. 100 sentences, each, being controlled by a common user interface. The aim was to get a realistic idea of the possible contribution of each module in a real MT system. Input Pattern Matching Word-to-Word Syntactic Transl. Semantic Transl. User Interface Output In another course the students had to implement a classical centralised integrated system with a wordto-word pre-processor, syntactic and semantic modules, domain knowledge with an ontology and a user interface. Input Word-to-word Semantics Parser Backbone Domain User Interface Output

The aim of designing such systems is not only o to offer to the students some interesting programming exercise, but o to make them conscious of what the implementation of a real system means. The remaining pages of this paper will explain details of this last topic of the courses, where we made the students reflect on questions like: - How will the system react with huge amounts of data or maximal throughput?, - How well are changes of the domain or changes of the languages supported?, - What about maintenance by users? - Or, more complicated: What type of architecture is optimal for my required functionality?, - What kind of grammar is more suitable? - How much from the original plans did we realise? - Which is the intuition of a possible user? In order to give the students the opportunity to reflect on all these types of questions, we asked the students, as integrated part of the laboratory assignment, to check 17 criteria for software quality evaluation (among them maintainability, system work-flow integration, or efficiency) and 7 criteria for the linguistic functionality, like lexical coverage, syntactic coverage or compatibility with (European) standards and formats. The aim was to familiarise students with the evaluations of software projects. For some criteria, however, like maintanability or domain coverage, there was apparently no reasonable answer when working with a toy system, but the idea was to expose the students early enough to all aspects of software evaluation in general and machine translation in particular. To make the whole implementation process more realistic, we also prepared criteria for the specification of the software, which had to be fulfilled before the implementation process. Both software evaluation and specification criteria follow general software engineering theory (Somerville, 1990) In the following we will explain in detail each of the specification and evaluation criteria (both software specific and linguistic), and we will report about what the students learned from following such schemata. 2 Specification Criteria a) Functional requirements: mainly the behaviour of input and output, both at system and module level - On the system level the specification has to define the type and form of the input (e.g. speech or text, format of the input, e.g., file format and which other formats are supported, restrictions of text input from keyboard, menu-selection, audio formats, etc.) as well as the type and form of the output. To demonstrate the generality of the criteria we included such cases, where the output of a natural language processing system may be something else than a natural language utterance again.: A record from a databank or an action (in case of a robot system), or simply specific terms or web-links. - The other part of the functional requirements concerns the module interfaces of the system. Specifying formats for input and output of each module right from the beginning, makes it much easier to work afterwards independently among the teams of this course. Each group can develop and test their modules with simulated input-data without waiting for completed work of the others. b) Performance requirements The students were asked to estimate which time behaviour has or is required for their system, and which resources it will need. c) Usage requirements Among students, this criterion usually is the most neglected parameter. They tend to assume that if input and output are in natural language, no special attention has to be paid to a user interface. A potential user, in their view, needs only a text field, where to type the input, and another part of the screen for the presentation of the output. We tried to raise their awareness for more specific requirements, esp. in MT systems: Dialogue windows for unknown words and errors in the input or the proper selection of labels and controls, facilities for reading in files, for pre-processing etc. d) Embedding requirements Under this heading the students are asked to specify, which hardware is needed for their system and which operating systems will be supported. A realistic scenario for the application has also to be discussed.

3 General Evaluation Criteria The first group of criteria evaluates the usage of software, seen from the perspective of the user, under i) there follow criteria for the software product itself and under j) process quality critera a) Adequacy This point has to be assessed in reference to 2.c, i.e. how much from the specified user requirements are fulfilled, how user-friendly the system interface is, etc.. The students have to give precise examples, of situations where their system reacts adequate, and cases where improvements seem necessary. b) Transparency This evaluation criterion includes reasonable user estimations about processing errors, the plausibility of the system s behaviour in general or reasonable help facilities. Example: In a translation tool the user (ideally) has to be informed, whether a nontranslated term is a word out of dictionary, a proper name or simply an input error by the user. Of course at the level of toy systems we can not expect from the students (especially under time constraints) to tackle such problems, but they must be aware of their existence. c) work-flow integration As mentioned under paragraph 2.d, a possible scenario has to be specified initially for the system. In the evaluation statements the students are requested to explain to what extent their system would fit into an assumed work flow and with which additional time and costs their product can be adapted to other scenarios or work flow environments. Further, how flexible it is to functional extensions, because in a course such toy systems are designed exactly for the given or defined scenario. By including such a criterion we force the students to reflect about the difficulties of building a system, which is general enough to cover different scenarios and different work flow environments. d) Specifications match This requires a detailed comparison with all specification criteria, which of them are met by the implementation, what is still missing, and more, what is not conform with the specification criteria at all and why. The students must provide reasons why for example input and output formats were changed, or not supported, in the given form. e) Reliability The deterministic behaviour of the product, and its components has to be evaluated. As there is, e.g. no additional sensor input, translation systems must be deterministic. f) Robustness This is again an issue, where students have to learn a lot about real software behaviour and to make a very detailed evaluation. From our experience, they assume that their system works with all input data in the form that they require, and that the system runs in a similar way as with their test sentences. Their specifications usually cope only with the positive functionality What to do, not with functions to avoid certain behaviour. What happens, if the user forgets to specify parameters, if the user makes none of necessary actions, enters corrupted data etc. What to do about faulty input? g) Failure safety This criterion is mentioned only to familiarise the students with large scale evaluation procedures. For a prototype toy system it is not assumed that the implementation will include restart facilities, or that there are backup copies, but such aspects are important for real systems. h) Efficiency The efficiency of the program has to be estimated in terms of hardware requirements and consumption of resources, as well as the time required to perform certain operations. i) Product quality Under this title the students will correctly understand to briefly explain whether the program execution is correct, i.e. the expected behaviour is delivered. To refine the discussion we introduced the following sub-criteria: - correctness (e.g. correct processing, complete correspondence to specifications) - comprehensibility (e.g. structure of programs, choice of designators and names in the code ) - testability - maintainability - changeability o structural changes o functional changes o problem-type changes The criteria mentioned so far are valid for any software product. In our case of toy translation tools, the problem of correctness is much more complicated due to the translation specific features. In contrast to classical software products where to any input a unique correct output must correspond, translation translation theory say clearly, that there

is more than one correct translation of the same input sentence. Moreover, the assessment correct for a translation is relative. For example, in the case of the Verbmobil system evaluation (Tessiore and v. Hahn, 2000), a lot of users were prepared to classify the output as correct already, if they could understand the meaning and pragmatics of the translation. Usually the existent evaluation methodologies of machine translation, require the existence of a reference translation. A set of metrics are defined in the literature (Dabbadie and al., 2002) starting from the output and the reference translation. Our experience proved, that the existence of a reference translation can be even misleading for the students. For at least two of the toy systems, which were developed in our courses, we provided the students with a test corpus, consisting of about 70-100 sentences and their reference translations. At least three problems were encountered: 1. The students had a strong fixation on our reference translation: either they tried to tune their system artificially to deliver exactly the given reference, or they classified all translations as incorrect, which did not met perfectly the reference. 2. The construction of the (bilingual) lexicons is done strictly according to the reference translation: Only the morphology, meanings etc. encountered in the test corpus are included. As a consequence, the students did face the problem of disambiguation only in those cases, where we included it intentionally. 3. The development of the system was done strictly to cover the test corpus. Any additional sentence, would fail. The scenario which we are applying now after this experience is rather different: At the beginning the students get no test corpus. Their first task together with the requirements in the specification task is to estimate what kind of sentences can occur in the given domain, and to design subsequently a lexicon which covers such situations. After the design phase and during evaluation we provide a test corpus, but only with sentences in the source language. This test corpus prevents the students from choosing only very simple cases, e.g., no anaphora and ellipses, no sub clauses or defective sentences j. Process quality (e.g. quality of the implementation process, certification, quality of specification) In contrast to the evaluation of the product, which addresses only the results delivered by the system and it s overall behaviour, process evaluation means the evaluation of the conditions, under which the software was produced. This covers the methodology for compiling the specifications, security measures, the design of tests, and the cooperation among the groups and with the customer. Here, the students have the opportunity to reflect about the quality of their production process and about the results of, e.g., underestimating time resources etc. Obviously, under time constraints, the code is not always documented, not always, explicit enough. The aim of this professional software evaluation is not to over-criticize the results of the students but to show them what requirements are expected at a commercial level even for tasks, which are, by nature, not completely and formally defined and, by nature, vague, because this is the nature of language. 4. Criteria for Linguistic quality evaluation In section 3 we presented evaluation criteria, which are valid for all software products. In the following we will concentrate on specific criteria for linguistic processing, in particular for translation tools. a) Coverage - Lexicon - Syntax - Semantics As explained in section 2.i), in a toy system the students will implement a reduced lexicon, a grammar which covers only part of the language and will deal only with restricted semantic problems. In our opinion, however, it is important that the student can define exactly the amount of linguistic features that they cover. Therefore they are asked to indicate: - how many entries the lexicon has and to give examples of important word, which may occur in the given domain, but were not included, - the annotations in the lexicon, the choice of lexicon type (stem lexicon versus full-form lexicon) and correspondingly, their morphological processing, - types of sentences that can be processed, and types of realistic sentences which will fail,

- semantic phenomena, which are tackled and solved b) Pragmatics Here the students have to evaluate to which extent their software covers pragmatics aspects of the languages. Good examples are common directive speech acts like The course is given in the city centre ( Das Seminar wird in der Stadtmitte abgehalten, = in the university main building, not in CS building). c) Compatibility Translation tools make use of lots of resources (corpora, lexicons, grammars, etc.). Their development is time consuming, and therefore standardization efforts have been made since many years. The aim is to provide reusable resources. Therefore the students are asked to discuss, whether - the format of their data, e.g., to what extent these meet existing standards and formalisms. If not, is the lexicon encoded in a reusable format (at least some XML version)? - the grammar follows a well-known formalism (HPSG, functional grammar, etc.) and, on which basis the choice was done. Concerning the languages, we usually define right from the beginning what is the source and what is the target language for the translation process. The students, however, must discuss if their program: - can it be (easily) reversed to translate backwards: from target to source - can it be adapted to new language pairs, and with which amount of work,. Here the general translation paradigm (transfer versus interlingua) can be addressed Especially the linguistic evaluation can be a starting point for a broader discussion in the seminar about rather difficult issues in NLP: - how much does a change of the lexicon design influences the design and the functionality of the whole system, - is the lexicon part of the grammar (transition networks), then changes have influence on the whole grammar and the parser, - how do technical ad-hoc decisions (easy implementation, time constraints, programming languages) restrict the whole system design and inhibit reasonable linguistic solutions. Similar discussions can be triggered concerning the change of the domain. The change of the domain involves major re-implementations of at least the lexical resources and the pragmatic processes. 5 Conclusions In this paper we presented criteria for the specification and evaluation of toy machine translation systems. to asses their quality The criteria can be grouped in two classes: general software evaluation criteria and specific linguistic ones. Both are used by the students to evaluate their own programmming. It is quite clear, that many of these criteria are by far too complex for such toy systems. The main aim is to familiarize computer science and linguistics students with real evaluation methodology. From our experience, the students had real difficulties to asses each point of the criteria list. However, at the end of the evaluation, they got some general ideas about why perhaps some of the methods, although locally successful, are not general enough, from which issues the success of an implementation depends and, last but not least, why the implementation of a machine translation system is not a trivial task. 6 Bibliographical References Marianne Dabbadie and Anthony Hartley and Margeret King and Keith J. Miller and Mustafa El Hadi and Andrei Popescu-Belis and Florence Reeder and Michelle Vanni. 2002. A Hands-On Study of the Reliability and Coherence of Evaluation Metrics. In Proceedings of the Workshop Machine Translation Evaluation Human Evaluators meet Automated Metrics, Third International Conference on Language Resources and Evaluation LREC 2002, pp. 8-16 Walther v. Hahn and Cristina Vertan. 2002. Architectures of toy systems for teaching Machine Translation. In Proceedings of the 6th EAMT Workshop on Teaching Machine Translation, Manchester, pp. 69-78. Harald Somers. 2001. Three Perspectives on MT in the Classroom. In Proceedings of the Workshop on Teaching Machine Translation VIIIth MT Summit, Santiago de Compostella Ian Somerville. 1990. Software Engineering, third edition. Addison-Wesley Publishing Company, Massachusetts Lorenzo Tessiore and Walther v. Hahn. 2000. Functional Validation of a Machine Interpretation System:Verbmobil. In Verbmobil: Foundations of Speech-to-speech Translation, W. Wahlster ed., Springer Verlag, Berlin, pp. 611-634.