Machine Translation in Practice. Convertus AB. Anna Sågvall Hein 2017-

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Parsing of part-of-speech tagged Assamese Texts

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

National Literacy and Numeracy Framework for years 3/4

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

English Language and Applied Linguistics. Module Descriptions 2017/18

Controlled vocabulary

Developing a TT-MCTAG for German with an RCG-based Parser

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

A Framework for Customizable Generation of Hypertext Presentations

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

An Interactive Intelligent Language Tutor Over The Internet

APA Basics. APA Formatting. Title Page. APA Sections. Title Page. Title Page

LING 329 : MORPHOLOGY

Cross Language Information Retrieval

Character Stream Parsing of Mixed-lingual Text

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

ARNE - A tool for Namend Entity Recognition from Arabic Text

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Universiteit Leiden ICT in Business

PUBLIC CASE REPORT Use of the GeoGebra software at upper secondary school

Applications of memory-based natural language processing

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Some Principles of Automated Natural Language Information Extraction

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

SIE: Speech Enabled Interface for E-Learning

A First-Pass Approach for Evaluating Machine Translation Systems

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Derivational and Inflectional Morphemes in Pak-Pak Language

Update on Soar-based language processing

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Ensemble Technique Utilization for Indonesian Dependency Parser

Specifying a shallow grammatical for parsing purposes

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A Grammar for Battle Management Language

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Longman English Interactive

Modeling full form lexica for Arabic

Language Model and Grammar Extraction Variation in Machine Translation

CS 598 Natural Language Processing

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Linking Task: Identifying authors and book titles in verbose queries

Salli Kankaanpää, Riitta Korhonen & Ulla Onkamo. Tallinn,15 th September 2016

Implementing a tool to Support KAOS-Beta Process Model Using EPF

(12) United States Patent Bernth et al.

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

AQUA: An Ontology-Driven Question Answering System

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Software Maintenance

Constraining X-Bar: Theta Theory

Executive summary (in English)

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

THE VERB ARGUMENT BROWSER

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Context Free Grammars. Many slides from Michael Collins

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

A Domain Ontology Development Environment Using a MRD and Text Corpus

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Constructing Parallel Corpus from Movie Subtitles

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Evaluation of Learning Management System software. Part II of LMS Evaluation

Noisy SMS Machine Translation in Low-Density Languages

Tavastia Way of Finnish Language Support during Vocational Studies. Tiina Alhainen Coordinator of Multicultural Issues Tavastia Education Consortium

Distant Supervised Relation Extraction with Wikipedia and Freebase

New Features & Functionality in Q Release Version 3.1 January 2016

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

What the National Curriculum requires in reading at Y5 and Y6

Formulaic Language and Fluency: ESL Teaching Applications

Introduction to Moodle

Timeline. Recommendations

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Myths, Legends, Fairytales and Novels (Writing a Letter)

A relational approach to translation

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Curriculum for the Academy Profession Degree Programme in Energy Technology

SRI LANKA INSTITUTE OF ADVANCED TECHNOLOGICAL EDUCATION REVISED CURRICULUM HIGHER NATIONAL DIPLOMA IN ENGLISH. September 2010

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Refining the Design of a Contracting Finite-State Dependency Parser

On-Line Data Analytics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Intermediate Academic Writing

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Epping Elementary School Plan for Writing Instruction Fourth Grade

M55205-Mastering Microsoft Project 2016

Detecting English-French Cognates Using Orthographic Edit Distance

Transcription:

Machine Translation in Practice Convertus AB http://www.convertus.se/home-en.html Anna Sågvall Hein 2017-

Convertus AB A Swedish Language Technology company specialising in Machine Translation, MT, and automatic Language Checking, LC Founded in 2006 by a group of Computational Linguists at Uppsala University, headed by Professor Anna Sågvall Hein First LC reference installation - Scania Checker, 2000 at Scania CV AB First MT reference installation - the Syllabus Translator, 2007 at Uppsala University

Convertus business model Convertus develops and markets software and services in the Language Technology field. The core product is a complete solution to Machine Translation provided as a web-based service. A key product is BTS - a platform for running and managing machine Anna Sågvall translation Hein 2017-04-using different

The complete MT solution Preprocessing Translation memory Machine translation Automatic post-editing Manual post-editing

Preprocessing Spell checking Grammar checking Reformulation Compliance to controlled language

Convertus MT engines Rule-based translation engines, RBMT+ Statistical translation engines, SMT Combinations of RBMT and SMT

RBMT The oldest MT type Based on linguistic models Employs language resources Analyses the source language at some level

Convertus RBMT+ engines Multra sv-> en Deployed since 2007 Analysis: Handcrafted grammar, chart parser, UCP Fall-back Dictionary: Tagger Parsing: Partial parsing Generation: Language model Maltra fi->en Research prototype

Primary modules in Multra and Maltra ANALYSIS lexical and syntactic analysis parsing of source language segments into linguistic structures TRANSFER transfer translation of source language

Problems to be handled in MT Lexical ambiguity a word/token may represent different parts of speech, inflectional forms and meanings Variation/Translation ambiguity different words for the same meaning synonymy different ways of formulating a statement paraphrase Language differences lexicon, morphology, syntax

An example Sv. Under tiden stannade bilen. En. Meanwhile, the car stopped. Problems Lexical choice / under under, during, miracle Word order Punctuation

Multra solution Lexical choice / dictionary, transfer Word order / generation grammar Punctuation / generation grammar

Analysis structure

Transfer structure

Generation structure

Dictionary One-word-units and multiword units Hierarchical organisation of the dictionary Dictionary set-up specifies a hierarchy between parts of the dictionary (sub-dictionaries) All unique alternatives are presented to the parser in the preferred order

A dictionary hierarchy

Tagging Simple syntactic analysis based on N-gram processing filtering out the best alternatives provided by the dictionary (HunPos). då om då.ab-> then.ab/ då.sn- >because.sn om.sn->if.sn/ om.pp->about.pp Default alternatives are provided for words outside the dictionary.

Parsing Chart parsing using Uppsala Chart Processor, UCP. UCP is an in-house chart parser with a procedural grammar formalism. The grammar writer inserts active and passive edges in the chart thereby promoting the processing. Processing is non-deterministic, i.e. all combinations of active and passive edges are explored.

Ranking rules express priorities between competing analyses formulated in linguistic terms competing analyses are submitted to the transfer module in the order in which they were ranked

Transfer Transfer rules are applied to feature structures generated by the parser Rule-application is implemented as unification of feature structures. Transfer rules are expressed in a PATR-like formalism All applicable rules are applied

Transfer rule format LABEL name SOURCE path expression(s) TARGET path expression(s) TRANSFER?x1?x2

Types of transfer rules Copy a feature Delete a feature Transfer a feature structure Define a target feature structure

Copy a feature LABEL phr.cat SOURCE <* phr.cat> = x TARGET <* phr.cat> = x TRANSFER

Delete a feature LABEL gender SOURCE <* gender> = ANY TARGET <*> = <*> TRANSFER

Transfer a feature structure LABEL subj SOURCE <* subj> =?x1 TARGET <* subj> =?x2 TRANSFER?x1?x2

Define a target feature structure LABEL ta.bort-remove SOURCE <* lex sym>=ta.vb+bort.ab.1 <* word.cat>=verb TARGET <* lex>=remove.vb.1 <* word.cat>=verb

Generation The generation module operates on the preferred feature structure produced by the transfer component. The module is responsible for creating words from feature bundles and ordering them in the order specified by the English grammar. Generation rules are expressed in a PATR-like formalism. Rule application is based on unification and concatenation.

A generation rule % "The process includes reconditioning" LABEL subj-verb-obj.dir x1 ---> x2 x3 x4 : <x1 subj> = <x2> <x1 verb> = <x3> <x1 obj.dir> = <x4> <x3 numb> = <x2 numb> <x3 person> = <x2 person>

Language resources Dictionaries Swedish dictionary English dictionary Translation dictionary Tagging rules and correction rules Grammars

Fall-back Lexical Out Of Vocabulary units, OOVs Tagger creates default source word description Source word copied into target Parsing No complete parse Use partial parses Generation No complete generation

Partial parsing Select the best configuration of partial parses and translate them one by one. The best configuration of partial parses is assumed to be the one with the smallest number of partial parses. The selection uses a greedy search algorithm.

Aspects of RBMT Translations close to the source language Translation quality dependent on the quality and coverage of the language resources A labor-intensive task to build the language resources Fall-back mechanisms to account for gaps in the language resources

Aspects of SMT Pros Can be built fast on previous translations Idiomatic translations Cons Translation quality dependent on the quality and size of the training data No guaranteed translation relation to the source language Words may be lost Words may be inserted

Convertus applications The syllabus translator Multra, Maltra, SMT Technical translation for industrial clients Multra, SMT Gisting for in-house purposes SMT

How to run the MT service? Users Integrated in the user s normal work flow the education data bases of the universities The Syllabus Translator as a plug-in to Trados the translators usual way of working Independent service the BTS platform Developers Terminal for large-scale testing

Bologna Translation Service, BTS A translation platform developed in the Bologna Project (www.bologna-translation.eu) Supports dynamic learning of MT See further https:// www.convertus.se/sv/oversattningsplatform

Manual post-editing A platform for manual post-editing of translation segments is provided edit save approve Approved translation segments are stored in the Translation Memory and re-used

Translation memory, TM TM comprises approved translation segments. TM grows as the service is used. Search in TM is the first option in the translation process.

Automatic post-editing Translation memory is searched for manual postedits at regular intervals. Post edits are reformulated as Automatic Postediting Rules, APEs. APEs are appended to the translation engines contributing to their quality.

Project DigInclude Facilitate access to digital information provided by Swedish authorities Convertus role is to provide translation services for immigrant languages Coordinator SICS EAST SWEDISH ICT AB