Machine Translation in Practice Convertus AB http://www.convertus.se/home-en.html Anna Sågvall Hein 2017-
Convertus AB A Swedish Language Technology company specialising in Machine Translation, MT, and automatic Language Checking, LC Founded in 2006 by a group of Computational Linguists at Uppsala University, headed by Professor Anna Sågvall Hein First LC reference installation - Scania Checker, 2000 at Scania CV AB First MT reference installation - the Syllabus Translator, 2007 at Uppsala University
Convertus business model Convertus develops and markets software and services in the Language Technology field. The core product is a complete solution to Machine Translation provided as a web-based service. A key product is BTS - a platform for running and managing machine Anna Sågvall translation Hein 2017-04-using different
The complete MT solution Preprocessing Translation memory Machine translation Automatic post-editing Manual post-editing
Preprocessing Spell checking Grammar checking Reformulation Compliance to controlled language
Convertus MT engines Rule-based translation engines, RBMT+ Statistical translation engines, SMT Combinations of RBMT and SMT
RBMT The oldest MT type Based on linguistic models Employs language resources Analyses the source language at some level
Convertus RBMT+ engines Multra sv-> en Deployed since 2007 Analysis: Handcrafted grammar, chart parser, UCP Fall-back Dictionary: Tagger Parsing: Partial parsing Generation: Language model Maltra fi->en Research prototype
Primary modules in Multra and Maltra ANALYSIS lexical and syntactic analysis parsing of source language segments into linguistic structures TRANSFER transfer translation of source language
Problems to be handled in MT Lexical ambiguity a word/token may represent different parts of speech, inflectional forms and meanings Variation/Translation ambiguity different words for the same meaning synonymy different ways of formulating a statement paraphrase Language differences lexicon, morphology, syntax
An example Sv. Under tiden stannade bilen. En. Meanwhile, the car stopped. Problems Lexical choice / under under, during, miracle Word order Punctuation
Multra solution Lexical choice / dictionary, transfer Word order / generation grammar Punctuation / generation grammar
Analysis structure
Transfer structure
Generation structure
Dictionary One-word-units and multiword units Hierarchical organisation of the dictionary Dictionary set-up specifies a hierarchy between parts of the dictionary (sub-dictionaries) All unique alternatives are presented to the parser in the preferred order
A dictionary hierarchy
Tagging Simple syntactic analysis based on N-gram processing filtering out the best alternatives provided by the dictionary (HunPos). då om då.ab-> then.ab/ då.sn- >because.sn om.sn->if.sn/ om.pp->about.pp Default alternatives are provided for words outside the dictionary.
Parsing Chart parsing using Uppsala Chart Processor, UCP. UCP is an in-house chart parser with a procedural grammar formalism. The grammar writer inserts active and passive edges in the chart thereby promoting the processing. Processing is non-deterministic, i.e. all combinations of active and passive edges are explored.
Ranking rules express priorities between competing analyses formulated in linguistic terms competing analyses are submitted to the transfer module in the order in which they were ranked
Transfer Transfer rules are applied to feature structures generated by the parser Rule-application is implemented as unification of feature structures. Transfer rules are expressed in a PATR-like formalism All applicable rules are applied
Transfer rule format LABEL name SOURCE path expression(s) TARGET path expression(s) TRANSFER?x1?x2
Types of transfer rules Copy a feature Delete a feature Transfer a feature structure Define a target feature structure
Copy a feature LABEL phr.cat SOURCE <* phr.cat> = x TARGET <* phr.cat> = x TRANSFER
Delete a feature LABEL gender SOURCE <* gender> = ANY TARGET <*> = <*> TRANSFER
Transfer a feature structure LABEL subj SOURCE <* subj> =?x1 TARGET <* subj> =?x2 TRANSFER?x1?x2
Define a target feature structure LABEL ta.bort-remove SOURCE <* lex sym>=ta.vb+bort.ab.1 <* word.cat>=verb TARGET <* lex>=remove.vb.1 <* word.cat>=verb
Generation The generation module operates on the preferred feature structure produced by the transfer component. The module is responsible for creating words from feature bundles and ordering them in the order specified by the English grammar. Generation rules are expressed in a PATR-like formalism. Rule application is based on unification and concatenation.
A generation rule % "The process includes reconditioning" LABEL subj-verb-obj.dir x1 ---> x2 x3 x4 : <x1 subj> = <x2> <x1 verb> = <x3> <x1 obj.dir> = <x4> <x3 numb> = <x2 numb> <x3 person> = <x2 person>
Language resources Dictionaries Swedish dictionary English dictionary Translation dictionary Tagging rules and correction rules Grammars
Fall-back Lexical Out Of Vocabulary units, OOVs Tagger creates default source word description Source word copied into target Parsing No complete parse Use partial parses Generation No complete generation
Partial parsing Select the best configuration of partial parses and translate them one by one. The best configuration of partial parses is assumed to be the one with the smallest number of partial parses. The selection uses a greedy search algorithm.
Aspects of RBMT Translations close to the source language Translation quality dependent on the quality and coverage of the language resources A labor-intensive task to build the language resources Fall-back mechanisms to account for gaps in the language resources
Aspects of SMT Pros Can be built fast on previous translations Idiomatic translations Cons Translation quality dependent on the quality and size of the training data No guaranteed translation relation to the source language Words may be lost Words may be inserted
Convertus applications The syllabus translator Multra, Maltra, SMT Technical translation for industrial clients Multra, SMT Gisting for in-house purposes SMT
How to run the MT service? Users Integrated in the user s normal work flow the education data bases of the universities The Syllabus Translator as a plug-in to Trados the translators usual way of working Independent service the BTS platform Developers Terminal for large-scale testing
Bologna Translation Service, BTS A translation platform developed in the Bologna Project (www.bologna-translation.eu) Supports dynamic learning of MT See further https:// www.convertus.se/sv/oversattningsplatform
Manual post-editing A platform for manual post-editing of translation segments is provided edit save approve Approved translation segments are stored in the Translation Memory and re-used
Translation memory, TM TM comprises approved translation segments. TM grows as the service is used. Search in TM is the first option in the translation process.
Automatic post-editing Translation memory is searched for manual postedits at regular intervals. Post edits are reformulated as Automatic Postediting Rules, APEs. APEs are appended to the translation engines contributing to their quality.
Project DigInclude Facilitate access to digital information provided by Swedish authorities Convertus role is to provide translation services for immigrant languages Coordinator SICS EAST SWEDISH ICT AB