Computational Dictionaries & Terminology

Similar documents
Cross Language Information Retrieval

Applications of memory-based natural language processing

AQUA: An Ontology-Driven Question Answering System

Modeling full form lexica for Arabic

English-German Medical Dictionary And Phrasebook By A.H. Zemback

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

1. Introduction. 2. The OMBI database editor

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Analysis of Lexical Structures from Field Linguistics and Language Engineering

The MEANING Multilingual Central Repository

Ontologies vs. classification systems

Word Sense Disambiguation

A Case Study: News Classification Based on Term Frequency

Multilingual Sentiment and Subjectivity Analysis

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Ontological spine, localization and multilingual access

Controlled vocabulary

Cross-Lingual Text Categorization

Linking Task: Identifying authors and book titles in verbose queries

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Vocabulary Usage and Intelligibility in Learner Language

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Learning Methods in Multilingual Speech Recognition

Florida Reading Endorsement Alignment Matrix Competency 1

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

2.1 The Theory of Semantic Fields

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

THE VERB ARGUMENT BROWSER

A Bayesian Learning Approach to Concept-Based Document Classification

ROSETTA STONE PRODUCT OVERVIEW

Constructing Parallel Corpus from Movie Subtitles

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Domain Ontology Development Environment Using a MRD and Text Corpus

Development of the First LRs for Macedonian: Current Projects

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

Language Independent Passage Retrieval for Question Answering

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lemmatization of Multi-word Lexical Units: In which Entry?

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

Procedia - Social and Behavioral Sciences 154 ( 2014 )

The taming of the data:

On document relevance and lexical cohesion between query terms

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Finding Translations in Scanned Book Collections

arxiv: v1 [cs.cl] 2 Apr 2017

Derivational and Inflectional Morphemes in Pak-Pak Language

Combining a Chinese Thesaurus with a Chinese Dictionary

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

English Language and Applied Linguistics. Module Descriptions 2017/18

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

A heuristic framework for pivot-based bilingual dictionary induction

EDITORIAL: ICT SUPPORT FOR KNOWLEDGE MANAGEMENT IN CONSTRUCTION

Multi-Lingual Text Leveling

Introduction to Text Mining

Parsing of part-of-speech tagged Assamese Texts

Automating the E-learning Personalization

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Universiteit Leiden ICT in Business

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Timeline. Recommendations

MOODLE 2.0 GLOSSARY TUTORIALS

Formulaic Language and Fluency: ESL Teaching Applications

LOUISIANA HIGH SCHOOL RALLY ASSOCIATION

Literature and the Language Arts Experiencing Literature

LA1 - High School English Language Development 1 Curriculum Essentials Document

Developing a TT-MCTAG for German with an RCG-based Parser

CEFR Overall Illustrative English Proficiency Scales

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A process by any other name

Salli Kankaanpää, Riitta Korhonen & Ulla Onkamo. Tallinn,15 th September 2016

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Postprint.

Developing Grammar in Context

Computerized Adaptive Psychological Testing A Personalisation Perspective

Guidelines for Writing an Internship Report

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

What is a Mental Model?

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Annotation Projection for Discourse Connectives

BULATS A2 WORDLIST 2

Loughton School s curriculum evening. 28 th February 2017

BYLINE [Heng Ji, Computer Science Department, New York University,

VOCABULARY INSTRUCTION

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

teacher, paragraph writings teacher about paragraph about about. about teacher teachers, paragraph about paragraph paragraph paragraph

Syllabus FREN1A. Course call # DIS Office: MRP 2019 Office hours- TBA Phone: Béatrice Russell, Ph. D.

Transcription:

Computational Dictionaries & Terminology February 1 and 6, 2006 Dr. Andreas Eisele Computerlinguistik & DFKI Language Technology I WS 2005/2006

Computational Dictionaries & Terminology Motivation Definitions Relevant Standards Important Resources Automatic Acquisition Outlook Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 2

Motivation for computational dictionaries Natural language processing needs knowledge about words Morphological behavior (how words look/are modified) Syntactic behavior (part-of-speech, relation to other words) Semantic behavior (how words relate to meanings) But: Construction of lexical resources is a major investment Vocabularies can be very large, e.g. Duden s Deutsche Rechtschreibung ~ 115000, the Oxford English Dictionary (OED) ~ 301100, Der Große Muret Sanders ~ 560000 entries, technical terminologies may contain millions of entries Words can have many meanings, can be used in multiple ways, e.g. entry for get in Der Kleine Muret Sanders: ~ 340 lines Selection of vocabulary may depend on application, subsets may be almost unlimited in size (person/place/company names), and may change quickly over time (product names, computer jargon) Theory-specific description of syntactic/semantic behavior makes re-use difficult Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 3

Motivation for terminology terminology as a structured set of concepts and their designations in a particular subject field can be considered the infrastructure of specialized knowledge. Technical writing and technical documentation are impossible without properly using terminological resources. Highquality multilingual terminologies have become scarce and much desired commodities for language and knowledge industries. [Galinski/Budin 96] Research field: terminology science the scientific study of the concepts and terms found in special languages [ISO 1087] Practical field of application: terminology management creation of subject-field specific terminologies recording in terminology databases, dictionaries, lexicons, specialized encyclopedias Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 4

Some definitions (from Wikipedia) A dictionary is a list of words with their definitions, a list of characters with their glyphs, or a list of words with corresponding words in other languages. In some languages, words can appear in many different forms, but only the lemma form appears as the main word or headword in most dictionaries. Many dictionaries also provide pronunciation information; grammatical information; word derivations, histories, or etymologies; illustrations; usage guidance; and examples in phrases or sentences. The word thesaurus means a listing of words with similar, related, or opposite meanings (this meaning of thesaurus dates back to Roget's Thesaurus). For example, a book of jargon for a specialized field; or more technically a list of subject headings and cross-references used in the filing and retrieval of documents (or indeed papers, certificates, letters, cards, records, texts, files, articles, essays and perhaps even manuscripts), film, sound recordings, machine-readable media, etc. The first example of this genre, Roget's Thesaurus, was published in 1852, having been compiled earlier, in 1805, by Peter Roget. Entries in Roget's Thesaurus are not listed alphabetically but conceptually and are a great resource for writers. A glossary is a list of terms with the definitions for those terms. Traditionally, a glossary appears at the end a book and includes terms within that book which are either newly introduced or at least uncommon. In a more general sense, a glossary contains explanations of concepts relevant to a certain field of study or action. In this sense, the term is contemporaneously related to ontology. Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 5

More definitions Terminology, in its general sense, simply refers to the usage and study of terms words and compound words generally used in specific contexts. The term "terminology" may also refer to a more formal discipline which systematically studies of the labelling or designating of concepts particular to one or more subject fields or domains of human activity, through research and analysis of terms in context, for the purpose of documenting and promoting correct usage. This study can be limited to one language or can cover more than one language at the same time (multilingual terminology, bilingual terminology, and so forth). In Information Science, an ontology is the product of an attempt to formulate an exhaustive and rigorous conceptual schema about a domain. An ontology is typically a hierarchical data structure containing all the relevant entities and their relationships and rules within that domain (e.g., a domain ontology). However, computational ontology does not have to be hierarchical at all. The computer science usage of the term ontology is derived from the much older usage of the term ontology in philosophy. WordNet is a semantic lexicon for the English language. It groups English words into sets of synonyms called synsets, provides short definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools can be downloaded and used freely. Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 6

Some Important Resources Celex DB for Dutch, English, German: http://www.ru.nl/celex/ EAGLES guidelines including computational lexicons: http://www.ilc.cnr.it/eagles96/browse.html ELRA catalogue (http://www.elda.org/rubrique2.html): 57 monolingual lexicons 41 bi- and multilingual lexicons 22 terminological resources Wordnet, EuroWordnet FrameNet, Propbank Eurodicautom (1973) is the European Commission's multilingual term bank, based on phrasal automatic dictionary Dicautom (1964), and translation dictionary Euroterm (1962-68). Original languages: Dutch, French, German and Italian,added later: Danish,English (1973), Greek (1981), Portuguese and Spanish (1986), Finnish and Swedish (1995). Latin is also available. IATE (in the process of replacing Eurodicautom since 2000) is the EU inter-institutional terminology database system. It will be used for the collection, dissemination and shared management of EU-specific terminology. This system will be multilingual and will be available to EU agencies and institutions, freelance translators and European citizens. Eurovoc is a multilingual thesaurus covering the fields in which the European Communities are active; it provides a means of indexing the documents in the documentation systems of the European institutions and of their users. Eurovoc 4.2 exists in 16 official languages but the missing ones will be added IPC (International Patent Classification): see separate slides Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 7

Wordnet a semantic lexicon Wordnet groups together words according to lexical semantic relations like synonymy, hyponymy, meronymy, antonymy, etc. covers 4 open part-of-speech classes: nouns, verbs, adjectives, adverbs Words are assigned to sets of synonyms (synsets), all other relations hold between synsets WN has been used in many experiments on semantic disambiguation, IR, WN has been ported to many other languages, attempts to build cross-lingual versions are on the way Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 8

Some Terminology Standards Unfortunately, these standards refer to the data formats, not to the contents of terminology files TBX: Termbase Exchange format. This standard allows for the interchange of terminology data including detailed lexical information. The framework for TBX is provided by two ISO 12620, ISO 12200 and ISO Committee Draft 16642, known as TMF or Terminological Markup Framework. ISO 12620 provides an inventory of well-defined data categories with standardized names that function as data element types or as predefined values. OLIF: Open Lexicon Interchange Format. OLIF is an open, XMLcompliant standard for the exchange of terminological and lexical data. Although originally intended as a means for the exchange of lexical data between proprietary machine translation lexicons, it has evolved into a more general standard for terminology exchange. Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 9

Automatic Acquisition Technical terminology exists in large and quickly growing quantities A good way of keeping track is to mine existing documents for technical terms Statistical criteria can be used for monolingual term extraction, but based on frequencies alone it is hard to separate the wheat from the chaff Recent effort at DFKI (in collaboration with European Patent Office): Automatic terminology extraction from translated documents Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 10

Terminology Extraction from Patents Cooperation between DFKI and European Patent Office (EPO) Goal: Extract parallel terminologies for EN, DE, ES, FR from translated patent documents Motivation: Technical documentation makes up a large share of language industry s raw material, vocabulary is commercially interesting Manual construction of unrestricted multilingual terminologies would be prohibitively expensive Translated documents exist in large volumes, as well as techniques for sentence/word/phrase alignment IPC (hierarchical system of about 70K classes) may help to relate extracted terms with ontologies Test-bed for scalability of tools and resources How well do our tools cover technical texts? Can we acquire new lexical information from data? First step towards MT for technical documents Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 11

Terminology Extraction from Patents History and current status: Techniques were prototypically implemented in a feasibility study for WIPO ( 03, via acrolinx GmbH) Call for Tender by EPO in August 05 Bids and results on test data due in September EPO received 14 bids, DFKI delivered best results for DE EN, ES EN and was among the best for FR EN System for production under construction, started processing first batch of data (2.9M docs, >90GB) Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 12

Terminology Extraction from Patents A 01AGRICULTURE; FORESTRY; ANIMAL H 05 ELECTRIC HUSBANDRY; TECHNIQUES HUNTING; NOT OTHERWISE PROVIDED FOR A: human necessities B: performing TRAPPING; operations; FISHING transporting C: chemistry; metallurgy D: textiles; paper The International Patent Classification (IPC) A 01B H 05 SOIL K PRINTED WORKING CIRCUITS; IN AGRICULTURE CASINGS OR OR CONSTRUCTIONAL FORESTRY; PARTS, DETAILS, OF OR ACCESSORIES ELECTRIC APPARATUS; OF AGRICULTURAL MANUFACTURE MACHINES OF ASSEMBLAGES OR IMPLEMENTS, OF ELECTRICAL IN E: fixed constructions F: mechanical engineering; lighting; heating; G: physics [weapons; blasting based on the Strasbourg Agreement (1971) used by >100 national authorities GENERAL COMPONENTS H: electricity indispensable for finding prior art A 01 B 1/00 Hand tools hierarchical structure, consisting of A 01 B 1/02 spades, shovels eight sections (A..H) 120 classes (A01 H05) 628 subclasses (A01B H05K) 69,000 subdivisions (e.g. A01B 1/02 or H05K 10/00) regularly updated (currently in force: 7 th edition) officially released in EN and FR by WIPO, but translations to many languages are available from national authorities H 05 K 10/00 Arrangements for improving the operating reliability of electronic equipment, e.g. by providing a similar stand-by unit Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 13

Terminology Extraction from Patents Some research questions related to the IPC Automatic Classification Can IPC classes be identified automatically? (So far classification and search done by ~ 6500 experts) Ontology construction How does the IPC relate to the terminologies used in the various domains? Can we (semi-) automatically construct/extend these terminologies given the documents? Word sense disambiguation Can a given IPC class help to identify meaning/translation of a given term? Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 14

Terminology Extraction from Patents Setup: Use linguistic tools for corpus annotation POS-tagging Phrase recognition Lemmatization Use statistical tools for alignment GIZA++ from Franz Och Own algorithms based on word similarities Integrate module outcomes, transform into OLIF entries Improvement in 2 nd phase: Feed-back of modifications to basic modules Manual inspection and error analysis will be used to improve algorithms as long as the project is ongoing Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 15

Terminology Extraction: Architecture Statistical Word Alignment Word- Level Matches Parallel Documents Integration Phrase- Level Matches Linguistic Processing Augmented Documents (POS, chunks, lemmata) Selection and OLIF transformation OLIF DB Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 16

Terminology Extraction: Architecture II Statistical Word Alignment Word- Level Matches Parallel Documents Integration Phrase- Level Matches Linguistic Processing Augmented Documents (POS, chunks, lemmata) Selection and OLIF transformation OLIF DB Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 17

Examples for Patent Terminology Postbestimmungsortinformationsspeichereinrichtung = mail destination information memory means Informationsdurchforstungssteuerungseinrichtung = information browsing control means Hypervideonachrichtversendungsverarbeitungseinricht ung = hypervideo message posting processing means Gasphasenverunreinigungsabsorptionsflüssigkeit = gas phase contaminant absorbing liquid Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 18

Exercises Do a web search for a freely downloadable English-German bilingual lexical resource (try to find a large one). How many entries does it contain? How many translations for the words aktuell and offender can you find in it? Which of these terms can you find in wordnet? Compare these numbers with the ones given in a recent (superficial) comparison of web-based dictionaries in http://tomorrow.msn.de/internet/webguides/web-dolmetscher Look through a list translations of aktuell (+inflection) extracted by a statistical approach from a parallel corpus. Are there good entries in the automatically found data the lexicon was missing? Look in the ELRA catalog for English-German bilingual lexical/ terminological resources. Can you find anything useful? How much would it cost? Is it possible to obtain information about the level of detail contained in these resources without buying them? Language Technology I: Computational Dictionaries/Terminology (WS 05/06) 19