Core Linguistic Resources for the World s Languages

Similar documents
ROSETTA STONE PRODUCT OVERVIEW

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Modeling full form lexica for Arabic

Cross Language Information Retrieval

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Approved Foreign Language Courses

Linking Task: Identifying authors and book titles in verbose queries

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Speech Recognition at ICSI: Broadcast News and beyond

Busuu The Mobile App. Review by Musa Nushi & Homa Jenabzadeh, Introduction. 30 TESL Reporter 49 (2), pp

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Constructing Parallel Corpus from Movie Subtitles

Task Tolerance of MT Output in Integrated Text Processes

Section V Reclassification of English Learners to Fluent English Proficient

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Learning Methods in Multilingual Speech Recognition

1972 M.I.T. Linguistics M.S. 1972{1975 M.I.T. Linguistics Ph.D.

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Information for Candidates

DLM NYSED Enrollment File Layout for NYSAA

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Multilingual Sentiment and Subjectivity Analysis

Development of the First LRs for Macedonian: Current Projects

Language Independent Passage Retrieval for Question Answering

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Salli Kankaanpää, Riitta Korhonen & Ulla Onkamo. Tallinn,15 th September 2016

My First Spanish Phrases (Speak Another Language!) By Jill Kalz

Language Center. Course Catalog

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

English (from Chinese) (Language Learners) By Daniele Bourdaise

Standardized Assessment & Data Overview December 21, 2015

EUROPEAN DAY OF LANGUAGES

Program Change Proposal:

Roadmap to College: Highly Selective Schools

UDW+ Student Data Dictionary Version 1.7 Program Services Office & Decision Support Group

National Standards for Foreign Language Education

BYLINE [Heng Ji, Computer Science Department, New York University,

IB Diploma Program Language Policy San Jose High School

1. Introduction. 2. The OMBI database editor

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A heuristic framework for pivot-based bilingual dictionary induction

Chapter 5: Language. Over 6,900 different languages worldwide

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

TextGraphs: Graph-based algorithms for Natural Language Processing

ARNE - A tool for Namend Entity Recognition from Arabic Text

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Ontological spine, localization and multilingual access

Tour. English Discoveries Online

An Analysis of PharmD Industry Fellowships

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Guide to the Program in Comparative Culture Records, University of California, Irvine AS.014

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Modern Languages. Introduction. Degrees Offered

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Designing e-learning materials with learning objects

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Finding Translations in Scanned Book Collections

VII Medici Summer School, May 31 st - June 5 th, 2015

ENGLISH LANGUAGE LEARNERS (ELL) UPDATE FOR SUNSHINE STATE TESOL 2013

Developing a TT-MCTAG for German with an RCG-based Parser

Multi-Lingual Text Leveling

The Smart/Empire TIPSTER IR System

Routledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man

Conversions among Fractions, Decimals, and Percents

Undergraduate Programs INTERNATIONAL LANGUAGE STUDIES. BA: Spanish Studies 33. BA: Language for International Trade 50

5/26/12. Adult L3 learners who are re- learning their L1: heritage speakers A growing trend in American colleges

Language and Tourism in Sabah, Malaysia and Edinburgh, Scotland

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Creating Travel Advice

I AKS Research Grant

School of Languages, Literature and Cultures

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

HEALTH SERVICES ADMINISTRATION

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Applications of memory-based natural language processing

Age Effects on Syntactic Control in. Second Language Learning

English-German Medical Dictionary And Phrasebook By A.H. Zemback

LING 329 : MORPHOLOGY

AB104 Adult Education Block Grant. Performance Year:

Basic German: CD/Book Package (LL(R) Complete Basic Courses) By Living Language

Making Sales Calls. Watertown High School, Watertown, Massachusetts. 1 hour, 4 5 days per week

Mandarin Lexical Tone Recognition: The Gating Paradigm

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Language Model and Grammar Extraction Variation in Machine Translation

8. Prerequisites, corequisites (If applicable) Prerequisites: ACCTG 1 (Financial Accounting) ACCTG 168 (Tax Accounting)

Noisy SMS Machine Translation in Low-Density Languages

Effect of Word Complexity on L2 Vocabulary Learning

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Aviation English Training: How long Does it Take?

Vocabulary Usage and Intelligibility in Learner Language

Transcription:

Core Linguistic Resources for the World s Languages Christopher Cieri, Mike Maxwell, Stepanie Strassel {ccieri,maxwell,strassel}@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3615 Market Street, Philadelphia, PA 19104-2608 U.S.A. www.ldc.upenn.edu ELSNET, ENABLER, ICWLR 2003, Paris 1

Scoping the Problem 6700 Languages (according to Ethnologue) Assume international consortia create complete LRs for 50 languages/year at $700K/language Bottom Line: $4.7B and 134 years More importantly, the process of building LRs changes with the size of the language, its history of literacy, etc. E.g.: raw text acquisition; only 1500 languages written Electronic harvest Scanning/keyboarding of written text Paying native speakers to create original works Designing an orthography, interviewing native speakers and transcribing The motivation for building LRs also changes with language Culture & Folk medicine versus International Markets Understanding remote points of view ELSNET, ENABLER, ICWLR 2003, Paris 2

Proposal Features Design Core Project - must be possible Require <= 5 years Budget should be conceivable given our previous collective experience Manageable set of core languages many speakers worldwide, local experts & native-speaker annotators raw resources available on web Manageable set of core resources text, parallel text, translation lexicon, entity tagging grammatical sketch, tokenizer, morph-analyzer Publish to encourage extension Language resources & metadata describing them Corpus specifications & tools Coordinate work on LRs to minimize duplication of effort Promote the plan to international coordinating bodies, national governments, commercial sponsors researchers ELSNET, ENABLER, ICWLR 2003, Paris 3

Pre-History 1983: Penn Language Analysis Center founded; builds textbases, bilingual dictionaries in 35 languages 1992: LDC founded to distribute LRs for many languages 1995: CALLHOME corpora for Large Volume Continuous Speech Recognition 200 telephone conversations of 20-30 minutes Complete transcripts Pronouncing lexicon English, Spanish, Mandarin, Egyptian Arabic, German, Japanese 1996: CALLFRIEND corpora for Language Identification 200 telephone conversations of 20-30 minutes American English (Southern&Non-), Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese (Mainland & Taiwan), Spanish (Caribbean & Non-), Tamil, Vietnamese ELSNET, ENABLER, ICWLR 2003, Paris 4

Recent History 1999: TIDES Planning begins news understanding system for English speaking user multilingual capabilities with rapid porting to new languages 1999: JHU Workshop on rapid development of statistical machine translation 2000: LDC completes 50 language TIDES VOA collection 2001: TIDES reorganized with 3 primary & 3 secondary languages English, Mandarin, Arabic Spanish, Japanese, Korean 2002: TIDES Surprise Language experiments announced; LDC begins resource survey in preparation 2002: ICWLR planning meeting 2003: Surprise Language experiments Data collection dry run in Cebuano Data collection, technology development and evaluation in Hindi ELSNET, ENABLER, ICWLR 2003, Paris 5

LR Survey Preparation for TIDES Surprise Language Experiments Given that LDC would have no prior knowledge of Surprise Language And that, with the wrong choice, the experiment could become mired LDC proposed the survey to inform program manager s choice and to emphasize preparation over scramble Survey avoids gaming experiment by permanently changing the landscape. Based upon Ethnologue Limited to languages with 1,000,000+ speakers Temporarily excluded well studied languages (Chinese, French) Excluded languages all of whose speakers also another language with greater number of speakers (Cajun English, Sicilian) Excluded languages that are not written. Performed triage on remaining languages Developed decision tree where negative answers demote a language Questions researched roughly in triage order Now have triage results for 150/320 languages ELSNET, ENABLER, ICWLR 2003, Paris 6

% of World's Population who are Native Speakers Languages/Speakers 100% 80% 60% 40% 20% 0% 1 1,001 2,001 3,001 4,001 5,001 6,001 Languages Ordered by Number of Native Speakers ELSNET, ENABLER, ICWLR 2003, Paris 7

Survey Questions Demographics Language Name, SIL Code & Classification, Consider? Primary Country, Other Countries where spoken L1 Speakers Worldwide, % Who Speak Larger Language, Pivot Speakers with Internet Access, Predicted Growth, Net Hosts Is there a US Speaker Community? Literacy Rate? Students? Orthography Language Written, Simple Orthography, Separate Sentences/Words Linguistic Structure Simple Morphology? Dictionary? Special Considerations General Resources Newspaper, Radio/TV Descriptive Grammar in English, US Expert Bible, Book of Mormon, Other Translations Electronic Resources Standard Digital Encoding(s) 100K word News Text 100K word Parallel Text 10K word Translation Dictionary, Morph Analyzer ELSNET, ENABLER, ICWLR 2003, Paris 8

Sample Summary Summary contains decisions. Full report contains underlying data. ELSNET, ENABLER, ICWLR 2003, Paris 9

SL Dry Run Planned Duration: 1 week beginning March 5; Multiple Sites U. California at Berkeley, Carnegie-Mellon U., Johns Hopkins U., U. Maryland, MITRE, NYU, U. Pennsylvania/LDC, Sheffield U, USC/ ISI Philippine language Cebuano selected. Survey had identified: Bible, small news text archive, several printed dictionaries and grammars 8 hours into project, LDC had found 250,000 words of news texts, several other small monolingual and bilingual Cebuano texts, 4 computer-readable lexicons exceeding 24,000 entries in total Considerable overlap among what different sites discovered Disparity between survey and experiment results greater effort during the exercise survey search methodology» searches for Cebuano + lexicon, dictionary, news. missed resources labeled with alternative names (Bisayan and Visayan) Issues Overlap of effort inevitable No mode of electronic communication fast enough; LDC staff sat together Cebuano related closely to other Philippine languages, more distantly to other Malayo-Polynesian languages; difficult for non-speakers to distinguish Cebuano» Identified unique Cebuano worlds without inflectional morphology» Cebuano speakers checked the texts ELSNET, ENABLER, ICWLR 2003, Paris 10

SL Formal Evaluation Locate or build resources, develop & evaluate systems Language Hindi; Results significantly different Orders of magnitude more text on web; problem shifted to processing Within few hours basic resources located large resource conspiracy developed Encoding Hindi written in Devanagari Character Encodings Standards such as UNICODE & ISCII not commonly used. Every website had proprietary encodings; several sites had more than one Results All texts converted to Unicode (UTF-8) even though underspecified Team created finer encoding specification Texts also delivered in original form and ITRANS romanization Although character conversion took several weeks, integration of LRs and system development were accomplished in 1 month Hindi systems compared favorably in Topic Detection and Tracking, Cross Language IR, Content Extraction, Summarization and MT Recommendation from sites The surprise language experiment was tremendous success! Let s NOT do it again. ELSNET, ENABLER, ICWLR 2003, Paris 11

Current & Forthcoming LDC has NSF funds to extend resource finding, building efforts to 6 languages working in collaboration with University of Maryland at Baltimore and Johns Hopkins University languages with >1,000,000 native speakers high probability of basic resources available electronically wide variety of morpho-syntactic features wide variety of geographical regions at least two closely related language to support transfer experiments not likely to include European languages, Arabic, Chinese likely to include Dravidian, Indo-Aryan, Ingush, Malayo-Polynesian, Semitic, Turkic languages All data will be published metadata will be catalogued in OLAC as well as LDC Catalog TIDES community will fund continuation of the survey wants to extend the set of resources available for the 6 languages Specifically wants annotations to support information detection extraction, summarization and translations ELSNET, ENABLER, ICWLR 2003, Paris 12

Proposal LDC obligated to current path for at least the next year. SuperConsortium (e.g. of ICWLR, COCOSDA, ELSNET, ENABLER Network, LDC, ELRA, Korterm/Kaist, GSK, LDCIL & Talkbank and other partners) promote a minimum specification of core languages, core LRs, survey questions; define extended set of languages and resources on longer term LDC makes LR survey available to sites who submit complete survey answers for one new language SuperConsortium promotes the plan to EC, NSF, national funding agencies & commercial sponsors In many cases resources already exist but need to be identified and published. Resources collected & created are distributed through LDC, ELDA. Metadata for resources is published in OLAC and IMDI compliant forms and union catalogs Corpus specifications and annotation tools, including AGTK and tools created by Talkbank, are shared with other researchers, research groups to extend the LR catalog to new languages and for new data types. ELSNET, ENABLER, ICWLR 2003, Paris 13