Lexical loss as a shared linguistic innovation

Similar documents
Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

STA 225: Introductory Statistics (CT)

Probability estimates in a scenario tree

Artificial Neural Networks written examination

SAT MATH PREP:

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Grade 6: Correlated to AGS Basic Math Skills

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

(Sub)Gradient Descent

Mathematics subject curriculum

Probability and Statistics Curriculum Pacing Guide

learning collegiate assessment]

A Case Study: News Classification Based on Term Frequency

An application of student learner profiling: comparison of students in different degree programs

What the National Curriculum requires in reading at Y5 and Y6

Introduction to Simulation

NCEO Technical Report 27

The Importance of Social Network Structure in the Open Source Software Developer Community

Chapter 5: Language. Over 6,900 different languages worldwide

Graduation Initiative 2025 Goals San Jose State

Evaluation of Hybrid Online Instruction in Sport Management

5. UPPER INTERMEDIATE

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Python Machine Learning

Lecture 1: Machine Learning Basics

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Disability Functional Capacity Evaluation. Dear Doctor,

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Multi-Lingual Text Leveling

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Unit 3 Ratios and Rates Math 6

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

A cognitive perspective on pair programming

Library Consortia: Advantages and Disadvantages

University of Groningen. Systemen, planning, netwerken Bosman, Aart

1. Introduction. 2. The OMBI database editor

MERGA 20 - Aotearoa

Learning From the Past with Experiment Databases

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Ontologies vs. classification systems

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Commanding Officer Decision Superiority: The Role of Technology and the Decision Maker

Effectiveness of McGraw-Hill s Treasures Reading Program in Grades 3 5. October 21, Research Conducted by Empirical Education Inc.

Proof Theory for Syntacticians

10.2. Behavior models

University-Based Induction in Low-Performing Schools: Outcomes for North Carolina New Teacher Support Program Participants in

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

A Neural Network GUI Tested on Text-To-Phoneme Mapping

SARDNET: A Self-Organizing Feature Map for Sequences

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

A Study of Successful Practices in the IB Program Continuum

Assignment 1: Predicting Amazon Review Ratings

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

ReFresh: Retaining First Year Engineering Students and Retraining for Success

Linking Task: Identifying authors and book titles in verbose queries

Radius STEM Readiness TM

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Longitudinal Analysis of the Effectiveness of DCPS Teachers

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

An Introduction to Simio for Beginners

Self Study Report Computer Science

Corpus Linguistics (L615)

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

IS USE OF OPTIONAL ATTRIBUTES AND ASSOCIATIONS IN CONCEPTUAL MODELING ALWAYS PROBLEMATIC? THEORY AND EMPIRICAL TESTS

Do multi-year scholarships increase retention? Results

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

1GOOD LEADERSHIP IS IMPORTANT. Principal Effectiveness and Leadership in an Era of Accountability: What Research Says

Disambiguation of Thai Personal Name from Online News Articles

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

CollaboFramework. Framework and Methodologies for Collaborative Research in Digital Humanities. DHN Workshop. Organizers:

Evaluation of a College Freshman Diversity Research Program

content First Introductory book to cover CAPM First to differentiate expected and required returns First to discuss the intrinsic value of stocks

Reinforcement Learning by Comparing Immediate Reward

The Evaluation of Students Perceptions of Distance Education

Model Ensemble for Click Prediction in Bing Search Ads

CS 101 Computer Science I Fall Instructor Muller. Syllabus

On-the-Fly Customization of Automated Essay Scoring

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

University Library Collection Development and Management Policy

arxiv: v1 [cs.cl] 2 Apr 2017

Rule Learning With Negation: Issues Regarding Effectiveness

Capitalism and Higher Education: A Failed Relationship

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

CS Machine Learning

What is Research? A Reconstruction from 15 Snapshots. Charlie Van Loan

Language Arts: ( ) Instructional Syllabus. Teachers: T. Beard address

PROJECT PERIODIC REPORT

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Common Core State Standards

Designing a case study

Lecture 1: Basic Concepts of Machine Learning

Transcription:

Lexical loss as a shared linguistic innovation FIN-CLARIN seminar on Fenno-Ugric Computational Linguistics University of Helsinki, 2016-09-23 Juho Pystynen juho.pystynen@helsinki.fi

Why lexical loss? Loss of inherited linguistic material is a simple and commonplace linguistic innovation. Lexical material is a numerically rich source of data: any given language variety can be characterized by the presence of thousands of lexemes. If a language's history is known in some detail (back to a recent proto-language stage), often several hundreds of lexemes can be analyzed as lost vs. not lost.

Modelling loss (0) Loss is not a mirror image of the innovation of new vocabulary. Synonymy: words can persist in use even after the introduction of a new word of the same meaning. {a} > {a, b} Multiple innovations: a word's "replacement" can itself also be later lost. {a} > {b} > {c} Total loss: a lost word can end up replaced not by a new innovative word, but instead by an analytic expression or, by a pre-existing synonym. {a} > {a, b} > {b}

Modelling loss (1) Given a set of vocabulary in a (possibly reconstructed) protolanguage, we can at first approximation model lexical loss as the presence vs. absence of a reflex of a given proto-form. *{a, b, c, d, e, f } > {a, c, d, f } A simple metric of total losses in a given descendant variety will then be the total percentage of lexical material preserved vs. lost. Again at first approximation, we can model the loss process as essentially random. Much more finer-grained sociolinguistic and corpus analysis would be possible: recognition percentage among a speaker community, frequency of usage, median age for the acquisition of a particular word, usage competition between synonyms, variation in what proto-states may be assumed for these factors, etc.

Modelling loss (1) When a language variety has for some reason not been documented in detail (extinct, endangered, remote, etc.), some losses may be "virtual": a lexical item still remains in use, but has not been recorded by researchers. Working with a percentage measure, modellable as simply an additional loss factor "loss during documentation": p observed loss = p historical loss p documentation loss At the low end, documentation loss is highly unlikely to be random, due to early fieldwork surveys often having been based on lists of basic vocabulary. 'Five', 'head', 'woman' unlikely to be lost in the documentation process; 'multitude', 'pancreas', 'midwife' more likely

Modelling loss (1) Documentation loss is intrinsically not observable in (a given set of) data. Consequence 1: observed total losses are a measure of the comparative data, not directly of history. Consequence 2: comparison of lexical losses is unlikely to be immediately useful between languages documented to significantly differing degrees.

Modelling loss (2) Modelling losses as a binary variable at the lexeme level often runs into difficult edge cases. An etymological comparison is never a strictly proven fact. Solution: apply probabilistic modelling here as well. Exact figures are not possible to derive, but rough ballpark figures can be applied. A highly regular etymology 100% probability A plausible etymology with irregularities 50 90% probability A speculative etymology 1 10% probability Lack of etymology 0% probability

An example dataset: Samoyedic The Samoyedic languages: a relatively compact and homogeneously documented language group Eight reasonably well documented languages: Nganasan, Tundra Enets, Forest Enets, Tundra Nenets, Forest Nenets, Selkup, Kamass, Mator No overshadowing major literary languages Language boundaries fairly clear Substantial reconstruction work is available Status as a part of the larger Uralic family allows improved grounding

An example dataset: Samoyedic A work-in-progress etymological database: Main lexical data source: the etymological dictionary of Janhunen (1977) Addenda from later studies, e.g. Helimski (1986, 1993), Aikio (2002, 2006) Thus far in humble spreadsheet form 790 lexemes (and growing) with rough probabilistic encoding Reconstruction, distribution of reflexes, further etymology

An example dataset: Samoyedic

An example dataset: Samoyedic Basic retention percentages: Nganasan 61% Selkup 80% Enets 67% Kamassian 57% Yurats 17% Koibal 36% Tundra Nenets 87% Mator 44% Forest Nenets 78%

Modelling subgrouping We need to allow the possibility that different observed loss rates reflect also different historical loss rates, and not merely different documentation losses Within a family tree model, we can assign loss rates not just for languages, but in more general for branches Could we however do the inverse: identify branches from losses?

Modelling subgrouping Isolated retention percentages provide no subgrouping information: for any arbitrary tree, we can always assign branch loss rates that multiply to the observed top node loss rates.

Modelling subgrouping We need to look at shared losses vs. retentions (between a given pair of varieties) on a wordby-word level to be able to locate common innovations. A shared loss (in the data) is, however, not automatically a common loss (in actual history). Indeed, for languages 1 and 2 with loss rates p 1 and p 2, we expect to see a shared loss rate p 1 p 2 already purely by chance.

Modelling subgrouping What we can do with ease is to calculate the expected shared loss, or retention, rates, and compare these with the attested rates. (With detailed statistical analysis, if we wish; for today's purposes a simple look at these metrics will however suffice) With probabilistic etymological coding, for a single lexeme we have, at a pinch: p(shared retention) = p 1 p 2 1,1 1 0,p 0 p(shared loss) = (1-p 1 ) (1-p 2 ) 1,p 0 0,0 1

Shared retentions Nganasan Selkup: predicted: 361 shared items; attested: 373 (103%) Tundra Nenets Forest Nenets: predicted: 505 shared items; attested: 565 (112%) Yurats Mator: predicted: 58 shared items; attested: 93 (160%) Kamassian Koibal: predicted: 150 shared items; attested: 260 (173%) main trend: generally elevated rates across board

Shared retentions Phenomenon 1: reconstructed vocabulary is not known independently of the descendants. Lexemes surviving in one language are usually not reconstructible (exception: words with wider Uralic pedigree) Lexemes surviving in zero languages are entirely unreconstructible. Observed retention rates are actually slightly elevated, loss rates slightly diminished. p L, accurate = n L / N (n L = # of lexemes attested in variety L; N = total number of proto-lexemes) p L, observed = n L / (N-N 0 ) (N 0 = total number of unreconstructible proto-lexemes) If N 0 /N small: p L, obs n L / N + n L / N 0 = p L, acc + p L, 0 An approximately linear error factor for retention rates in turn, constant error term for the predicted observed ratio

Shared retentions Phenomenon 2: as covered before, documentation loss is likely to introduce a bias towards basic vocabulary. Which is constant with respect to languages. Substantially poorer-documented languages will appear closer to all other languages than expected. The effect will cumulate, showing poorer-documented languages especially close to each other. The position of poorer-documented languages is not resolvable without a detailed model of documentation practices.

Shared retentions Naive approaches to quantitative lexical comparison often attempt to interpret a higher proportion of shared vocabulary as indicative of closer relationship. Innovative shared vocabulary may indeed constitute historically common innovations However, historically common retentions are by contrast unindicative of common descent In principle, statistically significant upticks in shared retentions could instead indicate unidentified family-internal loaning Emerging bias among shared retention rates are however most likely to simply constitute methodological artifacts in the data.

Shared losses Nganasan Selkup: predicted: 73 shared losses; attested: 84 (115%) Tundra Nenets Forest Nenets: predicted: 29 shared losses; attested: 89 (307%) Yurats Mator: predicted: 372 shared losses; attested: 407 (109%) Kamassian Koibal: predicted: 233 shared losses; attested: 341 (146%) again, main trend is generally elevated rates

Shared losses The Nenets subgroup now clearly stands out among the material Poorly recorded varieties become now distant rather than close Losses will concentrate among less basic vocabulary, likely to be lost during documentation. If non-basic vocabulary is a numerical majority, losses among it will also be left less likely to co-occur Elevated overall rates, however, are likely to indicate the existence of large subgroups Subgroups may have historically undergone common losses While their complements may have missed out on lexical innovations In principle investigable by iterative subgrouping: pool Nenets and Km-Kb together as single varieties, repeat count for new results?

Shared losses vs. retentions Next, let's consider the comparative data between the two Nenets varieties a bit closer. Four surface categories can be identified: retained in both TN and FN: n RR = 565 lost in both TN and FN: n LL = 89 retained in TN, lost in FN: n RL = 104 lost in TN, retained in FN: n LR = 31 The surface retention and loss rates: p R, TN = (n RR + n RL ) / N; p R, FN = (n RR + n LR ) / N p L, TN = (n LR + n LL ) / N; p L, FN = (n RL + n LL ) / N

Shared losses vs. retentions However, if a Nenets subgroup indeed exists, we can divide loss events in two sets: early common losses in Proto- Nenets, vs. late losses separately TN vs. FN (some again may occur in parallel in both!) Also retentions during the Proto-Nenets period will exist in common. Moreover: a slightly elevated rate of common retentions is therefore indeed expected as well! But note the order of inference: shared losses common subgroup common retentions Retentions themselves continue to not suffice as evidence for common ancestry.