Lexical loss as a shared linguistic innovation

Lexical loss as a shared linguistic innovation FIN-CLARIN seminar on Fenno-Ugric Computational Linguistics University of Helsinki, 2016-09-23 Juho Pystynen juho.pystynen@helsinki.fi

Why lexical loss? Loss of inherited linguistic material is a simple and commonplace linguistic innovation. Lexical material is a numerically rich source of data: any given language variety can be characterized by the presence of thousands of lexemes. If a language's history is known in some detail (back to a recent proto-language stage), often several hundreds of lexemes can be analyzed as lost vs. not lost.

Modelling loss (0) Loss is not a mirror image of the innovation of new vocabulary. Synonymy: words can persist in use even after the introduction of a new word of the same meaning. {a} > {a, b} Multiple innovations: a word's "replacement" can itself also be later lost. {a} > {b} > {c} Total loss: a lost word can end up replaced not by a new innovative word, but instead by an analytic expression or, by a pre-existing synonym. {a} > {a, b} > {b}

Modelling loss (1) Given a set of vocabulary in a (possibly reconstructed) protolanguage, we can at first approximation model lexical loss as the presence vs. absence of a reflex of a given proto-form. *{a, b, c, d, e, f } > {a, c, d, f } A simple metric of total losses in a given descendant variety will then be the total percentage of lexical material preserved vs. lost. Again at first approximation, we can model the loss process as essentially random. Much more finer-grained sociolinguistic and corpus analysis would be possible: recognition percentage among a speaker community, frequency of usage, median age for the acquisition of a particular word, usage competition between synonyms, variation in what proto-states may be assumed for these factors, etc.

Modelling loss (1) When a language variety has for some reason not been documented in detail (extinct, endangered, remote, etc.), some losses may be "virtual": a lexical item still remains in use, but has not been recorded by researchers. Working with a percentage measure, modellable as simply an additional loss factor "loss during documentation": p observed loss = p historical loss p documentation loss At the low end, documentation loss is highly unlikely to be random, due to early fieldwork surveys often having been based on lists of basic vocabulary. 'Five', 'head', 'woman' unlikely to be lost in the documentation process; 'multitude', 'pancreas', 'midwife' more likely

Modelling loss (1) Documentation loss is intrinsically not observable in (a given set of) data. Consequence 1: observed total losses are a measure of the comparative data, not directly of history. Consequence 2: comparison of lexical losses is unlikely to be immediately useful between languages documented to significantly differing degrees.

Modelling loss (2) Modelling losses as a binary variable at the lexeme level often runs into difficult edge cases. An etymological comparison is never a strictly proven fact. Solution: apply probabilistic modelling here as well. Exact figures are not possible to derive, but rough ballpark figures can be applied. A highly regular etymology 100% probability A plausible etymology with irregularities 50 90% probability A speculative etymology 1 10% probability Lack of etymology 0% probability

An example dataset: Samoyedic The Samoyedic languages: a relatively compact and homogeneously documented language group Eight reasonably well documented languages: Nganasan, Tundra Enets, Forest Enets, Tundra Nenets, Forest Nenets, Selkup, Kamass, Mator No overshadowing major literary languages Language boundaries fairly clear Substantial reconstruction work is available Status as a part of the larger Uralic family allows improved grounding

An example dataset: Samoyedic A work-in-progress etymological database: Main lexical data source: the etymological dictionary of Janhunen (1977) Addenda from later studies, e.g. Helimski (1986, 1993), Aikio (2002, 2006) Thus far in humble spreadsheet form 790 lexemes (and growing) with rough probabilistic encoding Reconstruction, distribution of reflexes, further etymology

An example dataset: Samoyedic

An example dataset: Samoyedic Basic retention percentages: Nganasan 61% Selkup 80% Enets 67% Kamassian 57% Yurats 17% Koibal 36% Tundra Nenets 87% Mator 44% Forest Nenets 78%

Modelling subgrouping We need to allow the possibility that different observed loss rates reflect also different historical loss rates, and not merely different documentation losses Within a family tree model, we can assign loss rates not just for languages, but in more general for branches Could we however do the inverse: identify branches from losses?

Modelling subgrouping Isolated retention percentages provide no subgrouping information: for any arbitrary tree, we can always assign branch loss rates that multiply to the observed top node loss rates.

Modelling subgrouping We need to look at shared losses vs. retentions (between a given pair of varieties) on a wordby-word level to be able to locate common innovations. A shared loss (in the data) is, however, not automatically a common loss (in actual history). Indeed, for languages 1 and 2 with loss rates p 1 and p 2, we expect to see a shared loss rate p 1 p 2 already purely by chance.

Modelling subgrouping What we can do with ease is to calculate the expected shared loss, or retention, rates, and compare these with the attested rates. (With detailed statistical analysis, if we wish; for today's purposes a simple look at these metrics will however suffice) With probabilistic etymological coding, for a single lexeme we have, at a pinch: p(shared retention) = p 1 p 2 1,1 1 0,p 0 p(shared loss) = (1-p 1 ) (1-p 2 ) 1,p 0 0,0 1

Shared retentions Nganasan Selkup: predicted: 361 shared items; attested: 373 (103%) Tundra Nenets Forest Nenets: predicted: 505 shared items; attested: 565 (112%) Yurats Mator: predicted: 58 shared items; attested: 93 (160%) Kamassian Koibal: predicted: 150 shared items; attested: 260 (173%) main trend: generally elevated rates across board

Shared retentions Phenomenon 1: reconstructed vocabulary is not known independently of the descendants. Lexemes surviving in one language are usually not reconstructible (exception: words with wider Uralic pedigree) Lexemes surviving in zero languages are entirely unreconstructible. Observed retention rates are actually slightly elevated, loss rates slightly diminished. p L, accurate = n L / N (n L = # of lexemes attested in variety L; N = total number of proto-lexemes) p L, observed = n L / (N-N 0 ) (N 0 = total number of unreconstructible proto-lexemes) If N 0 /N small: p L, obs n L / N + n L / N 0 = p L, acc + p L, 0 An approximately linear error factor for retention rates in turn, constant error term for the predicted observed ratio

Shared retentions Phenomenon 2: as covered before, documentation loss is likely to introduce a bias towards basic vocabulary. Which is constant with respect to languages. Substantially poorer-documented languages will appear closer to all other languages than expected. The effect will cumulate, showing poorer-documented languages especially close to each other. The position of poorer-documented languages is not resolvable without a detailed model of documentation practices.

Shared retentions Naive approaches to quantitative lexical comparison often attempt to interpret a higher proportion of shared vocabulary as indicative of closer relationship. Innovative shared vocabulary may indeed constitute historically common innovations However, historically common retentions are by contrast unindicative of common descent In principle, statistically significant upticks in shared retentions could instead indicate unidentified family-internal loaning Emerging bias among shared retention rates are however most likely to simply constitute methodological artifacts in the data.

Shared losses Nganasan Selkup: predicted: 73 shared losses; attested: 84 (115%) Tundra Nenets Forest Nenets: predicted: 29 shared losses; attested: 89 (307%) Yurats Mator: predicted: 372 shared losses; attested: 407 (109%) Kamassian Koibal: predicted: 233 shared losses; attested: 341 (146%) again, main trend is generally elevated rates

Shared losses The Nenets subgroup now clearly stands out among the material Poorly recorded varieties become now distant rather than close Losses will concentrate among less basic vocabulary, likely to be lost during documentation. If non-basic vocabulary is a numerical majority, losses among it will also be left less likely to co-occur Elevated overall rates, however, are likely to indicate the existence of large subgroups Subgroups may have historically undergone common losses While their complements may have missed out on lexical innovations In principle investigable by iterative subgrouping: pool Nenets and Km-Kb together as single varieties, repeat count for new results?

Shared losses vs. retentions Next, let's consider the comparative data between the two Nenets varieties a bit closer. Four surface categories can be identified: retained in both TN and FN: n RR = 565 lost in both TN and FN: n LL = 89 retained in TN, lost in FN: n RL = 104 lost in TN, retained in FN: n LR = 31 The surface retention and loss rates: p R, TN = (n RR + n RL ) / N; p R, FN = (n RR + n LR ) / N p L, TN = (n LR + n LL ) / N; p L, FN = (n RL + n LL ) / N

Shared losses vs. retentions However, if a Nenets subgroup indeed exists, we can divide loss events in two sets: early common losses in Proto- Nenets, vs. late losses separately TN vs. FN (some again may occur in parallel in both!) Also retentions during the Proto-Nenets period will exist in common. Moreover: a slightly elevated rate of common retentions is therefore indeed expected as well! But note the order of inference: shared losses common subgroup common retentions Retentions themselves continue to not suffice as evidence for common ancestry.