9/7/7 A new hybrid hypothesis for the origin and spread of the Indo-European languages Russell Gray,Max Planck Institute for the Science of Human History, Jena Theories of Indo-European Origin The origin of Indo-European languages the most intensively studied, yet still most recalcitrant, problem of historical linguistics Diamond and Bellwood, Science, 2003 Talk structure.the challenge(s) 2. Bayesian phylolinguistics made easy 3.The Indo-European debate goes Bayesian 4. Archaeogenetics to the rescue? 5. A new hypothesis for the origin of PIE the quest for the origins of the Indo-Europeans has all the fascination of an electric light in the open air on a summer night: it tends to attract every species of scholar or would-be savant who can take pen to hand. Mallory, 989.
9/7/7 So why do I care? The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual process, are curiously parallel. We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation Tree of life Tree of languages The Descent of Man, 87 Darwin s notebook, 837 Schleicher, 865 Retentions vs innovations Sympliesiomorphies vs synapomorphies Glottoclock Molecular clock Karl Brugmann 884 Willi Hennig 950/966 Morris Swadesh 952 t = log c/2 log r Zuckerkandl & Pauling 962 c = % shared cognates r» 8% (200 word list) 2
9/7/7 Phylogenetic explosion in biology 0.7 Keyword Phylogen* in Scopus database http://treetapper-dev. blogs pot.co m/ Talk structure Percentage of Total Publications 0.6 0.5 0.4 0.3 0.2.The challenge(s) 2. Bayesian phylolinguistics made easy 3.The Indo-European debate goes Bayesian 4. Archaeogenetics to the rescue? 5. A new hypothesis for the origin of PIE 0. 980 990 2000 200 Year Which tree is more likely Why use computers? Languages # rooted trees 3 3 4 5 5 05 6 945 7 0395 8 3535 9 2027025 0 34459425 20 8.2 X 0 2 50 2.7 X 0 76 00 3.4 X 084 (2n-5)!! 3
9/7/7 Modern Bayesian Phylogenetic Inference What is the ancestral state?. Data 2. Model 3. Priors 4. Tree search Depends on the tree And the model assumptions about the relative probabilities of character state changes matter Likelihood calculation 2 Three models of lexical evolution. Equal probability of cognate gains and losses 2. Dollo (gains can only occur once, assumes stochastic clock)) 3. Covarion 4
9/7/7 MCMC search (Markov chain Monte Carlo) Convergence? Bayesian MCMC inference posterior probability of the trees given the priors, the data and the model No one tree to rule them all Posterior sample of trees reveals uncertainty Densitree visualisation A B C D A B C D A B C D 20% 50% 30% Adapted from Cui et al. 203. What we are NOT doing What is to be gained?. Counting cognates to get % shared cognates 2. Pairwise comparisons of languages 3. Assuming constant rates Level playing field no cherry picking, explicit optimality criterion to evaluate subgrouping hypotheses Quantify uncertainty Estimate dates 5
9/7/7 Talk structure Data.The challenge(s) 2. Bayesian phylolinguistics made easy 3.The Indo-European debate goes Bayesian Dyen et al. (992) - 84 languages, 2449 cognate sets Added three extinct languages Swadesh list - 200 items of basic vocabulary numerals, kinship terms, terms for body parts, and basic verbs Relatively resistant to borrowing Recognized borrowing removed - e.g. English mountain borrowed from French montagne was not coded as cognate. Binary coding English And Big Fire Meat Rub Water German Und Gross 2 Feuer Fleisch 2 Reiben Wasser Dutch En Groot 2 Vuur Vleesch 2 Wrijven Water Swedish Och 2 Stor 3 Eld 2 Kott 3 Gnida 2 Vatten Icelandic Og 2 Stor 3 Eldr 2 Hold 3 Nua 3 Vatn Danish Og 2 Stor 3 Ild 2 Kod 4 Gnide 2 Vand Greek Ke 3 Meghalos 4 Fotia 3 Kreas 5 Trivo 4 Nero 2 Bayesian MCMC tree-estimation MrBayes (v.2 & v.3) - Huelsenbeck & Ronquist 0 independent runs with 4 chains,300,000 generations First 300,000 discarded as burnin Sample every 0,000 Majority rule consensus tree of 0,000 MCMC trees Fren ch /I berian Italic Celtic linguists don t do dates April & Robert McMahon (2006) No rth German i c West Germanic German i c Sl avi c Balto-Slavic Baltic Indic Indo-Iranian Iranian Albanian Arme n ia n Greek Toc ha ria n Hi tti te 6
9/7/7 3!!! calibration points (ranges) 450 AD 800 AD 50 AD 300 AD 50 AD 250 AD 650 BC 300 AD Fren ch /I b eri an West German i c No rth German i c Italic Germanic Celtic Date of proto Indo-European estimated using penalised maximum likelihood rate smoothing Gray & Atkinson (2003) Nature, 426, 435-439. >000 BC > 200 BC 800 BC 500 BC Indic Iranian Indo-Iranian 50 AD 000 AD 400 BC 00 AD >500 BC 40 BC 350 AD 800 BC 300 BC Sl avi c Baltic Balto-Slavic Albanian Greek Arme n ia n Toc ha ria n Hi tti te Responses to Gray & Atkinson 2003 It is a very good paper. X#@$?&! (unrepeatable) Criticisms of our method. Wrong answer linguistic paleontology rejects this time date Larry Trask, Sussex Univ. 2. Reliance on lexical data Lexical data is the least reliable type of data - Don Ringe, Univ. of Pennsylvania 3. Model misspecification The argument from the wheel English wheel Hom. Greek kuklos OHGerman wel Sanskrit cakra OIcelandic hjol Proto-Indo-European word *kwekwlo- wheel How could a language that was last spoken around 0,000 years ago have words for things that were not invented until 4000 years later? used techniques that are not appropriate for their data." - Tandy Warnow, Univ. of Texas 4. Cognate sets not independent Larry Trask. 7
9/7/7 The argument from the wheel Other explanations Should we believe the lexical data? Constraining the tree to Ringe et al. 2002 topology gives similar date estimates Borrowing new technology Semantic shift ì kola wheel Old Russian *kwel- (to turn, rotate) è kuklos wheel - Greek î cakra wheel - Sanskrit Center of gravity/center of diversity Austronesian expansion Sequence, timing, pulses and pauses Gray, Drummond & Greenhill. 2009. Sc ience, 323, 479-483. Bayesian phylogeography Gray,Drummond, & Greenhill (2009) Science, 323, 479-483. Lemey et al.,mbe, 200 8
9/7/7 Alex Alekseyenko Quentin Atkinson Remco Bouckaert Alexei Drummond Michael Dunn Russell Gray Simon Greenhill Philippe Lemey Marc Suchard The team Bayesian phylogeography. Data (basic vocab) 2. Location (language ranges) 3. Diffusion model 4. Calibration data to date language divergences 5. Bayesian MCMC inference of phylogeny in BEAST New (improved) data http://ielex.mpi.nl/ Michael Dunn, MPI Nijmegen +language location data + model of spatial diffusion + Bayesian inference of phylogeny in BEAST Substitution models Simple binary reversible model (0 çè ) Binary covarion model slow(0 çè ) fast (0 çè ) Stochastic dollo model (0 è one gain, many loses) Clock Uncorrelated lognormal relaxed clock (Drummond et al., 2006) 9
9/7/7 = test origin hypotheses Bouckaert et al (202) Science. -7000-6000 0.95-5000 0.99 0.99 0.96-4000 -3000 0.85 0.86-2000 Hittite -000 0.0 000 Celtic Italic Germanic Balto-Slavic Albanian Greek Armenian Indo-Iranian Tocharian 2000 Posterior distribution on root location Bayes factor Phylogeographic analysis Anatolian vs. steppe I Anatolian vs. steppe II RRW: All languages 75.0 59.3 RRW: Ancient languages only 404.2 582.6 RRW: Contemporary languages only 2.0.4 Landscape aware: Diffusion 298.2 4.9 Bouckaert et al (203) Science. Correction. Celtic Italic Germanic 0.78 Balto-Slavic 0.36 0.69 Indo-Iranian 0.36 0.48 Anatolian Tocharian Armenian Greek Albanian 8000 7000 6000 5000 4000 3000 2000 000 0 Time (years ago) Bouckaert et al (202) Science. 0
9/7/7 RRW Posterior distribution on root location X X X RRW: All languages 380.4 625.2 RRW: Constrained 74.0 45.4 RRW: Ancient only 828 + RRW: Contemporary only* 73 + Big problem inconsistent data coding (Swadesh policy of most commonly used lexeme not consistently followed) 20 Butterflies in the Chang et al data 5 Number of Languages 0 5 0 0 00 200 Number of Records COBL Ancient Greek
9/7/7 Better data, multi-state, better model that allows but does not enforce direct ancestry, new results. Chang et al 204 New Jena results Umbrian Latin 0.94 Gaulish 0.97 0.99 0.98 0.99 0.94 Romanian Vlach Sardinian_Logudoro Sardinian_Cagliari Sardinian_Nuoro Italian Ladin Romansh Friulian Catalan Portuguese Spanish Provencal Walloon French Gothic Old_Norse Old_Swedish 0.99 Old_English 0.99 Old_High_German Icelandic Faroese Norwegian_Riksmal Swedish Danish English Frisian Dutch Flemish German Luxembourgish 2.0.0 0.0 2
9/7/7 Talk structure. The challenge(s) 2. Bayesian phylolinguistics made easy 3. The Indo-European debate goes Bayesian 4. Archaeogenetics to the rescue? Haak et al 205 Talk structure. The challenge(s) 2. Bayesian phylolinguistics made easy 3. The Indo-European debate goes Bayesian 4. Archaeogenetics to the rescue? 5. A new hypothesis for the origin of PIE 3
9/7/7 A hybrid model for the origin and spread of Indo-European Haak, Heggarty, Krause & Gray A hybrid model for the origin and spread of Indo-European Haak, Heggarty, Krause & Gray Thanks to Colleagues Cormac Anderson (Jena, MPG-SHH) Quentin Atkinson (Auckland) Remco Bouckaert (Auckland) Alexei Drummond (Auckland) Michael Dunn (Nijmegen) Simon Greenhill (ANU, Jena) Wolfgang Haak (Jena) Paul Heggarty (Jena, MPG-SHH) Johannes Krause (Jena) Philippe Lemey (Leuven) Marc Suchard (UCLA) 4