Corpus Linguistics and Multivariate Statistics Seminar 1 Dylan Glynn www.dsglynn.univ-paris8.fr dglynn@univ-paris8.fr
What is a corpus?
You shall know a word by the company it keeps J. R. Firth 1957
Three basic, but inter-related, approaches to corpora 1. Comparison corpora - Concordances 2. Formal patterns with a corpus - Collocations 3. Meaning patterns within a corpus - Correlations These methods vary massively in complexity and application They are typically used to answer questions in Discourse Analysis Critical Discourse Analysis Sociolinguistics Semantics and Pragmatics Phonology Morpho-syntax Stylistics...
Comparison corpora - Concordances The most basic type of concordance is the list of words and their frequencies in a body of writing. They have been used for hundreds of years in especially Theology and Literature. They are still important in stylistics and in many other areas of research. They are very quick and easy to compile and often represent the first step of more advanced studies. For example take the parliamentary speeches of the Tories and Labour and compile a word list of both and compare the frequency of certain key words? Do the same for men and women in the parliament? What would differences tell us?
2. Formal patterns with a corpus - Collocations This is the mainstay of contemporary corpus linguistics. There are various types Frequency Analysis Concordance Analysis KWIC (keyword in context) - Collocations (concordance of word co-occurrences) - Collostructions (concordance of word syntactic pattern) Vector analysis / Word space analysis
Frequency Analysis 2008 US Presidential Election
Concordance Analysis
2. Collocation Analysis
Collostructional Analysis Table 2. Most significant V cause -V result co-varying collexemes in the into-causative in the 1990 2000 volumes of The Guardian (cf. Gries and Stefanowitsch 2004: 230) V cause V result N -log (p Fisher-Yates ) bounce accepting 29 14.074 torture confessing 8 13.155 draw commenting 6 10.581 shock understanding 7 10.483 stimulate producing 6 9.330 dupe carrying 8 7.244 con paying 16 7.019 hoodwink leaving 8 6.982 mislead buying 14 6.980 delude supposing 3 6.792 terrorise fleeing 4 6.762 talk letting 12 6.743 dupe leaving 13 6.609 force making 51 6.546 pressure having 14 6.505 bounce announcing 6 6.100 shame cleaning 4 5.953 dragoon voting 7 5.899 swing planning 2 5.518 fool queuing 3 5.435 lock using 5 5.406 guide lending 2 5.372 rush making 11 5.305 educate understanding 3 5.296 fool seeing 6 5.180
Vector Analysis synonymy of run come through guide run for bring home the bacon track down consort hunt down pass hunt lead sail extend go through go across win succeed deliver the goods unloosen unloose release occur incur fulfil fulfill chronological succession sequence carry through succession carry out successiveness chronological sequence action locomote accomplish travel displace persist prevail discharge die hard move outpouring execute endure campaign ravel carry political campaign black market ladder bleed incline go draw tend ply melt idle run off run run around run along run over run away melt down race running trial tally running play test unravel running game foot race footrace scarper streak take to the woods scat turn tail lam hightail it escape break away head for fly the hills coop bunk loose liberate free zip travel rapidly speed hurry trip runnel lean be given streamlet rivulet play rill range function work operate course flow feed period get become period of time time period process indefinite quantity liberty treat NOUNS
3. Meaning patterns within a corpus - Correlations There is a continuum from counting occurrences of some meaning or use through to large-scale multivariate modelling of the behaviour of those uses. Relative Frequencies This simple counting of uses is a simple and useful corpus-driven approach. It is the mainstay of Discourse Analysis and Functional Linguistics Multidimensional correlations This line of research is the newest and only began in the 1990s in Belgium and Germany It is an extremely popular technique in Cognitive Linguistics
Relative Frequencies Memoranda vs. email in office communication
Multidimensional correlations Conceptualisation of HOME
Form-based vs Meaning based analysis Problems with Meaning based analysis i. Low degree of representativity due to small sample size ii. High degree of subjectivity due to manual analysis Response to Problem of Representativity i. Restrict studies to careful controlled datasets ii. Predictive statistical modelling is essential Response to Problem of Subjectivity i. Clearly operationalised usage-features ii. Multiple annotators and Kappa scores for reliability
Corpus Evidence in Langauge Science In the late 1960s, the Functionalists were questioning the the assumptions of the European and American Structuralists France: Martinet, Benveniste, Culioli Britain: Firth, Halliday, Sinclair Russia: Bendarko, Aprecijan, Mel cuk America: Givon, Hopper, Fillmore, Lakoff The debate remains the of the two key debates of linguistics today. This debate is central to the theories of arguably the three greatest linguists in history What structures language?
Corpus Evidence in Langauge Science Where is grammar? Does langue come from parole or does parole come from langue? Humbolt - ergon (product) vs. energeia (activity) de Saussure - langue vs. parole Chomsky - performance vs. competence Theoretically, there are strong arguments for both Empirically, there are strong arguments for both Corpus linguistics necessarily assumes that the product is a result of the activity, that langue comes from parole, that competence is a based on performance... Although probably less than half, a very large group of linguists today think this is RUBBISH!!!
Corpus Evidence in Langauge Science The main argument against using performance as an index of structure. I live in New York versus I live in Dayton, Ohio Chomsky s (1964) Frequency of performance tells us about the world langauge is used to describe, not the langauge structure in the mind. Q. 1. Why does one assume that the langauge in the mind is different from the world it describes At some level, I come from New York is more important in langauge than I come from Dayton Q. 2. Why would one look at raw frequency to describe langauge, it is always relative? We could only compare the frequency of these two utterances, if the same number of people lived in Dayton & New York
Questions for Corpus Linguistics In every corpus-based study, it is crucial you are aware of the practical limitations and theoretical assumptions of the method!!! (this includes your mémoires) 1. Practical Questions a. Representativity Text type b. Representativity Hapax legomena 2. Theoretical Questions a. Frequency Linguistic structure b. Frequency Thematic bias 3. Analytical Questions a. Negative Evidence b. Objective Accuracy
Representativity - Practical Questions for Corpus-Based Research 1. Text type and Topic of Discourse The type of text and what it is talking can have a profound effect on your results The most common meaning of run will be fast pedestrian motion in a corpus of children s books, but it will be management in a corpus economics news press. 2. Hapax legomena and rare events The largest corpus in the world is but a fraction of langauge Something that can be very rare in a corpus, is, in fact, quite common out there in the real world We are relatively restricted to quite common events. Things like idioms etc. are relatively rare. a. What are the implications of each of these questions for you own project b. If a particular question has implications for your project, what measure have you taken to respond to the question?
Frequency - Theoretical Questions for Corpus-Based Research 1. Linguistic structure This is the langauge parole debate 2. Thematic bias This is the same as the issue of text type and is the basis of Chomsky s criticsm. 1. What are the implications of each of these questions for you own project 2. If a particular question has implications for your project, what measure have you taken to respond to the question?
Object - Analytical Questions for Corpus-Based Research 1. Negative Evidence We only have what people say, not what they don t say. How can we disprove hypotheses? 2. Objective Accuracy To increase representativity and objectivity, we necessarily increase inaccuracy If we increase accuracy, we necessarily decrease representativity and objectivity 1. What are the implications of each of these questions for you own project 2. If a particular question has implications for your project, what measure have you taken to respond to the question?
For next week Have a look at each of the articles on line. Choose one and have a go at reading it. Remember, your memoire should look something like on of these articles... serious.