Statistiek II. John Nerbonne. October 1, Dept of Information Science With thanks to Hartmut Fitz for a recent pass!

Similar documents
Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

STA 225: Introductory Statistics (CT)

Individual Differences & Item Effects: How to test them, & how to test them well

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

12- A whirlwind tour of statistics

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Research Design & Analysis Made Easy! Brainstorming Worksheet

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Degeneracy results in canalisation of language structure: A computational model of word learning

Probability and Statistics Curriculum Pacing Guide

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

MIDDLE AND HIGH SCHOOL MATHEMATICS TEACHER DIFFERENCES IN MATHEMATICS ALTERNATIVE CERTIFICATION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ROA Technical Report. Jaap Dronkers ROA-TR-2014/1. Research Centre for Education and the Labour Market ROA

The Good Judgment Project: A large scale test of different methods of combining expert predictions

School Size and the Quality of Teaching and Learning

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

CS Machine Learning

Lecture 1: Machine Learning Basics

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Age Effects on Syntactic Control in. Second Language Learning

The Impact of Formative Assessment and Remedial Teaching on EFL Learners Listening Comprehension N A H I D Z A R E I N A S TA R A N YA S A M I

Does the Difficulty of an Interruption Affect our Ability to Resume?

Running head: DELAY AND PROSPECTIVE MEMORY 1

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

The Effect of Syntactic Simplicity and Complexity on the Readability of the Text

Mandarin Lexical Tone Recognition: The Gating Paradigm

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Strategy Abandonment Effects in Cued Recall

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

EDPS 859: Statistical Methods A Peer Review of Teaching Project Benchmark Portfolio

The Effects of Strategic Planning and Topic Familiarity on Iranian Intermediate EFL Learners Written Performance in TBLT

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

How to Judge the Quality of an Objective Classroom Test

Analysis of Enzyme Kinetic Data

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

A Case Study: News Classification Based on Term Frequency

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Levels of processing: Qualitative differences or task-demand differences?

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

CHAPTER III RESEARCH METHOD

Interdisciplinary Journal of Problem-Based Learning

An Empirical and Computational Test of Linguistic Relativity

A Grammar for Battle Management Language

Analyzing the Usage of IT in SMEs

ACBSP Related Standards: #3 Student and Stakeholder Focus #4 Measurement and Analysis of Student Learning and Performance

Constraining X-Bar: Theta Theory

Accountability in the Netherlands

A Comparison of Charter Schools and Traditional Public Schools in Idaho

Contents. Foreword... 5

The Impact of Test Case Prioritization on Test Coverage versus Defects Found

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

PREDISPOSING FACTORS TOWARDS EXAMINATION MALPRACTICE AMONG STUDENTS IN LAGOS UNIVERSITIES: IMPLICATIONS FOR COUNSELLING

Greek Teachers Attitudes toward the Inclusion of Students with Special Educational Needs

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Argument structure and theta roles

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Physics 270: Experimental Physics

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Python Machine Learning

Parsing of part-of-speech tagged Assamese Texts

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Florida Reading Endorsement Alignment Matrix Competency 1

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

A COMPARATIVE STUDY BETWEEN NATURAL APPROACH AND QUANTUM LEARNING METHOD IN TEACHING VOCABULARY TO THE STUDENTS OF ENGLISH CLUB AT SMPN 1 RUMPIN

Cued Recall From Image and Sentence Memory: A Shift From Episodic to Identical Elements Representation

Firms and Markets Saturdays Summer I 2014

SOFTWARE EVALUATION TOOL

What is a Mental Model?

Automatization and orthographic development in second language visual word recognition

Functional Skills Mathematics Level 2 assessment

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Stopping rules for sequential trials in high-dimensional data

Linking Task: Identifying authors and book titles in verbose queries

learning collegiate assessment]

Lecture 2: Quantifiers and Approximation

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Natural Language Processing. George Konidaris

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

A Comparison of the Effects of Two Practice Session Distribution Types on Acquisition and Retention of Discrete and Continuous Skills

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Assignment 1: Predicting Amazon Review Ratings

THE ACQUISITION OF INFLECTIONAL MORPHEMES: THE PRIORITY OF PLURAL S

Transcription:

Dept of Information Science j.nerbonne@rug.nl With thanks to Hartmut Fitz for a recent pass! October 1, 2010

Last week Factorial ANOVA: used when there are several independent variables (factors) allows to study interaction between factors assumptions like one-way ANOVA: homogeneity of variance, normality, independence Today: repeated measures ANOVA (aka within-subjects -design) one-way repeated measures ANOVA factorial repeated measures ANOVA mixed factors repeated measures ANOVA

Last week Last week s 2 2 ANOVA: repetition accuracy of object-relatives two factors, two levels each factor A: animacy of head noun factor B: relative clause subject type factors induced four disjoint groups of items (four tokens per type) 48 children, dependent measure: averaged repetition accuracy Conducted factorial ANOVA by item, measured whether there was a difference in repetition accuracy between four groups of sentence types (ANP, INP, APro, IPro)

A different way to look at the same data Could also have looked at repetition accuracy by participant same two factors, head noun animacy and relative clause subject type average over tokens per type for each participant Sentence type Child ANP INP APro IPro 1 0.00 0.00 0.00 0.00 2 0.00 0.00 0.75 0.38 3 0.00 0.50 0.88 0.75..... 48 0.25 0.50 1.00 0.88 Measure participants repeatedly in all conditions, perform 2 2 ANOVA by participant (expect similar main effects)

One-way repeated measures ANOVA Repeated measures ANOVA: Like related-samples t-test, but for 3 conditions A, B, C, etc. Applications: same group of subjects measured under 3 or more conditions A, B, C,... matched k-tuples of subjects, one member measured under A, one under B, one under C,... in the latter case, matched tuples are treated as one subject Labels: repeated measures or within-subjects design, randomized blocks design

One-way repeated measures ANOVA Characteristics: assumptions like standard ANOVA, but data points not independent (repeated measures) economical in design because each subject measured under all conditions often research question requires repeated measures, e.g., longitudinal studies: each sample member measured repeatedly at several ages example: children can discriminate many phonetic distinctions across languages without relevant experience; longitudinal study shows there is a decline in this ability (within first year) key idea: eliminate variation between sample members (reduces within-groups variance)

Partitioning the variance One-way independent samples ANOVA: SST = SSG + SSE Total Sum of Squares = Group Sum of Squares + Error Sum of Squares One-way repeated measures ANOVA: same subjects in each group (i.e., condition) determine aggregate variance among subjects (SSS): SSS = I N j=1 (x j x) 2 where I number of conditions, x j subject mean (across conditions), and x total mean remove this effect of individual differences from SSE determine MSE from SSE*= SSE SSS

One-way repeated measures example Experiment: Computational model learns to produce complex sentences from meaning (Fitz, Neural Syntax, 2008). Task: model receives semantic structure of a sentence as input tries to produce sentence which expresses this meaning production by word-to-word prediction Example: Input: Agent [DOG] Action [CHASE] Patient [CAT] Sequential output: the dog the dog chases the dog chases the cat

One-way repeated measures example But how to represent semantic relations for multiple clauses? Three semantic conditions: (a) give more prominence to main clause E.g., the dog that runs chases the cat (order-link) (b) mark the topic and focus of both clauses (topic-focus) E.g., the dog that [the dog] runs chases the cat (c) features which bind topic and focus (binding) E.g., the dog that runs chases the cat, Agent-Agent The model s learning behavior is tested in each of these conditions. Question: Is model sensitive to different semantic representations?

One-way repeated measures example Subjects: model is randomly initialized exposed to 10 different sets of randomly generated training items ( 10 experimental subjects) subject = model + fixed parameters + training environment each subject tested in conditions (a) (c) (repeated measures) Dependent variable: mean sentence accuracy after learning phase (on 1000 test items) Scoring: model produces target sentence exactly: 1 any kind of lexical or grammatical error: 0 sentence accuracy: percentage of correct utterances

One-way repeated measures example Data on modelling the acquisition of relative clauses: Model- Condition Subject subject order-link topic-focus binding mean 1 80 94 98 90.7 2 73 90 98 87 3 70 98 94 87.3..... 10 71 99 94 88 Mean 76.3 95.8 94.9 89 Note: subject means (across conditions) required to compute subject sum of squares (SSS).

Check normality and standard deviations Normal Q Q Plot: binding Sample Quantiles 88 90 92 94 96 98 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Theoretical Quantiles SDs: order-link: 4.9, topic-focus: 2.66, binding: 3.03

Visualizing the data Prediction accuracy 70 75 80 85 90 95 100 (1) orderlink (2) binding (3) topicfocus Semantics Little skew, different medians, no overlap between (1) and (2) or (3), very likely significant

Computing the error sum of squares Model- Condition Subject subject order-link topic-focus binding mean 1 80 94 98 90.7 2 73 90 98 87 3 70 98 94 87.3..... 10 71 99 94 88 Mean 76.3 95.8 94.9 89 SSE = I N i (x ij x i ) 2 = (80 76.3) 2 +... + (94 94.9) 2 = 362.6 i=1 j=1

Key idea of repeated measures Because subjects are measured in all conditions: remove variability due to individual differences from SSE! Independent samples: SST Repeated measures: SST SSG SSE SSG SSE - SSS MSG MSE MSG MSE F-value F-value

Computing the subject sum of squares Subject Sum of Squares: aggregate measure of between-subjects variability SSS = I N (x j x) 2 j=1 = 3 (90.7 89) 2 + 3 (87 89) 2 +... + 3 (88 89) 2 = 86 Adjust error sum of squares: SSE* = SSE SSS = 362.6 86 = 276.6

Computing the mean squared error SSE*: usual SSE minus between-subjects sum of squares (SSS) Recall different degrees of freedom: DFT = N 1 = 30 1 = 29 (total) DFG = I 1 = 3 1 = 2 (group) DFE = N I = 30 3 = 27 (error) Subject degrees of freedom (corresponding to SSS): DFS = Number of subjects in each group 1 = 10 1 = 9 Remove this component from DFE, and what remains is: DFE* = DFE DFS = 27 9 = 18

R output Manually: R output: MSE*= SSE* DFE* = 276.6 18 = 15.37 F -value: F = MSG MSE* = 1211.7 15.37 = 78.83 Error: subject Df Sum Sq Mean Sq F value Pr(>F) Residuals 9 86.00 9.55 Error: subject:semantics Df Sum Sq Mean Sq F value Pr(>F) semantics 2 2423.40 1211.70 78.85 1.2428e-09 *** Residuals 18 276.60 15.37 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. Reject null hypothesis H 0, i.e., conclude that difference in semantic representations does affect the model s learning behavior

Post-hoc tests* Tukey s Honestly Significant Differences test suitable for multiple comparisons when ANOVA is significant requires equal group sizes! based on Studentized range statistic Q SPSS doesn t do HSD for repeated measures (use Bonferroni) Compute HSD manually: q* = Null-hypothesis H 0 : µ i = µ j Alternative hypothesis H a : µ i µ j Reject H 0 if q* q (check table) µ i µ j q MSE* N

Applying Tukey HSD* Test difference between topic-focus and binding condition in the example: q*= 95.8 94.9 q 15.37 10 = 0.9 1.537 = 0.73 q has two degrees of freedom: group size (here 9), and DFE* (here 18) q(9, 18) = 6.08 (from table for Studentized range statistic) Hence, q* q, do not reject H 0 (at α = 0.01). Conclude: the model learns complex sentences equally well in the topic-focus and binding condition

Applying Tukey HSD* Test difference between binding and order-link condition in the example: q*= 94.9 76.3 q 15.37 10 = 0.9 1.537 = 15.0 q has two degrees of freedom: group size (here 9), and DFE* (here 18) q(9, 18) = 6.08 (from table for Studentized range statistic) Hence, q* q, reject H 0 (at α = 0.01). Conclude: the model learns complex sentences more reliably in the binding than in the order-link condition.

Repeated measures in factorial design Note: repeated measures i.e., within-subjects factors can also be used in factorial ANOVA Example: in previous experiment include time as another within-subjects factor test whether model learns better (averaged over time) with any one semantics test whether model learns faster with any one semantics A positive answer is strongly suggested when looking at the model s performance over time, the learning trajectories

Repeated measures in factorial design 100 Utterances correctly produced (%) 90 80 70 60 50 40 30 20 10 topic focus binding order link 0 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of sentences trained Model performance over time (for the three semantics)

Check normality Normal Q Q Plot: topicfocus80 Sample Quantiles 82 84 86 88 90 92 94 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Theoretical Quantiles Check normality and standard deviations for 2 5 subgroups!

Repeated measures in factorial design We compare the binding with topic-focus semantics Conduct a 2 5 repeated measures ANOVA with time and semantics as within-subjects factors Df Sum Sq Mean Sq F value Pr(>F) epoch 4 120875.740 30218.935 646.14094 2.22e-16 *** Residuals 36 1683.660 46.768 Df Sum Sq Mean Sq F value Pr(>F) semantics 1 3856.4100 3856.4100 13.41262 0.0052167 ** Residuals 9 2587.6900 287.5211 Df Sum Sq Mean Sq F value Pr(>F) epoch:semantics 4 1785.14000 446.28500 9.49397 2.3996e-05 *** Residuals 36 1692.26000 47.00722 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05.

Visualizing interaction mean sentence accuracy 0 20 40 60 80 100 factor(epoch) 100000 80000 60000 40000 20000 binding topic focus semantics Interaction: Although with both semantics model reaches similar proficiency, it learns significantly faster in the topic-focus condition

Mixed factor ANOVA design Often, subjects divided into separate groups, e.g., gender: male/female age: 3/4-year old children type of language impairment: Wernicke/Broca aphasia mother tongue: Dutch, English, German but subjects in each group are tested in several conditions Mixed-factors: n-way ANOVA with between-subjects and within-subjects factors In fact, perhaps the most common ANOVA design (see next example)

Mixed factor ANOVA: example Withaar & Stowe investigated effects of syntax and phonology on processing time of relative clauses Task: read sentences word-by-word on computer screen, press button to see following word. Times between button presses are measured (reading times) Syntax: difference between relative clause types where relative pronouns are understood subjects: de bakker die de tuinmannen verjaagt relative pronouns are understood objects: de bakker die de tuinmannen verjagen Phonology: rhyming vs. non-rhyming words in relative clause (Longoni, Richardson & Aiello showed that word lists with rhyming elements take longer to process)

Syntax, rhyme, reaction times Design: Four kinds of sentences shown, one group of participants per rhymed/non-rhymed, both syntactic structures shown to each group. betweensubjects Syntax: within-subjects Phonology Object Relative Subject Relative non-rhym. non-rhym. obj.-rel. non-rhym. subj.-rel. rhym. rhym. object-rel. rhym. subject-rel. Extras: W&S also controlled for subject s attention span, and for which sentences were shown (no similar sentences shown to same subject) Measurement: time needed for the last word in relative clause

Data: means and SDs of four groups Note: no SD is twice as large as another (but it s close...) Factorial ANOVA question: are means significantly different?

Normality assumption Look at data: are distributions normal? Rhymed and unrhymed object-relatives

Normality assumption Rhymed and unrhymed subject-relatives Remark: longest reaction time good candidate for elimination (worth checking on)

Multiple questions Again, we ask two/three questions simultaneously: 1. Is rhyme affecting word processing time? 2. Do relative clause types affect processing time? 3. Do the effects interact, or are they independent? Questions 1 & 2 might have been asked in separate one-way ANOVA designs (but these would have been more costly in number of subjects) Question 3 can only be answered with factorial ANOVA

Visualizing ANOVA questions Question 1: Is rhyme affecting processing time? Note: similar box plots for rhyme in subject-relatives

Visualizing ANOVA questions Question 2: Does relative clause type affect processing time? Little skew, different medians, large overlap: difficult to tell

Visualizing interaction If no interaction, lines should be parallel. In fact, rhyming speeds processing of object relatives. Multiple ANOVA will measure this exactly.

Mixed-factor ANOVA in SPSS Syntax: Phonology: within-subjects factor (repeated measures) between-subjects factor betweensubjects Syntax: within-subjects Phonology Object Relative Subject Relative non-rhym. non-rhym. obj.-rel. non-rhym. subj.-rel. rhym. rhym. object-rel. rhym. subject-rel. Invoke: repeated measures define distinct factors take care not to mix them up!

Mixed-factor ANOVA results Between-subjects (row) effects (rhyme/no rhyme): Hence, rhyme does not significantly affect processing speed

Mixed-factor ANOVA results Within-subjects (column) effects (object- vs subject-relatives): Hence, syntax has a profound effect on processing speed; no interaction (in spite of graph!)

Repeated measures ANOVA: summary Repeated measures ANOVA: generalized related-samples t-test assumptions like standard ANOVA except for independence required whenever a group of subjects measured under different conditions eliminates between-subjects variance from MSE typical applications: linguistic ability of children measured over time cognitive function in same group of subjects tested under different conditions computational learning models compared for different input environments advantage over independent samples: efficient in experimental design

Next week Next week: correlation and regression