The Developmental Lexicon (DeveL) Project Database. This file describes the electronic supplement to the article Schröter, P.

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Phonological and Phonetic Representations: The Case of Neutralization

Linking Task: Identifying authors and book titles in verbose queries

Florida Reading Endorsement Alignment Matrix Competency 1

Journal of Phonetics

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

English Language and Applied Linguistics. Module Descriptions 2017/18

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Phonological Processing for Urdu Text to Speech System

A Bayesian Model of Stress Assignment in Reading

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Modeling full form lexica for Arabic

Phonological Encoding in Sentence Production

THE INFLUENCE OF TASK DEMANDS ON FAMILIARITY EFFECTS IN VISUAL WORD RECOGNITION: A COHORT MODEL PERSPECTIVE DISSERTATION

Stages of Literacy Ros Lugg

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

LING 329 : MORPHOLOGY

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Learning Methods in Multilingual Speech Recognition

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The French Lexicon Project: Lexical decision data for 38,840 French words. and 38,840 pseudowords

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Frequency in lexical processing. R. Harald Baayen, Petar Milin, and Michael Ramscar. Eberhard Karls University, Tübingen, Germany.

Semantic Modeling in Morpheme-based Lexica for Greek

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Corpus Linguistics (L615)

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Constructing Parallel Corpus from Movie Subtitles

SARDNET: A Self-Organizing Feature Map for Sequences

Sublexical frequency measures for orthographic and phonological units in German

Natural Language Processing. George Konidaris

Development of the First LRs for Macedonian: Current Projects

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Syntactic surprisal affects spoken word duration in conversational contexts

THE VERB ARGUMENT BROWSER

Arabic Orthography vs. Arabic OCR

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Multiple Route Model of Lexical Processing

DIBELS Next BENCHMARK ASSESSMENTS

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Disambiguation of Thai Personal Name from Online News Articles

DOWNSTEP IN SUPYIRE* Robert Carlson Societe Internationale de Linguistique, Mali

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Phonological encoding in speech production

Rhythm-typology revisited.

Switchboard Language Model Improvement with Conversational Data from Gigaword

Universiteit Leiden ICT in Business

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Problems of the Arabic OCR: New Attitudes

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Program in Linguistics. Academic Year Assessment Report

Underlying Representations

1. Introduction. 2. The OMBI database editor

Parsing of part-of-speech tagged Assamese Texts

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Multi-Lingual Text Leveling

TEKS Comments Louisiana GLE

Word Stress and Intonation: Introduction

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Detecting English-French Cognates Using Orthographic Edit Distance

Large Kindergarten Centers Icons

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Character Stream Parsing of Mixed-lingual Text

Universal contrastive analysis as a learning principle in CAPT

The phonological grammar is probabilistic: New evidence pitting abstract representation against analogy

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Role of the Head in the Interpretation of English Deverbal Compounds

Analysis of Lexical Structures from Field Linguistics and Language Engineering

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Developing a TT-MCTAG for German with an RCG-based Parser

Individual Differences & Item Effects: How to test them, & how to test them well

CS 598 Natural Language Processing

Considerations for Aligning Early Grades Curriculum with the Common Core

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

C a l i f o r n i a N o n c r e d i t a n d A d u l t E d u c a t i o n. E n g l i s h a s a S e c o n d L a n g u a g e M o d e l

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

The influence of orthographic transparency on word recognition. by dyslexic and normal readers

What the National Curriculum requires in reading at Y5 and Y6

2,1 .,,, , %, ,,,,,,. . %., Butterworth,)?.(1989; Levelt, 1989; Levelt et al., 1991; Levelt, Roelofs & Meyer, 1999

Chapter 5. The Components of Language and Reading Instruction

Using SAM Central With iread

Coast Academies Writing Framework Step 4. 1 of 7

Transcription:

1 The Developmental Lexicon (DeveL) Project Database This file describes the electronic supplement to the article Schröter, P., & Schroeder, S. (in press). The Developmental Lexicon Project:A Behavioral Database to Investigate Visual Word Recognition Across the Lifespan. Behavior Research Methods. Please refer to the full manuscript for further information. The DeveL database provides item estimates for 1152 German words in seven age groups (Grade 1, Grade 2, Grade 3, Grade 4, Grade 6, Young Adults, Old Adults) for two tasks (Lexical Decision - LD, Naming - NAM) and five different outcome variables(lexical Decision: Accuracy, Response Latency; Naming: Accuracy, Onset Time, Pronunciation Duration). In addition, accompanying linguistic characteristics for the 1152 words are provided. Structure of the DeveL Database Data are provided as an R data file (DeveL.RData). The file has the following structure: There are five different data frames each corresponding to one of the dependent variables (Lexical Decision accuracy: ld.acc, Lexical Decision RT: ld.rt, Naming accuracy: nam.acc, Naming Onset Time: nam.on, Naming Duration Time: nam.dur). In each data frame, data for the seven age groups (Grade 1: g1, Grade 2: g2, Grade 3: g3, Grade 14 g4, Grade 6: g6, Young Adults: ya, Old Adults: oa) are represented by a set of three columns each (Naming Duration Time is only available for Grades 1-4). The first column (n) represents the number of data points on which the item parameter in a group is based. The second column (m) provides the estimated item effect for each word in this age group. Item effects were estimated using the random effects (best linear unbiased predictors; see Bates et al., 2016) from the mixed-effects model that was fitted for each age group separately and added to the

2 overall intercept of that group. To ease interpretation, responses were back-transformed from the logit-scale to proportion correct for the accuracy measures and from the log-scale to milliseconds for all RT measures. The last column (se) for each age group represents the standard error of the item effect. It combines the uncertainty about the random item effect and the uncertainty of the overall group intercept. Finally, a sixth data frame (item) provides important linguistic characteristics for all words included in the database (described below). All string variables are encoded using UTF-8. The names and order of the variables in this data frame correspond to their description below. All data frames can easily be combined using word as a linking variable. Linguistic Variables Frequency characteristics Normalized type frequency refers to the number of occurrences of a type, i.e. a distinct word form in a corpus, per million tokens. We included frequency norms of both the childlex (version 0.16, December 2015; see Schroeder, Würzner, Heister, Geyken, & Kliegl, 2015) and the DWDS corpus (Digitales Wörterbuch Deutscher Sprache, version 0.4, January 2014; see Geyken, 2007). childlex norms are derived from a set of ten million tokens drawn from 500 of the most popular German children s books. The DWDS corpus is based on 120 million tokens extracted from various books and newspapers for adults. Lemma frequency is the total number of occurrences of a distinct word stem (lemma) per million words (i.e., NAME for NAMEN, NAMENS etc.). Again, we included lemma frequency norms of both the childlex and the DWDS corpus. Subjective frequency refers to the rated frequency of words in spoken and written German. Norms are derived from a rating study conducted with 100 German university students, who rated the use and occurrence of a word on a seven-point Likert scale ranging from 1 (never) to 7 (several times a day).

3 Age of acquisition is the estimated mean age in years at which a word was acquired. Data was provided by 100 German university students, who were asked to write down at which age they believed to have heard or used a word for the first time. Orthographic characteristics Length is the (integer) number of letters in a word. Unigram frequency is the summed unigram frequency of each letter in a word based on the childlex unigram type frequencies. Bigram frequency refers to the summed bigram frequency based on type bigram frequencies in the childlex corpus. Here, bigram is defined as a sequence of two letters within a word. The summed bigram frequency of a word (e.g., NAME) is the sum of the frequencies of its successive bigrams, where begin and end of a word are also treated as letters (e.g., $N & NA & AM & ME & E$). Trigram frequency, which is also based on childlex type frequencies, is the sum of the frequencies of a sequence of three letters within a word (again treating begin and end of a word as separate letters, e.g., $NA & NAM & AME & ME$). N refers to Coltheart s N, which is the number of words that are obtained when changing one letter in a word while keeping the identity and positions of the other letters constant (Coltheart, Davelaar, Jonasson, & Besner, 1977). As NAME, for example, can be changed into DAME, NAHE, and NASE, the number of its orthographic neighbors is 3. Reported are values based on both the childlex and the DWDS corpus. OLD20 is the mean Levenshtein Distance from a word to its 20 closest orthographic neighbors. The Levenshtein Distance is a measure for the distance between letter strings as a function of the minimum number of changes, i.e. substitutions, additions, and deletions, that are required to generate one word from another. For NAME, the Levenshtein Distance to NAHE would be 1 (for the substitution of M and H), whereas to NARBE it would be 2 (for

4 the substitution of M and R, and the addition of B). As OLD20 does not require all neighbors to have the same length, it enables a larger range of orthographic variability than Coltheart s N. OLD20 was computed according to the procedure introduced by Yarkoni, Balota, and Yap (2008) and as implemented in vwr package in R (Keuleers, 2015) using down-cased types as the reference lexicon. Again, we included values from both the childlex and the DWDS corpus. Phonological characteristics Phonological transcriptions for most of the words were taken from the CELEX corpus (Baayen et al., 1996). Ten words, which were not included in the CELEX database, were transcribed manually. Phonetic transcription is the visual representation of speech sounds through a phonetic script. Here, the DISC format was used a machine-readable phonetic alphabet based on the IPA (International Phonetic Alphabet). Number of phonemes refers to the sum of all contrastive phonological units in a word. As NAME, for example, consists of the phonological units /n/, /a/, and /m/ /ǝ/, the number of its phonemes is 4. Number of syllables refers to the sum of all uninterrupted units of speech sound in a word. Syllable structure shows the composition of each syllable in a word by denoting the presence and sequence of its vowels (V) and consonants (C). The syllable structure of NAME, for example, is [CV][CV]. Syllable parse shows the decomposition of a word into its syllables separated by a hyphen.

5 Morphological characteristics Part of speech (PoS) specifies the syntactic function of the word. Here, a simplified version of the Stuttgart-Tübingen-Tagset (STTS) was used distinguishing between nouns (N), verbs (V), and adjectives/adverbs (A) which are the only parts of speech that were used in the project. Morpheme parse shows the decomposition of a word into its morphological constituents through distinct separators. We transcribed words manually and used # for a boundary between two stems, + for a boundary between a prefix and a stem, and ~ for a boundary between a suffix and a stem. Rounded brackets {} indicate inflection. Number of morphemes refers to the sum of all morphemes in a word (not including inflection). Whereas NAME only consists of one morpheme, VORNAME has two (VOR + NAME). Morphological status refers to the composition of the word according to its meaningcarrying constituents. M denotes mono-morphemic status (e.g., NAME), C a compound (e.g., SPITZ NAME, engl. nick name), and D a derivation (e.g., VOR NAME, engl. prename). Morphological segmentation refers to the composition of the word according to the sequence of stem (S) and present affixes (A). S denotes a stem and A an affix. Semantic characteristics Imageability refers to the mean degree of how easy a word elicits mental images. Values are derived from a rating study conducted with 100 German university students, who were asked to indicate how easily they could think of an image given a single word. They rated imageability on a seven-point Likert scale ranging from 1 (hard to imagine) to 7 (easy to imagine). Valence refers to the mean degree of how much emotional valence a word carries, extending from attractiveness (positive valence) to aversiveness (negative valence). Data was

6 provided by 100 German university students, who rated emotional valence using Self- Assessment-Manikins (SAMs; Lang, 1980) on a seven-point Likert scale ranging from 3 (very negative) through 0 (neutral) to +3 (very positive). Arousal refers to the mean degree of how much alertness a word provokes. Values are derived from a rating study, in which SAMs were used for depicting increasing degrees of arousal. 100 German university students rated arousal on a five-point Likert scale ranging from 1 (low arousal) to 5 (high arousal).