A Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study. E. N. Westerhout November 10, 2005

Similar documents
Index. Language Test (ANELT), 29, 235 auditory comprehension, 4,58, 100 Blissymbolics, 305

Beeson, P. M. (1999). Treating acquired writing impairment. Aphasiology, 13,

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Florida Reading Endorsement Alignment Matrix Competency 1

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

CEFR Overall Illustrative English Proficiency Scales

DIBELS Next BENCHMARK ASSESSMENTS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Corpus Linguistics (L615)

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Organizing Comprehensive Literacy Assessment: How to Get Started

English Language and Applied Linguistics. Module Descriptions 2017/18

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Stages of Literacy Ros Lugg

2,1 .,,, , %, ,,,,,,. . %., Butterworth,)?.(1989; Levelt, 1989; Levelt et al., 1991; Levelt, Roelofs & Meyer, 1999

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

STAFF DEVELOPMENT in SPECIAL EDUCATION

How to Judge the Quality of an Objective Classroom Test

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Language Acquisition Chart

SLINGERLAND: A Multisensory Structured Language Instructional Approach

Discussion Data reported here confirm and extend the findings of Antonucci (2009) which provided preliminary evidence that SFA treatment can result

Presentation Summary. Methods. Qualitative Approach

Accelerated Learning Course Outline

Mandarin Lexical Tone Recognition: The Gating Paradigm

Applications of memory-based natural language processing

Phonological and Phonetic Representations: The Case of Neutralization

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Films for ESOL training. Section 2 - Language Experience

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

South Carolina English Language Arts

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

The College Board Redesigned SAT Grade 12

Changing User Attitudes to Reduce Spreadsheet Risk

REVIEW OF CONNECTED SPEECH

The Common European Framework of Reference for Languages p. 58 to p. 82

Linking Task: Identifying authors and book titles in verbose queries

Part I. Figuring out how English works

The Strong Minimalist Thesis and Bounded Optimality

TA Script of Student Test Directions

Clinical Review Criteria Related to Speech Therapy 1

Accelerated Learning Online. Course Outline

Assessing speaking skills:. a workshop for teacher development. Ben Knight

Chapter 9: Conducting Interviews

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

Introduction to the Common European Framework (CEF)

Formulaic Language and Fluency: ESL Teaching Applications

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER

English Language Arts Summative Assessment

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

ASSISTIVE COMMUNICATION

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

PART 1. A. Safer Keyboarding Introduction. B. Fifteen Principles of Safer Keyboarding Instruction

COMPETENCY-BASED STATISTICS COURSES WITH FLEXIBLE LEARNING MATERIALS

TRAITS OF GOOD WRITING

Universal Design for Learning Lesson Plan

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Extending Place Value with Whole Numbers to 1,000,000

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Psychology and Language

Human Factors Computer Based Training in Air Traffic Control

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Guidelines for Writing an Internship Report

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Training Staff with Varying Abilities and Special Needs

Course Law Enforcement II. Unit I Careers in Law Enforcement

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Understanding the Relationship between Comprehension and Production

Speech Recognition at ICSI: Broadcast News and beyond

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals

CALIFORNIA STATE UNIVERSITY, SAN MARCOS SCHOOL OF EDUCATION

Preprint.

Effective Instruction for Struggling Readers

Longitudinal family-risk studies of dyslexia: why. develop dyslexia and others don t.

The taming of the data:

5. UPPER INTERMEDIATE

YMCA SCHOOL AGE CHILD CARE PROGRAM PLAN

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

ADHD Classroom Accommodations for Specific Behaviour

Fountas-Pinnell Level P Informational Text

CS 598 Natural Language Processing

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Software Maintenance

Signs, Signals, and Codes Merit Badge Workbook

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Writing Functional Dysphagia Goals

PRESENTED BY EDLY: FOR THE LOVE OF ABILITY

Merbouh Zouaoui. Melouk Mohamed. Journal of Educational and Social Research MCSER Publishing, Rome-Italy. 1. Introduction

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Transcription:

A Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study E. N. Westerhout November 10, 2005

Abstract In this thesis, a pilot study for the development of a corpus of Dutch aphasic speech (CoDAS) is presented. Given the lack of resources of this kind not only for Dutch but also for other languages, CoDAS will be able to set standards and will contribute to the future research in this area. A corpus of Dutch aphasic speech should fulfill at least three requirements. First, it should encode a plausible sample of contemporary Dutch as spoken by aphasic patients. That is, it should include speech representing different types of aphasia as well as various communication settings. Secondly, the speech fragments should be documented with the relevant metadata which should include information about the speaker and aphasia. Thirdly, the corpus should be enriched with various kinds of linguistic information. Given the special character of the speech contained in CoDAS, we cannot simply carry over the design and the annotation protocols of existing corpora, such as SDC or CHILDES. However, they have been assumed as starting point. In our pilot study, we have established the basic requirements with respect to text types, metadata, and annotation levels that CoDAS should fulfill. In this respect, we have investigated whether and how the procedures and protocols for the annotation and transcription used for the SDC should be adapted in order to annotate and transcribe the aphasic speech properly. In particular, for the orthographic transcription and the part-of-speech tagging, suggestions for improvement of the existing protocols have been given. On the other hand, the phonetic transcription procedure assumed within the SDC can be adopted without major modifications. i

Acknowledgements First of all, I would like to thank my supervisors, Dr. Paola Monachesi and Dr. Esther Janse, for their valuable guidance and encouragement during the writing process. I would also like to thank the six clients of the Afasiecentrum in Capelle aan den IJssel that participated in the pilot study and Mia Verschaeve, Speech and Language Therapist in this center. My vote of thanks also goes to the Speech and Language Therapist Janneke Wolters, who conducted the AAT s on the patients. The members of the ILK group in Tilburg, especially Dr. Erwin Marsi and Dr. Antal van den Bosch, thank you for performing the automatic transcriptions needed within the pilot study. Last but not least, I would like to thank Anne Marie van de Zande. We started to work on this project together. Unfortunately, our subjects proved to be too different, so we decided to work on our own projects. Nevertheless, we met several times in Utrecht to work together. Working together is definitely more fun than working alone. ii

Contents Abstract Acknowledgements i ii 1 Introduction 1 1.1 Motivation.......................................... 1 1.2 Overview........................................... 2 1.2.1 Part 1: Aphasia and Corpora........................... 2 1.2.2 Part 2: Corpus Design............................... 2 1.2.3 Part 3: The Pilot Study............................... 3 1.2.4 Conclusions and suggestions for future research................ 3 2 Aphasia 4 2.1 Causes............................................ 4 2.2 Varieties........................................... 5 2.3 Patients............................................ 8 2.3.1 The Akense Afasie Test (AAT) ( Aachen Aphasia Test )............ 8 2.3.2 The six patients................................... 9 2.3.3 Data used for the pilot study........................... 11 3 Corpora 12 3.1 Characteristics........................................ 12 3.2 Types............................................. 13 3.3 General issues........................................ 15 3.4 Levels of annotation and transcription.......................... 17 3.5 Three important corpora.................................. 19 3.5.1 Brown Corpus.................................... 19 3.5.2 Lancaster-Oslo-Bergen Corpus.......................... 19 3.5.3 British National Corpus.............................. 19 4 Relevant corpora for this project 21 4.1 CHILDES........................................... 21 4.1.1 Three components................................. 22 4.1.2 Design issues.................................... 23 iii

CONTENTS CONTENTS 4.1.3 Metadata....................................... 24 4.1.4 Levels of annotation and transcription...................... 24 4.2 The Spoken Dutch Corpus................................. 25 4.2.1 Design issues.................................... 25 4.2.2 Text types...................................... 26 4.2.3 Metadata....................................... 26 4.2.4 Levels of annotation and transcription...................... 27 5 Corpus design 28 5.1 Purpose............................................ 28 5.2 Permissions......................................... 29 5.3 Text types.......................................... 29 5.3.1 Deviating from the Spoken Dutch Corpus.................... 29 5.3.2 CoDAS........................................ 30 5.4 Metadata........................................... 31 5.5 Levels of annotation and transcription.......................... 32 6 Orthographic Transcription 34 6.1 Criteria for guidelines................................... 34 6.2 The EAGLES guidelines.................................. 35 6.2.1 Spelling guidelines................................. 35 6.2.2 Unidentifiable material.............................. 36 6.3 The CHILDES project.................................... 36 6.3.1 Spelling guidelines corresponding to the EAGLES guidelines........ 36 6.3.2 Complementary spelling guidelines....................... 37 6.3.3 Unidentifiable material.............................. 38 6.3.4 Transcription of aphasic speech in CHILDES.................. 38 6.4 The Spoken Dutch Corpus................................. 40 6.4.1 Spelling guidelines corresponding to the EAGLES guidelines........ 40 6.4.2 Complementary spelling guidelines....................... 41 6.4.3 Unidentifiable material.............................. 41 6.5 Orthographic transcription of the non-fluent speech.................. 42 6.5.1 Problematic issues................................. 42 6.5.2 Transcription of the problematic issues..................... 43 7 Phonetic Transcription 48 7.1 The Spoken Dutch Corpus................................. 48 7.1.1 Phonetic transcription files............................ 49 7.1.2 The symbol set................................... 50 7.1.3 Automatic generation of phonetic transcriptions................ 50 7.1.4 TreeTalk....................................... 50 7.1.5 Verification and correction............................. 51 7.2 Phonetic transcription of the non-fluent speech..................... 52 iv

CONTENTS CONTENTS 8 Lemmatization and part-of-speech tagging 53 8.1 EAGLES guidelines..................................... 53 8.2 CHILDES........................................... 54 8.3 The Spoken Dutch Corpus................................. 54 8.3.1 Lemmatization................................... 55 8.3.2 Part-of-speech tagging............................... 55 8.4 Tagging the non-fluent speech............................... 58 8.4.1 Performance of the Memory-Based Tagger (MBT)............... 59 8.4.2 Improving the performance of the Memory-Based Tagger.......... 60 9 Conclusions 61 9.1 Future research....................................... 62 Appendices 67 A Metadata of the SDC and CoDAS 68 A.1 Metadata about the recordings.............................. 68 A.2 Metadata about the participants.............................. 70 A.2.1 Metadata of the SDC................................ 70 A.2.2 Complementary metadata for CoDAS...................... 71 B Orthographic transcriptions 72 B.1 Patient 1........................................... 72 B.2 Patient 4........................................... 75 C The SDC symbol set 79 D EAGLES recommended subcategories and values 81 E Tagset of the Spoken Dutch Corpus 83 E.1 Obligatory.......................................... 83 E.2 Recommended........................................ 83 F Part-of-Speech tagging 87 F.1 Patient 1........................................... 87 F.2 Patient 4........................................... 87 v

Chapter 1 Introduction 1.1 Motivation In 2004, the Spoken Dutch Corpus (SDC) ( Corpus Gesproken Nederlands ) project finished (Oostdijk et al., 2002). This project aimed at the construction of a database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders. However, the content of the corpus is restricted, because it only contains speech from adults with intact speech abilities. Speech from persons with aphasia or other speech and language disorders has not been included. The SDC project gave rise to the question whether it would be interesting to develop another corpus of spoken Dutch, namely a specialized corpus containing only Dutch aphasic speech. In this thesis we state that it indeed would be interesting and useful to have a Corpus of Dutch Aphasic Speech (CoDAS). Because of the special character of the speech contained in such a corpus, the design will be different from the design of the SDC. One of the purposes of this thesis is to sketch a design for a Corpus of Dutch Aphasic Speech. Just as the design of the corpus differs from that of the SDC, the annotation of the aphasic speech will also have to be performed in a different way, because aphasic speech differs from normal speech at different points. Therefore, a pilot study has been carried out to investigate the changes that should be made in order to make it possible to annotate and tag the speech of aphasics. For the purpose of the pilot study, we have focused on the speech of non-fluent aphasics. The orthographic transcription and part-of-speech tagging of this kind of speech have been examined thoroughly. We have also taken the phonetic transcription into consideration. So, the second goal of the thesis is to investigate which problems with respect to the annotation and transcription should be tackled when a Corpus of Dutch Aphasic Speech is going to be developed. In summary, the thesis is intended to serve as a preparatory study for the set up of a Corpus of Dutch Aphasic Speech and focuses on two aspects. First, corpus design issues are considered. The additional aim of the thesis is to investigate how the annotation and transcription of such a corpus should be performed. Therefore, a pilot study was carried out, in which non-fluent aphasic speech has been examined. 1

1.2. Overview Introduction 1.2 Overview The thesis can be divided in three parts. The first part contains three introductory chapters providing background information about the language impairment aphasia, corpora in general and two relevant corpora for this project. Part 2 is about the design requirements that should be met when a Corpus of Dutch Aphasic Speech would be designed. The pilot study is the topic of interest of the third part, which comprises Chapter 6, 7, and 8. In the pilot study, we focus on three annotation levels, namely the orthographic transcription, the phonetic transcription and the part-of-speech tagging. 1.2.1 Part 1: Aphasia and Corpora Chapter 2 focuses on the language impairment aphasia. The three main causes for aphasia are stroke, trauma, and tumors. Depending on the location and the size of the impairment, different aspects of speech can be disturbed. Therefore, different types of aphasia are distinguished. The last section of this chapter focuses on the patients involved in the pilot study. Within this section we discuss the Akense Afasie Test ( Aachen Aphasia Test ) (AAT), the test we used to diagnose the aphasic patients involved in the pilot study, and the scores of the patients on this test. For the pilot study, we made use of the first component of the AAT (the spontaneous language sample). Corpora are an essential tool for linguistic research. Chapter 3 is about corpora and discusses some of its general characteristics. It proceeds with introducing different distinctions made to typify corpora (e.g. written vs. spoken, synchronic vs. diachronic). Before a corpus can be developed, several design issues have to be dealt with. These issues are also covered in this chapter. Thereafter, the different annotation and transcription levels are discussed. The chapter ends with some examples of existing corpora. In Chapter 4, two corpora that are relevant for this project are discussed. These two corpora are CHILDES (MacWhinney, 2000a,b) and the Spoken Dutch Corpus (Oostdijk et al., 2002). The CHILDES system is of particular interest for this project, because the kind of speech contained in this system also deviates from normal speech. The second corpus we used, the SDC, is relevant for this project because it contains only Dutch speech and is accompanied by very extended, detailed protocols for the transcription of Dutch speech. 1.2.2 Part 2: Corpus Design Chapter 5 is about the design of a Corpus of Dutch Aphasic Speech. The design of a corpus heavily depends on its purpose. Therefore, the chapter starts with formulating the purpose a Corpus of Dutch Aphasic Speech would serve. Then, the chapter proceeds with several other relevant aspects of corpus design, such as obtaining permissions for using speech, the text types that should be included, what metadata about the patients are relevant and at which levels the speech should be annotated and transcribed. Obtaining permissions for using aphasic speech and making it available for other researchers, could be problematic. A committee has to grant these permissions. Even when it would be allowed to use the speech transcripts, for privacy reasons it might probably not be possible to use the speech recordings. 2

1.2. Overview Introduction 1.2.3 Part 3: The Pilot Study The first level of transcription that has been examined, is the orthographic transcription (Chapter 6). An orthographic transcription is a verbatim record of what was actually said using the standard spelling conventions. The chapter starts with a comparison of a general set of guidelines (EAGLES), the CHILDES guidelines, and the guidelines developed for the SDC. Thereafter, it proceeds with discussing the orthographic transcription of six speech fragments according to the guidelines given in the SDC-protocol. Some points require special attention, because they are typical for aphasic speech. Chapter 7 focuses on the phonetic transcription of the aphasic speech. A phonetic transcription provides information on how words are pronounced. For the phonetic transcription of the aphasic speech the same procedure has been followed as for the phonetic transcription of the SDC. The transcription process was performed automatically; a grapheme-to-phoneme conversion program has been used. In Chapter 8, the part-of-speech tagging (POS-tagging) of the aphasic speech has been discussed. The task of part-of-speech tagging is assigning a part-of-speech label to each word of a text. Just as the chapter for the orthographic transcription, this chapter starts with a comparison of the EAGLES guidelines, the method used in CHILDES and the way the POS-tagging has been carried out within the SDC project. The orthographic transcriptions of the aphasic speech were tagged automatically using one of the taggers that has been used for the tagging of the SDC. The performance of this tagger on the aphasic speech is discussed in the final section of the chapter. 1.2.4 Conclusions and suggestions for future research The thesis ends with conclusions and suggestions for future research. Based on the findings of the pilot study, different issues for future research are mentioned. 3

Chapter 2 Aphasia The abilities to understand and produce spoken and written language are located in multiple areas of the brain (most times in the left hemisphere). When one of these areas or the connection between them is damaged, the language production and comprehension becomes impaired. This language impairment is called aphasia, a word derived from the Greek words a (not) and phasis (to speak). Aphasia is a language disorder, the intellect of aphasia patients is not damaged. In this chapter, the main causes of aphasia are discussed (Section 2.1). Thereafter, the most common varieties of aphasia are discussed (Section 2.2). Section 2.3 is about the characteristics of the patients involved in this pilot study. 2.1 Causes In the Netherlands, about 30,000 people suffer from aphasia. In 85% of the cases, the cause of aphasia is a CVA (stroke). Other causes are traumatic brain injuries (12%) and brain tumors (3%) (Davidse and Mackenbach, 1984). CVA CVA is short for cerebrovascular accident, also referred to as a stroke. A stroke is caused by a lack of blood supply to the brain due to an occlusion (90% of the cases) or by hemorrhage (10% of the cases). Depending on the area of the brain that is damaged, a CVA can cause coma, paralysis (reversible or irreversible), speech problems (aphasia), visual disturbances, and dementia (Wikipedia, 2005b). Traumatic brain injury A traumatic brain injury (TBI) is an injury to the brain caused by a severe blow to the head or by being shaken violently. Half of all TBIs are due to transportation accidents involving automobiles, motorcycles, bicycles, and pedestrians. Disabilities resulting from a TBI depend upon the severity of the injury, the location of the injury, and the age and general health of the patient. Some common disabilities include problems with cognition (thinking, memory, and reasoning), sensory processing (sight, hearing, touch, taste, and smell), communication (expression and understanding), and behavior or mental health (depression, 4

2.2. Varieties Aphasia anxiety, personality changes, aggression, acting out, and social inappropriateness). Language and communication problems are common disabilities in TBI patients. Some may experience aphasia, others may have difficulty with the more subtle aspects of communication, such as body language and emotional, non-verbal signals (Wikipedia, 2005c). Brain tumor A brain tumor is a mass of unnecessary cells growing in the brain. Within brain tumors benign and malignant tumors are distinguished. Those descriptions refer to the degree of malignancy or aggressiveness of a brain tumor. A benign brain tumor consists of very slow growing cells, usually has distinct borders, and rarely spreads. A malignant brain tumor is usually rapid growing, invasive, and life-threatening, these brain tumors are often called brain cancer. The time point of symptom onset in the course of disease correlates in many cases with the nature of the tumor (benign or malignant). Depending on the tumor location and the damage it may have caused to surrounding brain structures, any type of focal neurologic symptoms can occur, such as personality changes, cognitive and behavioral impairment, hemiparesis and aphasia (American Brain Tumor Association, 2004; Wikipedia, 2005a). 2.2 Varieties Language impairments differ depending on the location and size of the damage. The brain can be divided down the middle lengthwise into two halves called the cerebral hemispheres. One of these two is the dominant hemisphere for a certain task. The dominant hemisphere is more involved than the other hemisphere in governing certain body functions, such as controlling the arm and leg used preferentially in skilled movements. For most individuals, the left hemisphere is dominant for language. Approximately 70 percent of all individuals with damage to the left hemisphere experience some type of aphasia, whereas only 1 percent of persons with right hemispheric lesions will experience this (Akmajian et al., 2001). Each hemisphere is divided into four lobes, namely the frontal lobe, the parietal lobe, the temporal lobe, and the occipital lobe. Broca s area is located in the frontal lobe and Wernicke s area is situated in the temporal lobe of the dominant hemisphere, in the so-called perisylvian speech area. This zone contains, besides Broca s area and Wernicke s area, the supramarginal gyrus, the angular gyrus, and the arcuate fasciculus (Love and Webb, 1996). Figure 2.1 shows where Broca s area, Wernicke s area, and the arcuate fasciculus are situated in the brain. It also shows the primary motor cortex (controls movements of, among others, the speech muscles), the primary auditory cortex (responsible for processing of auditory information), and the primary visual cortex (responsible for processing of visual information). For each of the aphasia varieties, the main speech characteristics are mentioned (Love and Webb, 1996; Dharmaperwira-Prins and Maas, 2002; Blauw-van Mourik and Koning-Haanstra, 1990). Broca s aphasia (expressive aphasia, motor aphasia) Broca s aphasia is associated with damage to Broca s area in the brain (red area in figure 5

2.2. Varieties Aphasia Figure 2.1: Language area s in the brain 2.1). It is characterized by non-fluent speech containing many pauses. It typically has a telegraphic nature, because of the deletion of function words and disturbances in word order. Only the main content words are present, vital connecting words are missing. Repetition of words and phrases is impaired. Patients with Broca s aphasia also have phonological problems, they reduce sound clusters in words. Another characteristic of Broca s aphasia is that the patients encounter word finding difficulties, e.g. when a patient is asked what his wife s name is, he might not be able to come up with it. In addition to having impaired speech, people with Broca s aphasia also encounter writing difficulties. Writing is included in expressive language, so damage to Broca s area of the brain affects it. Writing can be additionally impaired because of weakness on the right side of the body. People with Broca s Aphasia have relatively good comprehension, it is mainly their expressive language that is impaired. Most Broca s aphasics are painfully aware of their own mistakes. The two fragments below illustrate the difficulty Broca s aphasics encounter in speaking (Akmajian et al., 2001). Examiner: Aphasic: Examiner: Aphasic: Tell me, what did you do before you retired? Uh, uh, uh, puh, par, partender, no. Carpenter? (shaking head yes) Carpenter, tuh, tuh, tenty [20] year. Examiner: Aphasic: Tell me about this picture. Boy... cook... cookie... took... cookie. Wernicke s aphasia (receptive aphasia, sensory aphasia) Wernicke s aphasia is associated with damage to Wernicke s area in the brain (yellow area in figure 2.1). It is a fluent aphasia characterized by difficulty in understanding language as well as difficulty in repetition of language. The speech is fluent, but paraphasic: parts of words are omitted, words are used incorrectly, neologisms are used and incorrect phonemes are substituted for correct phonemes. The content of what these patients say, ranges from mildly inappropriate to complete nonsense. Phrase length is normal and the syntactic structures of the sentences are most times acceptable. Reading ability is generally disturbed, and although writing ability is often retained, what is written may be abnormal. Patients with Wernicke s aphasia may not always be aware of their language difficulties. Akmajian et al. 6

2.2. Varieties Aphasia (2001) illustrated Wernicke s aphasia with the examples below. Examiner: Aphasic: Do you like it here in Kansas City? Yes, I am. Examiner: Aphasic: I d like to have you tell more about your problem. Yes, I ugh can t hill all of my way. I can t talk all of the things I do, and part of the part I can go allright, but I can t tell from the other people. I usually most of my things. I know what can I talk and know what they are but I can t always come back even though I know they should be in, and I know should something eely I should know what I m doing... Conduction aphasia (associative aphasia) Conduction aphasia is often associated with damage to the connection between the areas of Broca and Wernicke, the arcuate fasciculus (purple in figure 2.1) or in the left temporal lobe of the auditory association area. The areas themselves are still intact. Patients with conduction aphasia are unable to repeat words, sentences, and phrases. Speech is fluent and paraphasic, just as in Wernicke s aphasia. Auditory comprehension and reading comprehension are fairly good, just as in Broca s aphasia. Although patients with conduction aphasia are able to understand spoken language, they have word finding difficulties during the production of speech. The impact of this condition on reading and writing varies. In most cases, oral reading is paraphasic whereas silent reading is adequate. Spelling is poor, characterized by omissions, reversals, and substitutions of letters and words. Most patients with conduction aphasia are aware of their language problems. Global aphasia (total aphasia) Global aphasia is associated with damage to both Broca s and Wernicke s area. The symptoms of global aphasia are those of severe Broca s aphasia and Wernicke s aphasia combined: there is an almost total reduction of all aspects of spoken and written language, in both production and comprehension. Improvement may occur in one or both areas (expressive and receptive) over time with rehabilitation. Transcortical aphasia It is also possible that the site of lesion is situated outside the perisylvian speech area. Therefore, the language areas become isolated and cannot be reached. Depending on which language area is isolated, three transcortical aphasia types can be distinguished, namely transcortical motor aphasia, transcortical sensory aphasia, and mixed transcortical aphasia. The area around Broca s area is associated with transcortical motor aphasia. So, when a patients suffers from transcortical motor aphasia, the paths to between Broca s area and the other language areas are cut off. This variety resembles Broca s aphasia, except for the ability to repeat: this ability remains intact in transcortical motor aphasia. Just as transcortical motor aphasia resembles Broca s aphasia, transcortical sensory aphasia resembles Wernicke s aphasia. The area around Wernicke s area is damaged. The difference with Wernicke s aphasia is that the ability to repeat remains intact in transcortical sensory aphasia. 7

2.3. Patients Aphasia Transcortical mixed aphasia, also called isolation of the speech area, involves simple repetition. The only ability that is intact is the ability to repeat, patients echo what is said but can neither produce speech spontaneously nor understand it. This variety resembles global aphasia. Anomic aphasia (amnes(t)ic aphasia, nominal aphasia) The main characteristic of anomic aphasia is that the patient has word finding difficulties. The speech is relatively fluent and grammatical and the comprehension is good. The only deficit is trouble finding appropriate words. Anomic aphasia can be the result of a recovered aphasia of another aphasia type, but can also consist as its own aphasia type. Within anomic aphasia different types can be distinguished, depending on the place that is impaired (e.g. word production anomia, word selection anomia, semantic anomia). To illustrate what kind of problems anomic aphasics encounter, the following examples are given (Akmajian et al., 2001). Examiner: Aphasic: Who is the president of the United States? I can t say his name. I know the man, but I can t come out and say... I m very sorry, I just can t come out and say. I just can t write it to me now. Examiner: Aphasic: Can you tell me a girl s name? Of a girl s name, by mean, by which weight, I mean how old or young? Examiner: Aphasic: On what do we sleep? Of the week, er, of the night, oh from about 10:00, about 11:00 o clock at night until about uh 7:00 in the morning 2.3 Patients Within the pilot study, speech material of six aphasic patients has been considered. These patients were classified by their speech pathologist as being Broca s aphasics. However, we performed an aphasia test on them, which showed that they were not all pure Broca s. The first part of this section is about the test used to diagnose the patients. Thereafter, we proceed with discussing the speech characteristics of the patients involved in the pilot study. 2.3.1 The Akense Afasie Test (AAT) ( Aachen Aphasia Test ) The AAT consists of six subtests each testing the performance on one particular component of language. The six subtests involve a spontaneous speech sample, a token test, a repetition test, a written language test, a naming test, and a language comprehension test. The test is used to diagnose aphasic patients and to determine severeness and type of aphasia (Graetz et al., 1992). Spontaneous language sample The spontaneous language sample consists of a conversation between a speech therapist and an aphasia patient. There are five standard topics that are discussed during the conversation (e.g. profession, family, hobbies). By means of the conversation, six elements are judged, namely (1) Communicative behaviour (COM), (2) Articulation and 8

2.3. Patients Aphasia prosody (ART), (3) Formulaic language (AUT), (4) Semantics (SEM), (5) Phonology (FON), and (6) Syntax (SYN). For each of these points the patient can get a score between 1 (very impaired) and 5 ((almost) intact). Token test The token test starts with a pretest, in which the examiner tests whether the condition of the patient is good enough to perform this test. Thereafter, the test starts, the wrong responses are counted to determine the score. The test consists of five parts, each of them containing ten questions. The difficulty level increases over the parts. If a patient has only two or less good answers in one part, the remaining parts are skipped and the patient gets 10 points for these parts. Repetition The repetition test consists of five parts, also with increasing difficulty. The five parts are: (1) Sounds, (2) One-syllable words, (3) Multisyllable words, (4) Morphologically complex words, and (5) Sentences. The judgements are based on the percentages of phonemes or words that are correct, the number of times the speech therapist has to repeat the stimulus and the number of resumptions. Written language The written language component consists of three parts. In the first part, words and phrases have to be read aloud by the patient. When this part is finished, the patient has to compose words and phrases from blocks containing one or more syllables or words. In the last part of this subtest, the patient has to write down words and phrases to dictation. Naming The naming task involves pictures that have to be named. In the first part, these pictures are objects such as table, cigar, and candle. In the second part, ten colours have to be named. The third part contains again pictures from objects, but now from compound nouns, such as vacuum cleaner, screw-driver, and sailing boat. In the last part, the pictures show situations, such as a boy playing with a dog, two men quarreling, and a man fishing a boot out of the water. In this case, the patient has to say in one sentence what the picture is about. Comprehension The goal of the last component of the AAT is to test language comprehension. The comprehension is tested in four subtests of ten questions. In each question, the patient is shown four pictures. In the first and second part, the speech therapist reads words and sentences, respectively. The patient has to combine the heard word or sentence with the picture that best matches the heard word or sentence. In the third and fourth part, the patients have to read the words and sentences themselves and to combine them with the best matching picture. 2.3.2 The six patients For our pilot study, we used speech samples of six patients: three men and three women, with an average age of 54. The patients have visited an aphasia center for some years already, the time post onset was between three and four years. To obtain speech data, we made use of the Dutch version of the Aachen Aphasia Test (AAT) (Graetz et al., 1992). A qualified Speech and Language Pathologist conducted the test. Initially, we did not know exactly which parts of the 9

2.3. Patients Aphasia test would be needed for the pilot study. However, because because we also wanted to have an indication of the severity of the aphasia, we decided to conduct the whole test. The test data were automatically processed by the computer program AATP, which has been used to classify the patients. The results show that only one of the six patients was a pure Broca s aphasic (patient 4). Four of the patients were not classifiable at all in one of the types. However, because the pilot study is only intended for investigating possible problems and looking for requirements that must be fulfilled by a full corpus, it was not problematic that the severeness and type of aphasia differed among patients. The speech of all patients had a non-fluent character. The most important score for determining fluency of a patient is the sixth score within the spontaneous language sample. This is the score that gives information on the syntactic structures of the sentences. The score can vary between 0 (very heavy syntactic disorders) and 5 (no syntactic disorders). For our patients, the score on syntactic structures was 1 or 2. A score of 2 indicates that the sentences are short and usually syntactically incomplete. Besides, many flection forms and function words are not present. A score of 1 indicates that the patient almost does not use flection forms or function words and makes sentences of 1 or 2 words. The results on the test of the six patients involved in the pilot study are shown in Table 2.1. The figures in this table are the raw scores on the test and the percentages for the various aphasia varieties. Patient 1 2 3 4 5 6 Test scores Spontaneous speech sample 3 3 3 2 4 3 (communicative behaviour) 4 4 4 4 5 4 (articulation and prosody) 4 4 5 4 5 5 (automated language) 3 3 4 3 4 5 (semantic structure) 3 3 3 3 4 4 (phonematic structure) 2 2 2 1 2 2 (syntactic structure) Token test 30 37 25 29 10 13 Repetition 134 103 127 101 125 143 Written language 62 23 55 68 76 84 Naming 45 79 97 90 100 97 Comprehension 99 89 87 92 90 91 AATP scores Percentage Aphasia 100 100 100 100 98.4 86.5 Percentage Broca 14.3 47.4 69.1 99.9 8.1 47.0 Percentage Wernicke 26.2 52.6 30.8 0.1 21.4 1.2 Percentage Amnestic 59.5 0 0 0 70.5 51.8 Aphasia type??? Broca Amnestic? Table 2.1: The scores on the AAT of the patients involved in the pilot study For the pilot study we restricted ourselves to non-fluent speech. We decided to use a comparable sample, because this makes it possible to draw better conclusions. If all kinds of speech had been represented within the pilot study, it might have been more difficult to see whether problematic issues occur more than once or if they are typical for only this patient. A full Corpus of Dutch Aphasic Speech should contain speech samples from patients representing all different types of aphasia. 10

2.3. Patients Aphasia 2.3.3 Data used for the pilot study Only the first component of the AAT, the spontaneous speech sample, was used for the pilot study. The samples contain between 300 and 500 words spoken by the aphasic patient. Although the conversation is not completely spontaneous, because the topics are already determined, we nevertheless used these samples. It is rather difficult to obtain completely spontaneous speech from aphasic patients, because they do not speak as much as people without speech impairments do. The content of their utterances is usually very informative and many of the patients only speak when it is necessary. Therefore, the interview is a good alternative for collecting spontaneous speech samples. The other components of the AAT were not used for the pilot study, but were necessary for determining the severity and type of aphasia the patients have. 11

Chapter 3 Corpora A lot of linguistic research is done by means of a corpus. The word corpus is derived from the Latin word corpus meaning body and refers to a collection of texts. A corpus can be divided into subcorpora. A subcorpus has all the properties of a corpus but is part of a larger corpus. Corpora and subcorpora are divided into components. A component is not necessarily an adequate sample of a language and in that way it is distinct from a corpus and a subcorpus. It is a collection of pieces of language that are selected and ordered according to a set of linguistic criteria that serve to characterize its linguistic homogeneity. Whereas a corpus may illustrate heterogeneity, and also a subcorpus to some extent, the component illustrates a particular type of language (Sinclair, 1996). This chapter first discusses some of the characteristics of a corpus (Section 3.1). Thereafter, it proceeds with discussing some general distinctions that can be drawn to describe a corpus (Section 3.2). Section 3.3 covers the general issues that have to be considered before a corpus can be developed. Section 3.4 is about the different transcription and annotation levels that can be added to enrich a corpus. In Section 3.5, some important corpora are discussed, such as the British National Corpus and the Brown Corpus. 3.1 Characteristics According to Sinclair (1996), a corpus is assumed to have certain standard properties. Unless stated, these characteristics are attributed to anything called a corpus. A corpus which has one or more non-default values for these characteristics is called a special corpus : its title should specify its deviations from the assumptions. The four characteristics given by Sinclair (1996) are: Quantity = large The default value of quantity is large. A corpus is assumed to contain a large number of words. The whole point of assembling a corpus is to gather data in quantity. It has to be stressed here that any corpus, however big, always is a minuscule sample of all the speech and writing produced by all the users of a language. The minimum size is not exactly specified, but some examples show that the sizes of important existing corpora are very large. For instance, the British National Corpus consists of 100 million words collected from 12

3.2. Types Corpora samples of written and spoken British English and the Spoken Dutch Corpus comprises 10 million words contemporary standard Dutch spoken by adults living in the Netherlands and Flanders. Quality = authentic The default value for quality is authentic. All the material is gathered from the genuine communications of people going about their normal business. Corpora of the language of children, geriatrics, non-native speakers, users of extreme dialects, and very specialized areas of communication should be designated special corpora because of the unrepresentative nature of the language involved. Simplicity = plain text The default value of simplicity is plain text. This means that the user can expect an unbroken string of ASCII characters, with any mark-up clearly identified, and separable from the text. Nowadays for most corpora the texts are stored in XML format. This markup language has been carefully designed and does not impose any additional linguistic information on the text. Largely, its role in relation to text representation is to preserve in linear coding some features which would otherwise be lost. Documented = yes The default value for documented is yes. This means that full details about the constituents of a component are kept separately from the component itself. Corpus users seem to prefer to keep the documentation of texts in a separate place from the texts themselves, and to include only a minimal header that contains a reference to the documentation. For the management of corpora this practice allows the effective separation of plain text from annotation with only a small amount of programming effort. According to MacEnery and Wilson (1996), a corpus used in corpus linguistics has four characteristics. First, the corpus is a representative sample of a language variety. Second, the term corpus implies a body of text of finite size. Although this is not always the case - there also exist so-called monitor corpora to which texts can be added later - the majority of the existing corpora are finite in size. The third characteristic is that the corpus should be machine-readable. Advantages of machine-readable corpora over written or spoken formats are that they can be searched and manipulated at speed and that it is easier to enrich the corpus with extra information, such as part-of-speech tags. The last characteristic is that a corpus should constitute a standard reference for the language variety that it represents. Therefore, the corpus should be available for other researchers. 3.2 Types Corpora can be subdivided according to different criteria. Some general distinctions are discussed in this section, such as the distinctions between written and spoken corpora and between synchronic and diachronic corpora. 13

3.2. Types Corpora General corpora versus specialized corpora The first distinction that can be drawn is the distinction between corpora that are compiled for general purpose research (general corpora) and corpora that are highly domain-specific (specialized corpora). Corpora compiled for general purpose research are generally used for a wide variety of different research objectives. Because the scope of a specialized corpus is more specific, the group of researchers interested in such a corpus usually is smaller. However, it can be used for instance to highlight particular differences between standard language and specific registers (Kennedy, 1998). An example of a general corpus is the Spoken Dutch Corpus (Oostdijk et al., 2002). The CHILDES project can be classified as a specialized corpus, because it contains only corpora on child language and impaired language (MacWhinney, 2000a,b). Written corpora versus spoken corpora Initially, all language corpora consisted of written material collected from already existing text sources that were often electronically available (e.g. novels, newspapers, manuals). Nowadays, spoken language corpora have also been developed; in such corpora recorded speech has been transcribed. However, the differences between text and speech data data are very complex as orthographically transcribed speech is not the same as written text. Gibbon et al. (1997) mention eight important differences between written texts and spoken language that have to be taken into account. For example, the durability of text: written text stays on the paper when it is written down, speech is transient and therefore necessarily has to be recorded to make it accessible for future use. This is a rather trivial distinction, but a more practical difference is the time and money concerned in the development of corpora: developing written corpora is more time-consuming and more expensive. A third difference concerns the editing behaviour of speakers: interruptions, hesitations, repetitions of words, and self-repairs are properties of spoken language usually not present in written texts. The Spoken Dutch Corpus is a prime example of a spoken corpus (Oostdijk et al., 2002), whereas the Brown Corpus contains only texts from written sources (Francis and Kucera, 1979). Synchronic corpora versus diachronic corpora Corpora can be designed and used for synchronic or diachronic studies. A synchronic corpus is an attempt to represent a language or a text type of one particular time span whereas a diachronic corpus represents a language or text type over a period of time in order to make it possible to investigate language changes and differences over time (Kennedy, 1998). Most corpora are synchronic, examples are the British National Corpus (Aston and Burnard, 1998) and the Spoken Dutch Corpus (Oostdijk et al., 2002). The Helsinki Corpus of English Texts contains a diachronic part covering the period between 750 and 1700 (Kytö, 1996). Monolingual corpora versus multilingual corpora A corpus may contain texts in one language (monolingual corpus) or in multiple languages (multilingual corpus). Most corpora are monolingual, such as the British National Corpus (Aston and Burnard, 1998) and the Spoken Dutch Corpus (Oostdijk et al., 2002). Within the multilingual 14

3.3. General issues Corpora corpora a distinction is made between comparable corpora and parallel corpora. A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. Some parallel corpora, however, exist in several languages. Parallel corpora are considered to be a very interesting research topic at the moment, because of the opportunity to align the original text and the translation, and to gain insights into the nature of translation. The English-Norwegian Parallel Corpus is a parallel corpus of English and Norwegian texts (Oksefjell, 1999). A comparable corpus is one which selects similar texts in more than one language or variety. The possibilities of a comparable corpus are to compare different languages or varieties in similar circumstances of communication (MacEnery and Wilson, 1996). The ECI Corpus is an example of a comparable corpus, it contains texts from several European languages (Armstrong-Warwick et al., 1994). Dynamic corpora versus static corpora Most corpora are finite in size. For instance, the British National Corpus (Aston and Burnard, 1998) and the Spoken Dutch Corpus (Oostdijk et al., 2002), are both static corpora. Dynamic corpora, also referred to as monitor corpora, on the other hand consist of a growing, non-finite collection of texts. A monitor corpus can be used to perform research after language changes (MacEnery and Wilson, 1996). The Corpus di Italiano Scritto (CORIS) is a general reference corpus of present-day written Italian. It follows a dynamic corpus model, which is updated every two years (Rossini Favretti et al., 2001). 3.3 General issues According to Kennedy (1998), there are some points that have to be considered before a corpus can be developed. In this section these points are discussed. Purpose The compiler of the corpus has to formulate what its purpose will be. What kind of research questions will be addressed with it? Different goals require different types of corpora: a corpus used for lexical studies requires another design than a corpus that is used for grammatical studies, and for sociolinguistics other issues are important than for psycholinguistics. It is possible to decide which of the characteristics mentioned in Section 3.1 the corpus has to fulfill once the purpose of the corpus is known. Text types Once the goal of the corpus is known, the developers have to decide what text types should be incorporated in the corpus. For a general corpus, as many as possible text types should be in the corpus, whereas for a specific corpus about the style of texts written by English authors in the 18th century, only English texts from the 18th century are needed. It is important that the corpus contains as many as possible text types of the language variety it represents. Because the corpus 15

3.3. General issues Corpora is a representation of that specific variety, it is important that it contains a balanced language sample of the variety. Permission Corpus compilers must observe copyright laws. This is not only the case for written texts, where permission must be obtained from authors and publishers, but also for spoken text. The key issue for the collection of spoken text is that there is no invasion of personal privacy. Markup Inconsistent methods of encoding text can cause confusion. Therefore, standards have been developed for the electronic encoding of text. Following a standard facilitates the portability of electronic texts, making it possible to re-use them in different contexts on different equipment. In 1988, the Text Encoding Initiative (TEI) started. The goal of this initiative was to formulate standards for text documentation, text representation, text analysis and interpretation, and metalanguage and syntax issues. This resulted in a first draft of the TEI guidelines in 1990 under the title Guidelines for the Encoding and Interchange of Machine-Readable Texts. In the course of years, the guidelines have changed, the current version of the guidelines, TEI P4, was published in 2002 (Sperberg-McQueen and Burnard, 2004). The TEI guidelines provide means of representing those features of a text which need to be identified explicitly in order to facilitate processing of the text by computer programs (Sperberg- McQueen and Burnard, 2004). It is an application of the markup language SGML. The guidelines specify a set of tags which may be inserted in the electronic representation of the text, in order to mark the text structure and other textual features of interest. Without such explicit markers, many important features remain difficult to locate by mechanical means such as computer programs, and thus difficult to process effectively. The process of inserting such explicit markers for implicit textual features is often called markup, and the term markup language denotes the rules which govern the use of markup in a set of encodings (Sperberg-McQueen and Burnard, 2004; Kennedy, 1998). Metadata Metadata can be defined as data about data. When speaking about corpora, the term refers to the kind of data that is needed to describe a text in sufficient detail and with sufficient accuracy for some program to determine whether or not that text is relevant in a particular case. Or, the kind of data needed to describe a speaker in sufficient detail and with sufficient accuracy for some program to determine whether or not that person is relevant in a particular case. The metadata play a key role in organizing the ways in which a corpus can be meaningfully processed. Multiple levels of metadata may be associated with a corpus. First, information relating to the corpus as a whole (e.g., its title, its purpose). Second, information relating to the individual components of the corpus (e.g., the bibliographic description of an article) and third, information about the speakers. The TEI guidelines also specify standards for metadata. 16