The Structure of the ORD Speech Corpus of Russian Everyday Communication

The Structure of the ORD Speech Corpus of Russian Everyday Communication Tatiana Sherstinova St. Petersburg State University, St. Petersburg, Universitetskaya nab. 11, 199034, Russia sherstinova@gmail.com Abstract. The paper presents the structure of the ORD speech corpus of Russian everyday communication, which contains recordings of all spoken episodes recorded during twenty-four hours by a demographically balanced group of people in St. Petersburg. The paper describes the structure of the corpus, consisting of audio files, annotation files and information system and reviews the main communicative episodes presented in the corpus. 1 Introduction: What Is the ORD Corpus? The abbreviation ORD stems from Russian Odin Rechevoj Den, literally translated as one day of speech. The main aim of creating the ORD corpus is to collect recordings of actual speech which we use in our everyday communication. The ORD creating is an interdisciplinary project, in which specialists in many scientific branches are involved. Primarily, they are linguists experts in different aspects of Russian language phoneticians, grammarians, lexicographers, dialectologists. Besides them two psychologists and a sociologist took part in project creation as well as specialists in modern information technologies. For the first series of recordings a demographically balanced group of 30 subjects representing various social and age strata in the population of St. Petersburg was selected. The subjects spent one day with dictaphones dangling around their necks and recording all their communication. So dictaphones should be turned on in the morning recording breakfast at home with family members, then preparation for going to work, the way to work itself, speaking by cellular telephone, then official and informal conversations at work with colleagues (e.g., about problems with children, world financial crisis, yesterday s football match, etc.), lunch time, shopping, recreation and so on up to the moment when subjects went to bed. In the result more then 240 hours of recording were obtained, from which 170 hours contain speech data quite suitable for further linguistic analysis and more than 50 hours of recordings are good enough for further phonetic analysis. The corpus was divided into 2202 communication episodes. 134 episodes are already transcribed in detail. At present, orthographic transcription of the corpus numbers more than 50000 wordforms [1]. The corpus presents the unique linguistic material, allowing to perform fundamental research in many aspects including complex behaviour of people in real world. At the V. Matoušek and P. Mautner (Eds.): TSD 2009, LNAI 5729, pp. 258 265, 2009. c Springer-Verlag Berlin Heidelberg 2009

The Structure of the ORD Speech Corpus of Russian Everyday Communication 259 same time these utterly natural recordings may be used for practical purposes: for example, for verification of many scientific hypotheses, for adjustment and improvement of speech synthesis and recognition systems, etc. 2 The ORD Corpus Structure The ORD speech corpus consists of three major components: 1) audio files, 2) correspondent annotation files, and 3) information system. 2.1 Audio Files The methodology used for collection of speech for the ORD corpus when subjects were asked to keep dictophones on for many hours gave unique material about our everyday speech behaviour (e.g., we have quite rare recordings of people talking just to themselves or informal communication of cadets in barracks of military school). At the same time it inevitably got a great amount of non-acceptable recordings: fragments without speech and fragments with speech in very noisy environment (e.g., background remarks of metro passengers). It was necessary to separate fragments containing speech from that without speech, and to classify speech fragments according to their quality. Obviously, the archive copy of all recordings is kept in its original form, allowing to reconstitute speech days as they were. Every audio file of the other copy of the corpus was carefully listen to and segmented into fragments, which became the main units of audio corpus. It is supposed that each file is not longer than 30 minutes, contains speech recording of the similar quality and refers to the same or adjacent communication episode(s). All fragments without speech longer than several minutes are cut from ORD files, as well as fragments containing just background sounds of working TV or radio. Information on this segmentation is available in an auxiliary database. The new files have got names, referring to the subject s code and the ordinal number of episode. The phonetic quality of each file was further evaluated and measured in 4-score scale: 1 the best quality, suitable for precise phonetic analysis, 2 rather good quality partially suitable for phonetic analysis, 3 noisy recordings with low quality which is only partially legible (not suitable for phonetic analysis but suitable for other aspects of research), 4 unintelligible conversations or remarks in extreme noise, which could not be understand without noise reduction. It is planned to annotate first the recordings of the best quality, whereas the most noisy audio files are not to be annotated at all. 2.2 Annotation Files The ORD corpus is being annotated by means of two professional annotation tools ELAN [2] and Praat [3]. The main principles of multi-level tagging in the ORD corpus were described in [4]. The annotation formats of ELAN and Praat are fully convertible. ELAN is used for primary and general annotations of the corpus, whereas Praat is used for making real phonetic transcription and other phonetic annotations. The annotations are kept in files of two general types *.eaf (ELAN format) and *.TextGrid (Praat format). Being verified by experts, annotation data are exported into general information system.

260 T. Sherstinova 2.3 Information System The information system presents a relational database created on the base of MS Access 2003. All tables of the database are divided into 3 general groups. Group I Actual information about speakers, sound files, and communication episodes. Table 1.1 Informants (Speakers/Subjects): actual data about all base speakers, presented by the subjects themselves. For example, speaker s code (S01, S02, etc.), his/her nickname; gender; age; place of birth; social group; education; qualification; current occupation; nationality; number and quality of recorded files; total and usable time of recording; comments; etc. Table 1.2 Communicants (Interlocutors): some actual data about the main people who communicated with the subjects during their speech day: communicator s code; his/her (nick)name; relation to the subject or his/her social role (e.g., mother, friend, shop assistant, etc.); gender; approximate age; and some other possible information provided by the subjects interlocutors place of birth; social group; education; qualification; current occupation; nationality; as well as intelligibility of recorded speech; some comments; etc. Table 1.3 ARCSoundFiles: information about original (archival) sound files including total duration, that of intelligible speech and illegible or noisy fragments. Table 1.4 ORDSoundFiles: information about reformatted sound files including reference to correspondent original files, exact position in the original file, total duration, phonetic quality of recording, annotation priority, and the main communication episode, described in three fields: 1) where 2) doing what 3) who is (are) the main interlocutor(s). Table 1.5 Episodes: concise formal description of main voiced communication episodes. Segmentation into episodes was made by expert linguists, its description includes information on interlocutors, time, duration, place, aim and subject of communication, as well as some possible comments. At first segmentation into episodes was made rather arbitrary. Now we try to standardize both segmentation and its description. Group 2 presents some results of social and psycholinguistic data interpretation. Currently it contains just two tables: Table 2.1 InformantsSocial (Speakers Social Attributions) has the same structure as Table 1.2 (Speakers), but contains subjective evaluation of speaker s social characteristics by linguists who transcribed their speech. It should be noted that the linguists were not allowed to see actual information from Table 1.1 before filling Table 2.1. Table 2.2 InformantsPsycho (Speakers Psychological Portraits). This table is also filled by linguists who work with recordings and contains speaker s psychological rating in ten-point system for the following aspects: neurotizismus, spontaneous aggression, depression, irritation, sociability, tranquility, responsive aggression, self-consciousness, openness, extroversion/introversion, emotional instability, masculinity/feminism. Besides, here you can find a one-page essay about each base speaker written by researchers in a free style.

The Structure of the ORD Speech Corpus of Russian Everyday Communication 261 Group 3 contains tables for speech transcripts and multi-level annotations. Filling these tables of the database is still in progress. Table 3.1 MiniEpisodes is used to give reference to smaller real-life episodes within larger episodes described in Table 1.5. For example, Speaker X in Episode N (in the evening at summer house with mother) may have the following mini-episodes: 1) searching for matches (2 minutes), 2) trying to set fire to the oven (3 minites), 3) discussing plans for the rest of the evening (5 minutes), etc. Table 3.2 Timeline: Each (mini-)episode is sequentially subdivided into utterances (remarks/phrases) and pauses. The exact timing of each fragment is given. Table 3.3 Frases contains orthographic transcript of communication episodes made by linguists using a special system of notation. Reference to the speaking person is given in the Speaker field, using the same codes as in Table 1.1 and 1.2. Auxiliary information on starting-ending points of the phrase in milliseconds allows to listen to it in the correspondent database form. Table 3.4 Voice is linked with the previous table and contains information about possible changes of voice quality in some fragment of the speech (either physiologically or functionally e.g., hoarsely, smiling, yawning, exciting, imitating, etc.). Table 3.5 Events describes non-language audio events (squeak of a door, phone ring, etc.). Table 3.6 Notes: refers to auxiliary information which may be given to some period of time (e.g., this fragment contains specific youth slang ). Table 3.7 Words: keeps information about each wordform of the corpus (e.g., POS, grammatical form, syntactic role, phonetic transcription, etc.). Table 3.8 PhonWords: describes phonetic words. Segmentation into phonetic words is being made for sub-corpus of the best quality. The table currently contains orthographic spelling of phonetic words and reference to starting-ending points in correspondent audio files. Phonetic transcription is planned to be added later. It is possible to listen to each segmented phonetic word by means of database utilities. Table 3.9 Morphemes: contains data on morphemes (e.g., general class, ideal transcription, phonetic transcription, etc.). Table 3.10 Sounds: contains phonetic information about individual realization of phonemes or indivisible sound groups (ideal and real phonetic transcription, position, etc.). A number of additional levels of annotation is planned to be included into the database further (e.g., speaker s mental state, emotional connotations, communicational strategies, rhetorical techniques, prosodic models, etc.). Therefore, new tables will appear in the database in the future. On Fig. 1 you may see examples of orthographic transcripts of phrases in Table Frase for the mini-episode a story about speaker s driving lesson (Speaker S05, female, 29 years old, teacher). Basing on obtained annotations different frequency lists may be built for words and phonetic words for any speaker, episode, group of episodes and corpus in the whole. All occurrences of each word may be found and listen to. A flexible search system is currently being created for the database of annotations. Further data processing is based on database requests and special applications, which

262 T. Sherstinova Fig. 1. An example of orthographic transcript of phrases in Table Frase are currently being developed. One of such applications is E-Kar utility, which allows complex lexicographic and morphologic data processing. Speech material of the ORD corpus will be constantly increased. Thus the new recordings are currently being made by new groups of subjects. New functional modules are also being made within the ORD speech corpus for example, an audio dictionary of Russian morphemes. 3 Summarization of Communicative Episodes In this section we will briefly describe the content of the ORD corpus from the point of view of episodes typology. The term episode in the ORD terminology means continuous and preferably long-lasting fragment of one-day-of-speech recordings with the common conditions of communication (time, place, action, interlocutors). Episodes may be further subdivided into mini-episodes which refer to shorter periods of time with the common topic of conversation, simultaneous action, etc. All 2202 communication episodes detected in the days of speech were then divided into 22 general categories. The most frequent types are represented on Fig. 2. Percents shown on this chart refer to total duration of recording made in the given conditions and doesn t reflect the number of utterances (or words) recorded in each situation. It can be seen from Fig. 2, that the largest part of speech records (42% or nearly 93,6 hours) refers to communication related with the main occupation of the subjects (at work or at studies). This category is more than 4 times bigger than the group of the second rank - family conversation at home in the evening (9,92%). The types that follows refer to parties in coffee bars or restaurants (9,80%) and to dialogues on the way to the working place or somewhere else (8,77%). The fifth place is occupied by home conversations in

The Structure of the ORD Speech Corpus of Russian Everyday Communication 263 the morning (5,41%). The duration of any other category of episodes does not reach 5% of the averaged speech day. Such classification of episodes is rather rough. Evidently, some categories should be further reviewed. For example, the huge category at work may be divided into business meetings, work with clients, business calls, individual work, and personal contacts, including private and nonbusiness conversations. Moreover, some smaller episodes or situations (e.g., phone calls) may take place within practically any main category. Fig. 2. Summarization of main episodes in the ORD according to the total time of original recordings 3.1 Male and Female Speech Days Special investigation was made to compare average male and female speech days. In the majority of the categories of episodes the speech day of men does not differ significantly from that of women the difference is less than 1-2%, such is the case of the dominating category working/studies (see Table 1). There are, however, a few categories in which rather substantial difference can be observed. In particular, men spent more than 9% more time than women attending various (sport, cultural, etc.) events; therefore total time of men s being on a way is nearly 5% longer. As for women, they spent this time at home conversations in the evening (7% more than men on the average), at parties and dinners (both 2% longer than that of men) and in the morning conversations (3% longer). This result is, however, not surprising from both psychological and sociological points of view [5]. When the corpus will be totally annotated and transcribed we can measure more precise the quantity of speech (utterances and words) for each communication episodes and to study its dynamics within days-of-speech. Besides, differentiation of working days and holidays should be also taken into account. Then we can try to built an averaged

264 T. Sherstinova Table 1. Comparison of male and female communication episodes in the ORD corpus General types of episodes Difference (%) Male/Female 1 sport and cultural events 8,69 Male 2 on the way to anywhere 4,85 Male 3 at home in the daytime 2,51 Male 4 walk 1,58 Male 5 corporate party 1,47 6 visiting service centers, public institutions, etc. 1,17 7 the main occupation (work or study) 0,87 8 hobbies, leisure, sport 0,10 9 lunch 0,23 10 visiting a doctor 0,31 11 shopping 0,59 12 breakfast 0,79 13 on the way home 0,97 14 visiting other people 1,06 15 in the country 1,20 16 dinner 1,94 17 housework 2,05 Female 18 evening party (in cafe, restaurant) 2,08 Female 19 family talk at home in the morning 3,01 Female 20 family talk at home in the evening 6,92 Female model of peoples twenty-four-hour speech behaviour. Further, having determined the main chains and structures of everyday communication it will be possible to study time series of quantitative variables by means of standard statistical methods and to analyze frequency series (e.g., of lexical, grammatical or semantic units, acoustic phenomena, prosodic contours, etc.) depending on various conditions of communication. Acknowledgements The first recordings and database creating of the ORD corpus were supported by the Russian Foundation for Humanities within the framework of the project Speech Corpus of Russian Everyday Communication One Speaker s Day (project # 07-04-94515e/Ya). Nowadays creating of the corpus is supported by the program of the Russian Ministry of Education titled Sound Form of Russian Grammar System in Communicative and Informational Approach and by the grant of the Russian Foundation for Humanities Development of an Information System for Monitoring of Russian Spoken Language (project # 09-04-12115v). References 1. Asinovsky, A.S., Bogdanova, N.V., Rusakova, M.V., Stepanova, S.B., Sherstinova, T.Y.: Zvukovoj korpus russkogo yazyka povsednevnogo obschenia Odin rechevoj den : koncepcia i sosytojanie formirovania. In: Kompjuternaya lingvistika i intellektualnye tekhnologii. Vypusk. Po materialam mezhd. konferencii Dialog, Moscow, vol. 7 (14), pp. 488 494 (2008)

The Structure of the ORD Speech Corpus of Russian Everyday Communication 265 2. ELAN - Linguistic Annotator. Version 3.6, http://www.mpi.nl/corpus/manuals/manual-elan.pdf 3. Praat: Doing Phonetics by computer, http://www.praat.org 4. Ryko, A.I., Stepanova, S.B.: Mnogourovnevaya lingvisticheskaya razmetka zvukovogo korpusa russkogo yazyka. In: Kompjuternaya lingvistika i intellektualnye tekhnologii. Vypusk. Po materialam mezhd. konferencii Dialog, Moscow, vol. 7 (14), pp. 460 465 (2008) 5. Sherstinova, T.Y.: Odin rechevoj den na vremennoj shkale: o perspektivakh issledovania dinamicheskikh processov na materiale zvukovogo korpusa. In: Vestnik Sankt-Peterburgskogo universiteta, Chast 2, St. Petersburg. Seria 9: Filologia. Vostokovedenie. Zhurnalistika, vol. 4, pp. 227 235 (2008)