Quarterly Progress and Status Report. Automatic classification of accent and dialect type: results from Southern Swedish

Similar documents
Collecting dialect data and making use of them an interim report from Swedia 2000

Mandarin Lexical Tone Recognition: The Gating Paradigm

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Rhythm-typology revisited.

Speech Recognition at ICSI: Broadcast News and beyond

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Copyright by Niamh Eileen Kelly 2015

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A survey of intonation systems

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Letter-based speech synthesis

Learning Methods in Multilingual Speech Recognition

The Acquisition of English Intonation by Native Greek Speakers

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Speech Emotion Recognition Using Support Vector Machine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Journal of Phonetics

L1 Influence on L2 Intonation in Russian Speakers of English

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The influence of metrical constraints on direct imitation across French varieties

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Eyebrows in French talk-in-interaction

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Word Segmentation of Off-line Handwritten Documents

Individual Differences & Item Effects: How to test them, & how to test them well

English Language and Applied Linguistics. Module Descriptions 2017/18

A Hybrid Text-To-Speech system for Afrikaans

Florida Reading Endorsement Alignment Matrix Competency 1

SARDNET: A Self-Organizing Feature Map for Sequences

Using dialogue context to improve parsing performance in dialogue systems

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Segregation of Unvoiced Speech from Nonspeech Interference

Proceedings of Meetings on Acoustics

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

A Socio-Tonetic Analysis of Sui Dialect Contact. James N. Stanford Rice University. [To appear in Language Variation and Change 20(3)]

Phonological encoding in speech production

Automatic intonation assessment for computer aided language learning

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Dialog Act Classification Using N-Gram Algorithms

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Phonological Processing for Urdu Text to Speech System

A study of speaker adaptation for DNN-based speech synthesis

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Phonological and Phonetic Representations: The Case of Neutralization

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Discourse Structure in Spoken Language: Studies on Speech Corpora

Cross Language Information Retrieval

A Case Study: News Classification Based on Term Frequency

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

PROJECT MANAGEMENT AND COMMUNICATION SKILLS DEVELOPMENT STUDENTS PERCEPTION ON THEIR LEARNING

Corpus Linguistics (L615)

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Designing a Speech Corpus for Instance-based Spoken Language Generation

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Assignment 1: Predicting Amazon Review Ratings

Preprint.

Voice conversion through vector quantization

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Software Maintenance

Bitonal lexical pitch accents in the Limburgian dialect of Borgloon

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Rule Learning With Negation: Issues Regarding Effectiveness

Universal contrastive analysis as a learning principle in CAPT

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Practical Research. Planning and Design. Paul D. Leedy. Jeanne Ellis Ormrod. Upper Saddle River, New Jersey Columbus, Ohio

Reinforcement Learning by Comparing Immediate Reward

Lecturing Module

REVIEW OF CONNECTED SPEECH

CEFR Overall Illustrative English Proficiency Scales

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Lecture Notes in Artificial Intelligence 4343

Probability estimates in a scenario tree

Disambiguation of Thai Personal Name from Online News Articles

Building Text Corpus for Unit Selection Synthesis

prehending general textbooks, but are unable to compensate these problems on the micro level in comprehending mathematical texts.

Quarterly Progress and Status Report. Sound symbolism in deictic words

GOLD Objectives for Development & Learning: Birth Through Third Grade

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The KAM project: Mathematics in vocational subjects*

Evaluation of Various Methods to Calculate the EGG Contact Quotient

/$ IEEE

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Learning From the Past with Experiment Databases

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Transcription:

Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Automatic classification of accent and dialect type: results from Southern Swedish Frid, J. journal: Proceedings of Fonetik, TMH-QPSR volume: 44 number: 1 year: 2002 pages: 089-092 http://www.speech.kth.se/qpsr

TMH-QPSR Vol. 44 Fonetik 2002 Automatic classification of accent and dialect type: results from southern Swedish Johan Frid Department of Linguistics and Phonetics, Lund University Abstract This paper is about automatic classification of dialect and accent type in Swedish. We used disyllabic words from the speech material of the Swedia 2000 dialect project in order to build a statistical prediction model. The model uses different parameters extracted from F0 contours of the words. Classification of dialect is performed for the features 'province' and 'village', whereas classification of accent is performed for the feature 'type'. The best results for province, village and accent type are 30.3%, 15.7% and 85.4% correct predictions, respectively. This shows that the categories of province and village are too specific to use for a prediction model based on these F0 parameters, whereas it is possible to predict accent type with some accuracy even across dialects. Introduction It has long been known that intonation is a major source of variation in the dialects of Swedish. An important factor in causing this variation is found in the intonation data collected by Meyer (1937 and 1954) where it is clear that the temporal properties of turning points in the F0 contour differ among dialects. In different experiments, Bruce & Gårding (1978) and Bruce (1983) showed that the perceived type of dialect of an utterance could be varied by means of resynthesis with different F0 contours. Bruce (p.c.) has also demonstrated that dialect type may be recognized from the laryngeograph signal of an utterance. This indicates that listeners are able to use F0 as a cue to dialect. House (1990) and House & Bruce (1990) have dealt with automatic prosody recognition. Based on the performance of a human expert's analysis of F0 contours, they develop a set of rules for the classification of unknown F0 contours. The experiments presented here are based on the same basic idea: intonation, as manifested in the F0 contour, plays a part in the prediction of certain features of dialectal and accentual variation and it is possible to automatize the recognition of some aspects of this relationship. A possible application for dialect recognition is in voice response systems, which need to be able to cope with dialectal variation in order to maximize the number of potential users. Identification of accent type may restrict lexical possibilities and therefore facilitate lexical search and access in automatic speech recognition systems. Experiment In this section we will describe the material, the analysis methods and the modeling. Hypothesis The question we would like to answer is: to what extent can acoustic properties, like the timing and level of turning points in the F0 contour be used to recognize the dialect or type of accent correctly? A related question is if other information sources is advantageous in such a classification task, more specifically: does it help if we know which accent the speaker intended and/or if we know the segmental properties of the word, like temporal location of vowel or voice onset? Even though such information isn't obtainable directly from an acoustic analysis it may be available, e.g. in an alignment task. The issue has, however, not been dealt with in the framework of this paper. Material In this study, we used the material collected within the Swedia 2000 dialect project (Bruce et al 1998). In the Swedia material, there are recordings from more than 100 villages and 89

Speech, Music and Hearing towns, mainly located in Sweden but also a few in Finland, where Swedish is spoken in some areas. Gender and age is covered by recordings of both men and women, and of 'older' (aged around 55) and 'younger' (aged around 25) informants. There are at least three informants in each group of combination of gender and age (such as 'older men') from each village. The material we will use in this study consists of the words 'dollar' (Accent 1) and 'kronor' (Accent 2) spoken in phrases like 'tio dollar' or 'tjugo kronor', where either the numeral or the currency word is given phrase focus. Thereby we get both focal and non-focal versions of both A1 and A2 words. Each phrase is repeated a number of times, giving several versions of each word for each informant. The phrases were not read from written versions, but elicited by the interviewer by showing the informant notes with symbols for each word. The recordings were made in the informants' homes using a portable DAT-recorder and care was taken to avoid background noise. In almost all cases the recordings are of very high quality. Labeling The material was first labeled on the word level, locating the start and end position of each word in the whole recording session of an informant. In this process, word accents and phrase focuses were also indicated. For further segmental transcription we used a semi-automatic method consisting of automatic aligning and then manual post-processing of the segment boundaries. In this way, the temporal location of segment boundaries, most notably the important information about vowel onset in the stressed syllable, is obtained for each word. From this material we used the group 'older men'. Due to reasons of prioritizing within the project, the labeling of these were the first to be completed and ready for use. At the time of this study, only the material from southern Sweden was available to us. This material contains recordings from all provinces between Skåne in the south up to Dalsland, Östergötland and Gotland in the north. All in all, we have more than 100 speakers from ten different provinces in southern Sweden. Acoustic analysis and parameterization Three different set of parameters were extracted from the words in the material: 1) Time and F0 values of the first fall where the beginning of the fall comes before the end of the vowel in the stressed syllable. The temporal locations of the turning points were determined by a stylization method (see below). Following Bruce (1977) and Bruce & Gårding (1978), the method assumes that the perceptually relevant cue for accent is a fall somewhere near the vowel in the stressed syllable. Currently, the method is 'greedy'; i.e. it tries to identify the longest possible fall in the whole contour. This means that sometimes the end points may not be the ones a human analyzer would choose, since the F0 contour may continue to fall after the most relevant part of the fall. This is something we will try to improve in the future. 2) F0 level at the onset of the vowel in the stressed syllable. Time and F0 level of the two first stylization points (see below) after the vowel onset. This method makes no explicit assumption about the direction of F0 after the vowel onset, but only tries to capture the acoustic features of the turning points directly following it. Frid (2000) used the method with some success in distinguishing between Accent 1 and 2 in material from Skåne. 3) Tilt values. These values are based on the Tilt model by Taylor (2000). In this model, each intonation event is characterized by continuous parameters representing amplitude, duration and tilt (the shape of the event). The parameters are extracted using the tilt_analysis program distributed with the Edinburgh Speech Tools. Pitch analysis The pitch contour used for methods 1 and 2 was obtained by the pitch analysis algorithm by Boersma (1993), that is implemented in Boersma's PRAAT program. This is integrated with the stylization method used in these methods. For method 3 the method by Bagshaw et al (1993), distributed with the Edinburgh Speech Tools, was used as this implementation is integrated with the extraction of the parameter set in question (tilt). In order to avoid octave jumps etc, the raw pitch data underwent inspection and, where necessary, a reanalysis with adjusted low-pass and high-pass filter settings were performed. Stylization Methods 1 and 2 use stylized versions of the F0 contour. The stylization works by selecting tonal turning points in the contour. The points are selected so that when reconnecting the points 90

TMH-QPSR Vol. 43 Fonetik 2002 with straight lines, there may not, at any given point along the contour, be a difference in pitch between the reconstructed contour and the original contour that is larger than a set value, in this case one (1) semitone. This results in a series of time/frequency pairs, which describe the contour of a pitch pattern accurately, but with a smaller number of points than the full contour. Modeling In order to build a classification model, an automatic method was used to construct a classification and regression tree (CART, Breiman et al 1984) from the data. A CART is a statistical model, which can deal with incomplete data, multiple types of features both in input features and predicted features, and produces rules, which are human-readable. Again, we used an implementation from the Edinburgh Speech Tools, which is called wagon. All in all, there were 1858 words. Of these we used 90% for training and saved 10% for testing. As input we used both each parameter set individually as well as a combination of all parameters to see if there were synergy effects. Three runs where made, where we varied the 'stop' value 1, setting it either to 5, 10 or 20. The lower the value, the more fine-tuned to the training set the models get and there is a risk that the models get over-trained. We trained models to predict three different features: 1) The village that the speaker of an utterance is from. There were 37 different villages in the material. 2) The province that the speaker of an utterance is from. There are 10 different provinces. 3) The accent type of an utterance. There are two different accents, Accent 1 or Accent II. We did not distinguish between focal and nonfocal versions of the accents. Results The results for each feature prediction are shown in tables 1-3. The individual results for each method and stop condition are shown in the cells 1 According to Black et al (1998) this value "specifies the minimum number of examples necessary in the training set before a question is hypothesized to distinguish the group'' of each table. The results are presented as percentages of correct classifications. The best result for each method is printed in bold face. Table 1. Results for prediction of province. Method 5 10 20 All 30.3 23.2 28.7 1 24.3 27 28.7 2 23.8 23.2 20.6 3 19.5 23.8 23.8 Table 2. Results for prediction of village. Method 5 10 20 All 13.5 12.4 9.7 1 10.8 15.7 9.7 2 10.3 10.8 11.9 3 12.4 8.6 11.4 Table 3. Results for prediction of accent type. Method 5 10 20 All 85.4 80.5 82.7 1 74.1 73 74.6 2 67.0 70.3 71.4 3 80.5 78.9 73 Discussion First, it should be noted that the best results for province and accent type are obtained when combining all the methods. We also note a tendency that method 1 is slightly better than methods 2 and 3 when predicting dialectal origin, whereas method 3 outperforms methods 1 and 2 for prediction of accent type. Province The best result is 30.3%. Even though this is better than the estimated baseline result of 10% (since there are ten provinces one tenth correct is roughly what we would get if we guessed on the same province in all trials) this is still quite poor. The task of predicting what province the speaker of a given utterance is from is probably too specific to be performed reliably on the basis of F0 of disyllabic words only. Village The best result is 15.7%. This task is clearly a very difficult one, and much higher results shouldn't be expected on the basis of any other parameterization. This task is even more specific 91

Speech, Music and Hearing than guessing the province and therefore the results are even worse. Accent type The best result is 85.4%, which we think is rather good, given that the geographical spread is very high and that the prediction is based only on F0 without geographical information. Implications and plans for future studies For dialect, we probably need to use a rougher classification. Predicting village or province is simply too specific to be done reliably. The accent typology of Gårding (1977) may be useful, but we have not tried it here. We also would like to improve parameterization method 1, which we suspect is too 'greedy'. Restricting the search areas for the end point of the fall or selecting only the steepest part of a combined fall are two possible improvement methods. Geographically, we are going to extend the data set with material from the whole project in order to get a more complete coverage of all dialect areas. Furthermore, we want to include at least the speaker category 'older women' in order to get a more balanced gender coverage. We did not differ between focused and nonfocused versions of the words. This may improve the results, as differences in the realization of focal and nonfocal accent makes classification harder. Conclusion We have performed an experiment on the ability to predict accent type and dialectal origin of disyllabic words using F0 and segmental data. We used utterances from more than 100 different speakers (all male) from 10 provinces in southern Sweden. The utterances were both Accent 1 and Accent 2 words. The best results for province, village and accent type are 30.3%, 15.7% and 85.4% correct, respectively. This shows that the categories of province and village are too specific to use for a prediction model based on F0, whereas accent type is possible to predict with some accuracy even across dialects. Furthermore, the best results for province and accent type classification are obtained when combining different methods of parametrization. Acknowledgement I would like to thank all the past and present members of the Swedia 2000 project for their work with the speech database. References Bagshaw P C, Hiller S M & Jack M A (1993) Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching. Proceedings of Eurospeech 93, 1003-1006, Berlin. Black A, Lenzo K & Pagel V (1998) Issues in Building General Letter to Sound Rules. Proceedings of the Third ESCA Workshop on Speech Synthesis, 77-80 Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-tonoise ratio of a sampled sound. Proceedings of the Institute of Phonetic Sciences University of Amsterdam 17: 97-110. Breiman L, Friedman J, Olshen R & Stone C (1984) Classification and regression trees. USA: Wadsworth and Brooks. Bruce G (1977) Swedish word accents in sentence perspective. Sweden: CWK Greerup. Bruce G (1983) Accentuation and timing in Swedish. Folia Linguistica XVII/1-2, 221-238. Bruce G, Engstrand O & Eriksson A (1998). De svenska dialekternas fonetik och fonologi år 2000 (Swedia 2000) - en projektbeskrivning. Proceedings of 6:e Nordiska Dialektologkonferensen, 33-54. Bruce G and Gårding E (1978) A Prosodic Typology for Swedish Dialects. In: Gårding E, Bruce G and Bannert R, eds, Nordic Prosody. Sweden: Dept. of Linguistics, Lund University, 219-228. Frid J (2000) Compound accent patterns in some dialects of Southern Swedish. Proceedings of Fonetik 2000, 61-64. Gårding (1977) The scandinavian word accents. Sweden: CWK Greerup. House D (1990) Tonal Perception in Speech. Sweden: Lund University Press. House D and Bruce G (1990) Word and focal accents in Swedish from a recognition perspective. In: Wiik K and Raimo I, eds., Nordic Prosody V. Finland: Turku University, 156-173. Meyer E A (1937) Die intonation im Schwedischen, I: Die Sveamundarten. Studies Scand. Philol. 10, Univ. Stockholm Meyer E A (1954) Die intonation im Schwedischen, II: Die norrländischen mundarten. Studies Scand. Philol. 11, Univ. Stockholm Taylor P (2000) Analysis and synthesis of intonation using the Tilt model. Journal of the Acoustical Society of America 107(3), 1697-1714. 92