Same same but different An acoustical comparison of the automatic segmentation of high quality and mobile telephone speech

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Rhythm-typology revisited.

Speech Recognition at ICSI: Broadcast News and beyond

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Phonological and Phonetic Representations: The Case of Neutralization

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Journal of Phonetics

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Universal contrastive analysis as a learning principle in CAPT

Consonants: articulation and transcription

Modeling function word errors in DNN-HMM based LVCSR systems

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

On the Formation of Phoneme Categories in DNN Acoustic Models

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

WHEN THERE IS A mismatch between the acoustic

Proceedings of Meetings on Acoustics

Voice conversion through vector quantization

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Speech Emotion Recognition Using Support Vector Machine

SARDNET: A Self-Organizing Feature Map for Sequences

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Software Maintenance

Human Emotion Recognition From Speech

Learning Methods in Multilingual Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Modeling function word errors in DNN-HMM based LVCSR systems

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

Phonetics. The Sound of Language

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis

Segregation of Unvoiced Speech from Nonspeech Interference

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Lecture Notes in Artificial Intelligence 4343

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

On the nature of voicing assimilation(s)

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

The Structure of the ORD Speech Corpus of Russian Everyday Communication

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The IFA Corpus: a Phonemically Segmented Dutch "Open Source" Speech Database

Word Stress and Intonation: Introduction

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Annotation Pro. annotation of linguistic and paralinguistic features in speech. Katarzyna Klessa. Phon&Phon meeting

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Body-Conducted Speech Recognition and its Application to Speech Support System

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

REVIEW OF CONNECTED SPEECH

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Phonological Processing for Urdu Text to Speech System

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

The Acquisition of English Intonation by Native Greek Speakers

Speaker Recognition For Speech Under Face Cover

Word Segmentation of Off-line Handwritten Documents

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Building Text Corpus for Unit Selection Synthesis

Multi-Tier Annotations in the Verbmobil Corpus

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Constructing Parallel Corpus from Movie Subtitles

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Corpus Linguistics (L615)

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Eyebrows in French talk-in-interaction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Multilingual Speech Data Collection for the Assessment of Pronunciation and Prosody in a Language Learning System

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Cross Language Information Retrieval

Speaker Recognition. Speaker Diarization and Identification

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Grammar Lesson Plan: Yes/No Questions with No Overt Auxiliary Verbs

Speaker recognition using universal background model on YOHO database

Transcription:

INTERSPEECH 2013 Same same but different An acoustical comparison of the automatic segmentation of high quality and mobile telephone speech Christoph Draxler 1, Hanna S. Feiser 1,2 1 Institute of Phonetics and Speechprocessing, Ludwig-Maximilian University Munich, Germany 2 Bavarian State Criminal Police Office, Munich, Germany draxler feiser@phonetik.uni-muenchen.de Abstract In this paper we present a comparison of the performance of the automatic phonetic segmentation and labeling system MAUS [1] for two different signal qualities. For a forensic study on the similarity of voices within a family [2], eight speakers from four families were recorded simultaneously in both high bandwidth and mobile phone quality. The recordings were then automatically segmented and labeled using MAUS. The results show marked effects on the segment counts and durations between the two signal qualities: for the mobile phone quality, the segment counts for fricatives were much lower than for high quality recordings, whereas the segment counts for plosives and vowels increased. The segment duration of fricatives was much lower for mobile phone recordings, slightly lower for the front vowels, but quite much longer for the back and low vowels. Index Terms: forensic analysis, same-sex siblings, speech database, automatic segmentation, acoustical analysis 1. Introduction The automatic processing of speech in the context of large speech databases has achieved significant progress during recent years. It is now possible, given an orthographic transcript of an utterance, to automatically align a text with the audio signal, or to generate a fine phonetic segmentation and labeling which even takes into account coarticulatory effects. However, in general this works reliably only for high quality signals; with noisy or compressed audio signals, the results deteriorate. In the study presented here we systematically compare the performance of the MAUS system for read sentences for high bandwidth and mobile phone quality recordings. These recordings were performed in the context of a forensic study on the similarity of voices in families how similar are the voices of two brothers within a given family? With this setup, recording conditions were closely controlled, the only difference was the transmission channel. In forensic daily practice, this setup occurs quite often: the questionable recording made via the mobile phone is compared to high bandwidth recordings of the suspects during e.g. interrogations. When comparing these two conditions, the spectral features will quite likely show marked differences for example in fricatives [3], [4]. In forensic case work one often has the problem that there exist only phone recordings, which means that the acoustic information above 4000 Hz is missing from the signal. However, fricatives carry important information above this frequency so the empirical question is what information we still have in telephone speech for investigating differences between speakers [5]. The MAUS system was used to automatically segment and label the recordings. On the one hand, this was necessary to be able to process the large amount of speech data efficiently, on the other it was interesting in its own right to compare the performance of MAUS given that all other conditions were constant and each speaker produced the same material. With this comparison we ll be able to evaluate the quality of MAUS for high-bandwidth recordings, estimate the influence of compressed signals on the performance and indicate which phonemes are most probably affected by reduced signal quality. 2. Method The speech database consists of 8 male speakers aged 20-31 years from four families. All speakers grew up in the area in and around Munich in Bavaria. Each speaker read 100 phonetically rich sentences from the Berlin Corpus and 20 minimal pairs in carrier sentences (repeated four times). Each pair of brothers also had a spontaneous information exchange dialog about a movie fragment they each had seen prior to the dialog recording. Following the setup of the DyVis [6] and the Pool [7] corpora, speakers were recorded in separate rooms using both high quality microphones and mobile phones. The high bandwidth recordings were made with a Neumann TLM 103 P48 dynamic microphone at 44.1 khz sample rate and 16 bit quantization using the SpeechRecorder software [8]. At the same time, the speakers were recorded via mobile phone using Nokia 1680 and 2220 handsets respectively connected to an ISDN server. The signal quality of the mobile phone recordings thus is 8 khz sample rate with 8 bit alaw quantization. For further processing, the quantization of the mobile phone recordings was converted to 16 bit linear PCM. Fig. 1 shows a sample segmentation of the word haben (to have) for both high quality and mobile phone recordings for the same utterance. For the automatic segmentation and labeling, the web service version of the MAUS system was used [9]. The phoneme models were trained on high bandwidth speech with the German SAM-PA phoneme inventory. A standard right shift of the boundaries to the next 10ms is applied in a uniform way. For every recording, MAUS returned both the canonical form (i.e. citation pronunciation) of the words in the utterance, and a phonetic segmentation in Praat TextGrid file format. This segmentation takes into account coarticulatory effects, e.g. the @- elision in German syllables ending on -en such as /z a: g @ n/ vs. /z a: g n/. The TextGrid files were then read into an SQL database Copyright 2013 ISCA 1535 25-29 August 2013, Lyon, France

a) b) Figure 1: Sample signal for high quality a) and mobile phone b) recordings of the same utterance and the phoneme segments of haben. code quality type tokens sentences mobile 44 16904 high quality 44 17003 minimal pairs mobile 29 10351 high quality 29 10408 Table 1: Segment count for phonemes system. The database contains a total of 54.666 phoneme segments, in 13488 orthographic word and canonical form segments. Note that because MAUS assumes a hierarchical structure of elements in the different annotation tiers, i.e. a word has one canonic form and a canonic form may have many phonemes, this hierarchical structure is preserved in the segment table (technically, this is achieved by having a foreign key reference within the segment table). For the statistical computations, the software R was used with the RDBMS interface library RPostgreSQL. 3.1. Type and token counts 3. Analyses All speakers produced the same read utterances and hence the database contains the same orthographic word forms and canonical forms for every speaker and both recording qualities (341 word form or canonical form types and 4148 tokens for the sentences, 23 types and 2560 tokens for the minimal pairs). However, for phoneme segments, there are differences: the inventory is the same, but the token counts are different. For both the sentences and the minimal pairs, there are more phoneme segments in the mobile phone recordings than in the high bandwidth recordings (Table 3.1). In some words, phonemes are replaced by other phonemes due to coarticulation, e.g. /k/ by /x/ in gesagt (/g @ z a: k t/, past tense of the verb to say). These replacements occur both in mobile phone and in high bandwidth recordings, but their counts differ. For example, for mobile phone speech, /k/ is used 396 times, and /x/ 244 times, but 448 and 192 times respectively for high quality speech in the word gesagt. Other words have different segment counts for mobile and high bandwidth signal quality, e.g. the auxiliary verb haben (to word phoneme mobile high quality count count haben h 9 16 a: 16 16 b 9 @ 5 n 5 m 11 16 Table 2: Different automatic segmentations for the word haben. Note that MAUS always applies the coarticulation rules to the high bandwidth speech, but only to a lesser degree to the mobile phone speech. Phoneme class mobile high quality count count approximant 665 661 diphthong 693 683 nasal 2316 2364 fricative 3328 3684 plosive 3738 3667 vowel 5104 4888 Table 3: Counts for the phoneme classes by signal quality in the read sentences have) which has only three phoneme distinct phoneme labels for high bandwidth quality, but six distinct phonemes for mobile phone quality see Table 2. From the 341 word forms, 221 (= 64.81%) have the same count of distinct phonemes for high quality and mobile phone speech, 103 (= 30.2%) have one different phoneme, 13 (=3.81%) have two, and 4 (= 1.17%) have three or more different phonemes. Grouped by phoneme classes, it becomes clear that mainly the phoneme counts for fricatives, plosives, and vowels differ (Table 3). 3.2. Segment Durations In the remainder of the paper, only the read sentences will be considered because they cover all German phonemes. 1536

Due to the hierarchical annotations, the duration of the orthographic words and canonical forms is determined by the sum of the segment durations of the corresponding phoneme segments. The total duration of the sentence segments is 1200.81s for the high quality recordings, and 1239.87s for the mobile phone recordings. For high quality recordings, the average word segment duration is 0.287s, for mobile recordings it is 0.296s. The average phoneme segment duration is 0.071s for high quality recordings and 0.073 for mobile phone. Table 4 shows the segment durations by phoneme class. class mobile high quality duration duration approximant 0.099 0.073 diphthong 0.142 0.125 nasal 0.058 0.068 plosive 0.064 0.054 fricative 0.039 0.068 vowel 0.088 0.078 Table 4: Segment durations (in ms) by phoneme class for the read sentences All phoneme classes are affected there is a significant dependency between duration and signal quality (F = 7.8768, p = 0.005). Clearly, the fricatives show the strongest effect (see Figure 2). Here, the average phoneme duration for the mobile phone recording is only 57.3% of that of the high quality recordings. 0.00 0.05 0.10 0.15 0.20 0.25 HighQ.FRIC Mobile.FRIC Phoneme durations by quality HighQ.NASL Mobile.NASL Figure 2: Phoneme durations for fricative, nasal, plosive and vowel segments by signal quality for the read sentences HighQ.PLOS Mobile.PLOS HighQ.VOWL Mobile.VOWL 4. Discussion The data presented here was computed by automated processes. The only difference between the high quality and the mobile phone recordings is the transmission channel and, subsequently, the signal quality. Any difference in the automatic labeling and segmentation of the signal must thus be due to this difference. The MAUS system can be tuned to the different signal qualities by training the phoneme models, and by adapting weighting factors that govern the application of coarticulation rules. Hence, the results presented here are not a measure of the general performance of MAUS, but serve to illustrate the effects of different signal qualities. 4.1. Cutoff frequency The different counts for fricative, plosive and vowel phonemes in the high quality and the mobile phone recordings may be attributed to the cutoff frequency in the mobile phone signal. Fricatives, and the burst phase of plosives, have a large part of their energy above 4000 Hz, and this frequency range is not transmitted via the mobile phone or the ISDN channel. For an extreme example, see Figure 3, where the /s/ in /f E n s t 6/ (window), clearly visible in the high quality signal, is totally missing from the mobile phone signal, thus yielding the segmentation /f E n t 6/. As a consequence, to the automatic segmentation algorithm of MAUS and in particular the phoneme models that were trained on high quality signals, these sounds are either missing in the signal and hence the segments are elided, or substituted by another phoneme, e.g. a substitution of a voiceless fricative by a voiced one. A closer look at the segment counts reveals that the difference in vowel counts is almost entirely due to the /@/. In high quality signals, the /@/ is often elided through coarticulation, whereas in mobile phone quality signals MAUS with its standard settings does not apply this coarticulatory reduction very often. An interesting detail is the fact that in mobile phone speech, the voiced plosives /b, d, g/ occur much more frequently than in high quality recordings (1600 vs. 1391 times), whereas the voiceless plosives are much more frequent in the high quality recordings (2024 vs. 1657). This may be due to the fact that the burst energy in voiceless plosives is lost in the mobile phone signal, leading to more plosives being classified as voiced in mobile phone speech. The fricatives /h, v, x, f, s, C, z/ occur more often in high quality recordings, /r/ occurs equally often in both recordings, and only /S/ is more frequent in mobile phone recordings. Here, the effect of the cutoff frequency is especially clear there are only a few traces of the fricatives left in the mobile phone signal, which leads to these segments being elided. 4.2. Durations The differences in segment durations for high quality and mobile phone recordings affect mainly fricatives. The duration of /x, z, s, f, C, S/ for mobile phone is between 40.9% and 62.3% the duration of these phonemes in high quality recordings (and their counts differ between signal qualities). If voiced and voiceless fricatives are viewed separately, it becomes clear that most voiceless fricatives in mobile phone signals have impossibly short durations, and that the voiced fricatives are only slightly longer (see Figure 4). A possible explanation for the short duration of fricative segments is that 1537

a) b) Figure 3: Signal fragment corresponding to the word Fenster in the high quality a) and the mobile phone b) signal. Note that the cutoff frequency in the mobile phone signal almost completely removes the phoneme /s/ from the mobile phone signal since almost all traces of friction in the mobile phone signal are filtered out, but no matching coarticulation rule for the phoneme can be applied, MAUS computes a minimally short fricative segment with approx. 20ms length. Voiced fricatives yield longer segments for mobile phone recordings (but still significantly shorter than for high quality signals); here MAUS finds traces of the fricative in the lower part of the spectrum and thus computes longer segments. 0.00 0.05 0.10 0.15 0.20 0.25 Fricative durations VD.HighQ VL.HighQ VD.Mobile VL.Mobile Figure 4: Durations of voiced (VD) and voiceless (VL) fricative segments for high quality and mobile recordings The segment durations of the front vowels /Y, y:, i/ also are much shorter for mobile phone recordings than for their high quality recordings counterparts (70.2, 77.7 and 79.2%), but their counts do not differ very much. The other front vowels, e.g. /E, i:, e:/ are almost equal in duration for both recording qualities; in general, the further back and low a vowel, the longer its duration is in mobile phone recordings, e.g. for /o:, a, o/ the duration of the mobile phone segments is 125.9, 133.2 and 155.4% the length of its high quality counterpart. 5. Conclusion and outlook This acoustical analysis of the difference in the automatic segmentation and labeling of mobile phone and high quality recordings has shown that both labeling and segmentation are affected. The effects are not uniform across all phonemes, not even within phoneme classes. Fricatives are the most affected phonemes, and the most consistent effect is the shortening of their segment duration and their reduced segment count in mobile phone speech. Within plosives, voiced and voiceless plosives differ in their effects on segment counts and durations. Consonants in general, and fricatives in particular are very important as acoustic features in forensic phonetics as they have high perceptual confusability between speakers. Our results show that in the mobile phone signals fricatives are almost totally missing; this confirms findings of [10] who showed that the reduced signal quality of mobile telephone speech negatively affects speaker identification. Their analysis focused on spectral features of nasals; our analysis shows that also features such as segment counts and durations for nasals are significantly affected by the transmission channel. The comparison of mobile phone and high quality recordings is a real world application in forensics. Here, quite often an original recording, in general via a fixed network or mobile phone, is available, and must be compared with high quality recordings of subjects during interrogation. In such an application, it is important to know what effects the signal quality may have on automated processes such as the MAUS system. The present analysis is restricted in terms of speakers. Currently, further recordings are being performed at the Phonetics Institute within the same-sex sibling comparison project by the second author. A further limitation, which is quite common for large speech databases, is that a manual verification of the results is in general not feasible because of time and budget constraints. Novel approaches to the visualization of results, e.g. an interactive browser for large speech databases, may alleviate this problem in the future. 1538

6. References [1] F. Schiel, MAUS goes iterative, in Proc. LREC, Lisbon, Portugal, 2004, pp. 1015 1018. [2] H. Feiser, Acoustic similarities and differences in the voices of same-sex siblings, in Proc. of IAFPA, Cambridge, 2009. [3] K. Stevens, Sources of inter- and intra-speaker variability in the acoustic properties of speech sounds, in Proc. 7th Intl. Conference of the Phonetic Sciences, Montreal, Canada, 1971, pp. pp. 206 227. [4] N. Fecher, Spectral properties of fricatives: a forensic approach, in Proc. ISCA Tutorial and Workshop on Experimental Linguistics, Paris, 2011, pp. 71 74. [5] M. Jessen, Phonetische und linguistische Prinzipien des forensischen Stimmenvergleichs. LINCOM Studies in Phonetics, 2012. [6] F. Nolan, K. McDougall, G. de Jong, and T. Hudson, DyVis database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research, The International Journal of Speech, Language and the Law, vol. Vol. 16, p. pp. 31.57, 2009. [7] M. Jessen, Forensic reference data on articulation rate in german, Science and Justice, pp. 50 67, 2007. [8] C. Draxler and K. Jänsch, SpeechRecorder a universal platform independent multi-channel audio recording software, in Proc. LREC, Lisbon, 2004, pp. 559 562. [9] clarin.phonetik.uni-muenchen.de/baswebservices/. [10] E. Enzinger and P. Balazs, Speaker verification using pole/zero estimates of nasals, Eftimie Murgu Resita, vol. Anul XVIII, 2011. 1539