SPEECH ENHANCEMENT BY FORMANT SHARPENING IN THE CEPSTRAL DOMAIN

Similar documents
Speaker Identification by Comparison of Smart Methods. Abstract

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Human Emotion Recognition From Speech

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Emotion Recognition Using Support Vector Machine

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker recognition using universal background model on YOHO database

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Recognition. Speaker Diarization and Identification

Automatic segmentation of continuous speech using minimum phase group delay functions

THE RECOGNITION OF SPEECH BY MACHINE

Voice conversion through vector quantization

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Segregation of Unvoiced Speech from Nonspeech Interference

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Body-Conducted Speech Recognition and its Application to Speech Support System

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Modeling function word errors in DNN-HMM based LVCSR systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Speech Recognition at ICSI: Broadcast News and beyond

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Proceedings of Meetings on Acoustics

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Author's personal copy

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Lecture 9: Speech Recognition

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

GDP Falls as MBA Rises?

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Python Machine Learning

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

HOLMER GREEN SENIOR SCHOOL CURRICULUM INFORMATION

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Rhythm-typology revisited.

Statewide Framework Document for:

A Hybrid Text-To-Speech system for Afrikaans

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Edinburgh Research Explorer

Statistical Parametric Speech Synthesis

Strategies for Solving Fraction Tasks and Their Link to Algebraic Thinking

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Automatic Pronunciation Checker

Calibration of Confidence Measures in Speech Recognition

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

TITLE 23: EDUCATION AND CULTURAL RESOURCES SUBTITLE A: EDUCATION CHAPTER I: STATE BOARD OF EDUCATION SUBCHAPTER b: PERSONNEL PART 25 CERTIFICATION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Mathematics. Mathematics

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Class Meeting Time and Place: Section 3: MTWF10:00-10:50 TILT 221

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Evolutive Neural Net Fuzzy Filtering: Basic Description

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Control Tutorials for MATLAB and Simulink

Radius STEM Readiness TM

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

MTH 215: Introduction to Linear Algebra

Mathematics subject curriculum

Major Milestones, Team Activities, and Individual Deliverables

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Speech Recognition by Indexing and Sequencing

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Universityy. The content of

Affective Classification of Generic Audio Clips using Regression Models

Guidelines for blind and partially sighted candidates

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Transcription:

SPEECH ENHANCEMENT BY FORMANT SHARPENING IN THE CEPSTRAL DOMAIN David Cole and Sridha Sridharan Speech Research Laboratory, School of Electrical and Electronic Systems Engineering, Queensland University of Technology ABSTRACT: This paper presents a method for enhancing speech signals in the root cepstral domain, typically in conjunction with the cepstral subtraction technique. The effect of the processing is similar to the spectral sharpening method performed in the time domain, but with a much lower computational requirement when combined with cepstral subtraction and with better performance. The portions of the signal corresponding to frequency formant peaks are further amplified, while formant valleys are attenuated, with the aims of further reducing noise in those regions and improving speech quality. The cepstral procedure maintains the spectral tilt of all speech segments, unlike the time domain approach which only attempts to maintain the long-term average spectral tilt. The scheme was devised as part of an enhancement system designed for forensic speech enhancement, where speech intelligibility is an important consideration. INTRODUCTION Speech enhancement is a common requirement in speech processing systems, either as preprocessing for procedures such as speech or speaker recognition, or for improving the quality or intelligibility of speech for human audition. Thus speech enhancement has varying goals, depending on the use of the enhanced speech output and the procedures used as pre-processing for speech or speaker recognition might be quite different from those used for speech enhancement intended for improved human reception of the speech. The enhancement technique described in this paper was designed as an addition to the noise reduction system described by Fisher and Sridharan (1995), whose purpose was forensic speech enhancement. They used a combination of spectral subtraction and cepstral subtraction to produce enhancement results suitable for use in such an application. This requires that intelligibility not be impaired, and preferably improved. The procedure manipulates the cepstrum of the signal to provide an enhancement of the formant structure of the speech signal, with negligible computational cost when the signal cepstrum is already available. The procedure has similar aims to the time domain spectral sharpening technique, but has improved spectral tilt performance over that method. SPECTRAL SHARPENING The use of spectral sharpening for speech enhancement and noise reduction was suggested by Schaub and Straum (1991). The technique originally was used for adaptive post-filtering in speech coding schemes using linear prediction (Ramamoorthy and Jayant 1984). The basis is the linear predictive filter defined by A(z) = a K 1 2 m 1z + a 2z + + amz These coefficients are used for post-filtering of the decoded signal using the transfer function with filter parameters 0 < β < γ < 1. H (z) β = γ Accepted after full review page 244

The result of this post-filtering is a sharpening of the formant structure of the speech signal, with amplified formant peaks and attenuated formant valleys effectively due to the poles of the linear predictive resynthesis filter being shifted closer to the unit circle. As well as the enhancement application described by Schaub & Straub, this post-filtering technique has also been used to improve the quality of synthesized speech (Dines et al 2001). In the enhancement scheme of Schaub & Straub, high-pass filtering is applied to the speech signal before the spectral sharpening filter in an attempt to compensate for the general spectral tilt of the long-term speech spectrum. This is to avoid overemphasis of the spectral tilt characteristic. As will be shown here, this is not ideal, as the tilt of the speech spectrum in the short term often differs markedly from the long-term characteristic. A block diagram of the scheme is shown in Figure 1. LP analysis a 1 a m 1 ( z ) αz 1 β γ Figure 1. Spectral sharpening in the time domain using linear prediction. Figure 2 illustrates the effect of the time domain spectral sharpening scheme. This shows the smoothed (10 th order LPC) frequency spectrum of the vowel /a/ before (solid line) and after (dashed line) processing. The high pass pre-emphasis filter parameters were adjusted to maintain approximately the spectral tilt of the original spectrum for this segment. Figure 2. Smoothed spectrum of phoneme /a/ (a) before time domain processing (solid line) (b) after time domain processing (dashed line) Accepted after full review page 245

Figure 3 shows another segment of the same utterance, corresponding to the phoneme /s/. The original spectrum (solid line) shows the rising spectral tilt typical of this phoneme. After processing (with identical parameters to those for Figure 2), the smoothed spectrum of the processed segment shows the problem inherent in this scheme the pre-emphasis is compensating for an expected falling spectral tilt and this, combined with the rising spectral tilt of the post-filter, produces an overemphasis of the higher frequencies. Figure 3. Smoothed spectrum of phoneme /s/ (a) before time domain processing (solid line) (b) after time domain processing (dashed line) ROOT CEPSTRAL PROCESSING OF SPEECH SIGNALS The use of the root cepstrum in speech processing is generally associated with cepstral subtraction. The techniques of spectral subtraction and cepstral subtraction are both well known and widely used for noise reduction, where the speech signal s(n) is corrupted additively by noise z(n) to produce the observed noisy speech signal y(n): y (n) = s(n) + z(n) where n is the discrete time index. The discrete frequency domain equivalent is Y (k) = S(k) + Z(k) Spectral subtraction involves calculating an estimate Z ~ (k) of the noise spectral magnitude and subtracting this from the observed spectral magnitude Y (k). The resultant estimate of the clean signal spectral magnitude S ~ (k) is recombined with the noisy signal phase and inverse Fourier transformed to yield the estimated clean speech signal ~ s(n) : s(n) = F {( Y(k) Z ~ (k) ) e } j Y(k) ~ -1 Accepted after full review page 246

Generally, it is attempted to calculate the noise estimate during speech pauses, although a common approach is to use an auto-regressive average of the noisy speech signal as the noise estimate. The original cepstrum calculation used the inverse Fourier transform of the logarithm of the spectral magnitude. This is impractical in low noise conditions, where the undefined log(0) condition can arise. Thus, the root cepstrum is generally used for speech cepstral subtraction. The root cepstrum of the observed signal is: ŷ(n) -1 = F 1 p { Y(k) } with p typically 2 or 4. Cepstral subtraction follows the same form as spectral subtraction, using an accumulated cepstral noise estimate subtracted from the observed cepstrum. Despite its lack of a mathematical basis (since subtraction in the cepstral domain equates to deconvolution, not subtraction, in the time domain) cepstral subtraction is accepted as superior in performance to spectral subtraction for additive noise removal (Wu et al 1991), producing a less distorted output. Fisher and Sridharan (1994) concluded that spectral subtraction outperformed cepstral subtraction for low signal to noise ratios, and so devised a tandem system using both techniques which outperformed either used singly. The spectral sharpening procedure described below was developed to further improve the output quality of this system. SPECTRAL SHARPENING IN THE ROOT CEPSTRAL DOMAIN The use of the cepstrum in speech processing is not confined to speech enhancement. It has also been used widely in speech and speaker recognition fields because of its ability to separate vocal tract and excitation information. For vocal tract characterisation, the cepstrum is similar to linear prediction coefficients, although the information contained in the two sets of parameters is not identical. Linear prediction provides formant location and bandwidth defined by the roots of the LPC equation. In contrast, the cepstral coefficients can be used to build up a smoothed spectrum using the sinusoidal basis functions of the Fourier transform. Thus it is common for speech or speaker recognition systems to use either linear predictive coefficients or (low order) cepstral coefficients to parameterise vocal tract characteristics. The use of the root cepstrum for spectral sharpening utilises the encoding of the smoothed speech spectrum in the low order cepstral coefficients. By increasing the magnitude of selected cepstral coefficients, it is possible to increase the amplitude of the corresponding smoothed spectral envelope of the reconstructed signal. By considering the nature of the Fourier transformation, it is clear that the main contribution to the spectral tilt of a speech segment will be contained in the low order cepstral coefficients: n = 1 and possibly also 2. Thus to maintain the spectral tilt of a speech segment, these lowest order cepstral coefficients are unmodified (as well as the 0 th coefficient, which represents the energy level of the segment.) Expressing this scheme mathematically, we operate on the input cepstrum ŷ (n) to produce output cepstrum qˆ (n) as follows: qˆ(n) = ŷ(n) k n k n = b k n = 1 n < n < n l otherwise h Typically the boost factor b, which is greater than 1, is applied (for a frame size of 256 samples) for n=3 20 or thereabouts, which is adequate to represent the speech formant structure. Clearly, the procedure requires very little computational effort when the cepstrum has already been calculated for a cepstral subtraction procedure. Accepted after full review page 247

Figures 4 and 5 show the performance of the scheme for the same two speech segments used previously. The /a/ segment output is almost identical to that produced by the time domain scheme, and the /s/ segment output is quite clearly superior, maintaining the overall spectral tilt well while enhancing the formant structure. Figure 4. Smoothed spectrum of phoneme /a/ (a) before cepstral processing (solid line) (b) after cepstral processing (dashed line) Figure 5. Smoothed spectrum of phoneme /s/ (a) before cepstral processing (solid line) (b) after cepstral processing (dashed line) Accepted after full review page 248

OUTPUT QUALITY The maintenance of spectral tilt for individual segments, as shown in the above figures, results in an audible improvement in quality of the output of the cepstrally enhanced speech as compared with the time domain enhanced version. The need to use fixed pre-emphasis in the time domain scheme means that, in general, the spectral tilt of individual segments will be altered, adversely affecting the quality of the speech output. Usually, the result is a general increase in high frequency content, producing an unpleasant, harsh output. In contrast, the cepstral enhancement method maintains segmental spectral tilt, producing more natural and higher quality speech. These quality comparisons have not yet been quantified, but audio demonstrations that show these differences may be found at http://www.rcsavt.bee.qut.edu.au/pages/demonstrations.html REFERENCES Fisher, A. & Sridharan, S. (1994) Speech Enhancement for Forensic Applications, International Conference on Speech Science and Technology, SST-94, 40-45. Dines, J. & Sridharan, S. (2001) Trainable speech synthesis with hidden Markov models, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-2001), 833-836. Ramamoorthy, V. & Jayant, N. (1984) Enhancement of ADPCM speech by adaptive post-filtering, Technical Report Technical Journal of AT&T, V63-I8, 1465-1475. Schaub, A. & Straub, P. (1991). Spectral sharpening for speech enhancement / noise reduction, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-91), 993-996. Wu, C.S., Nguyen, V.V., Sabrin, H., Kushner, W. & Damoulakis, J. (1991). Fast self-adapting broadband noise removal in the cepstral domain, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-91), 957-960. Accepted after full review page 249