Detecting Converted Speech and Natural Speech for anti-spoofing Attack in Speaker Recognition

Similar documents
International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A study of speaker adaptation for DNN-based speech synthesis

Spoofing and countermeasures for automatic speaker verification

Human Emotion Recognition From Speech

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Emotion Recognition Using Support Vector Machine

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Recognition. Speaker Diarization and Identification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Support Vector Machines for Speaker and Language Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Pronunciation Checker

Speaker Recognition For Speech Under Face Cover

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Proceedings of Meetings on Acoustics

Learning Methods in Multilingual Speech Recognition

Affective Classification of Generic Audio Clips using Regression Models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Recognition at ICSI: Broadcast News and beyond

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Recognition by Indexing and Sequencing

Voice conversion through vector quantization

Edinburgh Research Explorer

Segregation of Unvoiced Speech from Nonspeech Interference

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Statistical Parametric Speech Synthesis

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Generative models and adversarial training

Measurement & Analysis in the Real World

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Use and Adaptation of Open Source Software for Capacity Building to Strengthen Health Research in Low- and Middle-Income Countries

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Software Security: Integrating Secure Software Engineering in Graduate Computer Science Curriculum

Mandarin Lexical Tone Recognition: The Gating Paradigm

Introduction to Simulation

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STA 225: Introductory Statistics (CT)

Assignment 1: Predicting Amazon Review Ratings

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Body-Conducted Speech Recognition and its Application to Speech Support System

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

SPEAKER IDENTIFICATION FROM SHOUTED SPEECH: ANALYSIS AND COMPENSATION

Probability and Statistics Curriculum Pacing Guide

Lecture 9: Speech Recognition

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Probabilistic Latent Semantic Analysis

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Circuit Simulators: A Revolutionary E-Learning Platform

Automatic segmentation of continuous speech using minimum phase group delay functions

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Independent Assurance, Accreditation, & Proficiency Sample Programs Jason Davis, PE

Switchboard Language Model Improvement with Conversational Data from Gigaword

Expressive speech synthesis: a review

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

A Hybrid Text-To-Speech system for Afrikaans

Knowledge Transfer in Deep Convolutional Neural Nets

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Corpus Linguistics (L615)

Lecture Notes in Artificial Intelligence 4343

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Process to Identify Minimum Passing Criteria and Objective Evidence in Support of ABET EC2000 Criteria Fulfillment

Comment-based Multi-View Clustering of Web 2.0 Items

THE enormous growth of unstructured data, including

Combining a Chinese Thesaurus with a Chinese Dictionary

Ansys Tutorial Random Vibration

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Transcription:

Detecting Converted Speech and Natural Speech for anti-spoofing Attack in Speaker Recognition Zhizheng Wu 1, Eng Siong Chng 1, Haizhou Li 1,2,3 1 School of Computer Engineering, Nanyang Technological University, Singapore 2 Human Language Technology Department, Institute for Infocomm Research, Singapore 3 School of EE & Telecom, University of New South Wales, Australia 12-Sep-2012

Outline Motivation Voice conversion overview Phase feature extraction Experiments Conclusions 2

Motivation We would like to detect converted speech (synthetic speech) to prevent spoofing attack against speaker verification system Phase artifacts in synthetic speech is an informative cue. We study the ways of phase feature extraction 1. Tomi Kinnunen, Zhizheng Wu, Kong Aik Lee, Filip Sedlak, Eng Siong Chng, Haizhou Li, "Vulnerability of Speaker Verification Systems Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech", ICASSP 2012. 2. Zhizheng Wu, Eng Siong Chng, Haizhou Li, "Speaker verification system against two different voice conversion techniques in spoofing attacks", Technical Report (http://www3.ntu.edu.sg/home/wuzz/), 2012. 3

Overview of Voice Conversion (1/3) GMM-based voice conversion Source Analysis Transformation function Phase artifacts created between analysis and synthesis! Synthesis Target 4

Overview of Voice Conversion (2/3) Unit-selection based voice conversion Source Analysis Source frame sequence Target frame sequence Target Speech Inventory Phase artifacts created between analysis and synthesis! Synthesis Target 5

Overview of Voice Conversion (3/3) An analysis-synthesis pass-through without transformation Source Analysis Fundamental frequency, spectral parameter Phase artifacts created between analysis and synthesis! Synthesis 6 Target

Phase Artifacts Voice conversion techniques focus on spectral conversion Magnitude spectrum contains more information Many vocoders usually use random phase, not the original phase to reconstruct the speech K.K. Paliwal and L.D. Alsteris, On the usefulness of STFT phase spectrum in human listening tests, Speech Communication, vol. 45, no. 2, pp. 153 170, 2005. 7

Phase feature extraction Short-time Fourier transform of signal x(n) X(w) = X(w) e jj(w ) X(w) j(w) is the magnitude spectrum is the phase spectrum MFCC This study 8

Frequency Frequency Cosine Normalized Phase Feature (Cos-phase) Natural speech 1 0.5 Time Converted speech 0-0.5-1 1 Apply discrete cosine function (DCT) and keep 12 coefficients as the feature 0.5 0-0.5 Time -1 9

Frequency Frequency Modified group delay phase (MGD-phase) Natural speech 60 40 Time Converted speech 20 0 80 Apply DCT and keep 12 coefficients as the feature 60 40 20 0 Time 10

Synthetic speech detector GMM-based detector C is the feature vector sequence of a speech signal is GMM model for converted speech is GMM model for natural speech We use 512 Gaussian components in this study. 11

Experimental setups Corpus: a subset of NIST SRE 2006 Training set (number of sessions) Natural model Converted model The duration of each session is 5 minutes Three training situations for converted model GMM-based converted speech for training Unit-selection based converted speech for training Pass-through speech for training We will conduct three experiments under the three training situations 12 100 100

Experimental setups Testing set (number of sessions) Natural GMM Converted Unit-selection 1, 500 1, 000 1, 000 Testing set: in total 3500 sessions. Evaluation metric: Equal error rate Natural to converted Converted to natural 13

Experimental setups Spoofing attack corpus construction SPTK: http://sp-tk.sourceforge.net/ Analysis: Mel-cepstral analysis Synthesis: MLSA filter 1. Tomi Kinnunen, Zhizheng Wu, Kong Aik Lee, Filip Sedlak, Eng Siong Chng, Haizhou Li, "Vulnerability of Speaker Verification Systems Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech", ICASSP 2012. 2. Zhizheng Wu, Eng Siong Chng, Haizhou Li, "Speaker verification system against two different voice conversion techniques in spoofing attacks", Technical Report (http://www3.ntu.edu.sg/home/wuzz/), 2012. 14

Results: 3 speech models vs 3 features for synthetic speech detection 15

Conclusions Phase artifacts are useful in detecting the synthetic speech When transformation technique is unknown, we may use analysis-synthesis pass-through method to simulate converted data 16

17 Thank you!