DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface

Similar documents
BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A study of speaker adaptation for DNN-based speech synthesis

Speaker Identification by Comparison of Smart Methods. Abstract

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Human Emotion Recognition From Speech

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Proceedings of Meetings on Acoustics

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

On the Formation of Phoneme Categories in DNN Acoustic Models

Body-Conducted Speech Recognition and its Application to Speech Support System

Speaker Recognition. Speaker Diarization and Identification

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Voice conversion through vector quantization

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Probabilistic Latent Semantic Analysis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition at ICSI: Broadcast News and beyond

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Artificial Neural Networks written examination

WHEN THERE IS A mismatch between the acoustic

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Python Machine Learning

THE world surrounding us involves multiple modalities

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Beginning primarily with the investigations of Zimmermann (1980a),

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Deep Neural Network Language Models

Edinburgh Research Explorer

Improvements to the Pruning Behavior of DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

Consonants: articulation and transcription

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Learning Methods for Fuzzy Systems

Spoofing and countermeasures for automatic speaker verification

Expressive speech synthesis: a review

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

MULTIMEDIA Motion Graphics for Multimedia

Automatic Pronunciation Checker

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Phonetics. The Sound of Language

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Evolutive Neural Net Fuzzy Filtering: Basic Description

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Affective Classification of Generic Audio Clips using Regression Models

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Word Segmentation of Off-line Handwritten Documents

INPE São José dos Campos

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Edinburgh Research Explorer

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Learning Methods in Multilingual Speech Recognition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Audible and visible speech

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Deep Bag-of-Features Model for Music Auto-Tagging

Articulatory Distinctiveness of Vowels and Consonants: A Data-Driven Approach

Course Law Enforcement II. Unit I Careers in Law Enforcement

Knowledge Transfer in Deep Convolutional Neural Nets

Hayward Unified School District Community Meeting #2 at

Time series prediction

9 Sound recordings: acoustic and articulatory data

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Lecture 1: Machine Learning Basics

Transcription:

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface Tamás Gábor Csapó, 1,2 Tamás Grósz, 3 Gábor Gosztolya 3,4, László Tóth, 4 Alexandra Markó 2,5 1 BME Department of Telecommunications and Media Informatics 2 MTA-ELTE Lendület Lingual Articulation Research Group 3 Institute of Informatics, University of Szeged 4 MTA-SZTE Research Group on Artificial Intelligence 5 ELTE Department of Phonetics Hungary Interspeech 2017 Stockholm Aug 24, 2017

Introduction Methods Results Summary Articulation and speech Introduction 2 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) I goal convert silent articulation to audible speech while the original speaker just mouths articulatory-to-acoustic mapping imaging techniques Ultrasound Tongue Imaging (UTI) Electromagnetic Articulography (EMA) Permanent Magnetic Articulography (PMA) lip video multimodal... 3 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) II Ultrasound Tongue Imaging (UTI) used in speech research since the early 80s ultrasound transducer positioned below the chin during speech tongue movement on video (up to 100 frames/sec) tongue surface has a greater brightness than the surrounding tissue and air relatively good temporal and spatial resolution [Stone et al., 1983, Stone, 2005] 4 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) III Vocal tract Ultrasound sample tongue 5 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) IV SSI types recognition followed by synthesis direct synthesis mapping techniques Gaussian Mixture Models Mixture of Factor Analyzers Deep Neural Networks (DNN) previously only one study about UTI and DNN [Jaumard-Hakoun et al., 2016] singing voice synthesis estimation of vocoder spectral parameters based on UTI and lip video AutoEncoder / Multi-Layer Perceptron 6 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Goal of the current study initial experiments in articulatory-to-acoustic mapping Micro ultrasound equipment access raw data direct speech synthesis from ultrasound recordings based on deep learning, using a feed-forward deep neural network 7 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Methods 8 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Recordings and data I parallel / synchronized ultrasound and speech recordings Micro system, with stabilization headset (Articulate Instruments Ltd.) one female speaker 473 Hungarian sentences from PPBA database [Olaszy, 2013] ultrasound speed: 82 fps speech sampling frequency: 22 050 Hz 9 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Recordings and data II Images from the same speaker - with different quality. 10 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Processing the speech signal simple impulse-noise excited vocoder analysis speech resampled to 11 050 Hz excitation parameter: fundamental frequency (F 0) spectral parameter: Mel-Generalized Cepstrum (MGC), 12-order MGC LSP in synchrony with ultrasound frame shift: 1 / (82 fps) synthesis impulse-noise excitation using original F 0 spectral filtering using predicted MGC 11 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Preprocessing the ultrasound data (64x946) raw ultrasound image is 64 946 pixels resized to 64 119 using bicubic interpolation input of DNN simplest case: one ultrasound image more advanced: use several consecutive images further reduction of image size is necessary size of the DNN input vector can be reduced by discarding irrelevant pixels 12 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Correlation-based feature selection mean / maximum of the correlation of pixels (DNN input) with MGC (DNN output) only retained 5, 10,..., 25% of the pixels with the largest importance scores A raw ultrasound image and the mask (max., 20%). 13 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Eigentongue feature extraction find a finite set of orthogonal images (called eigentounges) [Hueber et al., 2007] apply PCA on the ultrasound images, keep 20% of the information The first two extracted Eigentongues. 14 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Deep learning I input: 1 image / 5 consecutive images feature selection / EigenTongue output: spectral parameters (MGC) fully connected deep rectifier neural networks 5 hidden layers, 1000 neurons / layer linear output layer two training types joint model: one DNN for the full MGC vector separate models: separate DNNs for each of the output features 15 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Experimental results 16 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Objective measurements Normalized Mean Square Error (NMSE) and mean R 2 scores on the development set Type NMSE Mean R 2 DNN (separate models) 0.409 0.597 DNN (joint model) 0.384 0.619 DNN (feature selection (max.), 20%) 0.441 0.562 DNN (feature selection (avg.), 20%) 0.442 0.561 DNN (Eigentongue, 20%) 0.432 0.577 DNN (feature sel. (max.), 5 images) 0.380 0.625 DNN (feature sel. (avg.), 5 images) 0.388 0.615 DNN (Eigentongue, 5 images) 0.402 0.608 17 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Subjective listening test I MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor) 6 types, 10 sentences, 1 speaker natural sentences vocoded reference anchor: using constant MGC from a /swa/ vowel 3 proposed DNN approaches goal: evaluate overall naturalness rate from 0 (highly unnatural) to 100 (highly natural) 23 Hungarian listeners 20 females, 3 males; 19 32 years old 18 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Subjective listening test II 100 Natural Anchor Vocoded DNN (joint model) DNN (Eigentongue, 5 images) DNN (feat.sel.(max.), 5 images) Mean naturalness 80 60 40 20 0 94.82 56.22 30.21 31.10 32.18 2.65 Results of the listening test. 19 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Summary, conclusions 20 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Summary and conclusions I goal of the study: synthesize speech from tongue ultrasound images DNN-based articulatory-to-acoustic mapping tongue ultrasound vocoder spectral parameters various approaches joint model to predict all spectral features separate models for predicting the spectral features two variants of a correlation-based feature selection Eigentongue feature selection to reduce the size of ultrasound images feature selection methods combined with using several consecutive ultrasound frames 21 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Summary and conclusions II synthesized sentences (using the original F 0) are mostly intelligible a) vocoded reference [ sample] b) DNN (feature sel. (max.), 5 images) [ sample] Frequency (Hz) Frequency (Hz) 5000 a) 4000 3000 2000 1000 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 5000 b) 4000 3000 2000 1000 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time (s) Gyengéden megcirógatta az orrát egy papírcsiptetővel. 22 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Applications Silent Speech Interface (long-term goals) useful for the speaking impaired (e.g. after laryngectomy) can be used in extremely noisy environments to create speech, just with mouthing private conversations in public areas 23 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Future plans mapping from articulatory data to F 0 investigate other neural network types (e.g. AutoEncoders and CNNs) use multimodal articulatory data (e.g. video of the lips; EMA) test more advanced vocoders record real silent speech (silent articulation) 24 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Thank you for the attention! 25 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans References I Hueber, T., Aversano, G., Chollet, G., Denby, B., Dreyfus, G., Oussar, Y., Roussel, P., and Stone, M. (2007). Eigentongue feature extraction for an ultrasound-based silent speech interface. In Proc. ICASSP, pages 1245 1248, Honolulu, HI, USA. Jaumard-Hakoun, A., Xu, K., Leboullenger, C., Roussel-Ragot, P., and Denby, B. (2016). An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging. In Proc. Interspeech, pages 1467 1471. Olaszy, G. (2013). Precíziós, párhuzamos magyar beszédadatbázis fejlesztése és szolgáltatásai. Beszédkutatás 2013, pages 261 270. Stone, M. (2005). A guide to analysing tongue motion from ultrasound images. Clinical Linguistics & Phonetics, 19(6-7):455 501. Stone, M., Sonies, B., Shawker, T., Weiss, G., and Nadel, L. (1983). Analysis of real-time ultrasound images of tongue configuration using a grid-digitizing system. Journal of Phonetics, 11:207 218. 26 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion