Efficient Methods to Train Multilingual Bottleneck Feature Extractors for Low Resource Keyword Search

Similar documents
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

arxiv: v1 [cs.lg] 7 Apr 2015

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Calibration of Confidence Measures in Speech Recognition

arxiv: v1 [cs.cl] 27 Apr 2016

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learning Methods in Multilingual Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Deep Neural Network Language Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Modeling function word errors in DNN-HMM based LVCSR systems

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Investigation on Mandarin Broadcast News Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Speech Emotion Recognition Using Support Vector Machine

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition at ICSI: Broadcast News and beyond

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Noisy SMS Machine Translation in Low-Density Languages

arxiv: v4 [cs.cl] 28 Mar 2016

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Human Emotion Recognition From Speech

Australian Journal of Basic and Applied Sciences

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

INPE São José dos Campos

Word Segmentation of Off-line Handwritten Documents

Residual Stacking of RNNs for Neural Machine Translation

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Generative models and adversarial training

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A heuristic framework for pivot-based bilingual dictionary induction

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Review: Speech Recognition with Deep Learning Methods

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

CS Machine Learning

Probabilistic Latent Semantic Analysis

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Letter-based speech synthesis

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Vector Space Approach for Aspect-Based Sentiment Analysis

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

DLM NYSED Enrollment File Layout for NYSAA

Edinburgh Research Explorer

Cultivating DNN Diversity for Large Scale Video Labelling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Linking Task: Identifying authors and book titles in verbose queries

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Investigation of Indian English Speech Recognition using CMU Sphinx

Rule Learning With Negation: Issues Regarding Effectiveness

Language Model and Grammar Extraction Variation in Machine Translation

Voice conversion through vector quantization

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Cross-Lingual Text Categorization

Model Ensemble for Click Prediction in Bing Search Ads

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Transcription:

Efficient Methods to Train Multilingual Bottleneck Feature Extractors for Low Resource Keyword Search Chongjia Ni, Cheung Chi Leung, Lei Wang, Nancy Chen and Bin Ma 9 March 2017 ICASSP 2017, New Orleans

Outline Introduction Multilingual Data Selection for Low Resource Keyword Search Multilingual Deep Bottleneck Feature Extractors Experiments on 2015 NIST Open KWS Conclusions

Background Introduction LVCSR based keyword Search (KWS) for low resource languages Multilingual DNN for rapid language adaptation Bottleneck feature extraction from multilingual DNN Multilingual deep bottleneck features An efficient way for cross lingual knowledge transfer Not all multilingual data contribute equally to ASR/KWS performance of a target language

Organization of the Paper Introduction Effective multilingual data selection LSTM RNN for modeling languages Select utterances in multilingual training data that are acoustically close to the training data of the target language Multi lingual deep bottleneck feature (BNF) extractor Comparison with previous work with submodular subset selection Analysis on rapid updating existing BNF extractor vs. new BNF extractor

Multilingual Data Selection Multilingual Data Selection based on Submodular function GMM tokenization instead of phonetic related features

Multilingual Data Selection Multilingual Data Selection based on Language Identification LSTM RNN model for language identification Output (L+1) LSTM memory cells LSTM memory cells Input J. Gonzalez Dominguez, I. Lopez Moreno, H. Sak, J. Gonzalez Rodriguez, P. J. Moreno, Automatic Language Identification using Long Short Term Memory Recurrent Neural Networks, Interspeech 2014

Multilingual Data Selection Multilingual Data Selection based on Language Identification Utterances in multilingual training data, which have high softmax outputs for the target language, are selected. Select those utterances which are classified into the target language with high probability (acoustically similar to the training data of the target language)

Multilingual Deep Bottleneck Feature Extractors Shared hidden layer Multilingual DNN for Bottleneck Features Senones of L 1 Senones of L 2 Senones of L N... Bottleneck Layer Shared Hidden Layers L 1 L 2... L N Input layer J. T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, Cross language Knowledge Transfer using Multilingual Deep Neural Network with Shared Hidden Layers, ICASSP 2013

Experimental Setup Keyword Search Task for Low Resource Languages NIST Open Keyword Search 2015 Evaluation Swahili as target language Language packs of 23 other languages released by IAPRA Babel Program VLLP (3H training set) + 10H development set Dev10h + 15H evaluation set Evalpart1 Feature extraction 117 features including 22 fbank + 3 pitch + + + 42BNF Multilingual deep BNF extractor (6 hidden layers, 42 hidden units for bottleneck layer, 1500 hidden units for other hidden layers) Acoustic modeling Hybrid DNN (6 hidden layers, 1,024 hidden units for each hidden layer, 2,207 senones) Discriminative trained GMM HMM for alignment Cross entropy training + smbr criterion for sequence training Language modeling 3 gram Web data LM which was interpolated with 3 gram LM trained using VLLP transcription Interpolation optimized by minimizing perplexity on the transcription Dev10h Keyword search 4,454 keywords (260 OOV to LM with Web data and 2,667 OOV to LM with training transcription ATWV (actual term weighted value) and WER for measuring the performance

Experimental Setup Selected Multilingual Training Data for BNF Extractors Baseline Multilingual 509h Cantonese (175.2 hours), Pashto (111.1 hours), Turkish (107.4 hours), Tagalog (115.7 hours) while 4 languages were randomly selected from 23 FLPs Baseline Multilingual 14h Submodular 3.5 hours from each language selected from Baseline Multilingual 509h based on submodular subset selection Baseline Multilingual 14h LID 3.5 hours from each language selected from Baseline Multilingual 509h based on proposed multilingual data selection Submodular Multilingual 96h Zulu (20.1 hours), Pashto (35.0 hours), Vietnamese (27.6 hours), Cantonese (13.3 hours) selected from 23 FLPs based on submodular subset selection Proposed Multilingual 96h Haitian Creole (29.7 hours), Zulu (21.6 hours), Dholuo (23.9 hours), Vietnamese (20.7 hours) selected from 23 FLPs based on proposed multilingual data selection Proposed Multilingual 14h 3.5 hours from each language selected from Proposed Multilingual 96h based on proposed multilingual data selection Creole 14h Haitian Creole (14 hours) selected based on proposed multilingual data selection

Experiments Table 1. Performance of baseline KWS systems on Evalpart1. BNF extractor Baseline Monolingual Baseline Multilingual Data set for training BNF extractor Web data LM Training transcription LM WER ATWV WER ATWV VLLP TL 67.4 0.308 69.3 0.194 Baseline Multilingual 509h 64.5 0.361 69.0 0.216 Better performance by using a large amount of multilingual data even they are not carefully against the target language.

Experiments Table 2. The performance of different KWS systems on Evalpart1 by rapidly updating the baseline multilingual BNF extractor using 14 hours of multilingual data. BNF extractor Data set for updating BNF extractor Training transcription Web data LM LM WER ATWV WER ATWV R1 Baseline Multilingual 14h LID + VLLP TL 62.1 0.396 66.7 0.239 R2 Baseline Multilingual 14h Sub + VLLP TL 62.3 0.390 67.1 0.238 R3 Proposed Multilingual 14h + VLLP TL 61.4 0.397 66.0 0.242 R4 Creole 14h + VLLP TL 61.6 0.389 66.3 0.231

Experiments Table 3. The performance of different KWS systems on Evalpart1 by training multilingual BNF extractors from scratch. BNF extractor Data set for training BNF extractor Training transcription Web data LM LM WER ATWV WER ATWV S1 Baseline Multilingual 509h + VLLP TL 61.2 0.413 65.7 0.243 S2 Proposed Multilingual 96h 60.9 0.407 65.6 0.239 S3 Proposed Multilingual 96h + VLLP TL 60.7 0.416 65.6 0.244 S4 Submodular Multilingual 96h 61.3 0.399 65.8 0.237 S5 Submodular Multilingual 96h + VLLP TL 61.1 0.402 65.7 0.237 S6 Creole 14h + VLLP TL 65.1 0.372 69.5 0.221 Combining speech data of target language with multilingual data for building the BNF extractor gives a significant improvement. Training a new BNF extractor using the proposed data selection provided good performance. The amount of selected data also affects the performance of the BNF extractor.

Experimental analysis 0.12 0.1 0.08 0.06 0.04 0.02 0 Fig. 1. Similarity measure between different source languages and target language (Swahili). The vertical axis denotes the average misclassification posterior probability of all utterance of each language. Top two languages are overlapped with the four languages in Proposed Multilingual 96h (Haitian Creole, Zulu, Dholuo, Vietnamese). Not all the utterances in a language have equal similarity to the target language.

Conclusions Studied effective methods to train multilingual bottleneck features extractors for keyword search task for low resource languages. Not all multilingual data can contribute equally to the KWS performance. The utterances that are acoustically similar to the target language data set aremoreuseful. LSTM RNN based language identification is effective and efficient for multilingual data selection. Combining speech data of target language with multilingual data for building the BNF extractor gives an improvement for KWS of the target language.

Thank you