M4 in Brno speech. M4 meeting Sheffield, January

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Segregation of Unvoiced Speech from Nonspeech Interference

CS Machine Learning

A Case Study: News Classification Based on Term Frequency

Speaker Identification by Comparison of Smart Methods. Abstract

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Probabilistic Latent Semantic Analysis

Learning Methods in Multilingual Speech Recognition

Calibration of Confidence Measures in Speech Recognition

Body-Conducted Speech Recognition and its Application to Speech Support System

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker Recognition. Speaker Diarization and Identification

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

On the Formation of Phoneme Categories in DNN Acoustic Models

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Measurement. When Smaller Is Better. Activity:

Automatic Pronunciation Checker

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

arxiv: v1 [cs.lg] 7 Apr 2015

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Improvements to the Pruning Behavior of DNN Acoustic Models

Python Machine Learning

Deep Neural Network Language Models

SARDNET: A Self-Organizing Feature Map for Sequences

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Softprop: Softmax Neural Network Backpropagation Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Disciplinary Literacy in Science

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Full text of O L O W Science As Inquiry conference. Science as Inquiry

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Speaker recognition using universal background model on YOHO database

INPE São José dos Campos

A Neural Network GUI Tested on Text-To-Phoneme Mapping

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Corrective Feedback and Persistent Learning for Information Extraction

Investigation on Mandarin Broadcast News Speech Recognition

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

(Sub)Gradient Descent

Edinburgh Research Explorer

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Evaluation of a College Freshman Diversity Research Program

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Lecture 9: Speech Recognition

On-Line Data Analytics

Knowledge Transfer in Deep Convolutional Neural Nets

Australian Journal of Basic and Applied Sciences

Mandarin Lexical Tone Recognition: The Gating Paradigm

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Learning Microsoft Office Excel

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Rule Learning With Negation: Issues Regarding Effectiveness

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

arxiv: v1 [math.at] 10 Jan 2016

How to Judge the Quality of an Objective Classroom Test

Proceedings of Meetings on Acoustics

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Transcription:

M4 in Brno speech Jan Černocký http://www.fit.vutbr.cz/research/groups/speech cernocky@fit.vutbr.cz M4 meeting Sheffield, January 28 29 2003 1

VUT Brno main goals in M4-speech robust feature extraction. reliable phoneme recognition. This presentation Phoneme recognition - HMM, TRAPs. Image operators in TRAPs. All-pole modeling of everything Merging of weak recognizers Plans. 2

1. RELIABLE PHONEME DETECTION Petr Schwarz & Pavel Matějka Experiments carried out on TIMIT so far: trained on MERGED-TIMIT (to avoid problems on boundaries good for TRAPs). 202 files: HMM training, 260 - band classifier training, 49 - band classifier cross-validation, 119 - test. 42 phonemes HMM recognizer: HTK-based. 3 states per phoneme phonemes can follow each other without any restriction, no language model. 3

Time Critical Bands Spectrum TRAPs TRAP vectors 101 points 4 band classifier band classifier MERGER Class probabilities

Phoneme recognition using TRAPs HMM s set boundaries considering just one temporal trajectory with the center in hypothesized phoneme center (better time sync. does not have to deal with all possible shifts of the TRAP... ) 23 bands, 1-second trajectories around centers. band classifiers: MLP (Quicknet), 101-300-42. merger: MLP (Quicknet), 23 42-300-42. softmax non-linearity in the output layer: max. posterior determines the recognized phoneme. 5

Phoneme rec. accuracies HMM test set. hmm = traps = orig 47.52 hmm = orig traps = orig 70.67 hmm = traps 59.39 hmm = traps && hmm!= orig 11.87 hmm = orig 58.88 traps = orig 59.31 hmm!= orig && traps!= orig 29.32... and some results per phoneme... p hmm 78.4 47.4 pau traps 94.8 96.3 r hmm 67.5 62.4 s traps 81.5 86.3 sh hmm 84.7 75.4 t traps 61.4 65.6 6

And a chart... Really bad { 11.87% TRAPs is bad and HMM is bad 29.31% } TRAPs/HMM accuracy 59.31%/58.04% { 11.37% { } our merging space 23.39% sure good 47.52% 7

Lessons learned: TRAPs in this setup are not good enough for reclasification, can not replace MFCC. some phones can be classified much better by traps, some by HMMs. What next? replacing hard boundaries with a lattice - redefine probabilities on lattice arcs and rescore. a merger should be used for combining TRAP and HMM results. develop phoneme-specific measures for re-scoring...? adapt to the meeting data phoneme labels/forced alignments of the ICSI data? 8

2. IMAGE PROCESSING OPERATORS IN TRAPS Franta Grézl trying to incorporate processing known from image processing as edge detection into feautre extraction for ASR. spectrogram time-frequency image. Then the edge detector is looking for increasing ar decreasing of energy. Possible to look in different direction so we can obtain information about energy behavior from different sources. edge detectors are orthogonal sources can be seen as independent possibility of system combination. 9

Coefficients of Sobel filters G-operators. G1-1 0 1-2 0 2-1 0 1 G2 1 2 1 0 0 0-1 -2-1 10

Processing of a spectrogram 2 30 4 6 20 8 10 10 Original FB spectrum G1 12 14 16 0 10 2 22 18 20 20 4 6 8 10 20 18 16 50 100 150 200 250 Mapped spectrums 12 14 16 18 14 12 10 2 4 6 20 15 20 8 8 10 22 50 100 150 200 250 6 G2 10 12 14 5 0 5 16 18 10 20 15 50 100 150 200 250 11

Basic TRAPs Spectrum Critical bands temporal vector TRAP Band 101 points Classifier Probability Frequency Time Phoneme TRAP Band 101 points Classifier Merger Phoneme 12

Each operator having its own band-classifiers G1 Temporal vector Band Classifier mapped spectrum Temporal vector Band Classifier Probability G2 mapped Temporal vector Band Classifier Merger Phoneme spectrum Temporal vector Band Classifier 13

One band classifier processes data from both operators G1 mapped spectrum Temporal vector Band Classifier Probability G2 mapped spectrum Merger Phoneme Temporal vector Band Classifier 14

Experiment digits task, results subset of OGI NUMBERS, just digits. 4716 sentences, 2547 for training and 2169 for testing of HMM recognizer (CI phoneme models). Band probability estimators trained on OGI STORIES database: 29 classes. Merger trained on part of the target data OGI NUMBERS. 15 bark filter bands, 99 frames long TRAPs. System 5 band acc [%] merger acc [%] recognition acc [%] basic TRAP TR: 42.89 CV: 37.88 TR: 84.35 CV: 81.25 93.21 1 G1 TR: 42.46 CV: 39.74 TR: 82.64 CV: 80.10 93.08 2 G2 TR: 34.93 CV: 31.46 TR: 85.63 CV: 77.95 92.50 3 merged1 same as G-TRAPS 1, 2 TR: 89.20 CV: 82.79 95.34 4 merged2 TR: 49.68 CV: 45.52 TR: 86.63 CV: 82.67 94.84 15

Current work and future Other operators (combination of time-frequency): 0 1 2-1 0 1-2 -1 0-2 -1 0-1 0 1 0 1 2 different ways to merge (concatenation, averaging, PCA de-correlation,... ) designing better operators for specific phoneme classes (to verify if a given phoneme is really there)...? Meeting data. 16

3. ALL-POLE MODELING Petr Motĺıček all-pole modeling is the base of some popular feature extractions (LPCC, PLP). what else can we model using all-pole and will it help? modeling of spectral-subbands multiband on feature level reasonable though not extraordinary results (tested on Aurora). modeling of temporal trajectories (TRAPs) not good so far. Why? All-pole model does not take phase into account - ok for amplitude or power spectrum (phase gone anyway) but disasterous for temporal trajectories (the info about position of temporal pattern disappears). 17

4. MERGING OF WEAK RECOGNIZERS Lukáš Burget Assumptions: a sophisticated recognizer can be easily overfit to the training data. train smaller and possibly weaker recognizers that will be poorer each but better in combination. investigate the methods to merge their results hard output level (ROVER), stat-level merging. tested so far on the TI-DIGITS portion of AURORA DB. Best results: Baum-Welch algorithm using all recognizers. Evaluating state occupation likelihoods L j (t) and output probabilities b j (t) (to tell is it s really probable that we re in a given state, or if the HMM just couldn t do anything else... ) weighting L j (t) by b j (t) and running Viterbi on the result. 18

Results Clean data: 1 Gauss. component: global 96%, weak merged 98%, 2 Gauss. components 99%. Noisy data: 1 Gauss. component: global 70%, weak merged 82%, 2 Gauss. components 83%(?). WER Lukas s PhD Lukas s Nobel prize Dream... # Gauss components 19

Plans moving quickly to the meeting data. finding reliable features for phoneme detection (do not need to be the same for all). determine where the phoneme recognition can help the others: LVCSR - proper names, systematic mispronunciation of certain words by certain speakers, speaker characterization lengths of vowels, etc. use video features from video group: should greatly help e.g. in stop detection, but needs good sync! participation at the recognition efforts (Martin Karafiát at USFD). 20