M4 in Brno speech Jan Černocký http://www.fit.vutbr.cz/research/groups/speech cernocky@fit.vutbr.cz M4 meeting Sheffield, January 28 29 2003 1
VUT Brno main goals in M4-speech robust feature extraction. reliable phoneme recognition. This presentation Phoneme recognition - HMM, TRAPs. Image operators in TRAPs. All-pole modeling of everything Merging of weak recognizers Plans. 2
1. RELIABLE PHONEME DETECTION Petr Schwarz & Pavel Matějka Experiments carried out on TIMIT so far: trained on MERGED-TIMIT (to avoid problems on boundaries good for TRAPs). 202 files: HMM training, 260 - band classifier training, 49 - band classifier cross-validation, 119 - test. 42 phonemes HMM recognizer: HTK-based. 3 states per phoneme phonemes can follow each other without any restriction, no language model. 3
Time Critical Bands Spectrum TRAPs TRAP vectors 101 points 4 band classifier band classifier MERGER Class probabilities
Phoneme recognition using TRAPs HMM s set boundaries considering just one temporal trajectory with the center in hypothesized phoneme center (better time sync. does not have to deal with all possible shifts of the TRAP... ) 23 bands, 1-second trajectories around centers. band classifiers: MLP (Quicknet), 101-300-42. merger: MLP (Quicknet), 23 42-300-42. softmax non-linearity in the output layer: max. posterior determines the recognized phoneme. 5
Phoneme rec. accuracies HMM test set. hmm = traps = orig 47.52 hmm = orig traps = orig 70.67 hmm = traps 59.39 hmm = traps && hmm!= orig 11.87 hmm = orig 58.88 traps = orig 59.31 hmm!= orig && traps!= orig 29.32... and some results per phoneme... p hmm 78.4 47.4 pau traps 94.8 96.3 r hmm 67.5 62.4 s traps 81.5 86.3 sh hmm 84.7 75.4 t traps 61.4 65.6 6
And a chart... Really bad { 11.87% TRAPs is bad and HMM is bad 29.31% } TRAPs/HMM accuracy 59.31%/58.04% { 11.37% { } our merging space 23.39% sure good 47.52% 7
Lessons learned: TRAPs in this setup are not good enough for reclasification, can not replace MFCC. some phones can be classified much better by traps, some by HMMs. What next? replacing hard boundaries with a lattice - redefine probabilities on lattice arcs and rescore. a merger should be used for combining TRAP and HMM results. develop phoneme-specific measures for re-scoring...? adapt to the meeting data phoneme labels/forced alignments of the ICSI data? 8
2. IMAGE PROCESSING OPERATORS IN TRAPS Franta Grézl trying to incorporate processing known from image processing as edge detection into feautre extraction for ASR. spectrogram time-frequency image. Then the edge detector is looking for increasing ar decreasing of energy. Possible to look in different direction so we can obtain information about energy behavior from different sources. edge detectors are orthogonal sources can be seen as independent possibility of system combination. 9
Coefficients of Sobel filters G-operators. G1-1 0 1-2 0 2-1 0 1 G2 1 2 1 0 0 0-1 -2-1 10
Processing of a spectrogram 2 30 4 6 20 8 10 10 Original FB spectrum G1 12 14 16 0 10 2 22 18 20 20 4 6 8 10 20 18 16 50 100 150 200 250 Mapped spectrums 12 14 16 18 14 12 10 2 4 6 20 15 20 8 8 10 22 50 100 150 200 250 6 G2 10 12 14 5 0 5 16 18 10 20 15 50 100 150 200 250 11
Basic TRAPs Spectrum Critical bands temporal vector TRAP Band 101 points Classifier Probability Frequency Time Phoneme TRAP Band 101 points Classifier Merger Phoneme 12
Each operator having its own band-classifiers G1 Temporal vector Band Classifier mapped spectrum Temporal vector Band Classifier Probability G2 mapped Temporal vector Band Classifier Merger Phoneme spectrum Temporal vector Band Classifier 13
One band classifier processes data from both operators G1 mapped spectrum Temporal vector Band Classifier Probability G2 mapped spectrum Merger Phoneme Temporal vector Band Classifier 14
Experiment digits task, results subset of OGI NUMBERS, just digits. 4716 sentences, 2547 for training and 2169 for testing of HMM recognizer (CI phoneme models). Band probability estimators trained on OGI STORIES database: 29 classes. Merger trained on part of the target data OGI NUMBERS. 15 bark filter bands, 99 frames long TRAPs. System 5 band acc [%] merger acc [%] recognition acc [%] basic TRAP TR: 42.89 CV: 37.88 TR: 84.35 CV: 81.25 93.21 1 G1 TR: 42.46 CV: 39.74 TR: 82.64 CV: 80.10 93.08 2 G2 TR: 34.93 CV: 31.46 TR: 85.63 CV: 77.95 92.50 3 merged1 same as G-TRAPS 1, 2 TR: 89.20 CV: 82.79 95.34 4 merged2 TR: 49.68 CV: 45.52 TR: 86.63 CV: 82.67 94.84 15
Current work and future Other operators (combination of time-frequency): 0 1 2-1 0 1-2 -1 0-2 -1 0-1 0 1 0 1 2 different ways to merge (concatenation, averaging, PCA de-correlation,... ) designing better operators for specific phoneme classes (to verify if a given phoneme is really there)...? Meeting data. 16
3. ALL-POLE MODELING Petr Motĺıček all-pole modeling is the base of some popular feature extractions (LPCC, PLP). what else can we model using all-pole and will it help? modeling of spectral-subbands multiband on feature level reasonable though not extraordinary results (tested on Aurora). modeling of temporal trajectories (TRAPs) not good so far. Why? All-pole model does not take phase into account - ok for amplitude or power spectrum (phase gone anyway) but disasterous for temporal trajectories (the info about position of temporal pattern disappears). 17
4. MERGING OF WEAK RECOGNIZERS Lukáš Burget Assumptions: a sophisticated recognizer can be easily overfit to the training data. train smaller and possibly weaker recognizers that will be poorer each but better in combination. investigate the methods to merge their results hard output level (ROVER), stat-level merging. tested so far on the TI-DIGITS portion of AURORA DB. Best results: Baum-Welch algorithm using all recognizers. Evaluating state occupation likelihoods L j (t) and output probabilities b j (t) (to tell is it s really probable that we re in a given state, or if the HMM just couldn t do anything else... ) weighting L j (t) by b j (t) and running Viterbi on the result. 18
Results Clean data: 1 Gauss. component: global 96%, weak merged 98%, 2 Gauss. components 99%. Noisy data: 1 Gauss. component: global 70%, weak merged 82%, 2 Gauss. components 83%(?). WER Lukas s PhD Lukas s Nobel prize Dream... # Gauss components 19
Plans moving quickly to the meeting data. finding reliable features for phoneme detection (do not need to be the same for all). determine where the phoneme recognition can help the others: LVCSR - proper names, systematic mispronunciation of certain words by certain speakers, speaker characterization lengths of vowels, etc. use video features from video group: should greatly help e.g. in stop detection, but needs good sync! participation at the recognition efforts (Martin Karafiát at USFD). 20