Project #2: Survey of Weighted Finite State Transducers (WFST)

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Methods in Multilingual Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Generative models and adversarial training

Calibration of Confidence Measures in Speech Recognition

Human Emotion Recognition From Speech

(Sub)Gradient Descent

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A study of speaker adaptation for DNN-based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Investigation on Mandarin Broadcast News Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Emotion Recognition Using Support Vector Machine

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Python Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

On the Formation of Phoneme Categories in DNN Acoustic Models

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Improvements to the Pruning Behavior of DNN Acoustic Models

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Edinburgh Research Explorer

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker Identification by Comparison of Smart Methods. Abstract

SARDNET: A Self-Organizing Feature Map for Sequences

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

INPE São José dos Campos

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Computerized Adaptive Psychological Testing A Personalisation Perspective

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Artificial Neural Networks written examination

CS Machine Learning

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Lecture 1: Machine Learning Basics

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Speech Recognition by Indexing and Sequencing

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Learning Methods for Fuzzy Systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

An OO Framework for building Intelligence and Learning properties in Software Agents

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Softprop: Softmax Neural Network Backpropagation Learning

Lecture 9: Speech Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Test Effort Estimation Using Neural Network

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Evolutive Neural Net Fuzzy Filtering: Basic Description

Deep Neural Network Language Models

Automatic Pronunciation Checker

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Probabilistic Latent Semantic Analysis

Using dialogue context to improve parsing performance in dialogue systems

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

English Language and Applied Linguistics. Module Descriptions 2017/18

Arabic Orthography vs. Arabic OCR

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

On-Line Data Analytics

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

WHEN THERE IS A mismatch between the acoustic

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Switchboard Language Model Improvement with Conversational Data from Gigaword

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Transcription:

T-61.184 : Speech Recognition and Language Modeling : From Theory to Practice Project Groups / Descriptions Fall 2004 Helsinki University of Technology Project #1: Music Recognition Jukka Parviainen (parvi@james.hut.fi) Ville Turunen (vt@james.hut.fi) Jaakko Väyrynen (jjvayryn@james.hut.fi) The target of the project is to create software which recognizes songs. The training set is a collection of MP3 songs by several artists. First feature extraction is done, then the GMM model for each song is taught. Also we will build models for each artist and genre. The models are time-independent which allows recognition from any part of the song. The system will be implemented in C. Project #2: Survey of Weighted Finite State Transducers (WFST) Teemu Hirsimäki (thirsima@james.hut.fi) During the recent years, Weighted Finite State Transducers (WFST) have become an attractive framework for large vocabulary speech recognition. The increase in the computational power, and the development of efficient algorithms for composing and minimizing transducers have made it possible to build all information about phoneme models, context-dependency, pronunciation lexicon, and language model into transducers. With the WFST algorithms, these transducers can be composed, minimized and pruned efficiently off-line, before recognition, leading to a compact representation of the whole search network, which can be decoded with a simple Viterbi decoder. The aim of this literature survey is to review papers dealing with the development of WFST algorithms that allowed to build real transducer-based recognition systems. In addition, the survey tries to answer the following two questions. (1) In what areas in the field of speech and language processing, transducers have been used beside speech recognition? (2) What are the current shortcomings of the WFST-based recognition framework, and what things must be done with other methods? Project #3: Survey of Segment-Based Speech Recognition Petri Korhonen (petri@acoustics.hut.fi) The most modern speech recognition systems take as an input a set of features computed at fixed rate from short time windows of speech signal. An alternative framework to this is to prior to recognition phase to perform some acoustical analysis to get some explicit segmentation of the speech signal. The segments obtained this

way are of variable lengths, and different features can be used for different segments. The segmental framework allows a richer set of acoustic-phonetic features than can be incorporated into conventional frame-based representations. Systems based on this framework include for example SUMMIT. In this project I try to get an overview of this framework. Also I will try to find out what are the building blocks, and algorithms used in systems based on this framework. Project #4: Speech Recognition System for XForms Mikko Honkala (honkkis@tml.hut.fi) Mikko Pohja (mpohja@tml.hut.fi) XForms is a new abstract user interface definition language. In our opinion it could be used as the basis in multimodal services in the Web. In the project work, we will research and implement a speech recognition system for XForms. This system will allow filling forms and operating user interfaces written in XForms, using speech. We will use java-based Sphinx-4 speech recognition library (http://cmusphinx.sourceforge.net/sphinx4/). We will focus on how to implement the different form controls with the grammar-based recognition system. Also, we will implement the navigation within a form. The navigation differs from GUI based navigation, since in speech, the form has to be serialized. The implementation will be done to the XML browser X-Smiles (www.xsmiles.org). X-Smiles already has an XForms processor, and it is an open source project by the TML laboratory, so it is quite ideal environment to test multimodality and speech input. We will implement at least navigation, selection lists, and some data type-bound input fields in speech. Free form input fields are probably not feasible with grammar-based speech recognition system, such as Sphinx-4. Project #5: Music recognition server based on GMMs Pedro Díaz Jiménez (pdiaz@cc.hut.fi) The aim of the project would be to develop a music recognition server. This server would function like existing CDDB servers, but will work with music files instead of Audio CDs. Users submit to the server some kind of parameters about the song (maybe observation vectors based on MFCCs?) and receive the title, author, etc. of the song. This server would be useful to set the metadata information (ID3 in mp3 files for example) of the user's music collection. If time permits I also would like to try to add author and genre identification features. Project #6: An implementation of a token pass decoder Janne Pylkkönen (jpylkkon@james.hut.fi) The concept of token passing for speech recognition was introduced over 15 years ago. The most popular decoding technique for large vocabulary continuous speech

recognition (LVCSR) nowadays is the one-pass time-synchronous beam search strategy, which is still based on that same principle. The key advantage of token passing is the conceptually simple approach, which makes it easy to extend the strategy to handle many advanced problems in speech recognition, such as cross-word contexts and early language model pruning. This work involves implementing a token pass decoder to the existing CIS-HUT LVCSR framework. It already contains a stack decoder, from which some of the code can be reused. However, the new decoder should take care of several problems previously unaddressed, such as the use of tied HMM states in the lexical prefix tree and the possibility to use cross-word triphone contexts. When finished, the efficiency of the token pass decoder approach will be compared to the existing stack decoder. As an implementation of a decoder may prove to be quite extensive, a priority list of goals is defined. At the core is the construction of the lexical prefix tree and the implementation of the actual token pass algorithm for beam search. Next a problem of integrating early language model score based on bigrams is looked at. If time allows, also the implementation of cross-word triphone contexts is made. The overall design goal is to make the system easily extendable for future needs. Project #7: Speech Recognition Based on Artificial Neural Networks Veera Ala-Keturi (valaketu@cc.hut.fi) Artificial neural networks (ANNs) are systems consisting of interconnected computational nodes imitating the human neurons. Neural networks can be used e.g. to approximate functions or classify data into similar clusters (in a supervised or unsupervised manner). I will first look at some basic theory of neural networks: perceptrons, multi-layer networks, feed-forward vs. recurrent networks, update criterions and algorithms etc. I will then study hybrid (connectionist) models where HMMs and NNs are used together in speech recognition. In a hybrid HMM/NN system the neural network estimates the posterior probabilities, which can greatly enhance the discrimination ability of the system. Finally I will look at the rather new field of research dealing with extracting features from the data by neural networks. The purpose of this survey is to obtain an understanding of the state-of-the-art in usage of neural networks for speech recognition, and find the pros and cons of each technique. Project #8: Noise-Robust Speech Recognition Bernhard Leiner (bernhard@footbag.at) In my survey project I will try to give an overview about the different techniques and algorithms used to recognize speech in a noisy environment. The focus is on compensation of noise during the preprocessing stage (feature mapping) and model adaptation due to noise during the recognition.

Project #9: Language Modeling in Automatic Speech Recognition Antti Puurula (Antti.Puurula@helsinki.fi) Language models form an essential component in modern speech recognition system. Some of the missing percentages in ASR error rates could be related to inadequacy of the language models. The purpose of this survey project is to examine the modern n- gram models as well as some of the alternative approaches that have been tried. Project #10: Experiments with Spoken Passphrase and Speaker Identification Juha Raitio (juha.raitio@iki.fi) The goal of this project work is to study the possibilities for spoken passphrase and speaker identification. Of interest is a system that would identify an utterance of a passphrase as a previously presented one by the same speaker, or discard it as unknown. The objective is to conduct background research on the challenges, in order to select a plausible approach, and conduct experiments by applying it. Optionally an on-line toy system is implemented. The effort should be documented in scientific manner. Outline of experiments. K speakers utter M passphrases each N times, utterances are labeled by the speaker name and a passphrase id. T% of utterances of U% of passphrases by U% of the speakers are used for training a (speaker x passphrase x utterance) model. Utterances not used in training are used for testing. Each test passphrase must be either identified or discarded. Correct identifications and errors and their types are recorded and reported. Errors occur when an utterance is discarded though it should have been identified (type I), or if an utterance was identified incorrectly (type II). Type IIA error would occur is the speaker is identified incorrectly, IIB if the passphrase is identified incorrectly. Research questions. Is it possible to control and balance the probabilities of type I and II errors by model selection? Optionally: what other factors e.g. passphrase length, number of passphrases M, number of speakers K, number of repetitions of the passphrase by a speaker N, etc. affect the success rate. Project #11: Classifier Combination for Speech Recognition Matti Aksela (matti.aksela@hut.fi) For the actual course project, I will attempt to write up a survey of combination methods used in speech recognition. I will attempt to evaluate them with a more general view of classifier combining, and also consider the usability of some adaptive combination methods that have used in my previous research within the domain of speech recognition.

Project #12: Automatic Language Identification from Telephone Speech Zhirong Yang (rozyang@cc.hut.fi) The project work involves implementation and comparison of three approaches for automatic language identification of speech utterance: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languagedependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single-language phone recognizers, each trained in different language. Also, the performance by merging multiple language phone recognizers will be also investigated.