Spoken Language Recognition

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition at ICSI: Broadcast News and beyond

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Learning Methods in Multilingual Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Support Vector Machines for Speaker and Language Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Mandarin Lexical Tone Recognition: The Gating Paradigm

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Spoofing and countermeasures for automatic speaker verification

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Recognition. Speaker Diarization and Identification

English Language and Applied Linguistics. Module Descriptions 2017/18

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

On the Formation of Phoneme Categories in DNN Acoustic Models

Speaker recognition using universal background model on YOHO database

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker Recognition For Speech Under Face Cover

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

/$ IEEE

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Probabilistic Latent Semantic Analysis

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Proceedings of Meetings on Acoustics

Lecture Notes in Artificial Intelligence 4343

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Acquiring Competence from Performance Data

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

MMOG Subscription Business Models: Table of Contents

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Executive Guide to Simulation for Health

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Constructing Parallel Corpus from Movie Subtitles

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

Reducing Features to Improve Bug Prediction

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

A Case Study: News Classification Based on Term Frequency

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Dialog Act Classification Using N-Gram Algorithms

Edinburgh Research Explorer

Voice conversion through vector quantization

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Deep Neural Network Language Models

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Phonological and Phonetic Representations: The Case of Neutralization

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Body-Conducted Speech Recognition and its Application to Speech Support System

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Universal contrastive analysis as a learning principle in CAPT

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Requirements-Gathering Collaborative Networks in Distributed Software Projects

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Expressive speech synthesis: a review

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Eye Movements in Speech Technologies: an overview of current research

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Eyebrows in French talk-in-interaction

Affective Classification of Generic Audio Clips using Regression Models

Lecture 2: Quantifiers and Approximation

Investigation on Mandarin Broadcast News Speech Recognition

Age Effects on Syntactic Control in. Second Language Learning

New Ways of Connecting Reading and Writing

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Transcription:

Spoken Language Recognition Based on Spoken Language Recognition: From Fundamentals to Practice Haizhou Li; Bin Ma; Kong Aik Lee Stanisław Kacprzak 27.03.2014, Kraków, Seminarium DSP

Problem definition Given a spoken observation utterance O and set L of target languages we have to decide: Language Recognition / Language Identification (LID) Which of the N languages does O belong to? Language Verification Does O belong to language Li or to the other languages?

Why do we need language recogniton? Multilanguage spoken dialog systems (e.g., iformational terminals) Database, archive-search and retrieval systems Human-human communication systems (call-routing, automatic translation, emergency calls) [2]

Why do we need language recogniton? 1973: Australia introduces telephone interpretation as a fee-free service to respond to its growing immigrant communities. 1981: The first Over-the-Phone Interpretation (OPI) service is offered in the United States. 1981 1990: Telephone interpretation enters major U.S. industries including financial services, telecommunications, healthcare, and public safety. 1990's: The demand for telephone interpretation grows significantly; contributing factors include decreased prices in long distance calls, toll-free number access, and immigration trends. 1995: Language services company Kevmark, later known as CyraCom, patents a multiple-handset phone adapted for telephone interpreting. 1999: AT&T sells language services company Language Line Services. 2000's: Telephone interpretation becomes more sophisticated; quality of interpretation, faster connection speeds, and customer service become important to consumers. 2005: The U.S. telephone interpreting market is estimated at approximately $200 million. 2013: Language Lines Services acquires Pacific Interpreters.

Real life example! The company employs approximately 5,000 interpreters and support staff globally who answer 40 million calls each year. [3]

How people do it? It was concluded that human beings, with adequate training, are the most accurate language recognizers. This observation still holds after 15 years as confirmed again, provided that the human listeners speak the languages. For languages that they are not familiar with, human listeners can often make subjective judgments with reference to the languages they know, e.g., it sounds like German. This judgments are less precise but show how people apply linguistic knowledge at different levels for distinguishing between certain broad language groups Given only little previous exposure, human listeners can effectively identify a lan-guage without much lexical knowledge. In this case, human listeners rely on prominent phonetic, phonotactic, and prosody cues to characterize the languages.

Perceptual cues used for language recognition

Perceptual cues used for language recognition The use of phonetic and phonotactic cues is based on the assumption that languages possess partially overlapping sets of phonemes. (Though there are over 6k languages in the world, the total number of phones required to represent all the sounds of these languages ranges only from 200 to 300)

Phonotactics cues We can study the phonotactic differences between languages by examining how well a phone n-gram model of one language predicts the phone sequence across different languages in terms of perplexity. A lower perplexity shows that a phone n-gram matches better the phone sequence, in other words, the phone sequence is more predictable.

General scheme of acoustic -phonetic and phonotactic approaches Phonotatic approach example PRLM (Phone recognition and language modeling) Acoustic-phonetic apprach example SDC (Shifted Delata Cepstral coefficients)

Shift Delta Cepstral coefficients (SDC)

Parralel PRLM

Vector Space Modeling (VSC)

Vector Space Modeling in acoustic-phonetic appraches Creation of supervector m by stacking mean vectors from of all adopted mixture components derived from GMM-UBM. Kullback-Leibler (KL) divergence (approximation) KL Kernel function Bhattacharyya kernel

Intersession Variability Vocal Tract Length Normalization (VTLN) feature-level latent factor analysis (flfa) U - session variability matrix Feature compenastaions i-vector paradigm T total variablity matrix

Corpora The availability of sufficiently large corpora has been the major driving factor in the development of speech technology in recent decades 1990's OGI telephone speech database OGI-11L, OGI-22L Conversational corpora: CallHome (6 languages) CallFriend (12 languages) NIST LREs 1996, 2003, 2005, 2007, 2009, and 2011.

NIST LREs

Results The MITLL NIST LRE 2011 Language Recognition System [4]

Future directions We have not been able to effectively venture beyond acoustic phonetic and phonotactic knowledge, despite the fact that there exists strong evidence in human listening experiments that prosodic information, syllable structure, and morphology are useful knowledge sources.

References 1. Haizhou Li; Bin Ma; Kong Aik Lee, "Spoken Language Recognition: From Fundamentals to Practice," Proceedings of the IEEE, vol.101, no.5, pp.1136,1159, May 2013 doi: 10.1109/JPROC.2012.2237151 2. http://www.languageline.com/ 3. Navratil, Jiri. "Spoken language recognition-a step toward multilinguality in speech processing." Speech and Audio Processing, IEEE Transactions on 9.6 (2001): 678-685. 4. Singer, Elliot, et al. "The MITLL NIST LRE 2011 language recognition system." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. 2012.