International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May ISSN

Similar documents
Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Mandarin Lexical Tone Recognition: The Gating Paradigm

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A study of speaker adaptation for DNN-based speech synthesis

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Lecture Notes in Artificial Intelligence 4343

Body-Conducted Speech Recognition and its Application to Speech Support System

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

IEEE Proof Print Version

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Journal of Phonetics

Speaker recognition using universal background model on YOHO database

Automatic intonation assessment for computer aided language learning

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Word Stress and Intonation: Introduction

Rhythm-typology revisited.

Modeling function word errors in DNN-HMM based LVCSR systems

Proceedings of Meetings on Acoustics

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

THE RECOGNITION OF SPEECH BY MACHINE

Learning Methods in Multilingual Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Modeling function word errors in DNN-HMM based LVCSR systems

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Segregation of Unvoiced Speech from Nonspeech Interference

Voice conversion through vector quantization

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Affective Classification of Generic Audio Clips using Regression Models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

REVIEW OF CONNECTED SPEECH

Speaker Recognition. Speaker Diarization and Identification

L1 Influence on L2 Intonation in Russian Speakers of English

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

The Acquisition of English Intonation by Native Greek Speakers

Speaker Identification by Comparison of Smart Methods. Abstract

CEFR Overall Illustrative English Proficiency Scales

Getting the Story Right: Making Computer-Generated Stories More Entertaining

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

A Case-Based Approach To Imitation Learning in Robotic Agents

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The influence of metrical constraints on direct imitation across French varieties

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Consonants: articulation and transcription

One major theoretical issue of interest in both developing and

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Expressive speech synthesis: a review

Using dialogue context to improve parsing performance in dialogue systems

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Automatic segmentation of continuous speech using minimum phase group delay functions

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Corpus Linguistics (L615)

Using EEG to Improve Massive Open Online Courses Feedback Interaction

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Circuit Simulators: A Revolutionary E-Learning Platform

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

The Structure of the ORD Speech Corpus of Russian Everyday Communication

Guidelines for blind and partially sighted candidates

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Phonetics. The Sound of Language

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Lower and Upper Secondary

Eyebrows in French talk-in-interaction

SARDNET: A Self-Organizing Feature Map for Sequences

Support Vector Machines for Speaker and Language Recognition

Sample Goals and Benchmarks

Course Law Enforcement II. Unit I Careers in Law Enforcement

English Language and Applied Linguistics. Module Descriptions 2017/18

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Transcription:

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May-2014 600 Extraction of Prosodic Features for Speaker Recognition Technology and Voice Spectrum Analysis Authors: Nilu Singh 1, R. A. Khan 1 1 SIST-DIT, Babasaheb Bhimrao Ambedkar University (Central University), Lucknow, UP, India E-mail: nilu.chouhan@hotmail.com Abstract: The objective of this paper is to provide information and overview of prosodic features and spectral analysis of a speech signal. Speaker Recognition System is the make use of a machine to recognize the people from a spoken words. The majority in progress the highest level of development in the Automatic Speaker recognition System, done by using short term spectral information this approach disregard longterm selective information that can transmit supra segmental information such as prosodic and speaking style. We discussed in detail that what is prosody and its feature extraction technique including their mechanism and functionality. The goal of this paper is to provide an overview of prosodic feature extraction technique which helps people in Speaker Recognition Area. There are several characteristic in human speech prosody such as intonation, rhythm and stress, using these characteristic Speaker recognition can be done. Index Term: Introduction of web 2.0, History of web 2.0, tools & technology, why web 2.0 1 INTRODUCTION It is well known that the speech/voice of human being is the most natural way for the communication. Speech is the good medium to identify/recognize the people and the reason because it conveys information to the listeners. As we know the tone of the people is unique as their native place, it is possible to mimic voice but not exactly the tone if the people does not belong to the same native place. Speaking styles of different peoples will appear differently because the accent belongs to their native places. For example English speaking style is dissimilar for Indian people or any other people if their native place is different because there is a touch of native dialect in their voice/accent. As per speech production system, the speech signal conveys Linguistic information i.e. related to language and speaker information. As of the speech awareness point of view it conveys information concerning the environment in which the speech was formed and transmitted. In general human can naturally make sense of most of this information; this skill of human has encouraged researchers to understand speech production and becoming aware of something via sense (perception) for developing the system that

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May-2014 601 automatically extract and process the prosperity of information in speech [1]. information recovery within audio collection, recognition of performer in forensic analysis and personalization of user device. As discussed in [9] the basic prosodic includes Speaking rate, pause rate, timing and pitch f0 where pitch include melody, rate of change, global regrets. Prosodic feature are useful for assigning meaning, detecting sentence and topic boundaries also for speaker identification. As many studies say that the speech signal conveys the linguistic environment of the speaker. The fundamental frequency (f0) of the speech signal conveys the gender of speaker for example f0 is usually lower for male speakers, reason behind this the usually male have longer vocal cords [10]. Sound spectrum adopted the different frequencies present in a sound signal; it is an illustration of a Figure 1: structural design of a usual biometric Recognition System segment of sound signal in terms of the quantity of vibration at every entity of the frequency. Automatic Speaker recognition is a machine Spectrum of a speech signal generally existing as a proficient to identify an individual from a spoken words/sentence. Nowadays this technology used mostly in various areas such as forensic labs, access control, transaction authentication and many other areas. If we talk about Automatic Speaker Recognition then a question arise that why take graph of frequency. The aspect of measurement of spectrum there are many ways such as using a computer, using a microphone and an analogdigital converter as a function of time etc [11]. speech signal? The suitable answer for this question that the speech signal conveys several 2 VOICE SPECTRUM ANALYSIS information s about the speaker so it is used mostly for identifying a people. As discussed in [1][8] Automatic speaker recognition technology has wide area of applications where it can be used such as it enables systems to use a person s voice to control the access to restricted services e.g. automatic banking services, telephone access to As discussed in [9] the basic prosodic includes Speaking rate, pause rate, timing and pitch f0 where pitch include melody, rate of change, global regrets. Prosodic feature are useful for assigning meaning, detecting sentence and topic boundaries financial transactions or some other areas. also for speaker identification. As many studies say Automatic Speaker recognition technology also that the speech signal conveys the linguistic allows detection of speaker for in case accent based environment of the speaker. The fundamental

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May-2014 602 frequency (f0) of the speech signal conveys the gender of speaker for example f0 is usually lower for male speakers, reason behind this the usually male have longer vocal cords [10]. Sound spectrum adopted the different frequencies present in a sound signal; it is an illustration of a segment of sound signal in terms of the quantity of vibration at every entity of the frequency. Spectrum of a speech signal generally existing as a graph of frequency. The aspect of measurement of spectrum there are many ways such as using a computer, using a microphone and an analog-digital converter as a function of time etc [11]. magnitude of a quantity of factor beside frequency; spectrum is a segment of speech signal. For speaker recognition, sound spectrum is used for to break a speech signal into small blocks. Spectrum analysis is a preface measurement to carry out reduction of speech bandwidth and supplementary acoustic processing [2][10]. As discussed in [10] using the spectrum of the speech signal we can be able to obtained most of the parameters which used for recognition process. Speech signal spectral moderate information regarding the vocal tract as well as the excitation source in the glottis by resources of the formants and the fundamental frequency. For speaker recognition technique, the parameters obtained by spectral typically the equivalent as the ones used in speech recognition technique. 3 PROSODIC FEATURES FOR SPEAKER RECOGNITION Prosodic described as it use relating to the rhythmic characteristic of language or to the suprasegmental phonemes of pitch and stress and stage and nasalization i.e. the utterance of sounds modulated by the nasal resonators and voicing. Prosodic features of a speech signal can be used for to confine speaker specific information about variation in intonation, timing & loudness. Since prosodic features are Supra- Segmental long term features, they can provide corresponding Fig 2: voice waveform and voice Spectrum (created using MATLAB) Spectrum also defined as that it is a connection characteristically represented by a plot of the information to systems/machines based on phonetic features/frame-level features [2]. One of the most considered features of speech is pitch/fundamental frequency which reflects vocal folds vibration rate, vibration rate is affected by

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May-2014 603 different physical properties of vocal fold. As discussed in [1] the experimental results specify that prosodic features are valuable and make available new information for speaker recognition, prosodic features have been used for speaker recognition for a long time. There are two approaches to exploring the prosodic features first is pitch and energy sharing here a feature vector consisting of per- frame log pitch, log energy and their first derivatives was used for speaker verification. Second is pitch and energy track dynamics, in this pitch and energy gestures use by modeling the joint slope dynamics of pitch and energy contours. Pitch and energy slope states i.e. speculate various features of the utterance or also look up the gotten through environmental forces speaking habits as a person and hence it put up for speaker Recognition [4]. Extraction of prosodic feature can be categorized into two methods first is using the automatic speech recognizer, in this approach syllabic boundaries are obtained with the help of Automatic Speech Recognizer in this method variation points and start and end of articulation are used to segmented the speech signal. These segmented trajectories are then approximated and labeled into a small set of classes that describes the dynamics of f0 contour and energy contour. The second is the rising & falling, describe as segment duration and complimentary of automatic speech recognizer; phoneme or word context is use to train an n-gram here segments boundaries of articulation are classifier. estimated using discriminative information derived from the speech signal [5]. In [1][6] The most common features of prosodic are describe that a collection of prosodic features from pitch, energy & duration, the value of pitch and duration and pitch related features such as mean energy are the average value and standard and variance of pause duration and F0 values per deviation for all frames and for rising or falling word, extracted from each conversation face. In frames in other words says that number of frames this study the experimental result show that where pitch is rising. The term duration described prosodic features are valuable and make available as the average & standard deviation of words and new information for speaker recognition silence lengths in frames. Syllable- based prosodic technology. features are more effective for speaker recognition. In case of prosodic system the term prosody stand 4 SPAEKER SPECIFIC FEATURES for the patterns of stress and intonation in a OF PROSODY language or in other words we can say that it lay Prosody for the linguistics reflects various features out a collection of characteristics such as of the speaker/utterance and it contains intonation, stress and timing, for the most part information regarding rhythm, stress and expressed using variation in pitch energy and intonation of the speech. The speaker duration at various levels of speech. Prosody may

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May-2014 604 communication manner in a dialogue features are analyzed to see that the speaker communication style observed in conversation, contained useful information to the speaker recognition. The concept about prosody that the speaker information might be found in both static and dynamic forms, and speech production possibly initiate from anatomical, physiological/behavioral each & every individual in nature hence speaker these systems. As discussed in [7] the majority in progress the highest level of development in the Automatic Speaker recognition System be dependent on the spectral features which is derived from short-term spectral analysis (MFCC) of the speech signal. Since the scale of the shorttime spectrum encodes information about vocal tract shape for this reason spectral features are extensively used for speaker recognition characteristics varying in nature [5][7]. The technology. Since prosodic features derived from differences in physiological uniqueness occurred due to the shape and size of oral tract, nasal tract, vocal folds and trachea it can also go ahead to differences in vocal tract dynamics and excitation pitch, energy and duration which is relatively less affected by channel variation and noise as compared to spectral features, including all these aspects the conclusion is that prosodic based distinctiveness. The values of fundamental systems are more robust. frequency f0 vary with speakers because of 6 CONCLUSION differences in the physical structure of the vocal folds of persons. As aerial discussed that speaker In this paper we try to explain about prosody and uniqueness also prejudiced by the speaking style of prosodic feature for Speaker Recognition and speaker. The speaking style of speaker is habitually speech signal, mainly work of prosodic features determined by the persons/speakers source of revenue surroundings and the native language also. As studies of [7] say that the prosodic features based on modeling of pitch f0 statistics and early on work enlarged feature vector with raw f0. More recent work on prosodic modeled f0 separately are demonstrated in speech signal giving and include other factors such as pause and voice significant information concerning the speaking duration. This paper has presented an Automatic style of speakers. Speaker Recognition System using prosodic 5 ROBUSTNESS OF PROSODIC BASED SYSTEMS As many studies [5][8][9] results show that the prosodic based features can be used to efficiently improve the performance of Automatic features derived from pitch, energy and duration based parameters. Consistent pitch detection is especially significant to the statistical modeling of speech prosody. Pitch estimation of speech made natural mistakes due to acoustic noise and channel distortion, pitch halving and repetition errors. Recognition System and also add robustness to

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May-2014 605 REFERENCES 1. Jin, Qin, and Thomas Fang Zheng. "overview of Front-end Features for Robust Speaker Recognition." APSIPA ASC 2011 Xian. n. page. Print. 2. Shriberg, Elizabeth. "Higher-Level Features in Speaker Recognition." Springer-Verlag Berlin Heidelberg 2007. Speaker Classification I, LNAI 4343. (2007): 241-259. Print. 3. Mary, Leena. "Prosodic feature for speaker recognition." Trans. Array. Forensic Speaker RecognitionSpringer, 365-370. Print. 4. G.Adami, Andre, radu Mihaescu, Douglas A.Reynolds, and Reynolds J.Godfrey. "Modeling Prosodic Dynamics For Speaker CLSP Workshop 2002. The Johns Hopkins University. Nuance Communications. 10. Farr us i Cabeceran, Mireia. "FUSING PROSODIC AND ACOUSTIC INFORMATION FOR SPEAKER RECOGNITION." TALP Research Center, Speech Processing Group Department of Signal Theory and Communications Universitat Polit`ecnica de Catalunya. Barcelona, 07 2008. 11. "What is a Sound Spectrum?." school of physics. UNSW. Web. 18 Feb 2014. 12. HERMANSKY, HYNEK. "Speech recognition from spectral dynamics." Sa dhana, Indian Academy of sciences. Vol. 36. Part 5 (October 2011,): 729 744. Print. 13. "Spectral Analysis of Speech Signals.". N.p.. Web. 18 Feb 2014. Recognition." ICASSP 2003,IEEE. IV. (2003): 788-14. Weenink, David. Speech Signal Processing with 791. Print. Praat. ISBN-13. January 20, 2014. 1-328. Print. 5. B. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusacek, D. Reynolds, B. Xiang, Using Prosodic and Conversational Features for Highperformance Speaker Recognition: Report from JHU WS'02, ICASSP 2003. 6. Mary, Leena, and B. Yegnanarayana. "Extraction and representation of prosodic features for language and speaker recognition." ELSEVIER. Speech Communication 50. (2008): 782 796. Print. 7. Dahak, Najim, Pierre Dumouchel, and patrick Kenny. "Modeling Prosodic Feature with joint Factor Analysis for Speaker Verification." IEEE Transaction on Audio, Speech And Language Processing. (2007): n. page. Print. 8. Laskowski, Kornel, and Qin Jin. "Modeling prosody for speaker recognition: Why Estimating Pitch May Be A red Herring." Odyssey 2010, The speaker and language Recognition Workshop. (2010): n. page. Print. 9. P. Heck, Larry. "Integrating High-Level Information for Robust Speaker Recognition."