Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

Speaker recognition using universal background model on YOHO database

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A study of speaker adaptation for DNN-based speech synthesis

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

WHEN THERE IS A mismatch between the acoustic

Automatic Pronunciation Checker

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

On the Formation of Phoneme Categories in DNN Acoustic Models

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Rhythm-typology revisited.

Proceedings of Meetings on Acoustics

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Lecture 9: Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Word Segmentation of Off-line Handwritten Documents

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

A Case Study: News Classification Based on Term Frequency

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Problems of the Arabic OCR: New Attitudes

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Calibration of Confidence Measures in Speech Recognition

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Speech Recognition by Indexing and Sequencing

English Language and Applied Linguistics. Module Descriptions 2017/18

Automatic intonation assessment for computer aided language learning

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Support Vector Machines for Speaker and Language Recognition

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

IEEE Proof Print Version

Course Law Enforcement II. Unit I Careers in Law Enforcement

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Edinburgh Research Explorer

Biome I Can Statements

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Speaker Recognition. Speaker Diarization and Identification

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Voice conversion through vector quantization

A Retrospective Study

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Universiteit Leiden ICT in Business

Software Maintenance

Affective Classification of Generic Audio Clips using Regression Models

Probabilistic Latent Semantic Analysis

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

An Online Handwriting Recognition System For Turkish

Probability and Statistics Curriculum Pacing Guide

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

On the Combined Behavior of Autonomous Resource Management Agents

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Organizing Comprehensive Literacy Assessment: How to Get Started

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Speaker Identification by Comparison of Smart Methods. Abstract

NCEO Technical Report 27

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Corpus Linguistics (L615)

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Spanish IV Textbook Correlation Matrices Level IV Standards of Learning Publisher: Pearson Prentice Hall

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Transcription:

Automated Rating of Recorded Classroom Presentations using Speech Analysis in Kazakh Akzharkyn Izbassarova, Aidana Irmanova and Alex Pappachen James School of Engineering, Nazarbayev University, Astana www.biomicrosystems.info/alex Email: apj@ieee.org arxiv:1801.00453v1 [cs.cl] 1 Jan 2018 Abstract Effective presentation skills can help to succeed in business, career and academy. This paper presents the design of speech assessment during the oral presentation and the algorithm for speech evaluation based on criteria of optimal intonation. As the pace of the speech and its optimal intonation varies from language to language, developing an automatic identification of language during the presentation is required. Proposed algorithm was tested with presentations delivered in Kazakh language. For testing purposes the features of Kazakh phonemes were extracted using MFCC and PLP methods and created a Hidden Markov Model (HMM) [5], [5] of Kazakh phonemes. Kazakh vowel formants were defined and the correlation between the deviation rate in fundamental frequency and the liveliness of the speech to evaluate intonation of the presentation was analyzed. It was established that the threshold value between monotone and dynamic speech is 0.16 and the error for intonation evaluation is 19%. Index Terms MFCC, PLP, presentations, speech, images, recognition I. INTRODUCTION Delivering an effective presentation in today s information world is becoming a critical factor in the development of individuals career, business or academic success. The Internet is full of sources on how to improve presenting skills and give a successful presentation. These sources accentuate on important aspects of the presentation that grasps attention. Since there is no a particular template of an ideal oral presentation, opinions on how to prepare for oral presentations to make a good impression on the audience differ. For example, [1] claims that the passion about topic is a number one characteristic of the exceptional presenter. The author suggests that the passion can be expressed through the posture, gestures and movement, voice and removal of hesitation and verbal graffiti. Where the criteria for the content of presentation depend on the particular field, the standards for visual aspect and non-verbal communication are almost general for each presentation given in business, academia or politics. In the illustration of the examples of different postures and their interpretation the author emphasizes voice usage aspects like its volume, inflation, and tempo. It is important to mention that the author Timothy Koegel has twenty years of experience as a presentation consultant to famous business companies, politicians and business schools [1]. That is why the criteria for a successful presentation in terms of intonation given in this source can be used as a basis for speech evaluation as the the part of presentation assessment. However, it can be questioned how the assessment of speech is normally conducted based on these criteria. [2] examined the different criterion-referenced assessment models used to evaluate oral presentations in secondary schools and at the university level. These criterion-referenced assessment rubrics are designed to provide instructions for students as well as to increase the objectivity during evaluation. It was suggested that intonation, volume, and pitch are usually evaluated based on the comments in criterion-referenced assessment rubrics like Outstandingly appropriate use of voice or poor use of voice. The comments used in the evaluation sheets can be subjective [2] which is why the average relation between how people perceive the speech during the presentation and the level of change in intonation and tempo should be addressed. In this paper we present a software for evaluating presentation skills of a speaker in terms of the intonation. We use the pitch to identify the intonation of the speech. Also, we aim to implement the automatic identification of the speech-language during the presentation as the presentations used for testing the proposed algorithm delivered in kazakh language. This task poses another problem, as Kazakh speech recognition is still not fully addressed in previously conducted research works. The recognition of the Kazakh speech itself is not within the scope of this paper. The adaptation of other languages such as Russian or English are considered as a next step. The paper organized as follows: Section II presents the methodology of the design used for presentation evaluation, section III shows the results of testing the developed software and further section IV provides overall discussion of main issues of the software design. II. METHODOLOGY The Figure 1 illustrates the approach used to identify language and intonation. First, the features corresponding to the Kazakh phonemes are extracted. Then the model for language recognition is developed based on Hidden Markov Model (HMM). MATLAB is used to create a HMM for Kazakh phonemes. The block diagram in Fig. 2 illustrates the algorithm used in the code. The program should be able to evaluate the intonation and tempo of the speech. It is assumed that there is a direct correlation between the deviation rate in fundamental frequency and the liveliness of the speech. Thus, we need to conduct the

Figure 1. Flow chart for speech evaluation Figure 2. Block diagram for phone recognition pitch analysis to identify whether the proposed hypothesis is true. The pitch variation quotient derived from pitch contour of the audio files, where pitch variation quotient is a ratio of standard deviation of the pitch to its mean should be found. In order to identify the variation of pitch during presentations, the database of the presentations given in Kazakh language is created. This database consists of five presentations with ten-minute duration for each presentation. It is obtained by taking a video of students class presentations giving during Kazakh Music History and History of Kazakhstan courses at Nazarbayev University. For the simplicity of the analysis, presentations are divided into one-minute long audio files converted to WAV format. As a result, we obtain 32 audio files where seven presentations are with male voices and the rest by female. By using WaveSurfer program, the pitch value is found for each 7.5 ms of the speech. Two different sampling frequency values are tested to identify which sampling rate should be applied to obtain better results. 16 khz and 44.1 khz sampling frequency values are available in WaveSurfer. Thus, pitch is measured at these sampling rates. Then the mean and standard deviation of the pitch corresponding to each audio file is obtained. After that, a pitch variation quotient calculated. In order to obtain the pitch variation quotient we divide the standard deviation of the pitch to its mean. Finally, the results of the pitch variation quotient should be compared to the results of a perception test. The same speech files used for pitch extraction are used to conduct a test on how people perceive the speech regarding intonation. The purpose of this test is to identify the correlation between how people evaluate the presentation and the value of the pitch variation quotient. Since the paper aims to evaluate the presentation skills based on criteria such as intonation and tempo of the speech and give feedback to the users, the ability of the program to assess should be consistent with that how would professionals and general audience evaluate the presentation. Thus, we will ask students and professors to participate in this test. They will listen to a speech from presentations and categorize the speech into monotone or emotionless and dynamic or lively. Since the intonation during the presentation is not always constant, the speech will be divided into small segments so the participants will give feedback for each speech segment. They should give marks for each presentations based on the intonation of the speakers. A marking system is a following: 1- monotone, 2- middle and 3-dynamic. After that, all results will be analyzed and the average mark for each presentation will be calculated. These average marks are compared with the results of the pitch variation quotient. A. Formants III. RESULTS From data analysis results we defined first, second and third formants of Kazakh vowels. The Table 1 and Table 2 show the results for vowels produced by male and female voices, respectively. These phonemes were obtained by manually extracting each phoneme from KLC audio files. Table I AVERAGE FORMANT FREQUENCIES OF KAZAKH VOWELS PRODUCED BY MALE SPEAKERS Vowel F 1, Hz F 2, Hz F 3, Hz 734 1627 2769 517 1437 2500 540 1700 2705 513 1405 2505 811 1258 2640 577 808 2765 590 1307 2652 566 961 2605 443 2087 2900 The data given in Table 1 and Table 2 are used to observe the position of vowels according to their first and second formants. Figure 3 and Figure 4 illustrate the distribution of vowels for male and female voices respectively. B. Intonation evaluation The test was conducted in order to identify how listeners perceive presentations based on intonation. Totally, 32 fragments from the different presentations given in the Kazakh language were tested. The participants of the test were ranking presentations from 1 to 3, where 1 is for monotone presentation and 3 is for dynamic. In addition, the variation of pitch

Table II AVERAGE FORMANT FREQUENCIES OF KAZAKH VOWELS PRODUCED BY FEMALE SPEAKERS Vowel F 1, Hz F 2, Hz F 3, Hz 858 1929 3180 662 1424 2892 697 1844 2986 572 1529 2801 948 1397 3048 583 969 3220 743 1175 3072 696 1116 3155 554 2559 3150 Figure 5. Pitch variation quotient vs perception test results at 16 khz sampling rate Figure 3. First and second formant frequencies of Kazakh vowels produced by male speakers Figure 6. Pitch variation quotient vs perception test results at 44.1 khz sampling rate in each presentation was measured and the pitch variation quotient was found. The pitch was measured for the different values of the sampling frequency. The average value for pitch variation quotient at f=16 khz is 0.32 and at f=44.1 khz the average quotient for 32 presentation fragments is 0.16. Figure 5 and Figure 6 show the results for pitch variation quotient of Figure 4. First and second formant frequencies of Kazakh vowels produced by female speakers each presentation and their corresponding average marks based on the test results. Since the presentation were marked from 1 to 3, the average mark is 2. Thus, the boundary between monotone and dynamic presentation should be 2 along the x-axis and the average pitch variation quotient along the y- axis. In order to estimate error, the number of presentations with the value of pitch variation quotient below the average but with high average marks and inversely, the numbers of presentations with high pitch variation but low marks should be calculated. It is found that at f=16 khz sampling frequency the error is 34% and at f=44.1 khz estimated error is 19%. Finally, the same presentation was recorded twice but with different intonations of the speech. The pitch variation quotient of the monotone speech is 0.092 whereas the second record with more dynamic intonation has 0.179 pitch variation quotient. C. Phone recognition As phone recognition does not recognize the speech, there is no need to use the lexical decoding, syntactic and semantic analysis. Therefore, phonemes are used as matching units. In this paper training the Kazakh phonemes for further phone recognition[9] was conducted in MATLAB. The results are given from simulations of HMM with 1-emission and with 2- emission states. Models of context-independent phones which

Table III RECOGNITION RATE FOR 1-EMISSION AND 2-EMISSION STATE HMM Train/Test Recognition rate for Recognition rate for 1-emission state HMM 2-emission state HMM Female/Female 61.76 64.71 Male/Male 5.88 8.82 Male/Female 11.76 14.71 Female/Male 5.88 8.82 Figure 7. 1-emission state HMM Figure 8. 2-emission state HMM are represented by one or two emission states are shown in Figures 7 and 8, where a ij is a transition probability from state i to j, while S1...S4 are transition states, b i (O i ) is probability density function for each state or emission probability, O i are observations.in Figure 7 S1 is an initial state, S3 is an end state and S2 is an emission state (Figure 7). For 2-emission state HMM, S2 and S3 represent emission states (Figure 8). The phonemes recognition rate is calculated using Viterbi algorithm. Different sets of simulations are done with the variation of train and test data. Table 3 gives the results for recognition rates for 1-emission and 2-emission state. Train and test data contain phonemes recorded by female and male voices. IV. DISCUSSION MFCC and PLP coefficients were extracted to develop phoneme based automatic language identification[4]. As a result, 12 cepstral coefficients and one energy feature were obtained for each feature extraction technique [4], [8]. After that, the first and second derivatives of these 13 features were taken, which gives 39- dimensional feature vector per frame in total to represent each phoneme.after that mean and covariance vectors for each phoneme were calculated. These values were used to create training model for the Kazakh phonemes recognition. MATLAB code was used to train the phonemes and create an HMM for them. As results show, the 2-emission state HMM gives higher recognition rate comparing with 1-emission state. In order to train for Kazakh language identification, the Kazakh corpus with labeling on phoneme level should be used. However, nowadays the wordlevel labeling is available in the current Kazakh Language Corpus[3]. This limits further analysis for phone recognition and language identification. More time is required to create a corpus with phoneme labeling. In this paper, we analyzed the Kazakh phonemes by extracting them manually in Praat program from the set of recordings done in a soundproof studio as well as in real environment conditions. For the Kazakh language identification based on the phonological features of the language itself, a bigger phoneme database is required. V. CONCLUSION To conclude, in this paper we present the system that can be used to evaluate presentation skills of the speaker based on the intonation of the voice. To test the proposed design we used data in kazakh language which consequently led to consideration of language identification system. As language identification and speech recognition is a relatively new field for Kazakh language processing field, we believe that the development of such system could be useful for the further popularization of Kazakh language and realization of different projects that builds up on top of the Kazakh speech recognition systems. Future works cover the development of the Kazakh language corpus with the analysis and labeling up to phoneme level. After that, the language model for the Kazakh language can be developed. Finally, the larger database of the presentations in the Kazakh language should be created to analyze the presentation styles in the Kazakh language as well as to conduct a test and design an intonation evaluator. REFERENCES [1] T. Koegel, The exceptional presenter. Austin, TX: Greenleaf Book Group Press, 2007. [2] I. Michelle and L. Michelle, Orals ain t orals : How instruction and assessment practices affect delivery choices with prepared student oral presentations, in Australian and New Zealand Communication Association Conference, Brisbane, 2009.

[3] O. Makhambetov, A. Makazhanov, Zh. Yessenbayev, B. Matkarimov, I. Sabyrgaliyev, and A. Sharafudinov, Assembling the Kazakh Language Corpus, in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 10221031, Seattle, Washington, USA, October. Association for Computational Linguistics. [4] M. Zissman, Automatic language identification using Gaussian mixture and hidden Markov models, IEEE International Conference on Acoustics Speech and Signal Processing, 1993. [5] D. Ellis, PLP and RASTA (and MFCC, and inversion) in Matlab, Labrosa.ee.columbia.edu, 2015. [Online]. Available: http://labrosa.ee.columbia.edu/matlab/rastamat/. [Accessed: 19- Nov- 2015]. [6] J. Hamar, Using Sub-Phonemic Units for HMM Based Phone Recognition, Thesis for the degree of Philosophiae Doctor, Norwegian University of Science and Technology, 2013. [7] A. Moore, Hidden Markov Models, Autonlab.org, 2016. [Online]. Available: http://www.autonlab.org/tutorials/hmm.html. [Accessed: 16- Apr- 2016]. [8] D. Jurafsky, Feature Extraction and Acoustic Modeling, 2007. [9] R. Jang, ASR (Automatic Speech Recognition) Toolbox, Mirlab.org, 2016. [Online]. Available: http://mirlab.org/jang. [Accessed: 14- Apr- 2016].