Machine Learning of Level and Progression in Spoken EAL

Similar documents
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Calibration of Confidence Measures in Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Investigation on Mandarin Broadcast News Speech Recognition

Human Emotion Recognition From Speech

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mandarin Lexical Tone Recognition: The Gating Paradigm

On the Formation of Phoneme Categories in DNN Acoustic Models

Generative models and adversarial training

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Eyebrows in French talk-in-interaction

CEFR Overall Illustrative English Proficiency Scales

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Deep Neural Network Language Models

Using dialogue context to improve parsing performance in dialogue systems

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

IEEE Proof Print Version

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

arxiv: v1 [cs.lg] 7 Apr 2015

A Pilot Study on Pearson s Interactive Science 2011 Program

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

EXAMPLES OF SPEAKING PERFORMANCES AT CEF LEVELS A2 TO C2. (Taken from Cambridge ESOL s Main Suite exams)

Linking Task: Identifying authors and book titles in verbose queries

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Switchboard Language Model Improvement with Conversational Data from Gigaword

Predicting the Performance and Success of Construction Management Graduate Students using GRE Scores

Edinburgh Research Explorer

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Improvements to the Pruning Behavior of DNN Acoustic Models

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Degeneracy results in canalisation of language structure: A computational model of word learning

5. UPPER INTERMEDIATE

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

STA 225: Introductory Statistics (CT)

Miscommunication and error handling

USING DRAMA IN ENGLISH LANGUAGE TEACHING CLASSROOMS TO IMPROVE COMMUNICATION SKILLS OF LEARNERS

WHEN THERE IS A mismatch between the acoustic

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Evidence for Reliability, Validity and Learning Effectiveness

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Corpus Linguistics (L615)

Table of Contents. Introduction Choral Reading How to Use This Book...5. Cloze Activities Correlation to TESOL Standards...

GOLD Objectives for Development & Learning: Birth Through Third Grade

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Managerial Decision Making

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

The Common European Framework of Reference for Languages p. 58 to p. 82

arxiv: v1 [cs.cl] 27 Apr 2016

Natural Language Processing. George Konidaris

Dialog Act Classification Using N-Gram Algorithms

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Welcome to MyOutcomes Online, the online course for students using Outcomes Elementary, in the classroom.

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

CEF, oral assessment and autonomous learning in daily college practice

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Support Vector Machines for Speaker and Language Recognition

Affective Classification of Generic Audio Clips using Regression Models

Evolution of Symbolisation in Chimpanzees and Neural Nets

M55205-Mastering Microsoft Project 2016

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Loughton School s curriculum evening. 28 th February 2017

Individual Differences & Item Effects: How to test them, & how to test them well

TRANSNATIONAL TEACHING TEAMS INDUCTION PROGRAM OUTLINE FOR COURSE / UNIT COORDINATORS

Pair Programming: When and Why it Works

A Case Study: News Classification Based on Term Frequency

Rule Learning With Negation: Issues Regarding Effectiveness

Transcription:

Machine Learning of Level and Progression in Spoken EAL Kate Knill and Mark Gales Speech Research Group, Machine Intelligence Lab, University of Cambridge 5 February 2016

Spoken Communication Speaker Characteristics Environment/Channel Pronunciation Prosody Message Construction Message Realisation Message Reception

Spoken Communication Speaker Characteristics Environment/Channel Pronunciation Prosody Message Construction Message Realisation Message Reception Spoken communication is a very rich communication medium

Spoken Communication Requirements Message Construction should consider: Has the speaker generated a coherent message to convey? Is the message appropriate in the context? Is the word sequence appropriate for the message?

Spoken Communication Requirements Message Construction should consider: Has the speaker generated a coherent message to convey? Is the message appropriate in the context? Is the word sequence appropriate for the message? Message Realisation should consider: Is the pronunciation of the words correct/appropriate? Is the prosody appropriate for the message? Is the prosody appropriate for the environment?

Spoken Communication Requirements Message Construction should consider: Has the speaker generated a coherent message to convey? Is the message appropriate in the context? Is the word sequence appropriate for the message? Message Realisation should consider: Is the pronunciation of the words correct/appropriate? Is the prosody appropriate for the message? Is the prosody appropriate for the environment?

Spoken Language Versus Written ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold s gym and uh i try to exercise five days a week um and now and then i ll i ll get it interrupted by work or just full of crazy hours you know

Spoken Language Versus Written ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold s gym and uh i try to exercise five days a week um and now and then i ll i ll get it interrupted by work or just full of crazy hours you know Meta-Data Extraction (MDE) Markup Speaker1: / okay carl {F uh} do you exercise / Speaker2: / {DM yeah actually} {F um} i belong to a gym down here / / gold s gym / / and {F uh} i try to exercise five days a week {F um} / / and now and then [REP i ll + i ll] get it interrupted by work or just full of crazy hours {DM you know } /

Spoken Language Versus Written ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold s gym and uh i try to exercise five days a week um and now and then i ll i ll get it interrupted by work or just full of crazy hours you know Meta-Data Extraction (MDE) Markup Speaker1: / okay carl {F uh} do you exercise / Speaker2: / {DM yeah actually} {F um} i belong to a gym down here / / gold s gym / / and {F uh} i try to exercise five days a week {F um} / / and now and then [REP i ll + i ll] get it interrupted by work or just full of crazy hours {DM you know } / Written Text Speaker1: Okay Carl do you exercise? Speaker2: I belong to a gym down here, Gold s Gym, and I try to exercise five days a week and now and then I ll get it interrupted by work or just full of crazy hours.

Business Language Testing Service (BULATS) Spoken Tests Example of a test of communication skills A. Introductory Questions: where you are from B. Read Aloud: read specific sentences C. Topic Discussion: discuss a company that you admire D. Interpret and Discuss Chart/Slide: example above E. Answer Topic Questions: 5 questions about organising a meeting

Automated Assessment of One Speaker Audio Grade

Automated Assessment of One Speaker Audio Feature extraction Features Grader Grade

Automated Assessment of One Speaker Audio Speech recogniser Feature extraction Text Features Grader Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Speech Recognition Challenges Non-native ASR highly challenging Heavily accented Pronunciation dependent on L1 Commercial systems poor! State-of-the-art CUED systems Training Data Native & C-level non-native English Word error rate 54% BULATS speakers 30%

Automatic Speech Recognition Components Pronunciation Lexicon Recognition Engine The cat sat on Acoustic Model Language Model Acoustic Model training data Language Model training data

Forms of Acoustic and Language Models L2 Acoustic Model + L2 Language Model L2 audio data L2 text data L1 text data Used to recognise L2 speech

Forms of Acoustic and Language Models L2 Acoustic Model + L2 Language Model L2 audio data L2 text data L1 text data Used to recognise L2 speech Native Acoustic Model Native Language Model Native (L1) audio data Native (L1) text data Useful to extract features

Deep Learning for Speech Recognition Pitch PLP Bottleneck Tandem HMM GMM Log Likelihoods AMI Corpus Data BULATS Data Speaker Dependent Bottleneck Layer FBank Pitch Fusion Score Stacked Hybrid Bottleneck PLP Pitch Log Posteriors Fusion of HMM deep neural network and Gaussian mixture models trained on BULATS data

Recognition Error Rate Versus Learner Progression

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Baseline Features Mainly fluency based: Audio Features: statistics about fundamental frequency (f0) speech energy and duration Aligned Text Features: statistics about silence durations number of disfluencies (um, uh, etc) speaking rate Text Identity Features: number of repeated words (per word) number of unique word identities (per word)

Speaking Time Versus Learner Progression Average Speaking Time (secs) 700 600 500 400 300 200 100 0 A1 A2 B1 B2 C CEFR Grade spontaneous speech read speech

Pronunciation Features Hypothesis: poor speakers are weaker at making phonetic distinctions Statistical approach learn phonetic distances from graded data

Pronunciation Features Hypothesis: poor speakers are weaker at making phonetic distinctions Statistical approach learn phonetic distances from graded data Candidate Grade A1 Candidate Grade C1 Pattern of distances different between candidates of different levels

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Uses of Automatic Assessment Human graders very powerful ability to assess spoken language vary in quality and not always available Automatic graders more consistent and potentially always available validity of the grade varies and limited information about context

Uses of Automatic Assessment Human graders very powerful ability to assess spoken language vary in quality and not always available Automatic graders more consistent and potentially always available validity of the grade varies and limited information about context Use automatic grader for grading practice tests/learning process in combination with human graders combination: use both grades back-off process: detect challenging candidates

Gaussian Process Grader Currently have 1000s candidates to train grader limited data compared to ASR frames (100,000s frames) useful to have confidence in prediction Gaussian Process is a natural choice for this configuration

Form of Output Graders Pearson Correlation Human experts 0.85 Automatic GP 0.83 0.86

Combining Human and Automatic Graders 1 Correlation 0.95 0.9 0.85 Original 0.2 0.4 0.6 0.8 Gaussian process Interpolation weight Interpolate between human and automated grades Higher correlation i.e. more reliable grade produced Content checking can be done by the human grader

Detecting Outlier Grades Standard (BULATS) graders handle standard speakers very well non-standard (outlier) speakers less well handled use Gaussian Process variance to automatically detect outliers Correlation 1 Gaussian process 0.95 Ideal rejection Random rejection 0.9 0.85 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rejection rate (i.e., cost) Back-off to human experts Reject 10%: performance 0.83 è 0.88

Assessing Content Grader correlates well with expert grades features do not assess content primarily fluency features Train a Recurrent Neural Network Language Model for each question assess whether the response is consistent with example answers

Spoken Language Assessment Audio Feature extraction Features Grader Speech recogniser Text Automatically assess: Message realisation Fluency, pronunciation Message construction Construction & coherence of response Relationship to topic Grade

Spoken Language Assessment Audio Feature extraction Features Grader Speech recogniser Text Automatically assess: Message realisation Fluency, pronunciation Achieved (with room for improvement) Message construction Construction & coherence of response Relationship to topic Unsolved active research areas Grade

Spoken Language Assessment and Feedback Audio Feature extraction Features Grader Grade Speech recogniser Text Error Detection & Correction Feedback Automatically assess: Message realisation Fluency, pronunciation Message construction Construction & coherence of response Relationship to topic Provide feedback: Feedback to user: realisation, construction Feedback to system: adjust to level

Recognition Error Rate Versus Learner Progression

Time Alignment and Pronunciation Feedback Lightly supervised: No pronunciation labelling required trained just on grades

Conclusions Automated machine-learning for spoken language assessment important to keep costs down able to be integrated into the learning process Current level assessment of fluency ongoing research into assessing communication skills: appropriateness and acceptability Error detection and feedback is challenging high precision required in detecting where errors have occurred supplying feedback in appropriate form for learner

Thank You Acknowledgement: members of CUED MIL ALTA team: Rogier van Dalen, Kostas Kyriakopoulos, Andrey Malinin, Yu Wang