Machine Learning of Level and Progression in Second/Additional Language Spoken English

Similar documents
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning Methods in Multilingual Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Emotion Recognition Using Support Vector Machine

A study of speaker adaptation for DNN-based speech synthesis

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Using dialogue context to improve parsing performance in dialogue systems

Investigation on Mandarin Broadcast News Speech Recognition

CEFR Overall Illustrative English Proficiency Scales

Probabilistic Latent Semantic Analysis

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Human Emotion Recognition From Speech

Mandarin Lexical Tone Recognition: The Gating Paradigm

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Evidence for Reliability, Validity and Learning Effectiveness

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Calibration of Confidence Measures in Speech Recognition

EXAMPLES OF SPEAKING PERFORMANCES AT CEF LEVELS A2 TO C2. (Taken from Cambridge ESOL s Main Suite exams)

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 1: Machine Learning Basics

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Switchboard Language Model Improvement with Conversational Data from Gigaword

arxiv: v1 [cs.lg] 7 Apr 2015

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Creating Travel Advice

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Eyebrows in French talk-in-interaction

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Edinburgh Research Explorer

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Lower and Upper Secondary

5. UPPER INTERMEDIATE

The Common European Framework of Reference for Languages p. 58 to p. 82

Multi-Lingual Text Leveling

Deep Neural Network Language Models

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

A Case Study: News Classification Based on Term Frequency

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

M55205-Mastering Microsoft Project 2016

Segregation of Unvoiced Speech from Nonspeech Interference

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

English Language and Applied Linguistics. Module Descriptions 2017/18

GOLD Objectives for Development & Learning: Birth Through Third Grade

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Letter-based speech synthesis

Dimensions of Classroom Behavior Measured by Two Systems of Interaction Analysis

Measurement. Time. Teaching for mastery in primary maths

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Characteristics of the Text Genre Informational Text Text Structure

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CEF, oral assessment and autonomous learning in daily college practice

Linking Task: Identifying authors and book titles in verbose queries

PROGRESS MONITORING FOR STUDENTS WITH DISABILITIES Participant Materials

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A student diagnosing and evaluation system for laboratory-based academic exercises

Degeneracy results in canalisation of language structure: A computational model of word learning

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Loughton School s curriculum evening. 28 th February 2017

Miscommunication and error handling

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Generative models and adversarial training

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

A Pilot Study on Pearson s Interactive Science 2011 Program

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Predicting the Performance and Success of Construction Management Graduate Students using GRE Scores

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Rule Learning With Negation: Issues Regarding Effectiveness

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

WHEN THERE IS A mismatch between the acoustic

Transcription:

Machine Learning of Level and Progression in Second/Additional Language Spoken English Kate Knill Speech Research Group, Machine Intelligence Lab Cambridge University Engineering Dept 11 May 2016

Cambridge ALTA Instititute Virtual institute at University of Cambridge Computing, Linguistics, Engineering, Language Assessment Sponsorship from Cambridge English Language Assessment Work presented was done at CUED thanks to: Mark Gales, Rogier van Dalen, Kostas Kyriakopoulos, Andrey Malinin, Mohammad Rashid, Yu Wang

Spoken Communication Speaker Characteristics Environment/Channel Pronunciation Prosody Message Construction Message Realisation Message Reception

Spoken Communication Speaker Characteristics Environment/Channel Pronunciation Prosody Message Construction Message Realisation Message Reception Spoken communication is a very rich communication medium

Spoken Communication Requirements Message Construction should consider: Has the speaker generated a coherent message to convey? Is the message appropriate in the context? Is the word sequence appropriate for the message?

Spoken Communication Requirements Message Construction should consider: Has the speaker generated a coherent message to convey? Is the message appropriate in the context? Is the word sequence appropriate for the message? Message Realisation should consider: Is the pronunciation of the words correct/appropriate? Is the prosody appropriate for the message? Is the prosody appropriate for the environment?

Spoken Communication Requirements Message Construction should consider: Has the speaker generated a coherent message to convey? Is the message appropriate in the context? Is the word sequence appropriate for the message? Message Realisation should consider: Is the pronunciation of the words correct/appropriate? Is the prosody appropriate for the message? Is the prosody appropriate for the environment?

Spoken Language Versus Written ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold s gym and uh i try to exercise five days a week um and now and then i ll i ll get it interrupted by work or just full of crazy hours you know

Spoken Language Versus Written ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold s gym and uh i try to exercise five days a week um and now and then i ll i ll get it interrupted by work or just full of crazy hours you know Meta-Data Extraction Markup Speaker1: / okay carl {F uh} do you exercise / Speaker2: / {DM yeah actually} {F um} i belong to a gym down here / / gold s gym / / and {F uh} i try to exercise five days a week {F um} / / and now and then [REP i ll + i ll] get it interrupted by work or just full of crazy hours {DM you know } /

Spoken Language Versus Written ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold s gym and uh i try to exercise five days a week um and now and then i ll i ll get it interrupted by work or just full of crazy hours you know Meta-Data Extraction Markup Speaker1: / okay carl {F uh} do you exercise / Speaker2: / {DM yeah actually} {F um} i belong to a gym down here / / gold s gym / / and {F uh} i try to exercise five days a week {F um} / / and now and then [REP i ll + i ll] get it interrupted by work or just full of crazy hours {DM you know } / Written Text Speaker1: Okay Carl do you exercise? Speaker2: I belong to a gym down here, Gold s Gym, and I try to exercise five days a week and now and then I ll get it interrupted by work or just full of crazy hours.

Business Language Testing Service (BULATS) Spoken Tests Example of a test of communication skills A. Introductory Questions: where you are from B. Read Aloud: read specific sentences C. Topic Discussion: discuss a company that you admire D. Interpret and Discuss Chart/Slide: example above E. Answer Topic Questions: 5 questions about organising a meeting

Common European Framework of Reference (CEFR) Level C2 C1 B2 B1 A2 A1 Global Descriptor Fully operational command of the spoken language Good operational command of the spoken language Generally effective command of the spoken language Limited but effective command of the spoken language Basic command of the spoken language Minimal command of the spoken language

Automated assessment of one speaker Audio Grade

Automated assessment of one speaker Audio Feature extraction Features Grader Grade

Automated assessment of one speaker Audio Speech recogniser Feature extraction Text Features Grader Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Speech Recognition Challenges Non-native ASR highly challenging Heavily accented Pronunciation dependent on L1 Commercial systems poor! State-of-the-art CUED systems Training Data Native & C-level non-native English Word error rate 54% BULATS speakers 30%

Automatic Speech Recognition Components Pronunciation Lexicon Recognition Engine The cat sat on Acoustic Model Language Model Acoustic Model training data Language Model training data

Forms of Acoustic and Language Models L2 Acoustic Model + L2 Language Model L2 audio data L2 text data L1 text data Used to recognise L2 speech

Forms of Acoustic and Language Models L2 Acoustic Model + L2 Language Model L2 audio data L2 text data L1 text data Used to recognise L2 speech Native Acoustic Model Native Language Model Native (L1) audio data Native (L1) text data Useful to extract features

Speech Recognition System PLP Tandem HMM GMM Log Likelihoods AMI Corpus Data BULATS Data Bottleneck Speaker Dependent Bottleneck Layer FBank Fusion Score Stacked Hybrid Bottleneck PLP Log Posteriors Joint decoding - frame-level combination L(o t s i ) = λ T L T (o t s i )+ λ H L H (o t s i )

Recognition Rate vs L1 Acoustic models trained on English data from Gujarati L1 scored against crowd-sourced references

Recognition Error Rate vs Learner Progression %WER 50 40 30 20 10 0 Read Spontaneous Overall A1 A2 B1 B2 C CEFR Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Baseline Features Mainly fluency based: Audio Features: statistics about fundamental frequency (f0) speech energy and duration Aligned Text Features: statistics about silence durations number of disfluencies (um, uh, etc) speaking rate Text Identity Features: number of repeated words (per word) number of unique word identities (per word)

Speaking Time vs Learner Progression Average Speaking Time (secs) 700 600 500 400 300 200 100 0 A1 A2 B1 B2 C CEFR Grade spontaneous speech read speech

Pronunciation Features Hypothesis: poor speakers are weaker at making phonetic distinctions less proficient phone realisation closer to L2 more proficient phone realisation closer to L1 Statistical approach learn phonetic distances from graded data single multivariate Gaussian of K-L divergence per phoneme pair 1081 phoneme pairs JSD(p 1 (x), p 2 (x)) = 1 [ 2 KL(p 1(x) p 2 (x))+ KL(p 2 (x) p 1 (x)] KL(p 1 (x) p 2 (x)) = 1 ( 2 tr(σ 1 2 Σ 1 Ι)+ (µ 1 µ 2 ) T 1 Σ ) 2 µ 1 µ 2 1 ( ) + log Σ 2 Σ 1 1

Pronunciation Features vs Learner Progression Candidate Grade A1 Candidate Grade C2 Pattern of distances different between candidates of different levels Correlation with score: mis-pronounced phones higher K-L distance opposite of expectation that poor speakers have more overlap

Statistical Parser Features Parser features from RASP system improve grades for written tests Problem: speech recognition accuracy Smaller subtrees and leaves are fairly robust

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Outline Audio Speech recogniser Feature extraction Text Features Grader Grade

Uses of Automatic Assessment Human graders very powerful ability to assess spoken language vary in quality and not always available Automatic graders more consistent and potentially always available validity of the grade varies and limited information about context

Uses of Automatic Assessment Human graders very powerful ability to assess spoken language vary in quality and not always available Automatic graders more consistent and potentially always available validity of the grade varies and limited information about context Use automatic grader for grading practice tests/learning process in combination with human graders combination: use both grades back-off process: detect challenging candidates

Gaussian Process Grader Currently have 1000s candidates to train grader limited data compared to ASR frames (100,000s frames) useful to have confidence in prediction Gaussian Process is a natural choice for this configuration

Form of Output Graders Pearson Correlation Human experts 0.85 Automatic GP 0.83 0.86

Effect of Grader Features Grader Pearson Correlation with Expert Graders Standard examiners 0.85 Automatic baseline 0.83 + Pronunciation 0.84 + RASP 0.85 + Confidence 0.83 + RASP + Confidence 0.86 Pronunciation features 0.82

Combining Human and Automatic Graders 1 Correlation 0.95 0.9 0.85 Original 0.2 0.4 0.6 0.8 Gaussian process Interpolation weight Interpolate between human and automated grades higher correlation i.e. more reliable grade produced Content checking can be done by the human grader

Detecting Outlier Grades Standard (BULATS) graders handle standard speakers very well non-standard (outlier) speakers less well handled use Gaussian Process variance to automatically detect outliers Correlation 1 0.95 0.9 0.85 Ideal rejection Gaussian process Random rejection 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rejection rate (i.e., cost) Back-off to human experts - reject 10%: performance 0.83 è 0.88

Assessing Communication Level Ignore high-level content and communication skills currently 60000 A1 A2 B1 B2 1000 A1 A2 B1 B2 50000 40000 30000 20000 10000 800 600 400 200 0 unique words bigrams trigrams fourgrams 0 0 2 4 6 8 10 12 14 16 Number of phones / word Language complexity is related to proficiency Future work look into e.g. McCarthy s use of chunks I would say, and then Abdulmajeed and Hunston s correctness analysis

Assessing Content Grader correlates well with expert grades features do not assess content primarily fluency features Train a Recurrent Neural Network Language Model for each question assess whether the response is consistent with example answers

Topic Classification Experiment details System HL-dim Training Data 280-D LSA topic space Supervised (SUP): 490 speakers, 2x crowd-sourced transcriptions Semi-supervised (Semi-SUP): + 10005 speakers, ASR transcriptions Increasing quantity of data helps even though high %WER % Error KNN - SUP 20.8 RNNLM 100 17.5 RNNLM 200 Semi-SUP 9.3 RNNLM can handle large data sets unlike K-Nearest Neighbour (KNN)

Off-Topic Response Detection Synthesised pool of off-topic responses Naïve select incorrect response from any section Directed select incorrect response from same section

Spoken Language Assessment Audio Feature extraction Features Grader Speech recogniser Text Automatically assess: Message realisation Fluency, pronunciation Message construction Construction & coherence of response Relationship to topic Grade

Spoken Language Assessment Audio Feature extraction Features Grader Speech recogniser Text Automatically assess: Message realisation Fluency, pronunciation Achieved (with room for improvement) Message construction Construction & coherence of response Relationship to topic Unsolved active research areas Grade

Spoken Language Assessment and Feedback Audio Feature extraction Features Grader Grade Speech recogniser Text Error Detection & Correction Feedback Automatically assess: Message realisation Fluency, pronunciation Message construction Construction & coherence of response Relationship to topic Provide feedback: Feedback to user: realisation, construction Feedback to system: adjust to level

Recognition Error Rate Versus Learner Progression

Time Alignment and Pronunciation Feedback

Conclusions Automated machine-learning for spoken language assessment important to keep costs down able to be integrated into the learning process Current level assessment of fluency ongoing research into assessing communication skills: appropriateness and acceptability Error detection and feedback is challenging high precision required in detecting where errors have occurred supplying feedback in appropriate form for learner

Questions?