Interactive Approaches to Video Lecture Assessment

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A study of speaker adaptation for DNN-based speech synthesis

Human Emotion Recognition From Speech

Using dialogue context to improve parsing performance in dialogue systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Longman English Interactive

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

A Case Study: News Classification Based on Term Frequency

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Corpus Linguistics (L615)

CS 598 Natural Language Processing

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

On the Formation of Phoneme Categories in DNN Acoustic Models

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Formulaic Language and Fluency: ESL Teaching Applications

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

WHEN THERE IS A mismatch between the acoustic

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Universiteit Leiden ICT in Business

Investigation on Mandarin Broadcast News Speech Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Word Segmentation of Off-line Handwritten Documents

Improvements to the Pruning Behavior of DNN Acoustic Models

Context Free Grammars. Many slides from Michael Collins

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

THE VERB ARGUMENT BROWSER

Affective Classification of Generic Audio Clips using Regression Models

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Edinburgh Research Explorer

BULATS A2 WORDLIST 2

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

The Smart/Empire TIPSTER IR System

Secondary English-Language Arts

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

CEFR Overall Illustrative English Proficiency Scales

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

REVIEW OF CONNECTED SPEECH

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

READ 180 Next Generation Software Manual

Characterizing and Processing Robot-Directed Speech

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Creating Travel Advice

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

An Evaluation of POS Taggers for the CHILDES Corpus

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Lecture 9: Speech Recognition

OPAC and User Perception in Law University Libraries in the Karnataka: A Study

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Cross Language Information Retrieval

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Making the ELPS-TELPAS Connection Grades K 12 Overview

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Automatic Pronunciation Checker

Parsing of part-of-speech tagged Assamese Texts

Appendix L: Online Testing Highlights and Script

Using SAM Central With iread

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Speaker recognition using universal background model on YOHO database

Speech Recognition by Indexing and Sequencing

Calibration of Confidence Measures in Speech Recognition

Leveraging Sentiment to Compute Word Similarity

Transcription:

Interactive Approaches to Video Lecture Assessment August 13, 2012 Korbinian Riedhammer Group Pattern Lab

Motivation 2

key phrases of the phrase occurrences Search spoken text

Outline Data Acquisition Textual Summary

Data Acquisition

LMELectures A Corpus of Academic Spoken English Two lecture series read in 2009 Pattern Analysis (PA) Interventional Medical Image Processing (IMIP) 18 recordings per series About 40 hours of audio/video data Audio: 48 khz, 16 bit (AIFF), resampled to 16 khz Video: HD, reduced resolutions available due to bandwidth Clip-on cordless speaker microphone, room microphones Constant recording setting RRZE E-Studio Single speaker Same recording equipment 10

Transcription Semi-automatic segmentation into speech turns Based on speech pauses and silences 23,857 turns Average duration of 4.4 seconds Total of about 29 hours of speech Manual transcription New tool for the rapid transcription of speech Time effort: about 5 times real time Transcription results On average 14 words per speech turn 300,500 words transcribed Vocabulary size: 5,383 (excluding foreign words and word fragments) 11

Annotations Individual lecture PA06 Based on edited manual transcript 5 human subjects 20 phrases Salience: from 1 (very relevant) to 6 (useless) Further annotations Lecturer s key terms for series PA Presentation slides in PDF format 12

Data Acquisition

The Kaldi Toolkit State of the art, open source 4-layer system modeled by weighted finite state transducers (WFST) Statistical n-gram language model Lexicon with pronunciation alternatives Context dependent phonemes Hidden Markov models Acoustic frontend Mel-frequency cepstral coefficients (MFCC), 1 st and 2 nd order derivatives Phoneme dependent linear transformations Acoustic modeling: subspace Gaussian mixture models 14

The LMELectures System 600 Gaussian components, 5,500 HMM states Vocabulary size: 5,383 Language model 5,370,040 bi- and tri-grams Trained on 500+ million words (including spontaneous lecture speech) Name Duration # Turns # Words % WER train 25h 31m 20,214 250,536 - development 2h 07m 1,802 21,909 9.78 test 2h 12m 1,750 23,497 11.03 WER: word error rate 15

Data Acquisition

Candidate Selection A verb alone may be vague discuss what? An isolated noun may be ambiguous question difficult or easy? Information about the topic is often in the noun phrase He asked a difficult question about the modified processing of words. Apply part-of-speech tagging Extract noun phrases based on regular expression 17

Example it computes the principal axes that s the one d axis PRP VBZ AR ADJ NN IN VBZ AR NUM NN NN that shows the highest spread of the points IN VBZ AR ADJ NN AR AR NN adjective* (noun,number) + (article + adjective* (noun,number) + )* ADJ NN NUM AR ADJ NN NUM 18

Example it computes the principal axes that s the one d axis that shows the highest spread of the points axes principal axes one d axis one d d axis spread highest spread points spread of the points highest spread of the points 19

Unsupervised Frequent phrases may be salient With a similar occurrence count, longer phrases may be more salient Motivated by Didactics: less confusion by literal repetition Psycholinguistics: lexical entrainment weight phrase f =, n = 1 f (n + 1), n > 1 Data and domain independent: simple and reliable Other investigated strategies include prior world or domain knowledge 20

Comparison of s Compare a target ranking against a reference (human) ranking Standard measure: Normalized Distributed Cumulative Gain Award credit for placing valuable phrases at high ranks Compare lists of a certain length, e.g., top 10 phrases Phrases annotated with salience from 1 (very useful) to 6 (useless) gain phrase = 2 ( )/ 1 NDCG N = C gain(phrase ) ld (1 + i) 21

Multiple Annotators Objective Results NDCG for pair-wise comparison only 5 human annotators Human score average NDCG value of all human-human pairings 20 individual pairings score average NDCG value for all human-machine pairings 5 individual pairings Scores based on manual (TRL) and automatic (ASR) transcripts 22

Evaluation of Human and s NDCG 1.0 0.9 0.8 0.7 human automatic/trl automatic/asr Only small differences due to ASR errors 0.6 Similar quality 0.5 of human and automatic 1 ranking 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of phrases considered Fairly high human average agreement 23

Data Acquisition

Motivation Key phrases give a topical overview of the lecture Phrase occurrences can serve as a visual index or navigation aid Simple example: clickable occurrence bar 25

StreamGraphs Popular in the visualization community Stacked splines Left to right: Playback time (as with occurrence bar) Stream wideness: Current phrase dominance Dominance: Number of occurrences within certain time frame 26

Advantages Comfortably display 3 to 6 phrases simultaneously Stream width can suggest topical relations of phrases Similar widths at the same time indicate co-occurrence possibly related Different widths indicate rare or no co-occurrence possibly unrelated User Interactions Click into the stream jump to closest occurrence Change the phrases on display learn about topics and relations Interactions can be logged to collect data for customized rankings 27

Implementation Details 29

User Study

Task Based Evaluation Typical scenario: preparation for an exam Task should be independent of prior knowledge and comprehension Locate those segments of the video that cover certain topics Two groups of CS graduate students test and control Familiar with topic, speaker and lecture 5 subjects per group Each participant is provided with 3 lecture topics with short description Control group: Video only Test group: the presented interface Post-use questionnaire for test group to gather feedback 31

Results Group Accuracy Average time Control 68 % 30 Test 69 % 21 average time in minutes Both groups have a similar accuracy Video duration: 42 minutes The test group was on average about 29% faster Most users found the interface to be helpful and easy to use key phrase visualization to give a good overview 32

Summary Data Acquisition LMELectures, a new corpus of academic spoken English Extraction & speech recognition system for the LMELectures with a word error rate of 11% Unsupervised key phrase extraction and ranking that highly correlates to human rankings Novel video lecture browser that helps students to quickly assess the contents 33

Outlook Data Acquisition Extraction & More transcriptions for better acoustic and language models Integration of prior knowledge about speaker, room and topic Supervised methods for user-tailored rankings Larger user study on more lectures 34