LENA: Automated Analysis Algorithms and Segmentation Detail: How to interpret and not overinterpret the LENA labelings

Similar documents
Generative models and adversarial training

Speech Recognition at ICSI: Broadcast News and beyond

CS Machine Learning

Learning Methods in Multilingual Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Emotion Recognition Using Support Vector Machine

Lecture 1: Machine Learning Basics

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Generating Test Cases From Use Cases

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Modeling function word errors in DNN-HMM based LVCSR systems

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Extending Place Value with Whole Numbers to 1,000,000

Software Maintenance

Lecture 10: Reinforcement Learning

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

ECE-492 SENIOR ADVANCED DESIGN PROJECT

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker recognition using universal background model on YOHO database

ACCOMMODATIONS FOR STUDENTS WITH DISABILITIES

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Probabilistic Latent Semantic Analysis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Mandarin Lexical Tone Recognition: The Gating Paradigm

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Introduction to Causal Inference. Problem Set 1. Required Problems

A study of speaker adaptation for DNN-based speech synthesis

Visual CP Representation of Knowledge

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Major Milestones, Team Activities, and Individual Deliverables

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Statewide Framework Document for:

Rule Learning With Negation: Issues Regarding Effectiveness

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Assignment 1: Predicting Amazon Review Ratings

Calibration of Confidence Measures in Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Linking Task: Identifying authors and book titles in verbose queries

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Functional Skills Mathematics Level 2 assessment

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

8. UTILIZATION OF SCHOOL FACILITIES

Artificial Neural Networks written examination

12- A whirlwind tour of statistics

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Reinforcement Learning by Comparing Immediate Reward

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

Laboratorio di Intelligenza Artificiale e Robotica

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

(Sub)Gradient Descent

Evidence for Reliability, Validity and Learning Effectiveness

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

The Evolution of Random Phenomena

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Individual Differences & Item Effects: How to test them, & how to test them well

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

CS 100: Principles of Computing

On-the-Fly Customization of Automated Essay Scoring

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Ohio s Learning Standards-Clear Learning Targets

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

CS 446: Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Improvements to the Pruning Behavior of DNN Acoustic Models

LEGO MINDSTORMS Education EV3 Coding Activities

The Condition of College & Career Readiness 2016

Transcription:

LENA: Automated Analysis Algorithms and Segmentation Detail: How to interpret and not overinterpret the LENA labelings D. Kimbrough Oller The University of Memphis, Memphis, TN, USA and The Konrad Lorenz Institute for Evolution and Cognition Research (KLI), Altenberg, Austria Supported by NIDCD, NICHD, the KLI, and by the Plough Foundation

Overview of goal The software of LENA yields a very useful labeling The proof of its value is in outcomes (prediction of age, group classification, correlation with other language measures) So at a global level the system has proven itself and more importantly It has proven that automated analysis of massive samples is here to stay

Interpretive subtlety as a key to the long-term value of the approach We re going to focus on the labeling functions and how to interpret their outcomes appropriately The methods are designed to yield a maximally accurate outcome at a global level the level of the recording The labeling at the local level is subordinated to this global accuracy goal Much of what one sees in a real labeled file is not correct

The key conclusions The many mistakes that the software makes at the local level require us to be intelligent about how we use the information We need to think about ways the software might lead us astray But at the same time we need to capitalize on the opportunities of the new method and not be swayed by irrelevant traditional thinking that insists on some arbitrary metric of reliability

Maintain optimism E.g., low kappa is not necessarily a reason to discard data on any particular automated coding One needs always to look at the outcome comparisons and reason again about the significance of a low reliability factor I envision a back and forth between modeling and various outcomes In the following graph based on data from the PNAS paper, the canonical syllable (CS) and squeal (SQ) parameters (see red arrows) had very low (but positive and highly statistically significant) kappas of agreeement, yet they were strong predictors of age and of group differentiation

Correlations of 12 acoustic parameters with age p < 0.004 Arrows point to outcomes where agreement between human observer and machine labeling had low kappa, only a bit over 0.2 Typical Delayed Autistic

Labeling flowchart LENA DLP Feature Extraction Segmentation & Segment ID Adult Segments Sphinx Phone Decoder Adult Word Count LENA Reports Conv Turns Processing Turns Estimation Display Reports ITS File Key Child Segments Key Child Voc Processing Key Child Voc Count Other Segments Other Segment Processing Other Output ADEX

Segmentation or labelling topics There are eight basic categories of segment or label The Near/Far distinction (based on a likelihood ratio test where SIL likelihood is the denominator) yields seven additional categories; thus 15 total categories Within key child, additional distinctions Childvoc (or SCU, the term used in the PNAS paper) Cry Veg and fixed signals other than cry (including laugh) : VegFix

Gaussian mixture models (GMMs) at the core of the labeling Imagine eight acoustic representations (GMMs), all random noise at the beginning of training, each with the task of learning to resemble the acoustic characteristics of one of the 8 basic categories Imagine that a GMM is presented with segments that have been labeled by human transcribers as the category it is supposed to model, and that on each presentation, the GMM makes an adjustment in its acoustic characteristics to bring it a little closer to the characteristics of the presented segment All the 8 GMMs get this kind of training After very large numbers of presentations of labeled segments each GMM tends to stabilize as a model of the kind of segment it is supposed to model

More on the Gaussian mixture models (GMMs) After training, all the GMMs are non-random, each a composite model of its category (one has acoustic properties of Female Adult utterances, one of Male Adult and so on), based on many different exemplars that had been presented in training To test for reliability of the GMMs, they are presented with new human-labeled segments, that had not been involved in the model training, and the machine labeling is compared quantitatively with the human labeling

Labeling constraints Min duration constraints on labeled events 1000 ms for MAN/FAN/TVN/OLN 800 ms for SIL, NON, CXN 600 ms for CHN The special category of Overlap (OLN/OLF); must include a voice, but remember, it is based on its own GMM where training exemplars included one or more voices plus possible other sounds The start and end times of labeled events are often not where a listener would place them (30-40 ms errors are common) Vocal activity blocks (VABs) and the related idea of Conversations vs Pauses the 5 sec rule is used for boundaries between VABs

Other durational constraints Child vocalizations (Childvoc) within CHN/CHF begin when the acoustic energy level first rises to 90% above baseline for at least 50 ms and end when it falls to less than 10% above baseline for at least 300 ms Thus 50 ms is the absolute min for a Childvoc, and 300 ms is the max break within a Childvoc The easier way to think about this may be that Childvocs are never too short, and never broken up by long silences (never broken up by a silence as long as a typical syllable, i.e. 300 ms), but can consist of long utterances with many syllables When a silence longer than 300 ms occurs within a CHN/CHF, a new Childvoc begins

CUC 1 Silence CUC 2 Female Adult CUC 3 = 900 ms CUC= child utterance cluster or CHN/CHF

This is a blowup of the first CUC from the prior slide Step 1 2 3 4 5 Vocal Island analysis is not a part of the standard LENA algorithms, but was used in the PNAS analysis CVI 1 CU 1 CVI = Child vocal island (roughly a syllable) CVI 1 SV I 1 Silence = 200 ms Silence SCU CVI 2 CVI 2 SVI 2 Silence = 400 ms Silence = 400 ms Silence SCU= Speech-related Child Utterance CUC CVI 3 CU 2 CVI 3 Vegetative Silence = 500 ms Silence = 500 ms SVI = Speech-related Vocal Isand (not cry or veg) CU 3 CVI 4 Silence CVI 4 Cry Not used in our analysis

How was reliability of segmentations assessed? 70 hours of transcribed data in 6 ten-min chunks from each of 70 children balanced for gender and age were used for testing This was done with the segmentations from the automated system in front of the transcribers (in the open source software Transcriber ) Transcribers moved boundaries and relabeled with many more categories than the 15 (>70) The lead transcriber reviewed every segment in the entire 70 hours before submission to reliability tests Transcribers were encouraged to be critical of the machine labeling Often transcriptions showed events violating the min duration constraints

How reliable are the segmentations? The comparison between machine and transcribers was done at the frame level (10 ms) Collar guard at various settings (nominal 30 ms, the value used for the PNAS paper) to allow small errors without penalty at the start and end times of segments These yield over 0.7 agreement in most cells of the reliability matrices (many have been computed)

a. b. When rows sum to one, the human listener is the gold standard Human listener classification Key child Machine classification Other Key child 0.73 0.27 Other 0.05 0.95 When columns sum to one, the machine is the gold standard Human listener classification Key child Machine classification Other Key child 0.64 0.03 Other 0.36 0.97

These data give a picture of the accuracy of the algorithms within the CHNs and CHFs, that is the accuracy of differentiation of Speech related material from cries and vegetative sounds a. b. Human listener classification SVI Machine classification Cry/Vegetative SVI 0.75 0.25 Cry/Vegetative 0.16 0.84 Human listener classification SVI Machine classification Cry/Vegetative SVI 0.86 0.28 Cry/Vegetative 0.14 0.72

What is a Vocal Activity Block (or conversation)? Lots of room for containing a variety of event types Must contain at least one of the following 4 segment types, or any combination of them: MAN, FAN, CHN, or CXN But a VAB can be broken up by (i.e., a new VAB starts at) any combination of more than 5 sec of the 11 segment types that cannot be part of a conversation, namely MAF or FAF or CHF or CXF or OLF or OLN or NOF or NON or TVF or TVN or SIL And of course a VAB can include within it, any combination of less than 5 sec of the segments that cannot be part of a conversation

What is a conversational turn? Lots of room for containing a variety of event types, but CXN is not included MAN or FAN + CHN in either order, within vocal activity block (VAB) Must not include any combination of more than 5 sec of MAF or FAF or CHF or CXF or OLF or OLN or NOF or NON or TVF or TVN or SIL AND, if a CXN intervenes between a MAN or FAN + CHN in either order, no conversational turn is counted A FINAL CONSTRAINT: A conversational turn is invalidated by any FAN or MAN that was given a 0 word count by the word count module (AWC)

In summary, what is a pause between VABs? Lots of room for containing a variety of event types Any Far event, OLN or TVN, NON, or SIL Must consist of these things in any combination of at least 5 sec

Major things to look out for Reliability of labeling is pretty good at the event (segment) level At the level of conversational turn or any sequence of events, you reduce the reliability by amounts unknown, perhaps as much as the product of the reliabilities for the two segments (e.g., 0.7 0.7= 0.49) And consider complications of interpretation if there is overlap or far segments embedded in the turn (which they are allowed to be), or FAN/MAN with 0 word count

More technical topics Gaussian mixture models were trained on 230 hours of human coded data Labeling is based on a maximum likelihood model (for every segment in a recoding, a likelihood is determined for each of the GMMs, and the highest likelihood is chosen as the label) Time frame of operation of the GMMs is 10 ms, but the label decisions are made based on min length of events (i.e., twice the min length constraint, or 1200-2000 ms, is the search space) Iteration of procedures occurs in several instances TV detection is refined in subsequent passes of processing