Deep Learning in Music Informatics

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

WHEN THERE IS A mismatch between the acoustic

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speaker Identification by Comparison of Smart Methods. Abstract

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

THE enormous growth of unstructured data, including

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A Deep Bag-of-Features Model for Music Auto-Tagging

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Statewide Framework Document for:

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Knowledge Transfer in Deep Convolutional Neural Nets

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

THE world surrounding us involves multiple modalities

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Speech Emotion Recognition Using Support Vector Machine

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

M55205-Mastering Microsoft Project 2016

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Lecture 10: Reinforcement Learning

Speaker recognition using universal background model on YOHO database

A Reinforcement Learning Variant for Control Scheduling

arxiv: v2 [cs.cv] 30 Mar 2017

On the Formation of Phoneme Categories in DNN Acoustic Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Assignment 1: Predicting Amazon Review Ratings

Generative models and adversarial training

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

CS 446: Machine Learning

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Probabilistic Latent Semantic Analysis

Comment-based Multi-View Clustering of Web 2.0 Items

Artificial Neural Networks written examination

Human Emotion Recognition From Speech

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

(Sub)Gradient Descent

arxiv: v1 [cs.cl] 2 Apr 2017

Speech Recognition at ICSI: Broadcast News and beyond

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Sound and Meaning in Auditory Data Display

A study of speaker adaptation for DNN-based speech synthesis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Foothill College Summer 2016

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Application of Virtual Instruments (VIs) for an enhanced learning environment

Learning Methods for Fuzzy Systems

Visit us at:

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speaker Recognition. Speaker Diarization and Identification

GACE Computer Science Assessment Test at a Glance

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Major Milestones, Team Activities, and Individual Deliverables

A Case Study: News Classification Based on Term Frequency

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Segregation of Unvoiced Speech from Nonspeech Interference

Probability and Game Theory Course Syllabus

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Honors Mathematics. Introduction and Definition of Honors Mathematics

Learning Methods in Multilingual Speech Recognition

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Timeline. Recommendations

Calibration of Confidence Measures in Speech Recognition

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Affective Classification of Generic Audio Clips using Regression Models

Mandarin Lexical Tone Recognition: The Gating Paradigm

Constructing a support system for self-learning playing the piano at the beginning stage

Learning Disability Functional Capacity Evaluation. Dear Doctor,

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

A Review: Speech Recognition with Deep Learning Methods

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Truth Inference in Crowdsourcing: Is the Problem Solved?

Proceedings of Meetings on Acoustics

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

LEGO MINDSTORMS Education EV3 Coding Activities

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

STA 225: Introductory Statistics (CT)

Grade 6: Correlated to AGS Basic Math Skills

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Software Maintenance

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

How to Judge the Quality of an Objective Classroom Test

LEARNER VARIABILITY AND UNIVERSAL DESIGN FOR LEARNING

Android App Development for Beginners

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Transcription:

Deep Learning in Music Informatics Demystifying the Dark Art, Part III Practicum Eric J. Humphrey 04 November 2013

Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning How you are already doing it (kinda) Why should you consider it Some tips, tricks, insight, and advice How you might do it intentionally A few thoughts for the future

Deep learning is... Cascade of multiple layers, composed of a few simple operations Linear algebra Point-wise nonlinearities Pooling Input Layer 1 Layer 2 Layer N Output Matrix Operation Pointwise Non-linearity Pooling

Nonlinearites enable complexity Cascaded non-linearities allow for complex systems composed of simple, linear parts The composite of two linear systems is just another linear system The composite of two non-linear systems is an entirely different system y = B(Ax) =(BA)x linear y = h(bh(ax)) 6= h(bah(x)) nonlinear

Why is this relevant to music? Quite literally, music is composed! Hierarchies of pitch and loudness form chords and melodies, phrases and sections, eventually building entirely pieces. Deep structures are well suited to encode these relationships. +3 +3-3 1 +5-3 PT -3-3 NT

Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning How you are already doing it (kinda) Why should you consider it Some tips, tricks, insight, and advice How you might do it intentionally A few thoughts for the future

The ever versatile DFT X[k] = N 1 X n=0 x[n]e j2 nk/n X = I x R

Short-Time Fourier Transform

Some common MIR operations Linear algebra The DFT is a general affine transformation (dot-product)...followed by an absolute value (full wave rectification)...followed by a logarithm The DCT is a general linear affine transformation PCA is a learned, linear affine transformation NMF is a learned, linear affine transformation Non-linearities: Half/Full-wave rectification, peak picking, logarithms Pooling: Histograms, standard deviation, min/max/median

The pieces of deep learning are everywhere in feature design.

Chroma Audio Signal Short-time Windowing Affine Transformation Non-linearity Affine Transformation Features 800ms Constant-Q Filterbank Modulus / Log-Scaling Octave Equivalnce Chroma

MFCCs Audio Signal Short-time Windowing Affine Transformation Non-linearity Affine Transformation Features 50ms Mel-scale Filterbank Log-Scaling DCT MFCC

Feature design is based on a shared intuition: Build invariance into your representations.

Case in Point: Tempo Estimation Time-Frequency Representation Amplitude Amplitude Subband Time (seconds) Time (seconds) Novelty Function Tempo Spectra Audio Subband Decomposition Onset Detection Periodicity Analysis Argmax BPM

Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning How you are already doing it (kinda) Why should you consider it Some tips, tricks, insight, and advice How you might do it intentionally A few thoughts for the future

The goal of feature extraction Model the relationship between inputs (x) and observations (y) Restated: Develop representations that encode some desired invariance Robust when this is captured Noisy when variance is uninformative / misleading Why is this difficult? You have to know what you want You have to know how to do it audio function chroma

The Simplest Function latent function observations model y = mx + b =[m, b] y = f(x )

Equations Parameters Building good functions consists of two distinct problems: Getting the right equation family (general) Getting the right parameterization (specific) a single instance equation family infinite solution space

Feature design proceeds by choosing an equation family and a specific parameterization.

How do we choose parameters? Often, manually adjust parameters to optimize some objective function A deep network layer is the same equation family as: The DFT The DCT PCA <- learned! NMF <- learned!...and plenty others. y = h(wx+ b) The only difference lies in the parameterization

Consider Chroma Pitch Class Profiles (Fujishima, 1999) De facto standard harmonic representation of audio Several off-the-shelf implementations Refining chroma extraction for over a decade Designed for chord recognition, now used for other tasks Structural analysis Cover song retrieval

Chroma is octave equivalence, right? T = Chroma CQT Weights

What if we learn these weights? = 2 Squared Error Chroma Templates Chroma Constant-Q Spectra Weights (192 x 12)

Deep learning is just a way of finding good parameters for the equations we already use.

...and those parameters might not be what you expected.

Chroma weights - Start

Chroma weights - End

Chroma, Side-by-side Octave Equivalence Learned Transform

Learning == Searching! Given an objective function, you can find, rather than choose, good parameters The equation family acts a constraint on the search space

Feature Design vs Function Design Feature design is preferable when you have almost no data Time and effort intensive process Manually optimizing parameters is slow Function design is preferable when you have any amount of data Same principles, slightly relaxed formulation Quickly explore new representations you might not know how to derive

Like a fretboard, perhaps?

Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning How you are already doing it (kinda) Why should you consider it Some tips, tricks, insight, and advice How you might do it intentionally A few thoughts for the future

Designing deep networks We already kind of do this How many principal components do you keep? What is the window size of your DFT? How many channels should your filterbank have? Use the same intuition to design deep networks Constant-Q TFR Convolutional Layers Affine Layer 6D-Softmax Layer RBF Layer (Templates) MSE

Designing your loss function Optimization criteria is extremely important Think of it like any greedy system (economies, children, etc) Encourage and reward good behavior Penalize bad behavior Anticipate poor local minima Take advantage of domain knowledge! How can we steer it toward the right answer? How can we use musical understanding to restrict the search space?

Tricks and tips for training Leverage of known data distortions CQT rotations Perceptual codecs Additive noise Tuning hyperparameters: sample distributions rather than grid searches Be mindful of latent priors in your data

Controlling complexity Regularization can help reign in overly complex models Weight decay Sparsity - weights or representations Parameter normalization / limiting Dropout, point-noise, data-driven initialization Model capacity can be reduced

2-Layer Chroma Network - Start

2-Layer Chroma Network - End

Regularized Learning - Start

Regularized Learning - End

Learned Chroma, side-by-side Unconstrained Regularized

Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning How you are already doing it (kinda) Why should you consider it Some tips, tricks, insight, and advice How you might do it intentionally A few thoughts for the future

Slinging code Python + Theano make this very easy to try out. Symbolic differentiation Compiles to C-code (wicked fast!) CUDA (GPU) integration Stop by the Late-breaking & Demos on Friday for a hack session and install assistance! Chord-annotated DFT dataset Monophonic instrument dataset Sample code for you to use

And iterate until convergence! (This has only gotten faster.)

Outline In this part of the talk, we ll touch on the following: Recap: What is deep learning How you are already doing it (kinda) Why should you consider it Some tips, tricks, insight, and advice How you might do it intentionally A few thoughts for the future

Tangible Opportunities Contribute back to the bigger machine learning community Application to problems for which we have little feature insight Timbre Auto-mixing Time and sequences are the crux of music Non-linear motion, sequentiality, repetition (long-term structure) Harmonic and temporal correlations But we know Digital Signal Processing: Convolutional Networks Normal Filterbanks (FIR) Recurrent Networks Recursive Filterbanks (IIR)

Challenges Domain knowledge is crucial to success Can we initialize a network with a state of the art system, e.g. tempo tracking Unsupervised learning, i.e. making sense of unlabeled data. Still a good goal! (our brains do it, after all) Reconstruction / computational creativity? Music signal processing has advantages over other fields Time is fundamental, but ultimately an open question Strong potential to lay the foundation for better AI Analysis of learned functions, insights for music, MIR Leverage compositional tools to create music data for training Can we finally solve onset detection, tempo tracking, or multi-pitch estimation?

thanks / questions? mail: ejhumphrey@nyu.edu twitter: ejhumphrey