CS 510: Lecture 8. Deep Learning, Fairness, and Bias

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Artificial Neural Networks written examination

(Sub)Gradient Descent

Artificial Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Knowledge Transfer in Deep Convolutional Neural Nets

CSL465/603 - Machine Learning

arxiv: v1 [cs.lg] 15 Jun 2015

CS Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

INPE São José dos Campos

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Softprop: Softmax Neural Network Backpropagation Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CS 446: Machine Learning

arxiv: v2 [cs.cv] 30 Mar 2017

An empirical study of learning speed in backpropagation

Lecture 1: Basic Concepts of Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Forget catastrophic forgetting: AI that learns after deployment

Time series prediction

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Axiom 2013 Team Description Paper

Learning Methods for Fuzzy Systems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning to Schedule Straight-Line Code

Calibration of Confidence Measures in Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.cv] 10 May 2017

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Second Exam: Natural Language Parsing with Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Human Emotion Recognition From Speech

Speech Recognition at ICSI: Broadcast News and beyond

Evolutive Neural Net Fuzzy Filtering: Basic Description

Assignment 1: Predicting Amazon Review Ratings

Test Effort Estimation Using Neural Network

Issues in the Mining of Heart Failure Datasets

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Evolution of Symbolisation in Chimpanzees and Neural Nets

Modeling function word errors in DNN-HMM based LVCSR systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A Review: Speech Recognition with Deep Learning Methods

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Generative models and adversarial training

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Attributed Social Network Embedding

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

SARDNET: A Self-Organizing Feature Map for Sequences

A Reinforcement Learning Variant for Control Scheduling

On the Formation of Phoneme Categories in DNN Acoustic Models

Speaker Identification by Comparison of Smart Methods. Abstract

Deep Neural Network Language Models

A study of speaker adaptation for DNN-based speech synthesis

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Using focal point learning to improve human machine tacit coordination

Learning From the Past with Experiment Databases

arxiv: v1 [cs.cl] 2 Apr 2017

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

arxiv: v2 [cs.ir] 22 Aug 2016

WHEN THERE IS A mismatch between the acoustic

Knowledge-Based - Systems

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Semi-Supervised Face Detection

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

12- A whirlwind tour of statistics

Discriminative Learning of Beam-Search Heuristics for Planning

A Deep Bag-of-Features Model for Music Auto-Tagging

THE world surrounding us involves multiple modalities

Radius STEM Readiness TM

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Rule Learning With Negation: Issues Regarding Effectiveness

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Learning Methods in Multilingual Speech Recognition

Getting Started with Deliberate Practice

Mathematics process categories

Transcription:

CS 510: Lecture 8 Deep Learning, Fairness, and Bias

Next Week All Presentations, all the time Upload your presentation before class if using slides Sign up for a timeslot google doc, if you haven t already done so

Artificial Neural Networks: History Belief that it was necessary to model underlying brain architecture for AI In contrast to encoded symbolic knowledge (best represented by expert systems) Hebb - learning is altering strength of synaptic connections

Neural Networks Attempt to build a computation system based on the parallel architecture of brains Characteristics: Many simple processing elements Many connections Simple messages Adaptive interaction

Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms 10ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20, Section 5 3

Benefits of NN User friendly (well, reasonably) Non-linear Noise tolerant Many applications Credit fraud/assignment Robotic Control

Neurons Inputs (either from outside or other neurons ) Weighted connections that correspond to synaptic efficiency Threshold values to weight the inputs Passed through activation function to determine output

Example Unit Binary input/output Rule 1 if w0*i0 + w1*i1 +wb > 0 0 if w0*i0 + w1*i1 +wb <= 0

Activation functions g(in i ) g(in i ) +1 +1 (a) in i (b) in i (a) is a step function or threshold function (b) is a sigmoid function 1/(1 + e x ) Changing the bias weight W 0,i moves the threshold location Note similarity to logistic regression... Chapter 20, Section 5 5

-0.06 W1-2.5 W2 f(x) W3 1.4

-0.06 2.7-2.5-8.6 f(x) 0.002 x = -0.06 2.7 + 2.5 8.6 + 1.4 0.002 = 21.34 1.4

How to Adapt? Perceptron Learning Rule change the weight by an amount proportional to the difference between the desired output and the actual output. As an equation: ΔWi = η * (D - Y)Ii, where D is desired output and Y is actual output Stop when converges

Limits of Perceptrons Minsky and Papert 1969 Fails on linearly inseparable instances XOR linearly separable - pattern space can be separated by single hyperplane

Perceptrons vs Decision Trees

Multilayer Perceptrons (MLP)

Back Propagation Start with a set of known examples (supervised approach) Assign random initial weights Run examples through and calculate the mean-squared error Propagate the error by making small changes to the weights at each level Use chain rule to calculate the gradient efficiently Lather, rinse, repeat

Gradient Descent Algorithm Have some function Want Outline: Start with some Keep changing to reduce until we hopefully end up at a minimum

The gradient of J ( J) at a point can be thought of as a vector indicating which way is uphill J(θ 0,θ 1 ) If J is an error function we want to move downhill - opposite to the gradient

Gradient descent algorithm Have function J Want to produce vectors s.t. J(θ1)>J(θ2)>... start w/ θ0 θi+1 = θi - ɑi J(θi) ɑ(alpha) is the learning rate

Stochastic Gradient Descent Update J every time you look at a training example

Some non-linear activation functions

Most common activation function

A dataset Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc

Training the neural network Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc Initialise with random weights

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 1.4 2.7 1.9 Present a training pattern

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 1.4 Feed it through to get output 2.7 0.8 1.9

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 1.4 Compare with target output 2.7 0.8 0 1.9 error 0.8

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 1.4 Adjust weights based on error 2.7 0.8 0 1.9 error 0.8

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 6.4 2.8 1.7 Present a training pattern

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 6.4 Feed it through to get output 2.8 0.9 1.7

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 6.4 Compare with target output 2.8 0.9 1 1.7 error -0.1

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 6.4 Adjust weights based on error 2.8 0.9 1 1.7 error -0.1

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc 6.4 And so on. 2.8 0.9 1 1.7 error -0.1 Repeat this thousands, maybe millions of times each time taking a random training instance, and making slight weight adjustments Algorithms for weight adjustment are designed to make changes that will reduce the error

The decision boundary perspective Initial random weights

The decision boundary perspective Present a training instance / adjust the weights

The decision boundary perspective Present a training instance / adjust the weights

The decision boundary perspective Present a training instance / adjust the weights

The decision boundary perspective Present a training instance / adjust the weights

The decision boundary perspective Eventually.

The point I am trying to make weight-learning algorithms for NNs are dumb they work by making thousands and thousands of tiny adjustments, each making the network do better at the most recent pattern, but perhaps a little worse on many others but, by dumb luck, eventually this tends to be good enough to learn effective classifiers for many real applications

Some other points Detail of a standard NN weight learning algorithm later If f(x) is non-linear, a network with 1 hidden layer can, in theory, learn perfectly any classification problem. A set of weights exists that can produce the targets from the inputs. The problem is finding them.

Some other by the way points If f(x) is linear, the NN can only draw straight decision boundaries (even if there are many layers of units)

Some other by the way points NNs use nonlinear f(x) so they can draw complex boundaries, but keep the data unchanged

Some other by the way points NNs use nonlinear f(x) so they can draw complex boundaries, but keep the data unchanged SVMs only draw straight lines, but they transform the data first in a way that makes that OK

Neural network vocabulary Neuron = logistic regression or similar function Input layer = input training/test vector Bias unit = intercept term/always on feature Activation = response Activation function is a logistic (or similar sigmoid nonlinearity) Backpropagation = running stochastic gradient descent across a multilayer network Weight decay - regularization or Bayesian prior

Deep Learning Most current machine learning works well because of human-designed representations and input features Machine learning becomes just optimizing weights to best make a final prediction Representation learning attempts to automatically learn good features or representations Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction

Deep Architecture

Deep Learning Overview Train networks with many layers (vs. shallow nets with just a couple of layers) Multiple layers work to build an improved feature space First layer learns 1st order features (e.g. edges ) 2nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.) In current models layers often learn in an unsupervised mode and discover general features of the input space serving multiple tasks related to the unsupervised instances (image recognition, etc.) Then final layer features are fed into supervised layer(s) And entire network is often subsequently tuned using supervised training of the entire net, using the initial weightings learned in the unsupervised phase Could also do fully supervised versions, etc. (early BP attempts)

Why Deep Learning?

Learning Representations Handcrafting features is time-consuming incomplete domain/... The features are often both over-specified and The work has to be done again for each task/ We must move beyond handcrafted features and simple ML Humans develop representations for learning and reasoning Our computers should do the same

The Curse of Dimensionality

Unsupervised Feature and Weight Learning Today, most practical, good NLP& ML methods require labeled training data (i.e., supervised learning) But almost all data is unlabeled Most information must be acquired unsupervised Fortunately, a good model of observed data can really help you learn classification decisions

Learning Multiple Levels of Representation

Successive Layers Learn Deeper Representations object models object parts (combination of edges) edges pixels

Impressive Results Especially on Large Datasets Object Recognition - better than anything out there Speech Recognition (google voice search) Many other perceptual tasks in vision and NLP

Why now? Bigger Data - deep learning works best Better Hardware - multicore CPUs and GPUs Better Algorithms - autoencoders, deep belief networks, etc Let us train multiple inner layers well

Breakthrough: Unsupervised Pre-training

Difficulties with Supervised Networks Early layers of MLP do not get trained well Diffusion of Gradient error attenuates as it propagates to earlier layers Leads to very slow training Exacerbated since top couple layers can usually learn any task "pretty well" and thus the error to earlier layers drops quickly as the top layers "mostly" solve the task lower layers never get the opportunity to use their capacity to improve results, they just do a random feature map Need a way for early layers to do effective work Often not enough labeled data available while lots of unlabeled data Can we use unsupervised/semi-supervised approaches to take advantage of the unlabeled data Deep networks tend to have more local minima problems than shallow networks during supervised training

Semi-supervised Learning

Semi-supervised Learning

Training Deep Networks Build a feature space Note that this is what we do with SVM kernels, or trained hidden layers in BP, etc., but now we will build the feature space using deep architectures Unsupervised training between layers can decompose the problem into distributed subproblems (with higher levels of abstraction) to be further decomposed at subsequent layers

Greedy Layer-wise Training Train first layer using your data without the labels (unsupervised) Since there are no targets at this level, labels don't help. Could also use the more abundant unlabeled data which is not part of the training set (i.e. self-taught learning). Freeze the first layer parameters and start training the second layer using the output of the first layer as the unsupervised input to the second layer Repeat this for as many layers as desired This builds our set of robust features Use the outputs of the final layer as inputs to a supervised layer/model and train the last supervised layer(s) (leave early weights frozen) Unfreeze all weights and fine tune the full network by training with a supervised approach, given the pre-processed weight settings

Greedy Layer-wise Training Greedy layer-wise training avoids many of the problems of trying to train a deep net in a supervised fashion Each layer gets full learning focus in its turn since it is the only current "top" layer Can take advantage of the unlabeled data When you finally tune the entire network with supervised training the network weights have already been adjusted so that you are in a good error basin and just need fine tuning. This helps with problems of Ineffective early layer learning Deep network local minima We will discuss the two most common approaches Stacked Auto-Encoders Deep Belief Networks

The new way to train multi-layer NNs Train this layer first

The new way to train multi-layer NNs Train this layer first then this layer

The new way to train multi-layer NNs Train this layer first then this layer then this layer

The new way to train multi-layer NNs Train this layer first then this layer then this layer then this layer

The new way to train multi-layer NNs Train this layer first then this layer then this layer then this layer finally this layer

The new way to train multi-layer NNs EACH of the (non-output) layers is trained to be an auto-encoder Basically, it is forced to learn good features that describe what comes from the previous layer

an auto-encoder is trained, with an absolutely standard weight-adjustment algorithm to reproduce the input

an auto-encoder is trained, with an absolutely standard weight-adjustment algorithm to reproduce the input By making this happen with (many) fewer units than the inputs, this forces the hidden layer units to become good feature detectors

One Auto-encoder 73

Stacked Auto-encoders Stack sparse auto-encoders on top of each other, drop decode layer each time 74

Stacked auto-encoders Do supervised training on last layer Then do supervised training on whole network to fine tune the weights 75

Manifold Learning Hypothesis 76

Caveats Prevent the layers from just learning the identity (learn Features instead) Undercomplete - middle layer smaller than input Sparsity - penalize hidden unit activations Use regularization to keep most nodes at or near 0 Denoising - Stochastically corrupt training instance, but train encoder to decode uncorrupted instance Contractive - force encoder to have small derivatives (stay on manifold) 77

Fairness and Learning Going to show video of Aylin Link on course website to her talk (may be easier for online students if there is feedback). 78