Learning. Machine. A First Course in. Simon Rogers Mark Girolami. Chapman & Hall/CRC. CRC Press. Machine Learning & Pattern Recognition Series

Chapman & Hall/CRC Machine Learning & Pattern Recognition Series A First Course in Machine Learning Simon Rogers Mark Girolami CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor Sc Francis Croup, an Informa business A CHAPMAN & HALL BOOK

a List of Tables xi List of Figures xiii Preface xix 1 Linear Modelling: A Least Squares Approach 1 11 Linear modelling 1 111 Defining the model 2 112 Modelling assumptions 3 113 Defining what a good model is 4 114 The least squares solution worked example 6 115 Worked example 9 116 Least squares fit to the Olympics data 10 117 Summary 11 12 Making predictions 12 121 A second Olympics dataset 12 122 Summary 15 13 Vector/matrix notation 15 131 Example 22 132 Numerical example 23 133 Making predictions 24 134 Summary 24 14 Nonlinear response from a linear model 25 15 Generalisation and overfitting 28 151 Validation data 29 152 Crossvalidation 29 153 Computational scaling of JiTfold crossvalidation 32 16 Regularised least squares 33 17 Exercises 35 Further reading 37 2 Linear Modelling: A Maximum Likelihood Approach 39 21 Errors as noise 39 211 Thinking generatively 40 22 Random variables and probability 41 v

Bayes' an Olympics vi 221 Random variables 41 222 Probability and distributions 42 223 Adding probabilities 44 224 Conditional probabilities 44 225 Joint probabilities 45 226 Marginalisation 47 227 Aside rule 49 ' 75 228 Expectations 50 23 Popular discrete distributions 53 231 Bernoulli distribution 53 232 Binomial distribution 53 233 Multinomial distribution 54 24 Continuous random variables density functions 55 25 Popular continuous density functions 58 251 The uniform density function 58 252 The beta density function 60 253 The Gaussian density function 61 254 Multivariate Gaussian 62 255 Summary 65 26 Thinking generativelycontinued 65 27 Likelihood 67 271 Dataset likelihood 68 272 Maximum likelihood 69 273 Characteristics of the maximum likelihood solution 71 274 Maximum likelihood favours complex models 74 28 The biasvariance tradeoff 75 281 Summary 29 Effect of noise on parameter estimates 76 291 Uncertainty in estimates 78 292 Comparison with empirical 293 Variability in model parameters values 81 data 82 210 Variability in predictions 83 2101 Predictive variability example 85 2102 Expected values of the estimators 86 2103 Summary 90 211 Exercises 90 Further reading 93 3 The Bayesian Approach to Machine Learning 95 31 A coin game 95 311 Counting heads 97 312 The Bayesian way 98 32 The exact posterior 103 33 The three scenarios 104 331 No prior knowledge 104

the Gaussian a classconditional " vii 332 The fair coin scenario Ill 333 A biased coin 114 334 The three scenarios 335 Adding more data summary 116 116 118 34 Marginal likelihoods 117 341 Model comparison with the marginal likelihood 35 Hyperparameters 119 36 Graphical models 120 361 Summary 121 37 A Bayesian treatment of the Olympics 100 m data 122 371 The model 38 likelihood for model 372 The likelihood 124 373 The prior 124 374 The posterior 124 375 A firstorder polynomial 126 376 Making predictions 129 131 Marginal polynomial order selection 39 Chapter summary 133 310 Exercises 133 Further reading 137 122 4 Bayesian Inference 139 41 Nonconjugate models 139 42 Binary responses 140 421 A model for binary responses 140 43 A point estimate MAP solution 143 : 154 44 The Laplace approximation 149 441 Laplace approximation example: Approximating a gamma density 150 442 Laplace approximation for the binary response model 151 45 Sampling techniques 451 Playing darts 154 452 The MetropolisHastings algorithm 156 453 The art of sampling 164 46 Summary 165 47 Exercises 165 Further reading 167 5 Classification 169 51 The general problem 169 52 Probabilistic classifiers 170 521 The Bayes classifier 170 5211 Likelihood distributions 171 5212 Prior class distribution 171 5213 Example classconditionals 172

0/1 classifying the viii 5214 Making predictions 173 5215 The naive Bayes assumption 175 5216 Example text 175 5217 Smoothing 177 522 Logistic regression 179 5221 Motivation 180 5222 Nonlinear decision functions 181 5223 Nonparametric models Gaussian process 182 186 53 Nonprobabilistic classifiers 183 531 ifnearest neighbours 183 5311 Choosing if 184 532 Support vector machines and other kernel methods 5321 The margin 186 5322 Maximising the margin 187 5323 Making predictions 190 5324 Support vectors 191 5325 Soft margins 192 5326 Kernels 193 533 Summary 197 54 Assessing classification performance 198 541 Accuracy 542 Sensitivity and specificity loss 198 198 543 The area under the ROC curve 199 544 Confusion matrices 201 55 Discriminative and generative classifiers 203 56 Summary 203 57 Exercises 203 Further reading 205 6 Clustering 207 61 The general problem 207 62 ifmeans clustering 208 621 Choosing the number of clusters 210 622 Where ifmeans fails 212 623 Kernelised ifmeans 212 624 Summary 214 63 Mixture models 215 631 A generative process 216 632 Mixture model likelihood 217 633 The EM algorithm 219 6331 Updating nk 220 6332 Updating^ 221 6333 Updating Sfc 222 6334 Updating qnk 223 6335 Some intuition 224

ix 634 Example 225 635 EM finds local optima 226 636 Choosing the number of components 228 637 Other forms of mixture components 230 638 MAP estimates with EM 232 639 Bayesian mixture models 233 64 Summary 234 65 Exercises 234 Further reading 237 7 Principal Components Analysis and Latent Variable Models 239 71 The general problem 239 711 Variance as a proxy for interest 239 72 Principal components analysis 242 721 Choosing D 247 722 Limitations of PCA 247 73 Latent variable models 248 731 Mixture models as latent variable models 248 732 Summary 249 74 Variational Bayes 249 741 Choosing Q{9) 251 742 Optimising the bound 252 75 A probabilistic model for PCA 252 751 Qt(t) 254 752 QXn(x ) 256 753 QWm(wm) 257 754 The required expectations 258 755 The algorithm 258 756 An example 260 76 Missing values 260 761 Missing values as latent variables 262 762 Predicting missing values 264 77 Nonrealvalued data 264 771 Probit PPCA 264 772 Visualising parliamentary data 268 7721 Aside relationship to classification 272 78 Summary 273 79 Exercises 273 Further reading 275 Glossary 277 Index 283