Introduction to Machine Learning applied to genomic selection

Introduction to Machine Learning applied to genomic selection O. González-Recio 1 Dpto Mejora Genética Animal, INIA, Madrid; O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 1 / 51

Outline 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 2 / 51

MACHINE LEARNING What is Learning? Making useful changes in our minds. -Marvin Minsky- Denotes changes in the system that enable the system to make the same task more effectively the next time. -Herbert Simon- Machine Learning Multidisciplinary field. Bio-informatics, statistics, genomics, data mining, astronomy, www,... Avoids rigid parametric models that may be far away from our observations. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

MACHINE LEARNING Machine Learning in genomic selection Massive amount of information. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data. ML is able to extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Supervised Learning: we have a target output (phenotypes). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 5 / 51

MACHINE LEARNING Massive Genomic Information What does information consume in an information-rich world? it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. -Herbert Simon; Nobel price in Economics- Overview Develop algorithms to extract knowledge from some set of data in an effective and efficient fashion, to predict yet to be observed data following certain rules. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

INTRO What is Learning? Given: a colection of examples (data) E (phenotypes and covariates) Produce: an equation or description (T) that covers all or most examples, and predicts (P) the value, class or category of a yet-to-be observed example. The algorithm learns relationships and associations between already observed examples to predict phenotypes when their covariates are observed. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 7 / 51

MOTIVATION Definition a computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 8 / 51

INTRO Machine Learning is a piece in the process to adquire new knowledge. Workflow in Data Mining tasks From Inza et al. (2010) O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 9 / 51

OUTLINE OF THE COURSE In this course Basic concepts in Machine Learning Design of a learning system. Regularization and bias-variance trade off. Ensemble methods: Boosting Random Forest O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 10 / 51

Outline Learning System Design Description 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 11 / 51

Why is it important? Learning System Design Description Vital to implement an effective learning. What should be considered Wonder what do we want to answer. What scenario is expected. Design the learning and validation sets in consequence. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 12 / 51

Learning System Design Description Learning system in genomic selection Genome-wide association studies Goal: Find genetic variants associated to a given trait. What is the phenotype distribution in our population. Prediction of genetic merit in future generations is less important. Diseases: Case-control, case-case-control designs. Genomic selection Goal: Predict genomic merit of individuals w/o phenotype. We expect DNA recombinations in subsequent generations. Re-phenotyping every x generations. Overlapped or discrete generations. Select training and testing sets according to the characteristics of our population. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 13 / 51

Outline Learning System Design Types of designs 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 14 / 51

Learning design Same learning and validation set Learning System Design Types of designs O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 15 / 51

Learning design k-fold cross validaion Learning System Design Types of designs O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 16 / 51

Learning design Training and testing sets Learning System Design Types of designs O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 17 / 51

Outline Ensemble methods Ensemble methods 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 18 / 51

Introduction Ensemble methods Ensemble methods Wide variate of competing methods Bayes alphabet, Bayesian LASSO, Ridge regression, Logistic regression, Neural networks,... The comparative accuracy depends strongly on the trait, problem addressed or genetic architecture. A priori we don t know what method is better for a new problem. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 19 / 51

Introduction Ensemble methods Ensemble methods Ensembles Ensembles are combination of different methods (usually simple models). They have very good predictive ability because use complementary and additivity of models performances. Ensembles have better predictive ability than methods separately. They have known statistics properties (no black boxes ). In a multitud of counselors there is saftey O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 20 / 51

Introduction Ensemble methods Ensemble methods Ensembles y = c 0 + c 1 f 1 (y,x) + c 2 f 2 (y,x) +... + c M f M (y,x) + e O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 21 / 51

Ensemble methods Building Ensembles: Two steps Ensemble methods 1. Developing a population of varied models Also called base learners. May be weak models: slightly better than random guess. Same/different method. Features Subset Selection (FSS). May capture non-linearities and interactions. Partition of the input space. 2. Combining them to form a composite predictor Voting. Estimated weight. Averaging. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Examples Ensemble methods Ensemble methods Most common ensembles Model averaging (e.g. Bayesian model averaging). Bagging. Boosting. Random Forest. Can be worse Most ensembling use variations of one kind of modeling examples, but complex and heterogeneus ensembling may be imagined. Boosting and Random Forest High dimensional heuristic search algorithms to detect signal covariates. Do not model any particular gene action or genetic architecture. Do not provide a simple estimate of effect size. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 23 / 51

Outline Ensemble methods Bagging 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 24 / 51

Bagging Ensemble methods Bagging Bootstrap aggregating bootstrap data and average results ŷ = 1 M M m=1 f m(ψ m ), with Ψ m being a bootstrapped sample of the N records of (y,x). f m ( ) is the model of choice applied to the bootstrapped data. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 25 / 51

Bagging Ensemble methods Bagging Bootstrap aggregating e N(0,σ 2 e ) i.i.d. Averaging residuals ê i = 1 M M m=1 (y i ŷ im ), we expect that e approximatte to zero by a factor of M. Unfortunately, e are not independent during the process and a limit is usualy reached. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 26 / 51

Outline Ensemble methods Boosting 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 27 / 51

Boosting Ensemble methods Boosting Properties Based on AdaBoost (Freund and Schapire, 1996). May be applied to both continuous and categorical traits. Bühlmann and Yu (2003) proposed a version for high dimensional problems. Covariate selection Small step gradient descent O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 28 / 51

Boosting Ensemble methods Boosting In genomic selection Apply base learners on the residuals of the previous one. Implement feature selection at each step. Apply a small weight on each learner and train a new learner on residuals. It does not require heritance model specification (additivity, epistasis, dominance,... ). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 29 / 51

Outline Ensemble methods Random Forest 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 30 / 51

Random Forest Ensemble methods Random Forest Properties Based on classification and regression trees (CART). Analyze discrete or continuous traits. Implements feature selection. Exploits randomization. Massively non-parametric. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 31 / 51

Random Forest Ensemble methods Random Forest Advantages in genomic selection It does not require heritance model specification (additivity, epistasis, dominance,... ). It is able to capture complex interactions in the data. Implements bagging (Breiman, 1996). Reduce error prediction by a factor of the number of trees. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 32 / 51

Outline Ensemble methods Examples 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 33 / 51

Ensemble methods Examples Examples L2-Boosting algorithm applied to high-dimensional problems in genomic selection (Genetics Research, 2010) Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa Prediction accuracy for productive lifetime in a testing set in dairy cattle (3304 training/1398 testing; 32,611 SNPs) Method Pearson correlation MSE bias Boosting_OLS 0.65 1.08 0.08 Bayes A 0.63 2.81 1.26 Bayesian LASSO 0.66 1.10 0.10 O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 34 / 51

Ensemble methods Examples Examples L2-Boosting algorithm applied to high-dimensional problems in genomic selection (Genetics Research, 2010) Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa Prediction accuracy for progeny average feed conversion rate in a testing set in broilers (333 training/61 testing; 3481 SNPs) Pearson correlation MSE bias Boosting_NPR 0.37 0.006-0.018 Boosting OLS 0.33 0.006-0.011 Bayes A 0.27 0.007-0.016 Bayesian LASSO 0.26 0.007-0.010 O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 35 / 51

Ensemble methods Examples Examples Analysis of discrete traits in a genomic selection context using Bayesian regressions and Machine Learning (reviewing) Gonzalez-Recio O. and S. Forni ˆ for Scrotal Hernia incidence from three lines Prediction accuracy (cor(y, y)) of PIC Line A (923 purebred) Line B (919 purebred) Line C (700 crossbred) O. González-Recio (INIA) TBA 0.13 0.34 0.24 BTL 0.22 0.32 0.15 Machine Learning RanFor 0.26 0.38 0.23 L2B 0.17 0.12 0.24 LhB 0.09 0.32 0.15 UPV Valencia, 20-24 Sept. 2010 36 / 51

Ensemble methods Examples Examples Analysis of discrete traits in a genomic selection context using Bayesian regressions and Machine Learning (reviewing) Gonzalez-Recio O. and S. Forni Area under the ROC curve for Scrotal Hernia incidence from three lines of PIC Method Line A (923 purebred) Line B (919 purebred) Line C (700 crossbred) O. González-Recio (INIA) TBA 0.64 0.70 0.62 BTL 0.65 0.69 0.62 Machine Learning RanFor 0.67 0.73 0.67 L2B 0.55 0.60 0.67 LhB 0.60 0.72 0.66 UPV Valencia, 20-24 Sept. 2010 37 / 51

Ensemble methods Examples Examples Analysis of discrete traits in a genomic selection context using Bayesian regressions and Machine Learning (reviewing) Prediction accuracy for Scrotal Hernia incidence from a nucleus line of PIC O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 38 / 51

Outline Regularization Bias-Variance trade off 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 39 / 51

Background Regularization Bias-Variance trade off Regularization Analysis of high throughput genotyping data: large p, small n problem. Models without regularization or feature subset selection (FSS) are prone to overfitting and decrease predictive ability. Including all covariates increases the complexity of the model. Follow Occam s Razor: entities must not be multiplied beyond necessity or When accuracy of two hypothesis is similar, prefer the simpler one. Generalization is hurt by complexity. All new assumptions introduce possibilities for error, then, keep it simple. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 40 / 51

Model complexity Regularization Bias-Variance trade off Bias-variance trade off Low complexity: high bias, low variance. Large complexity: low bias, high variance. Optimum intermedium bias-variance trade off 5 10 15 Variance Bias^2 MSE 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Model complexity O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 41 / 51

Model complexity Regularization Bias-Variance trade off Bias-variance trade off Low complexity: high bias, low variance. Large complexity: low bias, high variance. Optimum intermedium O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 42 / 51

Regularization Bias-Variance trade off Regularization in shrinkage models Penalization term or prior assumptions Ridge Regression: penalize p s=1 β 2 s. Bayes B (C, D,...): set snp variance/coefficient to zero with probability π, and remaining snp variances are assumed inverted chi-squared prior distribution. Bayes A: assume a inverted chi-squared prior distribution for SNP variance. LASSO: penalize λ p s=1 β s. Bayesian LASSO: double exponential prior distribution (controlled by λ) on SNP coefficients. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 43 / 51

Outline Regularization Model complexity in ensembles 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 44 / 51

Complexity of ensembles Regularization Model complexity in ensembles Use simple models. Use many models. Interpretation of many models, even simple model, may be much harder than with a single model. Ensembles are competitive in accuracy though at a probable loss of interpretability. Too complex ensembles may lead to overfitting. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 45 / 51

Complexity of ensembles Regularization Model complexity in ensembles Are ensembles truly complex? They appear so, but do they act so? Controling complexity in ensembles is not as simple as merely count coefficients or assume prior distrbutions. Many ensembles do not show overfitting (Bagging, Random Forest). Control the complexity of the ensembles using cross-validation (There exist more complicated ways). Tune the number of ensembles constructed. Use more or less complex base learners. In general, ensembles are rather robust to overfitting. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 46 / 51

Complexity of ensembles Regularization Model complexity in ensembles Mean Squared Error in the training set (2 different base learners). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 47 / 51

Complexity of ensembles Regularization Model complexity in ensembles Mean Squared Error in the testing set (2 different base learners). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Remarks Remarks Machine Learning New data/concepts are frequently generated in molecular biology/genomic, and ML can efficiently adapt to this fast evolving nature. ML is able to deal with missing and noisy data from many scenarios. ML is able to deal with huge volumes of data generated by novel high-throughput devices, extracting hidden relationships not noticeable to experts. ML can adjust its internal structure to the data producing accurate estimates. ML uses algorithms that learn from the data (combinations of artificial inteligence and statistics). Need a careful data preprocessing and design of the learning system. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 49 / 51

Remarks Remarks Ensembles Ensembles are combination of several base learners, improving accuracy substantially. Ensembles may seem complex, but they do not act so. Perform extremely well in a variety of possible complex domains. Have desirable statistical properties. Scale well computationally. We will learn how to implement ensembles in a genomic selection context. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 50 / 51

To take home Remarks Inherent complexity of genetic/biologic systems have unknown properties/rules that may not be parametrized. Learn from experiences, interpret from knowledge. If worried for shrinkage, use boosting. If believe in state of nature yet, use Random Forest. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 51 / 51