Introduction to Machine Learning applied to genomic selection

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Machine Learning Basics

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning From the Past with Experiment Databases

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Softprop: Softmax Neural Network Backpropagation Learning

CSL465/603 - Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CS Machine Learning

Artificial Neural Networks written examination

Probability and Statistics Curriculum Pacing Guide

(Sub)Gradient Descent

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Comparison of network inference packages and methods for multiple networks inference

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Model Ensemble for Click Prediction in Bing Search Ads

Why Did My Detector Do That?!

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Universityy. The content of

arxiv: v1 [cs.lg] 15 Jun 2015

Machine Learning and Development Policy

Time series prediction

Word Segmentation of Off-line Handwritten Documents

Probability estimates in a scenario tree

Rule Learning With Negation: Issues Regarding Effectiveness

STA 225: Introductory Statistics (CT)

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

Speech Emotion Recognition Using Support Vector Machine

A Case Study: News Classification Based on Term Frequency

Reducing Features to Improve Bug Prediction

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

12- A whirlwind tour of statistics

An Introduction to Simio for Beginners

Universidade do Minho Escola de Engenharia

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Applications of data mining algorithms to analysis of medical data

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Rule Learning with Negation: Issues Regarding Effectiveness

Probabilistic Latent Semantic Analysis

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Semi-Supervised Face Detection

Truth Inference in Crowdsourcing: Is the Problem Solved?

Issues in the Mining of Heart Failure Datasets

Individual Differences & Item Effects: How to test them, & how to test them well

MGT/MGP/MGB 261: Investment Analysis

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Cooperative evolutive concept learning: an empirical study

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

SARDNET: A Self-Organizing Feature Map for Sequences

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Lecture 1: Basic Concepts of Machine Learning

Honors Mathematics. Introduction and Definition of Honors Mathematics

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Mathematics subject curriculum

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Australian Journal of Basic and Applied Sciences

An Empirical Comparison of Supervised Ensemble Learning Approaches

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Theory of Probability

A Comparison of Charter Schools and Traditional Public Schools in Idaho

Activity Recognition from Accelerometer Data

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Multi-label classification via multi-target regression on data streams

A Pipelined Approach for Iterative Software Process Model

Statewide Framework Document for:

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

MISSISSIPPI STATE UNIVERSITY SUG FACULTY SALARY DATA BY COLLEGE BY DISCIPLINE

WHEN THERE IS A mismatch between the acoustic

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Axiom 2013 Team Description Paper

Mining Association Rules in Student s Assessment Data

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

School Size and the Quality of Teaching and Learning

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

GDP Falls as MBA Rises?

Detailed course syllabus

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

Human Emotion Recognition From Speech

Data Fusion Through Statistical Matching

Seminar - Organic Computing

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Calibration of Confidence Measures in Speech Recognition

INPE São José dos Campos

A survey of multi-view machine learning

Data Structures and Algorithms

Analysis of Enzyme Kinetic Data

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Transcription:

Introduction to Machine Learning applied to genomic selection O. González-Recio 1 Dpto Mejora Genética Animal, INIA, Madrid; O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 1 / 51

Outline 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 2 / 51

Outline 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 3 / 51

MACHINE LEARNING What is Learning? Making useful changes in our minds. -Marvin Minsky- Denotes changes in the system that enable the system to make the same task more effectively the next time. -Herbert Simon- Machine Learning Multidisciplinary field. Bio-informatics, statistics, genomics, data mining, astronomy, www,... Avoids rigid parametric models that may be far away from our observations. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

MACHINE LEARNING What is Learning? Making useful changes in our minds. -Marvin Minsky- Denotes changes in the system that enable the system to make the same task more effectively the next time. -Herbert Simon- Machine Learning Multidisciplinary field. Bio-informatics, statistics, genomics, data mining, astronomy, www,... Avoids rigid parametric models that may be far away from our observations. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

MACHINE LEARNING Machine Learning in genomic selection Massive amount of information. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data. ML is able to extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Supervised Learning: we have a target output (phenotypes). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 5 / 51

MACHINE LEARNING Massive Genomic Information What does information consume in an information-rich world? it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. -Herbert Simon; Nobel price in Economics- Overview Develop algorithms to extract knowledge from some set of data in an effective and efficient fashion, to predict yet to be observed data following certain rules. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

MACHINE LEARNING Massive Genomic Information What does information consume in an information-rich world? it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. -Herbert Simon; Nobel price in Economics- Overview Develop algorithms to extract knowledge from some set of data in an effective and efficient fashion, to predict yet to be observed data following certain rules. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

INTRO What is Learning? Given: a colection of examples (data) E (phenotypes and covariates) Produce: an equation or description (T) that covers all or most examples, and predicts (P) the value, class or category of a yet-to-be observed example. The algorithm learns relationships and associations between already observed examples to predict phenotypes when their covariates are observed. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 7 / 51

MOTIVATION Definition a computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 8 / 51

INTRO Machine Learning is a piece in the process to adquire new knowledge. Workflow in Data Mining tasks From Inza et al. (2010) O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 9 / 51

OUTLINE OF THE COURSE In this course Basic concepts in Machine Learning Design of a learning system. Regularization and bias-variance trade off. Ensemble methods: Boosting Random Forest O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 10 / 51

Outline Learning System Design Description 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 11 / 51

Why is it important? Learning System Design Description Vital to implement an effective learning. What should be considered Wonder what do we want to answer. What scenario is expected. Design the learning and validation sets in consequence. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 12 / 51

Learning System Design Description Learning system in genomic selection Genome-wide association studies Goal: Find genetic variants associated to a given trait. What is the phenotype distribution in our population. Prediction of genetic merit in future generations is less important. Diseases: Case-control, case-case-control designs. Genomic selection Goal: Predict genomic merit of individuals w/o phenotype. We expect DNA recombinations in subsequent generations. Re-phenotyping every x generations. Overlapped or discrete generations. Select training and testing sets according to the characteristics of our population. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 13 / 51

Outline Learning System Design Types of designs 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 14 / 51

Learning design Same learning and validation set Learning System Design Types of designs O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 15 / 51

Learning design k-fold cross validaion Learning System Design Types of designs O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 16 / 51

Learning design Training and testing sets Learning System Design Types of designs O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 17 / 51

Outline Ensemble methods Ensemble methods 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 18 / 51

Introduction Ensemble methods Ensemble methods Wide variate of competing methods Bayes alphabet, Bayesian LASSO, Ridge regression, Logistic regression, Neural networks,... The comparative accuracy depends strongly on the trait, problem addressed or genetic architecture. A priori we don t know what method is better for a new problem. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 19 / 51

Introduction Ensemble methods Ensemble methods Ensembles Ensembles are combination of different methods (usually simple models). They have very good predictive ability because use complementary and additivity of models performances. Ensembles have better predictive ability than methods separately. They have known statistics properties (no black boxes ). In a multitud of counselors there is saftey O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 20 / 51

Introduction Ensemble methods Ensemble methods Ensembles y = c 0 + c 1 f 1 (y,x) + c 2 f 2 (y,x) +... + c M f M (y,x) + e O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 21 / 51

Ensemble methods Building Ensembles: Two steps Ensemble methods 1. Developing a population of varied models Also called base learners. May be weak models: slightly better than random guess. Same/different method. Features Subset Selection (FSS). May capture non-linearities and interactions. Partition of the input space. 2. Combining them to form a composite predictor Voting. Estimated weight. Averaging. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Ensemble methods Building Ensembles: Two steps Ensemble methods 1. Developing a population of varied models Also called base learners. May be weak models: slightly better than random guess. Same/different method. Features Subset Selection (FSS). May capture non-linearities and interactions. Partition of the input space. 2. Combining them to form a composite predictor Voting. Estimated weight. Averaging. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Examples Ensemble methods Ensemble methods Most common ensembles Model averaging (e.g. Bayesian model averaging). Bagging. Boosting. Random Forest. Can be worse Most ensembling use variations of one kind of modeling examples, but complex and heterogeneus ensembling may be imagined. Boosting and Random Forest High dimensional heuristic search algorithms to detect signal covariates. Do not model any particular gene action or genetic architecture. Do not provide a simple estimate of effect size. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 23 / 51

Outline Ensemble methods Bagging 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 24 / 51

Bagging Ensemble methods Bagging Bootstrap aggregating bootstrap data and average results ŷ = 1 M M m=1 f m(ψ m ), with Ψ m being a bootstrapped sample of the N records of (y,x). f m ( ) is the model of choice applied to the bootstrapped data. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 25 / 51

Bagging Ensemble methods Bagging Bootstrap aggregating e N(0,σ 2 e ) i.i.d. Averaging residuals ê i = 1 M M m=1 (y i ŷ im ), we expect that e approximatte to zero by a factor of M. Unfortunately, e are not independent during the process and a limit is usualy reached. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 26 / 51

Outline Ensemble methods Boosting 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 27 / 51

Boosting Ensemble methods Boosting Properties Based on AdaBoost (Freund and Schapire, 1996). May be applied to both continuous and categorical traits. Bühlmann and Yu (2003) proposed a version for high dimensional problems. Covariate selection Small step gradient descent O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 28 / 51

Boosting Ensemble methods Boosting In genomic selection Apply base learners on the residuals of the previous one. Implement feature selection at each step. Apply a small weight on each learner and train a new learner on residuals. It does not require heritance model specification (additivity, epistasis, dominance,... ). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 29 / 51

Outline Ensemble methods Random Forest 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 30 / 51

Random Forest Ensemble methods Random Forest Properties Based on classification and regression trees (CART). Analyze discrete or continuous traits. Implements feature selection. Exploits randomization. Massively non-parametric. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 31 / 51

Random Forest Ensemble methods Random Forest Advantages in genomic selection It does not require heritance model specification (additivity, epistasis, dominance,... ). It is able to capture complex interactions in the data. Implements bagging (Breiman, 1996). Reduce error prediction by a factor of the number of trees. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 32 / 51

Outline Ensemble methods Examples 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 33 / 51

Ensemble methods Examples Examples L2-Boosting algorithm applied to high-dimensional problems in genomic selection (Genetics Research, 2010) Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa Prediction accuracy for productive lifetime in a testing set in dairy cattle (3304 training/1398 testing; 32,611 SNPs) Method Pearson correlation MSE bias Boosting_OLS 0.65 1.08 0.08 Bayes A 0.63 2.81 1.26 Bayesian LASSO 0.66 1.10 0.10 O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 34 / 51

Ensemble methods Examples Examples L2-Boosting algorithm applied to high-dimensional problems in genomic selection (Genetics Research, 2010) Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa Prediction accuracy for progeny average feed conversion rate in a testing set in broilers (333 training/61 testing; 3481 SNPs) Pearson correlation MSE bias Boosting_NPR 0.37 0.006-0.018 Boosting OLS 0.33 0.006-0.011 Bayes A 0.27 0.007-0.016 Bayesian LASSO 0.26 0.007-0.010 O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 35 / 51

Ensemble methods Examples Examples Analysis of discrete traits in a genomic selection context using Bayesian regressions and Machine Learning (reviewing) Gonzalez-Recio O. and S. Forni ˆ for Scrotal Hernia incidence from three lines Prediction accuracy (cor(y, y)) of PIC Line A (923 purebred) Line B (919 purebred) Line C (700 crossbred) O. González-Recio (INIA) TBA 0.13 0.34 0.24 BTL 0.22 0.32 0.15 Machine Learning RanFor 0.26 0.38 0.23 L2B 0.17 0.12 0.24 LhB 0.09 0.32 0.15 UPV Valencia, 20-24 Sept. 2010 36 / 51

Ensemble methods Examples Examples Analysis of discrete traits in a genomic selection context using Bayesian regressions and Machine Learning (reviewing) Gonzalez-Recio O. and S. Forni Area under the ROC curve for Scrotal Hernia incidence from three lines of PIC Method Line A (923 purebred) Line B (919 purebred) Line C (700 crossbred) O. González-Recio (INIA) TBA 0.64 0.70 0.62 BTL 0.65 0.69 0.62 Machine Learning RanFor 0.67 0.73 0.67 L2B 0.55 0.60 0.67 LhB 0.60 0.72 0.66 UPV Valencia, 20-24 Sept. 2010 37 / 51

Ensemble methods Examples Examples Analysis of discrete traits in a genomic selection context using Bayesian regressions and Machine Learning (reviewing) Prediction accuracy for Scrotal Hernia incidence from a nucleus line of PIC O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 38 / 51

Outline Regularization Bias-Variance trade off 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 39 / 51

Background Regularization Bias-Variance trade off Regularization Analysis of high throughput genotyping data: large p, small n problem. Models without regularization or feature subset selection (FSS) are prone to overfitting and decrease predictive ability. Including all covariates increases the complexity of the model. Follow Occam s Razor: entities must not be multiplied beyond necessity or When accuracy of two hypothesis is similar, prefer the simpler one. Generalization is hurt by complexity. All new assumptions introduce possibilities for error, then, keep it simple. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 40 / 51

Model complexity Regularization Bias-Variance trade off Bias-variance trade off Low complexity: high bias, low variance. Large complexity: low bias, high variance. Optimum intermedium bias-variance trade off 5 10 15 Variance Bias^2 MSE 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Model complexity O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 41 / 51

Model complexity Regularization Bias-Variance trade off Bias-variance trade off Low complexity: high bias, low variance. Large complexity: low bias, high variance. Optimum intermedium O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 42 / 51

Regularization Bias-Variance trade off Regularization in shrinkage models Penalization term or prior assumptions Ridge Regression: penalize p s=1 β 2 s. Bayes B (C, D,...): set snp variance/coefficient to zero with probability π, and remaining snp variances are assumed inverted chi-squared prior distribution. Bayes A: assume a inverted chi-squared prior distribution for SNP variance. LASSO: penalize λ p s=1 β s. Bayesian LASSO: double exponential prior distribution (controlled by λ) on SNP coefficients. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 43 / 51

Outline Regularization Model complexity in ensembles 1 2 Learning System Design Description Types of designs 3 Ensemble methods Overview Bagging Boosting Random Forest Examples 4 Regularization Bias-Variance trade off Model complexity in ensembles 5 Remarks O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 44 / 51

Complexity of ensembles Regularization Model complexity in ensembles Use simple models. Use many models. Interpretation of many models, even simple model, may be much harder than with a single model. Ensembles are competitive in accuracy though at a probable loss of interpretability. Too complex ensembles may lead to overfitting. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 45 / 51

Complexity of ensembles Regularization Model complexity in ensembles Are ensembles truly complex? They appear so, but do they act so? Controling complexity in ensembles is not as simple as merely count coefficients or assume prior distrbutions. Many ensembles do not show overfitting (Bagging, Random Forest). Control the complexity of the ensembles using cross-validation (There exist more complicated ways). Tune the number of ensembles constructed. Use more or less complex base learners. In general, ensembles are rather robust to overfitting. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 46 / 51

Complexity of ensembles Regularization Model complexity in ensembles Mean Squared Error in the training set (2 different base learners). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 47 / 51

Complexity of ensembles Regularization Model complexity in ensembles Mean Squared Error in the testing set (2 different base learners). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Complexity of ensembles Regularization Model complexity in ensembles Mean Squared Error in the testing set (2 different base learners). O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Remarks Remarks Machine Learning New data/concepts are frequently generated in molecular biology/genomic, and ML can efficiently adapt to this fast evolving nature. ML is able to deal with missing and noisy data from many scenarios. ML is able to deal with huge volumes of data generated by novel high-throughput devices, extracting hidden relationships not noticeable to experts. ML can adjust its internal structure to the data producing accurate estimates. ML uses algorithms that learn from the data (combinations of artificial inteligence and statistics). Need a careful data preprocessing and design of the learning system. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 49 / 51

Remarks Remarks Ensembles Ensembles are combination of several base learners, improving accuracy substantially. Ensembles may seem complex, but they do not act so. Perform extremely well in a variety of possible complex domains. Have desirable statistical properties. Scale well computationally. We will learn how to implement ensembles in a genomic selection context. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 50 / 51

To take home Remarks Inherent complexity of genetic/biologic systems have unknown properties/rules that may not be parametrized. Learn from experiences, interpret from knowledge. If worried for shrinkage, use boosting. If believe in state of nature yet, use Random Forest. O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 51 / 51