Machine Learning for Chemoinformatics An introduction Francesca Grisoni University of Milano-Bicocca, Dept. of Earth and Environmental Sciences, Milan, Italy ETH Zurich, Dept. of Chemistry and Applied Biosciences, Zurich, Switzerland francesca.grisoni@unimib.it F. Grisoni, BigChem online course 17.05.2017 1
Presentation Outline Introduction Definition Elements of Machine Learning Additional Considerations The NFL theorem Validation Applicability Machine Learning approaches: some examples Local methods Tree-like approaches Neural Networks F. Grisoni, BigChem online course 17.05.2017 2
Introduction Machine learning in chemoinformatics Biological activity prediction Toxicity Physico-chemical properties P = f ( ) Multi-objective optimization Rational drug design F. Grisoni, BigChem online course 17.05.2017 3
Introduction Machine Learning (ML) Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. 1 https://www.toptal.com/machine-learning/machinelearning-theory-an-introductory-primer (1) Data (2) Task A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 2 (3) Performance 1 Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), 210-229. 2 Mitchell, T. M. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37), 870-877. F. Grisoni, BigChem online course 17.05.2017 4
Machine Learning Elements (1) The data and the G-I-G-O principle X (n x p) n samples p variables X f ( ) f ( 0.1, 1, 0, 3, 3.5, 2, ) F. Grisoni, BigChem online course 17.05.2017 5
Machine Learning Elements (1) The data and the G-I-G-O principle X (n x p) Y (n x 1) p variables p' P = f ( ) n samples X n samples Y P = f ( 0.1, 1, 0, 3, 3.5, 2, ) F. Grisoni, BigChem online course 17.05.2017 6
Machine Learning Elements (1) The data and the G-I-G-O principle X (n x p) p variables Y (n x 1) p' Garbage In = Garbage Out n samples X n samples Y Structures Experimental Responses F. Grisoni, BigChem online course 17.05.2017 7
Machine Learning Elements (2) Machine Learning Tasks Unsupervised Learning X (n x p) x 2 p variables n samples X x 1 F. Grisoni, BigChem online course 17.05.2017 8
Machine Learning Elements (2) Machine Learning Tasks Unsupervised Learning X (n x p) x 2 p variables n samples X x 1 F. Grisoni, BigChem online course 17.05.2017 9
Machine Learning Elements (2) Machine Learning Tasks Supervised Learning x 2 X (n x p) p variables n samples X x 1 F. Grisoni, BigChem online course 17.05.2017 10
Machine Learning Elements (2) Machine Learning Tasks x 2 Supervised Learning X (n x p) + p variables Y (n x 1) p' n samples X n samples Y x 1 F. Grisoni, BigChem online course 17.05.2017 11
Machine Learning Elements (2) Machine Learning Tasks x 2 Supervised Learning X (n x p) + p variables Y (n x 1) p' n samples X n samples Y x 1 F. Grisoni, BigChem online course 17.05.2017 12
Machine Learning Elements (2) Machine Learning Tasks F. Grisoni, BigChem online course 17.05.2017 13
Machine Learning Elements (2) Machine Learning Tasks Classification Regression F. Grisoni, BigChem online course 17.05.2017 14
Machine Learning Elements (3) Performance Classification N P Single class N TN FP Sensitivity or True Positive rate (TPR) P FN TP Specificity or True Negative rate (TNR) Precision F. Grisoni, BigChem online course 17.05.2017 15
Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] F. Grisoni, BigChem online course 17.05.2017 16
Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn p = 0/10 = 0% Sn p = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course 17.05.2017 17
Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn P = 0/10 = 0% Sn N = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course 17.05.2017 18
Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn P = 0/10 = 0% Sn N = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course 17.05.2017 19
Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn P = 0/10 = 0% Sn N = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course 17.05.2017 20
Machine Learning Elements (3) Performance Regression Real Pred. Root Mean Squared Error in Prediction (RMSEP) y y# F. Grisoni, BigChem online course 17.05.2017 21
Considerations on ML Additional Considerations F. Grisoni, BigChem online course 17.05.2017 22
Considerations on ML Additional Considerations 1. Choice of the learner No Free Lunch Theorem: For every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. 1 1 Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. F. Grisoni, BigChem online course 17.05.2017 23
Considerations on ML Additional Considerations 2. Bias-Variance Trade-off Error Bias à generalization (underfitting) Variance à descriptive ability (overfitting) Complexity F. Grisoni, BigChem online course 17.05.2017 24
Considerations on ML Additional Considerations 2. Bias-Variance Trade-off Error Bias à generalization (underfitting) Variance à descriptive ability (overfitting) Complexity F. Grisoni, BigChem online course 17.05.2017 25
Considerations on ML Additional Considerations 3. Validation group 1 group 2 Initial dataset group 4 F. Grisoni, BigChem online course 17.05.2017 26
Considerations on ML Additional Considerations 3. Validation group 1 group 2 Initial dataset Training set group 4 Test set F. Grisoni, BigChem online course 17.05.2017 27
Considerations on ML Additional Considerations 3. Validation group 1 group 2 Initial dataset Training set group 4 Test set Training set Validation set Test set F. Grisoni, BigChem online course 17.05.2017 28
Considerations on ML Additional Considerations 3. Validation group 1 group 2 group 3 group 4 group 5 Validation set Training set F. Grisoni, BigChem online course 17.05.2017 29
Considerations on ML Additional Considerations 4. Applicability The No Free Dessert (either!) theorem Machine learning models à Reductionist Types of chemical structures Physicochemical properties Mechanisms of action considered Applicability Domain: Chemical space where the property can be reliably predicted F. Grisoni, BigChem online course 17.05.2017 30
Considerations on ML Additional Considerations 4. Applicability min( x ), max( x ) 1 1 min( x ), max( x ) 2 2 H=X (X X) T -1 X p Man xy = å j - j j= 1 D x y F. Grisoni, BigChem online course 17.05.2017 31
Standard Machine Learning workflow (in chemoinformatics) Considerations on ML F. Grisoni, BigChem online course 17.05.2017 32
Standard Machine Learning workflow (in chemoinformatics) Considerations on ML Information extraction F. Grisoni, BigChem online course 17.05.2017 33
Standard Machine Learning workflow (in chemoinformatics) Considerations on ML Information extraction Applicability & predictivity F. Grisoni, BigChem online course 17.05.2017 34
Standard Machine Learning workflow (in chemoinformatics) Considerations on ML Information extraction Applicability & predictivity Application & new knowledge F. Grisoni, BigChem online course 17.05.2017 35
Machine Learning methods (overview) 1. Decision Tree-based learning Decision Trees Random Forest 2. Local Methods k-means algorithm k-nn algorithm 3. Artificial Neural Networks Feed-Forward NN Kohonen Maps F. Grisoni, BigChem online course 17.05.2017 36
(1) Decision Tree Learning Root node Decision node(s) Leaves F. Grisoni, BigChem online course 17.05.2017 37
(1) Decision Tree Learning Root node 1. Easy to interpret 2. No data pretreatment 3. Numerical/categorical variables 4. Classification and regression 5. Non parametric 6. Automatic variable selection Decision node(s) Leaves F. Grisoni, BigChem online course 17.05.2017 38
(1) Decision Tree Learning Random Forest Bagging (Bootstrap Aggregating) = the power of the crowd F. Grisoni, BigChem online course 17.05.2017 39
(2) Local approaches k-means clustering x 2 x 1 F. Grisoni, BigChem online course 17.05.2017 40
(2) Local approaches k-means clustering x 2 1. Select a k (3) x 1 F. Grisoni, BigChem online course 17.05.2017 41
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment x 1 F. Grisoni, BigChem online course 17.05.2017 42
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation x 1 F. Grisoni, BigChem online course 17.05.2017 43
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course 17.05.2017 44
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course 17.05.2017 45
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course 17.05.2017 46
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course 17.05.2017 47
(2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid 5. End x 1 F. Grisoni, BigChem online course 17.05.2017 48
(2) Local approaches k-nearest Neighbor (knn)? F. Grisoni, BigChem online course 17.05.2017 49
(2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times)? F. Grisoni, BigChem online course 17.05.2017 50
(2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course 17.05.2017 51
(2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 1 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course 17.05.2017 52
(2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 2 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course 17.05.2017 53
(2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 3 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course 17.05.2017 54
(2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 4 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course 17.05.2017 55
(2) Local approaches k-nearest Neighbor (knn) 1. Good for large training set with localized differences 2. Difficult to be interpreted 3. Which k? 4. Which distance measure? 5. Curse of dimensionality à Variable selection F. Grisoni, BigChem online course 17.05.2017 56
(3) Neural Networks Artificial Neurons Inputs x 2 x 1 Output x 3 x 4 f (x) y x p Activation Function F. Grisoni, BigChem online course 17.05.2017 57
(3) Neural Networks Artificial Neurons Inputs x 2 x 1 Output x 3 x 4 f (x) y x p Activation Function Neural networks. Comprehensive Chemometrics: Chemical and Biochemical Data Analysis Vol, 3. F. Grisoni, BigChem online course 17.05.2017 58
(3) Neural Networks Feed-Forward NN Output Layer 1. Untrained Network 2. Compute the outcome Hidden Layer(s) 3. Compute the error 4. Back propagation learning 5. Repeat until stop criterion Input Layer F. Grisoni, BigChem online course 17.05.2017 59
(3) Neural Networks Feed-Forward NN Error Training set Epochs F. Grisoni, BigChem online course 17.05.2017 60
(3) Neural Networks Feed-Forward NN Error Validation set Training set Epochs F. Grisoni, BigChem online course 17.05.2017 61
(3) Neural Networks Feed-Forward NN Error Validation set Training set Epochs F. Grisoni, BigChem online course 17.05.2017 62
(3) Neural Networks Kohonen Maps p dimensional Unsupervised non-linear mapping Topology preserving map 2 dimensional F. Grisoni, BigChem online course 17.05.2017 63
(3) Neural Networks Kohonen Maps Input Neurons Kohonen Layer 1. Competitive Learning Similarity to each neuron Winner takes all 2. Collaborative Learning Winning neuron update Update of close neurons Weights F. Grisoni, BigChem online course 17.05.2017 64
(3) Neural Networks Kohonen Maps Top map (compounds) F. Grisoni, BigChem online course 17.05.2017 65
(3) Neural Networks Kohonen Maps Top map (compounds) Weight maps (p) F. Grisoni, BigChem online course 17.05.2017 66
(3) Neural Networks Kohonen Maps Top map (compounds) Weight maps (p) F. Grisoni, BigChem online course 17.05.2017 67
Which ML algorithm? Purpose (clustering, regression, classification) Performance vs interpretability Covered chemical space (e.g., AD) Types of included variables http://scikit-learn.org/stable/tutorial/machine_learning_map/ F. Grisoni, BigChem online course 17.05.2017 68
Summary Machines can learn from our data No ML algorithm always outperforms the others Validation and Applicability Domain assessment are crucial Pay attention to what the performance metric is telling you! F. Grisoni, BigChem online course 17.05.2017 69
Supplementary reading Theory and Algorithms Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. Marini, F. (2009). Neural networks. In: Comprehensive Chemometrics: Chemical and Biochemical Data Analysis - Vol, 3. Online resources [Coursera] Ng, A. Machine Learning, Stanford University. https://www.coursera.org/learn/machine-learning [Online Book] Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com/ F. Grisoni, BigChem online course 17.05.2017 70