Machine Learning for Chemoinformatics An introduction

Size: px

Start display at page:

Download "Machine Learning for Chemoinformatics An introduction"

Patricia Rose
5 years ago
Views:

1 Machine Learning for Chemoinformatics An introduction Francesca Grisoni University of Milano-Bicocca, Dept. of Earth and Environmental Sciences, Milan, Italy ETH Zurich, Dept. of Chemistry and Applied Biosciences, Zurich, Switzerland F. Grisoni, BigChem online course

2 Presentation Outline Introduction Definition Elements of Machine Learning Additional Considerations The NFL theorem Validation Applicability Machine Learning approaches: some examples Local methods Tree-like approaches Neural Networks F. Grisoni, BigChem online course

3 Introduction Machine learning in chemoinformatics Biological activity prediction Toxicity Physico-chemical properties P = f ( ) Multi-objective optimization Rational drug design F. Grisoni, BigChem online course

Introduction Machine Learning (ML) Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. 1 https://www.toptal.

4 Introduction Machine Learning (ML) Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. 1 (1) Data (2) Task A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 2 (3) Performance 1 Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), Mitchell, T. M. (1997). Machine learning Burr Ridge, IL: McGraw Hill, 45(37), F. Grisoni, BigChem online course

5 Machine Learning Elements (1) The data and the G-I-G-O principle X (n x p) n samples p variables X f ( ) f ( 0.1, 1, 0, 3, 3.5, 2, ) F. Grisoni, BigChem online course

6 Machine Learning Elements (1) The data and the G-I-G-O principle X (n x p) Y (n x 1) p variables p' P = f ( ) n samples X n samples Y P = f ( 0.1, 1, 0, 3, 3.5, 2, ) F. Grisoni, BigChem online course

7 Machine Learning Elements (1) The data and the G-I-G-O principle X (n x p) p variables Y (n x 1) p' Garbage In = Garbage Out n samples X n samples Y Structures Experimental Responses F. Grisoni, BigChem online course

8 Machine Learning Elements (2) Machine Learning Tasks Unsupervised Learning X (n x p) x 2 p variables n samples X x 1 F. Grisoni, BigChem online course

9 Machine Learning Elements (2) Machine Learning Tasks Unsupervised Learning X (n x p) x 2 p variables n samples X x 1 F. Grisoni, BigChem online course

10 Machine Learning Elements (2) Machine Learning Tasks Supervised Learning x 2 X (n x p) p variables n samples X x 1 F. Grisoni, BigChem online course

11 Machine Learning Elements (2) Machine Learning Tasks x 2 Supervised Learning X (n x p) + p variables Y (n x 1) p' n samples X n samples Y x 1 F. Grisoni, BigChem online course

12 Machine Learning Elements (2) Machine Learning Tasks x 2 Supervised Learning X (n x p) + p variables Y (n x 1) p' n samples X n samples Y x 1 F. Grisoni, BigChem online course

13 Machine Learning Elements (2) Machine Learning Tasks F. Grisoni, BigChem online course

14 Machine Learning Elements (2) Machine Learning Tasks Classification Regression F. Grisoni, BigChem online course

15 Machine Learning Elements (3) Performance Classification N P Single class N TN FP Sensitivity or True Positive rate (TPR) P FN TP Specificity or True Negative rate (TNR) Precision F. Grisoni, BigChem online course

16 Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] F. Grisoni, BigChem online course

17 Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn p = 0/10 = 0% Sn p = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course

18 Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn P = 0/10 = 0% Sn N = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course

19 Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn P = 0/10 = 0% Sn N = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course

20 Machine Learning Elements (3) Performance Classification N P Global Performance N TN FP Non-Error Rate or Balanced-Accuracy ϵ [0,1] P FN TP N = 990; P = 10 TN = 990 (100%) TP = 0 (0%) Sn P = 0/10 = 0% Sn N = 990/990 = 100% Matthews Correlation Coefficient (MCC) ϵ [-1,1] Accuracy ϵ [0,1] NER = 50% Acc = 99% F. Grisoni, BigChem online course

21 Machine Learning Elements (3) Performance Regression Real Pred. Root Mean Squared Error in Prediction (RMSEP) y y# F. Grisoni, BigChem online course

22 Considerations on ML Additional Considerations F. Grisoni, BigChem online course

23 Considerations on ML Additional Considerations 1. Choice of the learner No Free Lunch Theorem: For every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. 1 1 Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. F. Grisoni, BigChem online course

24 Considerations on ML Additional Considerations 2. Bias-Variance Trade-off Error Bias à generalization (underfitting) Variance à descriptive ability (overfitting) Complexity F. Grisoni, BigChem online course

25 Considerations on ML Additional Considerations 2. Bias-Variance Trade-off Error Bias à generalization (underfitting) Variance à descriptive ability (overfitting) Complexity F. Grisoni, BigChem online course

26 Considerations on ML Additional Considerations 3. Validation group 1 group 2 Initial dataset group 4 F. Grisoni, BigChem online course

27 Considerations on ML Additional Considerations 3. Validation group 1 group 2 Initial dataset Training set group 4 Test set F. Grisoni, BigChem online course

28 Considerations on ML Additional Considerations 3. Validation group 1 group 2 Initial dataset Training set group 4 Test set Training set Validation set Test set F. Grisoni, BigChem online course

29 Considerations on ML Additional Considerations 3. Validation group 1 group 2 group 3 group 4 group 5 Validation set Training set F. Grisoni, BigChem online course

30 Considerations on ML Additional Considerations 4. Applicability The No Free Dessert (either!) theorem Machine learning models à Reductionist Types of chemical structures Physicochemical properties Mechanisms of action considered Applicability Domain: Chemical space where the property can be reliably predicted F. Grisoni, BigChem online course

31 Considerations on ML Additional Considerations 4. Applicability min( x ), max( x ) 1 1 min( x ), max( x ) 2 2 H=X (X X) T -1 X p Man xy = å j - j j= 1 D x y F. Grisoni, BigChem online course

32 Standard Machine Learning workflow (in chemoinformatics) Considerations on ML F. Grisoni, BigChem online course

33 Standard Machine Learning workflow (in chemoinformatics) Considerations on ML Information extraction F. Grisoni, BigChem online course

34 Standard Machine Learning workflow (in chemoinformatics) Considerations on ML Information extraction Applicability & predictivity F. Grisoni, BigChem online course

35 Standard Machine Learning workflow (in chemoinformatics) Considerations on ML Information extraction Applicability & predictivity Application & new knowledge F. Grisoni, BigChem online course

36 Machine Learning methods (overview) 1. Decision Tree-based learning Decision Trees Random Forest 2. Local Methods k-means algorithm k-nn algorithm 3. Artificial Neural Networks Feed-Forward NN Kohonen Maps F. Grisoni, BigChem online course

37 (1) Decision Tree Learning Root node Decision node(s) Leaves F. Grisoni, BigChem online course

38 (1) Decision Tree Learning Root node 1. Easy to interpret 2. No data pretreatment 3. Numerical/categorical variables 4. Classification and regression 5. Non parametric 6. Automatic variable selection Decision node(s) Leaves F. Grisoni, BigChem online course

39 (1) Decision Tree Learning Random Forest Bagging (Bootstrap Aggregating) = the power of the crowd F. Grisoni, BigChem online course

40 (2) Local approaches k-means clustering x 2 x 1 F. Grisoni, BigChem online course

41 (2) Local approaches k-means clustering x 2 1. Select a k (3) x 1 F. Grisoni, BigChem online course

42 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment x 1 F. Grisoni, BigChem online course

43 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation x 1 F. Grisoni, BigChem online course

44 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course

45 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course

46 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course

47 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid x 1 F. Grisoni, BigChem online course

48 (2) Local approaches k-means clustering x 2 1. Select a k (3) 2. Random assignment 3. Centroid calculation 4. Closest centroid 5. End x 1 F. Grisoni, BigChem online course

49 (2) Local approaches k-nearest Neighbor (knn)? F. Grisoni, BigChem online course

50 (2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times)? F. Grisoni, BigChem online course

51 (2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course

52 (2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 1 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course

53 (2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 2 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course

54 (2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 3 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course

55 (2) Local approaches k-nearest Neighbor (knn) 1. Calculate a distance (n train times) k = 4 2. Select a number of neighbors (k) to predict the response? F. Grisoni, BigChem online course

56 (2) Local approaches k-nearest Neighbor (knn) 1. Good for large training set with localized differences 2. Difficult to be interpreted 3. Which k? 4. Which distance measure? 5. Curse of dimensionality à Variable selection F. Grisoni, BigChem online course

57 (3) Neural Networks Artificial Neurons Inputs x 2 x 1 Output x 3 x 4 f (x) y x p Activation Function F. Grisoni, BigChem online course

58 (3) Neural Networks Artificial Neurons Inputs x 2 x 1 Output x 3 x 4 f (x) y x p Activation Function Neural networks. Comprehensive Chemometrics: Chemical and Biochemical Data Analysis Vol, 3. F. Grisoni, BigChem online course

59 (3) Neural Networks Feed-Forward NN Output Layer 1. Untrained Network 2. Compute the outcome Hidden Layer(s) 3. Compute the error 4. Back propagation learning 5. Repeat until stop criterion Input Layer F. Grisoni, BigChem online course

60 (3) Neural Networks Feed-Forward NN Error Training set Epochs F. Grisoni, BigChem online course

61 (3) Neural Networks Feed-Forward NN Error Validation set Training set Epochs F. Grisoni, BigChem online course

62 (3) Neural Networks Feed-Forward NN Error Validation set Training set Epochs F. Grisoni, BigChem online course

63 (3) Neural Networks Kohonen Maps p dimensional Unsupervised non-linear mapping Topology preserving map 2 dimensional F. Grisoni, BigChem online course

64 (3) Neural Networks Kohonen Maps Input Neurons Kohonen Layer 1. Competitive Learning Similarity to each neuron Winner takes all 2. Collaborative Learning Winning neuron update Update of close neurons Weights F. Grisoni, BigChem online course

65 (3) Neural Networks Kohonen Maps Top map (compounds) F. Grisoni, BigChem online course

66 (3) Neural Networks Kohonen Maps Top map (compounds) Weight maps (p) F. Grisoni, BigChem online course

67 (3) Neural Networks Kohonen Maps Top map (compounds) Weight maps (p) F. Grisoni, BigChem online course

68 Which ML algorithm? Purpose (clustering, regression, classification) Performance vs interpretability Covered chemical space (e.g., AD) Types of included variables F. Grisoni, BigChem online course

69 Summary Machines can learn from our data No ML algorithm always outperforms the others Validation and Applicability Domain assessment are crucial Pay attention to what the performance metric is telling you! F. Grisoni, BigChem online course

70 Supplementary reading Theory and Algorithms Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. Marini, F. (2009). Neural networks. In: Comprehensive Chemometrics: Chemical and Biochemical Data Analysis - Vol, 3. Online resources [Coursera] Ng, A. Machine Learning, Stanford University. [Online Book] Neural Networks and Deep Learning. F. Grisoni, BigChem online course

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled