Comparison of network inference packages and methods for multiple networks inference

Similar documents
Stopping rules for sequential trials in high-dimensional data

Lecture 1: Machine Learning Basics

Truth Inference in Crowdsourcing: Is the Problem Solved?

Semi-Supervised Face Detection

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Radius STEM Readiness TM

arxiv: v1 [cs.cl] 2 Apr 2017

Detecting English-French Cognates Using Orthographic Edit Distance

arxiv: v1 [math.at] 10 Jan 2016

Measurement. When Smaller Is Better. Activity:

On-the-Fly Customization of Automated Essay Scoring

An Online Handwriting Recognition System For Turkish

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

12- A whirlwind tour of statistics

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

CS Machine Learning

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Comment-based Multi-View Clustering of Web 2.0 Items

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

What Teachers Are Saying

Research computing Results

Axiom 2013 Team Description Paper

Introduction to Causal Inference. Problem Set 1. Required Problems

Artificial Neural Networks written examination

4. Long title: Emerging Technologies for Gaming, Animation, and Simulation

Understanding Games for Teaching Reflections on Empirical Approaches in Team Sports Research

Machine Learning and Development Policy

Issues in the Mining of Heart Failure Datasets

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Mathematics subject curriculum

What is Thinking (Cognition)?

Prerequisite: General Biology 107 (UE) and 107L (UE) with a grade of C- or better. Chemistry 118 (UE) and 118L (UE) or permission of instructor.

Empiricism as Unifying Theme in the Standards for Mathematical Practice. Glenn Stevens Department of Mathematics Boston University

Theory of Probability

Information-theoretic evaluation of predicted ontological annotations

A method to teach or reinforce concepts of restriction enzymes, RFLPs, and gel electrophoresis. By: Heidi Hisrich of The Dork Side

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

On-Line Data Analytics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

***** Article in press in Neural Networks ***** BOTTOM-UP LEARNING OF EXPLICIT KNOWLEDGE USING A BAYESIAN ALGORITHM AND A NEW HEBBIAN LEARNING RULE

A survey of multi-view machine learning

An Empirical and Computational Test of Linguistic Relativity

Corrective Feedback and Persistent Learning for Information Extraction

The Importance of Social Network Structure in the Open Source Software Developer Community

The Effect of Collaborative Partnerships on Interorganizational

Probabilistic Latent Semantic Analysis

arxiv: v1 [cs.lg] 3 May 2013

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

EDUC 2020: FOUNDATIONS OF MULTICULTURAL EDUCATION Spring 2011

Discovery of Topical Authorities in Instagram

Multi-label classification via multi-target regression on data streams

Critical Care Current Fellows

Reducing Features to Improve Bug Prediction

BMBF Project ROBUKOM: Robust Communication Networks

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Biomedical Sciences (BC98)

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Field Experience Management 2011 Training Guides

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Biological Sciences, BS and BA

Australian Journal of Basic and Applied Sciences

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Learning to Rank with Selection Bias in Personal Search

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Software Maintenance

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Universityy. The content of

FINAL EXAMINATION OBG4000 AUDIT June 2011 SESSION WRITTEN COMPONENT & LOGBOOK ASSESSMENT

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Loughton School s curriculum evening. 28 th February 2017

Task Completion Transfer Learning for Reward Inference

On the Distribution of Worker Productivity: The Case of Teacher Effectiveness and Student Achievement. Dan Goldhaber Richard Startz * August 2016

SESSION III: Training on Conducting the Informed Consent Process

Generating Test Cases From Use Cases

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

CURRICULUM VITAE Ma lgorzata Bogdan

The Effect of Income on Educational Attainment: Evidence from State Earned Income Tax Credit Expansions

Livermore Valley Joint Unified School District. B or better in Algebra I, or consent of instructor

Table of Contents. Introduction Choral Reading How to Use This Book...5. Cloze Activities Correlation to TESOL Standards...

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CS 446: Machine Learning

Cal s Dinner Card Deals

A Model of Knower-Level Behavior in Number Concept Development

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Fourth Grade. Reporting Student Progress. Libertyville School District 70. Fourth Grade

CSC200: Lecture 4. Allan Borodin

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

EGRHS Course Fair. Science & Math AP & IB Courses

Transcription:

Comparison of network inference packages and methods for multiple networks inference Nathalie Villa-Vialaneix http://www.nathalievilla.org nathalie.villa@univ-paris1.fr 1ères Rencontres R - BoRdeaux, 3 Juin 2012 Joint work with Nicolas Edwards, Laurence Liaubet, Nathalie Viguerie & Magali SanCristobal R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 1 / 17

From transcriptomic data to network Plan 1 From transcriptomic data to network 2 Network inference and multiple networks inference using R 3 Simulations R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 2 / 17

From transcriptomic data to network Transcriptome DNA contains the genetic instructions used in the development and functioning of living organims Molecular unit of the DNA, genes, are not all identically expressed in a given cell: it is assessed by means of the quantity of the corresponding mrna Genes expression can be measured by microarray, RT PCR...: transcriptomic data R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 3 / 17

From transcriptomic data to network Modelling multiple interactions between genes with a network Co-expression networks nodes: genes edges: direct co-expression between two genes R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 4 / 17

From transcriptomic data to network Modelling multiple interactions between genes with a network Co-expression networks nodes: genes edges: direct co-expression between two genes Method: Correlations Thresholding Graph R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 4 / 17

From transcriptomic data to network Multiple networks inference Transcriptomic data coming from several different conditions. Examples: genes expression from pig muscle in Landrace and Large white breeds; genes expression from obese humans after and before a diet. R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 5 / 17

From transcriptomic data to network Multiple networks inference Transcriptomic data coming from several different conditions. Examples: genes expression from pig muscle in Landrace and Large white breeds; genes expression from obese humans after and before a diet. Assumption: A common functioning exists regardless the condition; Which genes are correlated independently from/depending on the condition? R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 5 / 17

Network inference and multiple networks inference using R Plan 1 From transcriptomic data to network 2 Network inference and multiple networks inference using R 3 Simulations R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 6 / 17

Network inference and multiple networks inference using R Theoretical framework Gaussian Graphical Models (GGM) X N(0, Σ) Seminal work [Schäfer and Strimmer, 2005], GeneNet: estimation of the partial correlations π jj = Cor(X j, X j X k, k j, j ) (by using the inverse of Σ + λi) and edges selection by a Bayesian test based on a mixture model. R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 7 / 17

Network inference and multiple networks inference using R Theoretical framework Gaussian Graphical Models (GGM) X N(0, Σ) Edges selection by sparse penalty: graphical LASSO [Meinshausen and Bühlmann, 2006, Friedman et al., 2008], glasso: X j = β jk X k + ɛ. where (β jk ) jk are estimated by max (β jk ) k j log ML j λ β jk. k j β jk is related to S = Σ 1 by β jk = S jk S jj. k j R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 7 / 17

Network inference and multiple networks inference using R Theoretical framework Gaussian Graphical Models (GGM) X N(0, Σ) Edges selection by sparse penalty: graphical LASSO [Meinshausen and Bühlmann, 2006, Friedman et al., 2008], glasso: X j = β jk X k + ɛ. where (β jk ) jk are estimated by max (β jk ) k j log ML j λ β jk. k j β jk is related to S = Σ 1 by β jk = S jk S jj. Other related packages: parcor (different regularization methods for GGM, CV selection), GGMselect (network selection among a family): not used here k j R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 7 / 17

Network inference and multiple networks inference using R Multiple networks Independent estimations: if c = 1,..., C are different samples (or conditions, e.g., breeds or before/after diet...) max (β log c jk ) MLc j λ β c jk. k j,c=1,...,c c k j R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 8 / 17

Network inference and multiple networks inference using R Multiple networks Independent estimations: if c = 1,..., C are different samples (or conditions, e.g., breeds or before/after diet...) max (β log c jk ) MLc j λ β c jk. k j,c=1,...,c Joint estimations: c Implemented in the package simone, [Chiquet et al., 2011] GroupLasso Consensual network between conditions (enforces identical edges by a group LASSO penalty) CoopLasso Sign-coherent network between conditions (prevents edges that corresponds to partial correlations having different signs; thus allows one to obtain a few differences between the conditions) Intertwined In GLasso replace Σ c by 1/2 Σ c + 1/2Σ where Σ = 1 C k j c Σ c R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 8 / 17

Network inference and multiple networks inference using R Multiple networks Independent estimations: if c = 1,..., C are different samples (or conditions, e.g., breeds or before/after diet...) max (β log c jk ) MLc j λ β c jk. k j,c=1,...,c c Joint estimations: Additional tested approaches: Use the fact that individuals are paired (if concerned) to compute the partial correlations: Xc = 1/2X c + 1/2X i i i with X i = X c c i (implemented with GeneNet and simone) Combine the partial correlations instead of the correlations as in Intertwined (implemented from independent estimations obtained using simone, called therese ) k j R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 8 / 17

Network inference and multiple networks inference using R Tested packages and features Indep. Joint Selection? Inputs Outputs GeneNet [1] No confidence threshold X (π ij ) ij glasso [2,3] No none (but LASSO path Σ (Sij ) ij is available) simone [2,3] Yes number of edges X (S ij ) ij AIC, BIC (LASSO path) with [1] [Schäfer and Strimmer, 2005] [2] [Meinshausen and Bühlmann, 2006] [3] [Friedman et al., 2008] not shown: CV selection is not included in glasso and simone, but it can be implemented (be careful to the internal scaling and to the outputs) R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 9 / 17

Simulations Plan 1 From transcriptomic data to network 2 Network inference and multiple networks inference using R 3 Simulations R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 10 / 17

Simulations Data Datasets coming from The ANR project DéLiSus ( caractérisations génétique et phénotypique fines de populations porcines françaises, genetic and phenotypic variability of French pigs) The pan-european project DiOGenes (Diet, Obesity and Genes: new insight on obesity problems and routes to prevention) R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 11 / 17

Simulations Datasets description Real datasets DiOGenes dataset: variables: 39 variables (genes expressions and clinical variables) conditions: before/after a diet (paired individuals: 204 obese women) DeLiSus dataset: variables: expression of 123 genes conditions: two breeds (33 Landrace and 51 Large white ) R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 12 / 17

Simulations Datasets description Real datasets DiOGenes dataset: variables: 39 variables (genes expressions and clinical variables) conditions: before/after a diet (paired individuals: 204 obese women) DeLiSus dataset: variables: expression of 123 genes conditions: two breeds (33 Landrace and 51 Large white ) Simulated dataset To compare methods, a dataset was simulated from a GGM (with simone): underlying network: 39 variables with 5 groups of preferential attachment and a density equal to approximatly 3-4%. children networks: two networks obtained by randomly permuting 10% of the edges; variables: 2 204 observations of a GGM coming from these networks (observations are not pairwise). R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 12 / 17

Simulations Simulation results and conclusions All methods Precision= tp p Recall= tp tp+fn R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 13 / 17

Simulations Simulation results and conclusions All methods Precision= tp p Recall= tp tp+fn glasso performs well (with very low variability) but no real solution for tuning; simone performs well (especially joint methods), with an automatic tuning but large variability; therese has a low variability but no real solution for tuning; GeneNet has a low recall and a low variability. R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 13 / 17

Simulations Simulation results and conclusions Numerical performances Graph densities True density: 3.57% (on average) GeneNet (automatic): 4.38% glasso (manual): 8.14% simone (indep, BIC): 6.65% and simone (joint, BIC): 5.87% therese (semi manual): 5.26% R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 13 / 17

Simulations Simulation results and conclusions Numerical performances Graph densities True density: 3.57% (on average) GeneNet (automatic): 4.38% glasso (manual): 8.14% simone (indep, BIC): 6.65% and simone (joint, BIC): 5.87% therese (semi manual): 5.26% Shared edges between conditions Truth: 20.28% (on average) GeneNet (automatic): 15.95% glasso (manual): 32.74% simone (indep, BIC): 26.69% and simone (joint, BIC): 31.15% therese (semi manual): 30.92% R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 13 / 17

Simulations DiOGenes dataset (39 variables, 204 obese women, fixed density 5%) Density Transitivity % shared [1] GeneNet 0.06 0.22 0.68 [2] GeneNet (paired) 0.09 0.24 0.84 [3] simone (indep., Fried.) 0.05 0.52 0.76 [4] simone, CoopLasso 0.06 0.30 1.00 [5] simone, GroupLasso 0.06 0.30 1.00 [6] simone, intertwined 0.05 0.37 0.97 [7] simone, paired 0.04 0.52 0.94 [8] therese 0.05 0.46 0.82 [1] [2] [3] [4] [5] [6] [7] [8] [1] 1.00 0.98 0.45 0.61 0.61 0.53 0.42 0.42 [2] 1.00 0.58 0.66 0.66 0.66 0.55 0.58 [3] 1.00 0.79 0.79 0.84 1.00 0.92 [4] 1.00 1.00 0.95 0.76 0.76 [5] 1.00 0.95 0.76 0.76 [6] 1.00 0.82 0.79 [7] 1.00 0.97 [8] 1.00 R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 14 / 17

Simulations DeLiSus dataset (restricted dataset with 84 genes (51 pigs)) Density Transitivity % shared [1] GeneNet 0.00 0.71 0.46 [2] simone, MB-AND 0.05 0.08 0.17 [3] simone, Fried. 0.05 0.19 0.22 [4] simone, intertwined 0.05 0.09 0.52 [5] simone, CoopLasso 0.06 0.09 0.88 [6] simone, GroupLasso 0.04 0.07 0.99 [7] therese 0.05 0.17 0.66 [1] [2] [3] [4] [5] [6] [7] [1] 1.00 0.00 0.00 0.00 0.00 0.00 0.00 [2] 1.00 0.71 0.76 0.64 0.56 0.57 [3] 1.00 0.67 0.55 0.53 0.78 [4] 1.00 0.80 0.67 0.58 [5] 1.00 0.84 0.60 [6] 1.00 0.74 [7] 1.00 R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 15 / 17

Simulations Conclusion simulations: BIC is not always relevant target density, CV, GGMselect...? Joined methods produce more shared edges between conditions R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 16 / 17

Simulations Conclusion simulations: BIC is not always relevant target density, CV, GGMselect...? Joined methods produce more shared edges between conditions real life datasets low dimension case: large consensus between methods; joined methods are too similar (except maybe paired GeneNet and therese ) larger dimension case: methods are less consensual; GroupLasso and CoopLasso still produce too much shared edges very large dimension (not shown): 464 gene expressions for 51 + 33 pigs gave very bad performances: on real dataset, some methods were unable to produce results (and BIC selected graphs with no edge); hence, on simulated datasets with the same sample size and dimension, the recall was always very low. R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 16 / 17

References Collaboration Any questions?... Co-authors Nathalie Villa-Vialaneix Nicolas Edwards Laurence Liaubet (SAMM, U. Paris 1) (LGC, INRA Tlse) (LGC, INRA Tlse) Nathalie Viguerie (ORL, INSERM) Magali SanCristobal (LGC, INRA Tlse) R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 17 / 17

References Chiquet, J., Grandvalet, Y., and Ambroise, C. (2011). Inferring multiple graphical structures. Statistics and Computing, 21(4):537 553. Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441. Meinshausen, N. and Bühlmann, P. (2006). High dimensional graphs and variable selection with the lasso. Annals of Statistic, 34(3):1436 1462. Schäfer, J. and Strimmer, K. (2005). An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics, 21(6):754 764. R for multiple networks inference (RR 2012) Nathalie Villa-Vialaneix BoRdeaux, 06/03/2012 17 / 17