Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Similar documents
AP Statistics Summer Assignment 17-18

Heredity In Plants For 2nd Grade

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Naviance Family Connection

Visit us at:

Python Machine Learning

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Appendix L: Online Testing Highlights and Script

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

12- A whirlwind tour of statistics

16.1 Lesson: Putting it into practice - isikhnas

Minitab Tutorial (Version 17+)

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Science Fair Project Handbook

Biological Sciences, BS and BA

POFI 2301 WORD PROCESSING MS WORD 2010 LAB ASSIGNMENT WORKSHEET Office Systems Technology Daily Flex Entry

Using SAM Central With iread

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Mandarin Lexical Tone Recognition: The Gating Paradigm

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Field Experience Management 2011 Training Guides

CS Machine Learning

Sight Word Assessment

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Experience College- and Career-Ready Assessment User Guide

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

PowerTeacher Gradebook User Guide PowerSchool Student Information System

School Year 2017/18. DDS MySped Application SPECIAL EDUCATION. Training Guide

LEGO MINDSTORMS Education EV3 Coding Activities

Centre for Evaluation & Monitoring SOSCA. Feedback Information

New Features & Functionality in Q Release Version 3.1 January 2016

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

End-of-Module Assessment Task K 2

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

Instructional Supports for Common Core and Beyond: FORMATIVE ASSESMENT

Meriam Library LibQUAL+ Executive Summary

Educational Attainment

Teaching Reproducible Research Inspiring New Researchers to Do More Robust and Reliable Science

STA 225: Introductory Statistics (CT)

Status of Women of Color in Science, Engineering, and Medicine

Curriculum Scavenger Hunt

Notetaking Directions

HOLMER GREEN SENIOR SCHOOL CURRICULUM INFORMATION

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Interpreting ACER Test Results

Intro to Systematic Reviews. Characteristics Role in research & EBP Overview of steps Standards

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

Multiplication of 2 and 3 digit numbers Multiply and SHOW WORK. EXAMPLE. Now try these on your own! Remember to show all work neatly!

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

Learning Microsoft Office Excel

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Physics 270: Experimental Physics

4-3 Basic Skills and Concepts

Mathematics Success Level E

How the Guppy Got its Spots:

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Contemporary Opportunities and Challenges for teaching Pharmacogenomics to Student Pharmacists

End-of-Module Assessment Task

New Features & Functionality in Q Release Version 3.2 June 2016

Informal Comparative Inference: What is it? Hand Dominance and Throwing Accuracy

Lab Reports for Biology

Millersville University Degree Works Training User Guide

CENTRAL MAINE COMMUNITY COLLEGE Introduction to Computer Applications BCA ; FALL 2011

POWERTEACHER GRADEBOOK

/ On campus x ICON Grades

Hentai High School A Game Guide

CS 446: Machine Learning

Research Design & Analysis Made Easy! Brainstorming Worksheet

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Class Size and Class Heterogeneity

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Lecture 2: Quantifiers and Approximation

Global School-based Student Health Survey (GSHS) and Global School Health Policy and Practices Survey (SHPPS): GSHS

AC : PREPARING THE ENGINEER OF 2020: ANALYSIS OF ALUMNI DATA

Aalya School. Parent Survey Results

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Spinners at the School Carnival (Unequal Sections)

On-Line Data Analytics

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Data Diskette & CD ROM

DegreeWorks Advisor Reference Guide

Abu Dhabi Indian. Parent Survey Results

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

On-the-Fly Customization of Automated Essay Scoring

Abu Dhabi Grammar School - Canada

How and Why Has Teacher Quality Changed in Australia?

From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Creating a Test in Eduphoria! Aware

Mathematics Success Grade 7

CLA+ Analytics: Making Data Relevant Through Data Mining in Real Time

Emporia State University Degree Works Training User Guide Advisor

learning collegiate assessment]

Formative Assessment in Mathematics. Part 3: The Learner s Role

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Do multi-year scholarships increase retention? Results

Transcription:

Ricopili: Postimputation Module WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Ricopili Overview

Ricopili Overview

postimputation, 12 steps 1) Association analysis 2) Meta analysis 3) Collection of results 4) Create a separate set for heterogeneity P 5) count SNPs for various thresholds, create top lists 6) Clumping 7) Region plots 8) Forest plots 9) Manhattan plots 10) QQ - plots 11) LD score 12) Lambda plots 1000 individuals: 30 mins 40,000 individuals: 4 hours Step 1 takes majority of the computer resources of this module, it is possible to start from 2

Primary Meta-Analysis, implemented in the pipeline Basic association, logistic regression with covariates Highly parallelized (N_datasets x N_chunks jobs) Meta-analysis N_chunks jobs Presentation of Results mostly N_datasets jobs Interaction at different levels

Secondary Analyses Basic changes to primary association analysis: Different phenotypes, quantitative phenotypes Include, exclude individuals Additional covariates, Conditional Analysis Clumping options Meta-analysis from summary stats (no genotype access) Polygene scoring, leave one out LD-score (Replication lookup) (Integration into ricopili website)

Postimputation Output Files Sample composition excel file (basic*.num.xls): Lists all single datasets in the meta-analysis with cases / controls, N_SNPs, lambda-gc, effective sample size (4*nca*nco/(nca+nco)) SCZ52

Postimputation Output Files LD clumped result file with detailed information about each index SNP (header lists sample size). daner*1mhc.pdf No additional post association QC (MAF, INFO). Sorted by p-value, only 1 index MHC - SNP (chr6 25-35) Basic: rs-name, chromosome, position Association: P, OR, SE, A1, A2, FRQ_ca, FRQ_co, INFO Meta-Analytic: Direction_column (poor man s forest plot), ngt (N_genotyped) Genomic Clump R 2 = 0.6 && R 2 = 0.1: LD-friends (incl. distance and LD) region (left, right, size) R 2 = 0.6: genes (+50kb) with distance, gwas_catalog Add. Info: Number in brackets of genes show the distance (in kb) to the index SNP. Number brackets of LD friends show LD and distance to index.

Example from SCZ52 daner_pgc_scz52_0513a.gz.p4.clump.areator.sorted.1mhc.xls

Manhattan Plots Three different types with different properties (manhattan*pdf): *.v2.*: with gene-names and variable y-axis Clumps in red and yellow, rest in grey *nog2*: no gene-names, variable y-axis Clumps in green, rest in brown and blue ( Nature format ) *nog* No gene-names, fixed y-axis (at p = 10-30 ) for comparing different result sets Clumps in red, rest in grey

Examples of V2

V2. SCZ52

*nog2*

*nog2* SCZ52

*nog*

*nog*

*nog*

*nog*

QQ Plots qq*pdf: With ceiling at p = 10-12. Red confidence interval No LD pruning Including Lambda, Lambda1000, N_snps, samplesize MAF > 1%, Info > 0.6 (otherwise Lambda artificially deflated due to overrepresentation of rare variants with low power)

QQ Example (post imputation)

QQ Example without MAF filter

Lambda plot over Freq Lambda plot (see preimputation), *frqulo_lama-page1.pdf: X-axis MAF Grey SNP bin size (right y-axis) Blue SNPs with p < 10-11 (right y-axis) Red Lambda in each bin (red right y-axis)

Lambda plot over Freq ex.

Region Plots All genome wide significant regions (combining multiple index SNPs) At least 10 top regions Black snp center distinguish 1KG from Hapmap If 1 index SNP: Color and size base on LD (see legend) If multiple index SNPs: Different color scheme for each index SNP LD friends get color of index SNP with shade and size based on LD info. Detailed SNP info in blue upper right corner (blue letters): snp-name, P, OR, MAF, INFO, directions (left right - missing) GWAS_catalog upper left corner (red numbers) for examples just use ricopili website: http://www.broadinstitute.org/mpg/ricopili/

Region plot with two index SNPs

Forest Plots Index SNPs of all region plots, sorted by snp name With basic information in the header Alleles, position, direction-column, heterogeneity test results ngt denotes genotypes (1) or imputed (0). Number in brackets in the frequency columns show sample sizes Meta-analysis results in bold last row

Example of something going wrong (100s of these)

Corresponding Forest - plot Additional covariate (sex) and one of the cohorts is female only.

LD Score Basic LD score analysis (daner*ldsc.tar.gz): All analyses unconstrained All analyses on observed and liability scale SNP heritabilty for meta analysis Genetic correlation to published PGC SCZ52 and other publically available datasets All single commands in cmd.ldsc.txt

Focus on Heterogeneity Test Manhattan Plot (only v2): manhattan*het.pdf QQ plot: qq*het.pdf Excel result file: daner*het*1mhc.xls Forest-plots: areas.fo.*.pdf.gz Used mostly as a QC value. Interesting if differences between single association results are expected: BIP and SCZ Male and Female Cave: lower power than with directly testing genotypes (but usually with bigger sample size since distinct datasets usually on distinct platforms)

Rare example of positive heterogeneity finding Genomide significant het-p Non-significant in combined analysis No details shown since confidential

Conditional Analysis Must be done manually right now (see tutorials on website) Most of the time, LD independent SNPs are not independent in reality and will loose GWS in conditional analysis.

Example of conditional Analysis, revealing independent index SNPs

Output structure - Overview

Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts

Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts

Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts

Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts

Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts

Output structure - Details

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Replace genomic region, index SNPs (or directly via gene-name)

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Use different clumping thresholds (p/r2)

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Replace snp-name (needs positional information)

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots All relevant files in tar ball in distribution directory, use different options

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Use different parameters (gene-names, thresholds)

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Different parameters, e.g. ceiling effect

Detailed look at report* (*pdf.tar.gz) Contains all R-scripts for all plots in the directory (up to hundreds)

postimputation options See https://sites.google.com/a/broadinstitute.org/ri copili/cvas --help lists a lot of deprecated options

Wrap-up Questions? Useful Resources: Ricopili home page: https://sites.google.com/a/broadinstitute.org/ricopili Ricopili user group: https://sites.google.com/a/broadinstitute.org/ricopili/users-section Materials from this workshop: https://sites.google.com/a/broadinstitute.org/pgc-summer-school- 2015