Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October PDF Free Download

Ricopili: Postimputation Module WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Ricopili Overview

postimputation, 12 steps 1) Association analysis 2) Meta analysis 3) Collection of results 4) Create a separate set for heterogeneity P 5) count SNPs for various thresholds, create top lists 6) Clumping 7) Region plots 8) Forest plots 9) Manhattan plots 10) QQ - plots 11) LD score 12) Lambda plots 1000 individuals: 30 mins 40,000 individuals: 4 hours Step 1 takes majority of the computer resources of this module, it is possible to start from 2

Primary Meta-Analysis, implemented in the pipeline Basic association, logistic regression with covariates Highly parallelized (N_datasets x N_chunks jobs) Meta-analysis N_chunks jobs Presentation of Results mostly N_datasets jobs Interaction at different levels

Secondary Analyses Basic changes to primary association analysis: Different phenotypes, quantitative phenotypes Include, exclude individuals Additional covariates, Conditional Analysis Clumping options Meta-analysis from summary stats (no genotype access) Polygene scoring, leave one out LD-score (Replication lookup) (Integration into ricopili website)

Postimputation Output Files Sample composition excel file (basic*.num.xls): Lists all single datasets in the meta-analysis with cases / controls, N_SNPs, lambda-gc, effective sample size (4*nca*nco/(nca+nco)) SCZ52

Postimputation Output Files LD clumped result file with detailed information about each index SNP (header lists sample size). daner*1mhc.pdf No additional post association QC (MAF, INFO). Sorted by p-value, only 1 index MHC - SNP (chr6 25-35) Basic: rs-name, chromosome, position Association: P, OR, SE, A1, A2, FRQ_ca, FRQ_co, INFO Meta-Analytic: Direction_column (poor man s forest plot), ngt (N_genotyped) Genomic Clump R 2 = 0.6 && R 2 = 0.1: LD-friends (incl. distance and LD) region (left, right, size) R 2 = 0.6: genes (+50kb) with distance, gwas_catalog Add. Info: Number in brackets of genes show the distance (in kb) to the index SNP. Number brackets of LD friends show LD and distance to index.

Example from SCZ52 daner_pgc_scz52_0513a.gz.p4.clump.areator.sorted.1mhc.xls

Manhattan Plots Three different types with different properties (manhattan*pdf): *.v2.*: with gene-names and variable y-axis Clumps in red and yellow, rest in grey *nog2*: no gene-names, variable y-axis Clumps in green, rest in brown and blue ( Nature format ) *nog* No gene-names, fixed y-axis (at p = 10-30 ) for comparing different result sets Clumps in red, rest in grey

Examples of V2

V2. SCZ52

*nog2*

*nog2* SCZ52

*nog*

QQ Plots qq*pdf: With ceiling at p = 10-12. Red confidence interval No LD pruning Including Lambda, Lambda1000, N_snps, samplesize MAF > 1%, Info > 0.6 (otherwise Lambda artificially deflated due to overrepresentation of rare variants with low power)

QQ Example (post imputation)

QQ Example without MAF filter

Lambda plot over Freq Lambda plot (see preimputation), *frqulo_lama-page1.pdf: X-axis MAF Grey SNP bin size (right y-axis) Blue SNPs with p < 10-11 (right y-axis) Red Lambda in each bin (red right y-axis)

Lambda plot over Freq ex.

Region Plots All genome wide significant regions (combining multiple index SNPs) At least 10 top regions Black snp center distinguish 1KG from Hapmap If 1 index SNP: Color and size base on LD (see legend) If multiple index SNPs: Different color scheme for each index SNP LD friends get color of index SNP with shade and size based on LD info. Detailed SNP info in blue upper right corner (blue letters): snp-name, P, OR, MAF, INFO, directions (left right - missing) GWAS_catalog upper left corner (red numbers) for examples just use ricopili website: http://www.broadinstitute.org/mpg/ricopili/

Region plot with two index SNPs

Forest Plots Index SNPs of all region plots, sorted by snp name With basic information in the header Alleles, position, direction-column, heterogeneity test results ngt denotes genotypes (1) or imputed (0). Number in brackets in the frequency columns show sample sizes Meta-analysis results in bold last row

Example of something going wrong (100s of these)

Corresponding Forest - plot Additional covariate (sex) and one of the cohorts is female only.

LD Score Basic LD score analysis (daner*ldsc.tar.gz): All analyses unconstrained All analyses on observed and liability scale SNP heritabilty for meta analysis Genetic correlation to published PGC SCZ52 and other publically available datasets All single commands in cmd.ldsc.txt

Focus on Heterogeneity Test Manhattan Plot (only v2): manhattan*het.pdf QQ plot: qq*het.pdf Excel result file: daner*het*1mhc.xls Forest-plots: areas.fo.*.pdf.gz Used mostly as a QC value. Interesting if differences between single association results are expected: BIP and SCZ Male and Female Cave: lower power than with directly testing genotypes (but usually with bigger sample size since distinct datasets usually on distinct platforms)

Rare example of positive heterogeneity finding Genomide significant het-p Non-significant in combined analysis No details shown since confidential

Conditional Analysis Must be done manually right now (see tutorials on website) Most of the time, LD independent SNPs are not independent in reality and will loose GWS in conditional analysis.

Example of conditional Analysis, revealing independent index SNPs

Output structure - Overview

Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts

Output structure - Details

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Replace genomic region, index SNPs (or directly via gene-name)

Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots All relevant files in tar ball in distribution directory, use different options

Detailed look at report* (*pdf.tar.gz) Contains all R-scripts for all plots in the directory (up to hundreds)

postimputation options See https://sites.google.com/a/broadinstitute.org/ri copili/cvas --help lists a lot of deprecated options

Wrap-up Questions? Useful Resources: Ricopili home page: https://sites.google.com/a/broadinstitute.org/ricopili Ricopili user group: https://sites.google.com/a/broadinstitute.org/ricopili/users-section Materials from this workshop: https://sites.google.com/a/broadinstitute.org/pgc-summer-school- 2015

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015