Ricopili: Postimputation Module WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015
Ricopili Overview
Ricopili Overview
postimputation, 12 steps 1) Association analysis 2) Meta analysis 3) Collection of results 4) Create a separate set for heterogeneity P 5) count SNPs for various thresholds, create top lists 6) Clumping 7) Region plots 8) Forest plots 9) Manhattan plots 10) QQ - plots 11) LD score 12) Lambda plots 1000 individuals: 30 mins 40,000 individuals: 4 hours Step 1 takes majority of the computer resources of this module, it is possible to start from 2
Primary Meta-Analysis, implemented in the pipeline Basic association, logistic regression with covariates Highly parallelized (N_datasets x N_chunks jobs) Meta-analysis N_chunks jobs Presentation of Results mostly N_datasets jobs Interaction at different levels
Secondary Analyses Basic changes to primary association analysis: Different phenotypes, quantitative phenotypes Include, exclude individuals Additional covariates, Conditional Analysis Clumping options Meta-analysis from summary stats (no genotype access) Polygene scoring, leave one out LD-score (Replication lookup) (Integration into ricopili website)
Postimputation Output Files Sample composition excel file (basic*.num.xls): Lists all single datasets in the meta-analysis with cases / controls, N_SNPs, lambda-gc, effective sample size (4*nca*nco/(nca+nco)) SCZ52
Postimputation Output Files LD clumped result file with detailed information about each index SNP (header lists sample size). daner*1mhc.pdf No additional post association QC (MAF, INFO). Sorted by p-value, only 1 index MHC - SNP (chr6 25-35) Basic: rs-name, chromosome, position Association: P, OR, SE, A1, A2, FRQ_ca, FRQ_co, INFO Meta-Analytic: Direction_column (poor man s forest plot), ngt (N_genotyped) Genomic Clump R 2 = 0.6 && R 2 = 0.1: LD-friends (incl. distance and LD) region (left, right, size) R 2 = 0.6: genes (+50kb) with distance, gwas_catalog Add. Info: Number in brackets of genes show the distance (in kb) to the index SNP. Number brackets of LD friends show LD and distance to index.
Example from SCZ52 daner_pgc_scz52_0513a.gz.p4.clump.areator.sorted.1mhc.xls
Manhattan Plots Three different types with different properties (manhattan*pdf): *.v2.*: with gene-names and variable y-axis Clumps in red and yellow, rest in grey *nog2*: no gene-names, variable y-axis Clumps in green, rest in brown and blue ( Nature format ) *nog* No gene-names, fixed y-axis (at p = 10-30 ) for comparing different result sets Clumps in red, rest in grey
Examples of V2
V2. SCZ52
*nog2*
*nog2* SCZ52
*nog*
*nog*
*nog*
*nog*
QQ Plots qq*pdf: With ceiling at p = 10-12. Red confidence interval No LD pruning Including Lambda, Lambda1000, N_snps, samplesize MAF > 1%, Info > 0.6 (otherwise Lambda artificially deflated due to overrepresentation of rare variants with low power)
QQ Example (post imputation)
QQ Example without MAF filter
Lambda plot over Freq Lambda plot (see preimputation), *frqulo_lama-page1.pdf: X-axis MAF Grey SNP bin size (right y-axis) Blue SNPs with p < 10-11 (right y-axis) Red Lambda in each bin (red right y-axis)
Lambda plot over Freq ex.
Region Plots All genome wide significant regions (combining multiple index SNPs) At least 10 top regions Black snp center distinguish 1KG from Hapmap If 1 index SNP: Color and size base on LD (see legend) If multiple index SNPs: Different color scheme for each index SNP LD friends get color of index SNP with shade and size based on LD info. Detailed SNP info in blue upper right corner (blue letters): snp-name, P, OR, MAF, INFO, directions (left right - missing) GWAS_catalog upper left corner (red numbers) for examples just use ricopili website: http://www.broadinstitute.org/mpg/ricopili/
Region plot with two index SNPs
Forest Plots Index SNPs of all region plots, sorted by snp name With basic information in the header Alleles, position, direction-column, heterogeneity test results ngt denotes genotypes (1) or imputed (0). Number in brackets in the frequency columns show sample sizes Meta-analysis results in bold last row
Example of something going wrong (100s of these)
Corresponding Forest - plot Additional covariate (sex) and one of the cohorts is female only.
LD Score Basic LD score analysis (daner*ldsc.tar.gz): All analyses unconstrained All analyses on observed and liability scale SNP heritabilty for meta analysis Genetic correlation to published PGC SCZ52 and other publically available datasets All single commands in cmd.ldsc.txt
Focus on Heterogeneity Test Manhattan Plot (only v2): manhattan*het.pdf QQ plot: qq*het.pdf Excel result file: daner*het*1mhc.xls Forest-plots: areas.fo.*.pdf.gz Used mostly as a QC value. Interesting if differences between single association results are expected: BIP and SCZ Male and Female Cave: lower power than with directly testing genotypes (but usually with bigger sample size since distinct datasets usually on distinct platforms)
Rare example of positive heterogeneity finding Genomide significant het-p Non-significant in combined analysis No details shown since confidential
Conditional Analysis Must be done manually right now (see tutorials on website) Most of the time, LD independent SNPs are not independent in reality and will loose GWS in conditional analysis.
Example of conditional Analysis, revealing independent index SNPs
Output structure - Overview
Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts
Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts
Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts
Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts
Ricopili postimp Directory Structure. -dameta_outname -daner_outname ---da_dataset1 ---da_dataset2 -danerjobdir ---errandout -distribution ---OUTNAME -----replic -errandout -report_outname ---errandout Daner with association results with all single datasets separate in genomic chunks Dameta: with meta-analytic results in separate genomic chunks Danerjobdir: contains lists of all association / meta / score commands, subdir errandout contains job outputs Distribution: contains subdirs to important summary files from postimp pipeline run (from report subdir) Report: working directory for all summarizing presenting scripts
Output structure - Details
Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Replace genomic region, index SNPs (or directly via gene-name)
Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Use different clumping thresholds (p/r2)
Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Replace snp-name (needs positional information)
Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots All relevant files in tar ball in distribution directory, use different options
Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Use different parameters (gene-names, thresholds)
Detailed look at report* (*job_list) areaplot.job_list: region plots areator.job_list: clumping forestplot.job_list: forestplots ldsc.job_list: ld_score manhplot.job_list: manhattan-plots qqplot.job_list: QQ plots Different parameters, e.g. ceiling effect
Detailed look at report* (*pdf.tar.gz) Contains all R-scripts for all plots in the directory (up to hundreds)
postimputation options See https://sites.google.com/a/broadinstitute.org/ri copili/cvas --help lists a lot of deprecated options
Wrap-up Questions? Useful Resources: Ricopili home page: https://sites.google.com/a/broadinstitute.org/ricopili Ricopili user group: https://sites.google.com/a/broadinstitute.org/ricopili/users-section Materials from this workshop: https://sites.google.com/a/broadinstitute.org/pgc-summer-school- 2015