This page provides the documentation and data for estimating association statistics between gene expression and phenotypes from summary-level GWAS data. We provide precomputed gene expression weights from three studies to facilitate analysis. Please refer to the following manuscripts for details:
Gusev et al. “Integrative approaches for large-scale transcriptome-wide association studies” 2016 Nature Genetics |
Pasaniuc et al. “Fast and accurate imputation of summary statistics enhances evidence of functional enrichment” 2014 Bioinformatics |
See also http://bogdan.bioinformatics.ucla.edu/software/impg/ for details on IMPG
For questions or comments, contact Alexander Gusev [agusev@hsph.harvard.edu] or Bogdan Pasaniuc [pasaniuc@ucla.edu].
Due to large size, all code and datasets are hosted at https://data.broadinstitute.org/alkesgroup/TWAS
The TWAS.*.tar.bz2
package contains the binary files and scripts for computing TWAS statistics from pre-computed expression weights, as well as scripts for computing expression weights from your own expression + SNP data.
bin/TWAS.sh
script executes the association pipeline using pre-computed weights.bin/TWAS_get_weights.sh
script executes the pipeline for computing expression weights from PLINK-format genotype dataThe scripts rely on having R installed (and callable by R
), as well as two pre-compiled binaries.
This analysis uses the precomputed expression weights and requires only GWAS summary data (Z-scores or effect sizes and standard errors). A separate package contains the precomputed expression weights for each cohort. The example below
shows how to run the TWAS with pre-computed weights from the CCDC101 gene and summary data (Z-scores) from the Locke et al. GWAS. To run the full TWAS, repeat this procedure for every gene present in the WEIGHTS_*/
directory.
Extract this file (and any other WEIGHTS*bz2 files):
tar -xjf WEIGHTS_YFS.tar.bz2
This will create a WEIGHTS_YFS/
directory with subdirectories corresponding to each significantly heritable gene in the YFS cohort. You can then execute a single gene association call as follows:
bash bin/TWAS.sh \
{pointer to pre-computed expression weights for this gene} \
{pointer to Z-score file for this locus} \
{pointer to output}
For example, to run this using the GWAS summary statistics from the Locke et al. BMI study on gene CCDC101:
bash bin/TWAS.sh \
WEIGHTS_YFS/CCDC101/CCDC101.wgt \
WEIGHTS_YFS/CCDC101/CCDC101.LOCKE_BMI.GWAS.zscore \
CCDC101.LOCKE_BMI.TWAS
This will generate the file CCDC101.LOCKE_BMI.TWAS.imp
, which contains the imputed Z-score and the precision of the Z-score as second to last and last entries. The final TWAS Z-score is computed as $5/sqrt($6)
from this file. The output should be identical to the file WEIGHTS_YFS/CCDC101/CCDC101.LOCKE_BMI.TWAS.imp
Each directory contains the following files for a given gene:
*.wgt.map | The map for SNPs at the cis locus, with header row indicating format. |
*.wgt.cor | The SNP-Expression correlation (i.e. marginal eQTL statistics), one line per SNP. |
*.wgt.ld | The SNP-SNP correlation/LD matrix, square. |
*.LOCKE_BMI.GWAS.zscore | The GWAS summary statistics (from Locke et al. GWAS of BMI) for the cis locus, with header row indicating format: [SNP] [POSITION] [ALLELE1] [ALLELE2] [Z-SCORE] |
*.LOCKE_BMI.TWAS.imp | The imputed TWAS Z-score for this locus (i.e. the value we’re interested in). |
The expression weight data includes summary statistics for each locus from the Locke et al. BMI study. Make sure the summary statistics you use match the strand allele codes as well as the physical positions for the SNPs at each locus. If in doubt, remove any strand ambiguous (A-T, G-C) SNPs. The summary statistics can be partially overlapping with the expression weight data, but they must have the header and be in sorted physical order.
We provide a script to facilitate matching between your GWAS summary statistics and the pre-computed reference expression data which can be downloaded here. The script takes a genome-wide Z-score file in the above format and orients it against alleles in the reference data (also provided). You can then subset this file for each gene you are running the TWAS on. Note: this will also remove ambiguous A/T and G/C SNPs.
Due to large size, weights can be downloaded from https://data.broadinstitute.org/alkesgroup/TWAS
The WEIGHTS_*tar.bz2
contain bzipped tarballs of pre-computed expression weights from three studies. Expression weights were computed from the BLUP/genetic value of expression.
The Netherlands Twin Registry (NTR) based on expression from [Wright et al. 2014 Nat Genet] |
1,103 significantly heritable genes, computed from 1,247 unrelated post-QC samples expression measured by microarray in blood |
METabolic Sindrome In Men (MET/METSIM) |
2,146 significantly heritable genes, computed from 563 unrelated post-QC samples with expression measured by RNA-seq in adipose tissue. Note: The exact weights used in the paper are available in the old_versions/ directory; weights in this directory include additional samples that were initially excluded due to label-swaps. |
Young Finns Study (YFS) |
3,832 significantly heritable genes, computed from 1,264 unrelated post-QC samples expression measured by microarray in blood. |
Weights from genes that were significantly heritable in any cohort are additionally available in the ETC/WEIGHTS_NONSIGNIFICANT
directory.
These weights were not used in the paper but are provided for further analyses.
Warning, the extracted files are large (>1GB). All data is in flat, human readable files.
The script for computing expression weights yourself works one gene at a time, taking as input a standard PLINK format file (ped/map, not binary) which contains only the desired SNPs in the cis-locus of the gene (or any other desired SNPs) and expression of the gene set as the phenotype (6th field of the ped file).
The script is executed as follows:
bash bin/TWAS_get_weights.sh {pointer to prefix for PLINK file} {pointer to output}
for example, assuming you have a CCDC101.ped and CCDC101.map file (not provided) the following command would re-generate the expression weights:
bash bin/TWAS_get_weights.sh CCDC101 WEIGHTS_YFS/CCDC101/CCDC101.wgt
For maximum power, we recommend computing the genetic value of expression using BLUP/BSLMM and setting that as the phenotype. WARNING: No additional filtering is done to the included SNPs or the expression.
We proposed an N degree of freedom test for effect across N tissues after adjusting for predictor correlation (as estimated in the YFS data).
To reproduce results from the BMI (Locke et al) TWAS, simply run:
R --slave < OMNIBUS.R
The file reads in prediction correlations from OMNIBUS.CORR.dat
and individual reference TWAS results from OMNIBUS.ZSCORES.dat
. Both files contain headers describing their contents.
The output looks like the following:
LOCKE.BMI | AATK | 2 | 0.2120281 |
LOCKE.BMI | ABCC3 | 2 | 0.8500244 |
LOCKE.BMI | ABR | 2 | 0.7403101 |
LOCKE.BMI | ACAP1 | 2 | 0.5919793 |
LOCKE.BMI | ACCS | 3 | 0.1391507 |
Where the columns correspond to <GWAS> <GENE> <# measures/degrees-of-freedom> <P-value>
To compute omnibus statistics using your own TWAS results, simply replace
OMNIBUS.ZSCORES.dat
with a concatenation of your TWAS Z-scores in the
same format, where each row is:
<name of the reference (YFS/NTR/MET)> <GWAS study> <Gene name> <TWAS Z-score>
The script automatically identifies the duplicate entries for eah GWAS:gene combination and performs the corresponding N-dof test.
To compute statistics using your own weights and TWAS results, replace
OMNIBUS.CORR.dat
with the pairwise correlation for your predictors in the
same format, where each row is:
<expression 1 name> <expression 2 name> <gene name> <signed Pearson correlation>
These can be computed by predicting from all reference panels into a single cohort and computing the pearson correlation for every pair of predictors for every gene. The single cohort should be selected to have linkage disequilibrium patterns similar to the TWAS data where the tests will be applied.
The script automatically identifies the number of reference panels and constructs the corresponding NxN correlation matrix.
We have not evaluated the performance of this test on genes that are only significantly heritable in a single cohort.
Because this is a MANOVA, it is much more susceptible than GWAS to highly significant associations when the inputs were computed incorrectly. For example, individually weak TWAS Z-scores in opposite directions can become highly significant for predictors that are expected to be correlated in the same direction. We recommend care when performing the test (i.e. getting many significant genes is not an indication that everything worked correctly) and to double-check that marginal TWAS Z-scores are at least weakly consistent with an association.
How can I validate the TWAS associations?
We recommend the same procedures for confirming TWAS associations as have been used in GWAS analyses: replication in an external study. This can be done at the level of individual associations, or in aggregate using a gene-based risk score (see manuscript for details). Additionally, if expression weights were computed using this program then the 6th field of the .imp
file corresponds to the expected accuracy of the prediction on 0-1 scale. Seeing many genes with values well below 0.9 may be an indication of problems with the summary statistics or mismatching LD.
We also proposed a permutation test in the manuscript the evaluates how significant the contribution from expression data is on top of the GWAS signal. Loci that pass this permutation test show evidence of heterogeneity that’s captured by the expression, and are less likely to be chance co-locoalization.
Lastly, the top eQTL in a gene is expected to explain a substantial fraction of the signal, so individual inspection of eQTLs and conditional analysis can help elucidate the underlying causal variant.
How should I interpret the effect direction?
The TWAS effect-size is an estimate of the genetic covariance between gene expression and the GWAS trait, so the direction of effect is informative of this relationship. If the phenotypes have not been treated unusually, then a positive effect means over-expression of the gene leads to an increase in the phenotype (and vice versa).
What if the GWAS statistics have highly variable sample sizes?
In principle having different sample sizes will violate the assumptions made in the IMPG/TWAS approaches and may lead to spurious results. Conservatively, we would recommend to restrict analyses to SNPs that have similar sample sizes, though we have not observed substantial differences in practice. If you do see differences, email us and we can work towards more specific recommendations.
What else is out there?
TWAS has analogs to co-localization and Mendelian randomization (with gene expression as one of the traits). COLOC from the Wallace lab performs Bayesian co-localization analyses. There are many approaches to Mendelian randomization, but this code from Stephen Burgess is a great primer; as well as the SMR/HEIDI test from the Yang lab. The MetaXcan and PrediXcan suite of tools from the Im lab performs gene-based association tests with and without summary data.
You may also be interested in HESS to estimate local heritability; PAINTOR to fine-map causal variants; LD-score Regression to estimate genome-wide heritability and genetic correlation.
2016/03/22 | Added omnibus test and weights in ETC/OMNIBUS |
2016/02/24 | Minor update to debug output (monomorphics were not being reported) |
2016/02/23 | Previous versions had the direction of effect flipped, this did not impact significance but old Z scores must be multipled by -1 to be consistent with individual-level imputation. This has been corrected. |
2016/02/09 | Updated conversion script to automatically remove monomorphic SNPs |
2015/12/10 | Updated conversion script to deal with char alleles |
2015/12/07 | First official release |
Logo by Ryan Beck from The Noun Project