Download

EUGENE is implemented in C++ (64 bit; unix only). A binary file can be downloaded by clicking the button below.

Please enter your email address before downloading. This will be used to monitor software usage and, very occasionally, to notify you of major updates. If you do not want to receive any emails about updates, you can enter any text below instead of an email address.

In the current stand-alone version of EUGENE, there are three analyses options (use '--help' for a list of options):

(1) Gene-based association analysis of expression quantitative trait loci (eQTLs)

Summary of procedure. For a given gene, EUGENE will first extract all n independent eQTLs listed in the eQTL information file (input file #2 below). Then, for each eQTL, it will identify which (if any) of the listed proxies are present in the results file and extract the P-value for the proxy in stronger LD with the eQTL. For example, if gene A has n independent eQTLs, EUGENE will extract m (with 0 < m <= n) P-values from the results file for this gene, corresponding to the available most correlated proxy for each of the independent eQTLs listed for gene A. Then the association P-value provided for each of the m independent eQTLs (or proxies) is converted to a chi-square, and a gene-based sum statistic (S_gene) calculated by adding all m chi-squares. The statistical significance of the S_gene sum statistic is then obtained based on one of two possible approaches:

Simulations (flag '--assoc [n_gene_sim]'). In this case, a dummy phenotype is simulated for each individual provided in the genotyped dataset, and tested for association with each SNP. This produces GWAS results under the null hypothesis of no association with all SNPs provided. For each gene, we then compute the S_gene sum statistic as described above and check whether this exceeds that observed with the real GWAS. This procedure is repeated eg. 1 million times (defined by n_gene_sim parameter). We use an adaptive simulation procedure, so that the number of simulations performed per gene (up to n_gene_sim) depends on how significant the S_gene statistic is expected to be. Using --assoc can be slow when testing a large number of genes in a GWAS with large N. For example, on a single CPU, it could take a couple of days to test 14,000 genes in a GWAS with 8 million SNPs tested in 150K people, with up to 1 million simulations per gene to assess significance.

To perform gene-based association analysis of eQTLs, with significance determined by simulations, the command line should look like:

./eugene_1.3b --assoc [n_gene_sim] --gwas-results [file1] --eqtl-proxies [file2] --map [file3] --bfile [file4] --out example

The four required input files are described below.

Satterthwaite's approximation (flag '--fastbat'). Recently, Bakshi et al. 2016 noted that the significance of a gene-based sum statistic (ie. S_gene) can be estimated reliably and much more efficently using the Satterthwaite's approximation, which requires estimating the correlation between the variants used to calculate S_gene. The authors implemented this approach in GCTA. Given how much faster this approach is (eg. the dataset described above would take 2-3 minutes to analyse, instead of a couple of days), I've implemented the GCTA-fastBAT code in EUGENE, and recommend this as the default approach to estimate significance. If you use this approach, please cite both the EUGENE and GCTA-fastBAT papers in your work.

To perform gene-based association analysis of eQTLs, with significance determined using the Satterthwaite's approximation, the command line should look like:

./eugene_1.3b --fastbat --gwas-results [file1] --eqtl-proxies [file2] --map [file3] --bfile [file4] --out example

Input files. You need four input files to run this analysis:

File 1: GWAS results file.This file (read by the command --gwas-results) lists SNP association results, specifically expecting two columns, SNP name and association P-value. You must provide this file. A header line is not required, but if you include one, please name the first column "SNP".

COL1: SNP name (rs#)
COL2: P-value

File 2: eQTL information file. The second file required is read by the command --eqtl-proxies. Each row in this file represents a proxy SNP (col 3) that is correlated with an independent eQTL (col 2) for a given gene (col 1). For a given gene, different eQTLs listed in col 2 have an r2<0.1 between each other. SNPs listed in col 3 are a proxy (r2>0.8) for the eQTL listed in col 2. This file must contain 3 columns:

COL1: Gene name 
COL2: Independent eQTL (or index eQTL)
COL3: Proxy SNP (sorted by decreasing LD with eQTL)

The files availabe through the links below contain eQTLs identified in published GWAS of gene expression levels in different tissues. If your tissue of interest (or combination of tissues) is not available below, and there are published eQTL studies using that tissue, please let me know and I'll make input files available for that tissue.

To create the files provided below, I used the following procedure:

(1) Make a list of SNPs associated with gene-expression in cis (at a Bonferroni-corrected P<8.9x10-10 [<1 Mb] in the most recent database; trans eQTLs have not been included, but can be if requested), considering all tissues/eQTL studies indicated;

(2) Reduce that list to a set of 'independent' expression-associated SNPs (r2<0.1), using the --clump procedure in PLINK;

(3) Identify SNPs in LD (r2>0.8) with each of the independent eQTLs, which can be used as proxies in the event the actual independent eQTL was not tested in the GWAS (these SNPs are listed in the *.proxies.list files).

File 3: Gene positions file. The third file (read by the command --map) lists the chromosomal positions of each gene. This file is also provided below (b37). Each row in this file represents a gene with two columns:

COL1: Gene name
COL2: chr:start-end (with positions from b37)

File 4: SNP genotype dataset (plink binary format). You must provide EUGENE with actual genotype data (using the --bfile command) that will be used for simulations (in --assoc mode) or to estimate LD between SNPs (in --fastbat mode). This file should contain all SNPs included in the eQTL information file (file 2 above). A file with SNP data for 1000G European individuals (294 individuals, release 20130502_v5a) can be downloaded using the links below. You must use the fileset provided that matches the release date of the eQTL information file used (see above), to avoid problems running EUGENE. If you want to use your own genotype data (eg. with larger N), best to first create a single dataset including all chromosomes, but restricting to the set of SNPs included in the appropriate *.bim file available below.

Input files available for download:

Dataset release date eQTL information (File 2) Map file (File 3) Genotype data (File 4) Reference describing dataset (PMID)
2016-04-13 Select tissue ENSEMBL b37 bed bim fam 27554816
2017-03-17 Select tissue GENCODE-v19 b37 bed bim fam NA
2017-05-17 Select tissue GENCODE-v19 b37 bed bim fam 29679657
2018-06-27 Select tissue GENCODE-v19 b37 bed bim fam Under review

 

Output files. This analysis will produce four output files, in addition to the log file.

Gene-based association results (file *.eugene.out). For each gene present in both the eQTL information and map files, EUGENE will produce the following output:

Gene			: gene name
Position		: gene position (b37)
N_ind_eQTLs		: number of independent eQTLs listed in the eQTL information file
N_ind_eQTLs_tested	: number of independent eQTLs that were also present (or tagged by a proxy) in the results file
N_ind_eQTLs_sign	: number of independent eQTLs (or proxy) that had a P-value < 0.05 in the results file
Best_eQTL		: independent eQTL present (or tagged by a proxy) in the results file with the most significant P-value
Best_eQTL_proxy		: name of proxy that tagged the BEST_eQTL
Best_eQTL_proxy_P	: P-value in the results file for the proxy that tagged the BEST_eQTL
Gene_based_P		: Gene-based empirical P-value
N_simulations		: Number of simulations performed to calculate empirical P-value ('0' if in --fastbat mode)
eQTLs_tested		: Actual independent eQTLs (or proxies) included in the gene-based test

Lists of independent eQTLs included in gene-based test. For each gene present in both the eQTL information and map files, EUGENE will write out a list of the actual independent eQTLs (or proxies) included in the gene-based test. Three lists are produced (*.set1, *.set2 and *.proxies), that only differ in the file format (long [eg used by PLINK], wide [one line per gene] and same as the input eQTL proxies file [one line per eQTL per gene], respectively).

Example. For gene-based analysis of eQTLs identified in eQTL studies of whole blood, using up to 1 million simulations per gene to estimate significance, use:

./eugene_1.3b --assoc 1000000 --gwas-results example.txt --eqtl-proxies WHOLEBLOOD.20170317.eqtl.proxies.list --map gencode.v19.gene.list.b37 --bfile 1000G.20170317 --out example

Instead, if you want to use the much faster --fastbat option, then use:

./eugene_1.3b --fastbat --gwas-results example.txt --eqtl-proxies WHOLEBLOOD.20170317.eqtl.proxies.list --map gencode.v19.gene.list.b37 --bfile 1000G.20170317 --out example

(2) Estimation of empirical FDR thresholds to account for multiple testing (--fdr)

Summary of procedure. For a given P-value threshold t (eg. 5x10-4), FDR is approximated by the expected number of genes with a P-value less than t when the null is true (E[Ft]), divided by the expected total number of genes with a P-value less than t (E[St]) [for details, see Storey and Tibshirani, PNAS 2003]. A simple estimate of E[St] is the observed St, that is, the number of genes with a P-value less than t in the analysis of the real-world GWAS. To estimate E[Ft], we simulate a GWAS under the null hypothesis of no association between a dummy phenotype and any SNP, apply EUGENE and count the number of genes with a P-value less than t. We repeat this process 100 times, and estimate the mean number of genes significant at the threshold t across all 100 simulated GWAS (this represents E[Ft]). E[St] / F[St] is an estimate of the FDR when t is used to call a gene significant. The P-value threshold that result in an FDR closest to eg. 5% can then be identified from the output file. At that threshold t, 5% of genes called significant in the real-world GWAS are expected to be false-positives given multiple testing.

Usage. To estimate FDR thresholds, use:

./eugene_1.3b --fdr [n_gwas_sim] --eugene-results [file1] --bfile [file2] --out example

Where n_gwas_sim is the total number of GWAS (recommended: 100) that will be simulated to estimate Ft (see above). The two required input files are described below.

Input files. You need two input files to run this analysis:

File 1: EUGENE gene-based results.This file (read by the command --eugene-results) is the main output file generated by the --assoc or --fastbat analyses described above. Best not to change that output file, as EUGENE is expecting to find specific fields (eg. Gene name, Gene_based_P) in specific columns.

File 2: SNP dataset (plink binary format). As described above for analysis (1).

Output files.This analysis will create one output file, in addition to the log file.

Empirical FDR results (*.eugene.fdr). For a given P-value threshold t (column 1), EUGENE will produce the following output:

t	: P-value threshold
F(t)	: Mean number of genes significant at P-value threshold t under the null hypothesis of no association (average across number of GWAS simulations performed)
S(t)	: Total number of genes significant at P-value threshold t in the observed GWAS (according to results read in by --eugene-results)
FDR(t)	: Given by F(t) / S(t), that is, the proportion of genes significant at P-value threshold t in the observed GWAS that are likely to be false-positives given multiple testing.

Example. To estimate FDR thresholds based on 100 GWAS simulated under the null, use:

./eugene_1.3b --fdr 100 --eugene-results example.eugene.out --bfile 1000G.20170317 --out example1

(3) Annotation of EUGENE gene-based results (--annotate)

Summary of procedure. After running a gene-based analysis, one is typically interested in comparing the effect a particular SNP has on the trait of interest, with the effect it has on gene expression. For a given eQTL tested, is the allele that increases gene expression associated with increased or decreased trait values / disease risk? And is that pattern observed for other eQTLs of the same gene? What about across different tissues? To streamline this analysis, use the --annotate option in EUGENE. To run the analysis, you need to provide a file containing a list of genes to annotate (argument to --annotate), in addition to four input files already described above. For each gene, EUGENE will then identify each of the eQTLs tested and print out the direction of effect for a given allele on the trait/disease and on gene expression (when available).

Usage. To annotate results from a previous EUGENE run (--assoc), use:

./eugene_1.3b --annotate [file1] --eugene-results [file2] --eqtl-proxies [file3] --eqtl-database [file4] --gwas-results [file5] --snp-effect --out example

Input files. You need five input files to run this analysis:

File 1: List of genes to annotate.This file (read by the command --annotate) must contain a single column, with each row listing a gene name (eg. IL6R) for annotation. For example, you could select all genes with a gene-based association significant at FDR 5% in a previous EUGENE run.

File 2: EUGENE gene-based results.Output file (*.eugene.out) produced by --assoc or --fastbat analyses.

File 3: eQTL information file. Same format as described above for --assoc/--fastbat analyses.

File 4: eQTL database. This is a new file that was not required for --assoc/--fastbat or --fdr analyses. It includes direction of effect on gene expression for eQTLs listed in File 3, when available. Often eQTL studies do not list the effect allele for a given eQTL, and so for such studies this analysis is not informative. There is one eQTL database file for each eQTL information file listed above - these can be downloaded following the "Select tissue" link in the download section above. For example, if you used "--eqtl-proxies WHOLEBLOOD.2010317.eqtl.proxies.list", now you should also use "--eqtl-database WHOLEBLOOD.20170317.eqtl.database".

File 5: GWAS results file.The format of this file is similar to that described for the --assoc analysis, but now you need to include three additional columns:

COL1: SNP name (rs#)
COL2: P-value
COL3: Effect allele
COL4: Non-effect allele
COL5: Beta (NOTE: values <0 and >0 are taken as decreasing and increasing the trait/disease risk, respectively. If the SNP effect is expressed as an Odds Ratio, you must first conver to beta [log(OR)]).

So that EUGENE knows to expect 5 columns, instead of 2, you also need to include the '--snp-effect' option.

Output files.This analysis will create two output files, in addition to the log file.

Full annotation results (file *.eugene.annotation). For each gene included in file1, EUGENE will produce the following output:

Gene			: Gene name
Gene_based_P		: Gene-based association P-value (extracted from file2, previously calculated with --assoc or --fastbat)
Proxy_tested		: SNP included in the gene-based test, which was in LD (r2>0.8) with eQTL (listed in COL8)
Proxy_GWAS_pvalue	: Association P-value for that SNP in your input GWAS (extracted from file5)
Proxy_GWAS_effect_allele: Effect allele for that SNP in your input GWAS (extracted from file5)
Proxy_GWAS_other_allele : Other allele for that SNP in your input GWAS (extracted from file5)
Proxy_GWAS_effect	: SNP effect (Beta) on trait/disease in your input GWAS (extracted from file5)
eQTL			: Independent eQTL tagged by the SNP listed in COL3
eQTL_study		: Study that reported an association between the eQTL and gene expression
eQTL_type		: cis (<1 Mb from gene) or trans (>1 Mb or different chromosome) eQTL
eQTL_pvalue		: Association P-value between eQTL and gene expression reported in the eQTL study (listed in COL9)
eQTL_effect_allele	: If available, allele for which the eQTL effect on gene expression was reported in the eQTL study (listed in COL9)
eQTL_effect		: If available, eQTL effect on gene expression reported in the eQTL study (listed in COL9)

For a given gene, you can use this file to understand which eQTL proxies are contributing to a significant gene-based association. You can find these by selecting those with a 'Proxy_GWAS_pvalue' eg. <0.05. For these, you can then look at the 'eQTL_study' column to identify the tissue(s) where that eQTL effect was identified. In addition, when the SNP listed in COL3 (ie. the proxy tested in your GWAS) matches that listed in COL8 (ie. the SNP reported in the eQTL study) - that is, the proxy and the eQTL are the same SNP - then you can compare the direction of effect of that SNP on disease risk (COL7) and gene expression (COL13). This is only possible when the eQTL study reported the effect allele and beta, which is not always the case. On the other hand, when the proxy and the eQTL are not the same SNP (but, by definition, have an r2>0.8), then to compare direction of effect you need to know which alleles are in phase. Currently, this is not implemented in EUGENE; you could use eg. PLINK to do so (using --ld [snp1] [snp2], where snp1 would be the proxy and snp2 the eQTL).

Summary annotation results (file *.eugene.annotation-summary). For each gene included in file1, EUGENE will produce the following output:

Gene			: Gene name
eQTL_study		: Study that reported at least one independent eQTL for this gene
N_eQTL_tested		: Number of independent eQTLs identified in that study (COL2) that were included (itself or a proxy) in the gene-based test
N_eQTL_sign		: Number of independent eQTLs tested (COL3) that had a P<0.05 in your input GWAS (file5)
N_eQTL_pos		: Number of significant eQTLs (COL4) for which the allele associated with increased gene expression was associated with increased trait levels/disease risk
N_eQTL_neg		: Number of significant eQTLs (COL4) for which the allele associated with increased gene expression was associated with decreased trait levels/disease risk
N_eQTL_na		: Number of significant eQTLs (COL4) for which directional effect could not be compared between gene expression and trait levels/disease risk

For a given gene, this file provides an overview of the comparison between eQTL effect on gene expression and trait levels/disease risk. This summary is provided separately for each tissue reported to have at least one independent eQTL for that gene.

Example. To annotate results from a previous --assoc or --fastbat analysis, use:

./eugene_1.3b --annotate example.annotate --eugene-results example.eugene.out --eqtl-proxies WHOLEBLOOD.20170317.eqtl.proxies.list --eqtl-database WHOLEBLOOD.20170317.eqtl.database --gwas-results example.txt --snp-effect --out example2

Where 'example.annotate' contains a list of genes to annotate (file1 described above) and 'example.txt' contains your GWAS summary statistics (file5 described above).

History

28 March 2017: version 1.3b released, including:

-- Implemented GCTA's "fastBAT" approach to estimate significance of gene-based sum statistic using Satterthwaite's approximation.

-- Minor bugs fixed.

-- eQTLs from five new studies added to database (Kasela 2017, Yao 2017, Caliskan 2015, Nedelec 2016, Quach 2016)

12 Sept 2016: version 1.2b released, including two main changes:

-- Keep track of how many genes are still being tested after 100, 1000, 10000, etc simulations. If zero (ie none with respectively P<0.1, P<0.005, P<0.0005, etc), then stop simulations. This speeds up analyses, particularly when requesting --fdr.

-- Added --gc [correction_factor] as an optional flag to --assoc. This will adjust SNP P-values provided in --gwas-results [file1] for that correction_factor (eg. this could be lambda or LD-score intercept).

31 Aug 2016: Two major bugs fixed (below); if you used a previous version of EUGENE, must re-run with this new version.

-- The wrong SNPs for some genes were being included in the analysis of simulated data.

-- Genotype data not being read in for the last person in the *.fam file.

29 Aug 2016: version 1.1b released for beta testing, including:

-- Gene-based association analysis now requested by flag --assoc instead of --emp.

-- New analysis option --annotate: streamlines comparison of eQTL effect on trait and on gene expression.

-- Updated eQTL database and 1000G SNP data to include results from recently published eQTL studies.

-- No longer print summary of FDR results to log file, only to ouput file

-- Fixed a bug that resulted in the count of genes read by --eugene-results to be wrong.

-- Added option --write-nullgwas: writes out file with gene-based results for simulated GWAS (when running --fdr analysis).

-- Added option --n-gene-sim: specificies the maximum number of simulations per gene that will be used when estimating the significance of gene-based association of simulated GWAS (when running --fdr). The default number is 1 million.

22 July 2016: new implementation of Eugene posted (v1.0beta), including calculation of empirical P-values (--emp) and empirical FDR thresholds (--fdr).

19 February 2016: new manuscript submission. Accepted (JACI) in July 2016.

7 December 2015: first implementation of Eugene posted (v0.2), which included calculating asymptotic P-values only (slightly inflated when using r2<0.1 to define 'independent' eQTLs; no empirical correction for residual LD between eQTLs nor FDR estimation. Resubmission of paper delayed while functional experiments underway.

31 May 2015: first description of method in paper submitted for publication. Rejected in September 2015 (limited functional data supporting new risk genes for asthma).