David Duffy's QIMR Berghofer Homepage

This contains links to genetics, statistics, and computer programs for linkage and association analysis, including my own. It was last updated on 09-Aug-2020.

David L. Duffy, MBBS PhD.
QIMR Berghofer Medical Research Institute,
300 Herston Road,
Herston, Queensland 4006, Australia.
Email: David.Duffy@qimrberghofer.edu.au.

Some photos from our Tasmanian holiday 2000.
Some photos from our UK holiday 2010.
Some photos showing climbing 2001-18.
Paintings.

QIMR Berghofer and departmental links

Curriculum Vitae/Publication list

Via orcid.org
Local CV

Reviews of the genetic epidemiology of asthma and allergic disease

These are periodically updated reviews of the genetics of atopy and asthma. The articles include:

These chapters are based on my doctoral thesis, which is available in PDF format here.

Our publications on the genetics of allergic disease include:

Bouzigon E, Forabosco P, Koppelman GH, Cookson WO, Dizier MH, Duffy DL, Evans DM, Ferreira MA, Kere J, Laitinen T, Malerba G, Meyers DA, Moffatt M, Martin NG, Ng MY, Pignatti PF, Wjst M, Kauffmann F, Demenais F, Lewis CM (2010). Meta-analysis of 20 genome-wide linkage studies evidenced new regions linked to asthma and atopy. Eur J Hum Genet. 2010 Jan 13.
Ferreira MAR, O'Gorman L, Le Souëf P, Burton PR, Toelle BG, Robertson CF, Visscher PM, Martin NG, Duffy DL (2006). Variance components analyses of multiple asthma traits in a large sample of Australian families ascertained through a twin proband. Allergy 61:245-253.
Ferreira MAR, O'Gorman L, Le Souëf P, Burton PR, Toelle BG, Robertson CF, Visscher PM, Martin NG, Duffy DL (2005). Robust estimation of experimentwise P values applied to a genome scan of multiple asthma traits identifies a new region of significant linkage on chromosome 20q13. American Journal of Human Genetics 77:1075-1085.
Evans DM, Zhu G, Duffy DL, Montgomery GW, Frazer IH, Martin NG (2004). Major quantitative trait locus for eosinophil count is located on chromosome 2q. Journal of Allergy and Clinical Immunology 114:826-830. link
Duffy DL, Mitchell CA, Martin NG (1998). Genetic and environmental contributions to asthma. A cotwin-control study. American Journal of Respiratory and Critical Care Medicine 157: 840-845. link
Duffy DL, Healey SC, Chenevix-Trench G, Martin NG, Weger J, Lichter J (1995). Atopy in Australia [Letter]. Nature Genetics 10:260. link
Duffy DL, Battistutta D, Martin NG, Hopper JL, Mathews JD (1990). Genetics of asthma and hayfever in Australian twins. American Review of Respiratory Disease; 142: 1351-1358. link

An integrated genetic map

This table (last updated 20060618 18:40) contains interpolated genetic map positions for 128115 marker loci. The positions are in "Rutger's" cM (Kong X, Murphy K, Raj T, He C, White PS, Matise TC. A combined linkage-physical map of the human genome. Am J Hum Genet 2004; 75:1143-1148), estimated via locally weighted linear regression (lo(w)ess) from the Build 35.1 (and 34.3) physical map positions and published Rutgers genetic map positions ( R code here), and linearly interpolated "Oxstats" cM positions (Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science 2005; 310: 321-324). A major difference between these two metrics is the model for recombination across the centromere.

For the pseudoautosomal region, I have interpolated a male map based on the sperm typing data of Lien et al [2000]. This is a separate file.

A research note describing this map is:
Duffy DL (2006). An integrated genetic map for linkage analysis. Behavior Genetics 36: 4-6.

Physical versus genetic position chrom 20

Duplicate markers have been removed. Beware of name modifications (eg Marshfield names often have letter code suffixes such as Z i.e. "Primer moved from its initial position; the allele size changed").

1 Marker name
2 Alternative name
3 Alternative name 2
4 Locus type (STS or SNP)
5 Decode marker (D or .)
6 Marshfield marker (M or .)
7 Chromosome
8 Alternative chromosome (occasionally differs!)
9 Decode order number (1-5136)
10 Marshfield order number (1-8010)
11 Decode physical position on chromosome (bp)
12 Decode genetic map position (cM)
13 Marshfield genetic map position (cM)
14 Rutger genetic map position (cM)
15 Rutger male genetic map position (cM)
16 Rutger female genetic map position (cM)
17 Build 34.3 physical map position (bp)
17 Build 35.1 physical map position (bp)
18 Interpolated physical map position (bp)
19 Interpolated Rutger genetic map position (cM)
20 Interpolated Oxstats genetic map position (cM)

Master map (compressed archive)
Master map (uncompressed)
Pseudoautosomal map
A research note on the map
mastermap Unix shell script to read map positions

I have taken the chromosome band data used by NCBI Mapview to draw ideograms, and interpolated the band positions onto the above map (as opposed to the perhaps more logical approach of mapping linkage findings to a physical map!):

My Software

SIB-PAIR

I am making Fortran 95 source code and some binaries for my program SIB-PAIR available for downloading.

Sib-pair has had many bugs shaken out, though addition of newer commands leads to introduction of newer errors. While loading data could still be further improved for large datasets, analysis of data once in memory is fairly fast, so the program can be used for handling and analysis of genome-wide association study (GWAS) and smaller processed sequencing dataset. Since the May 2012 addition of multithreading (OpenMP), such analyses are now much faster. Very large datasets can also be analysed - data exceeding the available memory is automatically handled on disk rather than in memory (with a drop in speed).

The most recent version of Sib-pair (1.00b) is dated 8th August 2020 (see the list of new features). With respect to urgency of upgrading, for SIB-PAIR it is always a good idea! For example, map liftover using a chain file was not complete if the markers were unsorted - this was fixed 20180731. And the 20200221 changes include a nasty bug affecting merging of plink bed files. Note that the Windows binaries here may be a bit behind.

Program SIB-PAIR performs a number of analyses of family data that tend to be "nonparametric" or "robust" in nature. The name is a misnomer in that Sib-pair is actually for the analysis of arbitrary pedigrees. It is modelled to some extent on the Genetic Analysis System [Young, 1995] in terms of the command language and types of analysis. Included are routines for:

Mendelian error checking and genotype imputation.
Allele and haplotype frequencies in codominant genetic systems.
Familial correlations, sibship variances and variance components (including QTL linkage and combined linkage and association) for quantitative and binary traits.
Various Haseman-Elston regression and score-test analyses for a quantitative (or binary) trait using full and half-sib data.
Various transmission-disequilibrium tests (eg FBAT).
Tests of allelic association with a binary or quantitative trait -- Monte Carlo simulation of null distribution.
Single locus Affected Pedigree Method identity-by-state and identity-by-descent linkage analysis. This includes Wards [1993] extensions to include unaffected pedigree members.
Reading pedigree, map, and genotype data files in BEAGLE, FImpute, HapMap, MaCH, minimac, MERLIN, MLINK, PLINK, VCF, Wombat, and other formats.
Writing locus, kinship, map and pedigree files in the formats used by the programs APM, Arlequin, ASPEX, BEAGLE, blupf90, CRIMAP, FImpute, FISHER, GAS, GCTA (grm), GDA, Genehunter, LINKAGE, Loki, MERLIN, MENDEL, MORGAN, PAP, Plink, ROADTRIPS, SAGE, SOLAR, STRUCTURE, VCF, Wombat, and many more.

More recent releases of SIB-PAIR add multithreading, flexible manipulation of pedigree data, MLE of allele frequencies, segregation analysis, variance components (linkage) analyses that now allow multiple fixed effects including measured genotypes under Gaussian and threshold (probit-normal) models via deterministic approaches and a larger range of GLMMs via MCMC, the combined sibship/transmission disequilibrium score test for allelic association, an extension of the WQLS test to categorical traits, a quantitative trait TDT, the SKAT test, generalized linear (mixed) models, assorted classical twin analyses and one for bivariate survival analysis, multilocus population genetic analyses and estimation of empirical kinship coefficients.

The program executable is usually called sib-pair or sib-pair.exe. Precompiled executables are available for Linux and for Windows (see below), but there should be no problems compiling and running on platforms that have a Fortran 95 compiler. There are no hard coded constraints on number of loci, number of pedigree members or number of alleles at a marker (providing your computer has enough memory).

Using the japi library, a graphical file picker or directory browser is now working under Windows and Linux. An alternative uses the GTK2+ based pilib library. If these are not activated, there is a fallback simple text based file chooser.

The gfortran compiled code on linux is currently faster or as fast as any of the other compilers I use. Since some routines use formatted stream access, the program will not compile with some Fortran95 compilers.

Links to SIB-PAIR

[Documentation about Sib-pair]

An introduction to SIB-PAIR A tutorial in using Sib-pair.
Using SIB-PAIR to run other programs A tutorial in using Sib-pair to automate the running of other genetic analysis programs.
All the SIB-PAIR commands The list of Sib-pair language commands, with hypertext links to documentation for each command and enhanced examples of use.
Sib-pair manual The HTML version of the manual.
PDF version of Sib-pair manual prepared using htmldoc (or Lout and an awk script htm_lout.awk).
A brief technical report A summary of the program features.
Slides from a talk on Sib-pair A summary of recently introduced program features as of November 2008.
Slides from a talk on Sib-pair 2011 A summary of recently introduced program features as of August 2011.
About extrapolating small MC P-values The approach (Hill, 1975; Davis and Resnick, 1984) used to estimate Monte Carlo-P values when the observed statistic is larger than any simulated statistic.

[Download Sib-pair]

Linux

gsp64.linux: Linux 64 bit ELF binary of Fortran 95 version of Sib-pair. Compiled using gfortran -fopenmp.
64 bit gfortran serial version statically linked on Linux .
sib-pair32.linux: Linux 32 bit ELF binary of Fortran 95 version of Sib-pair. Compiled using gfortran.
the Solaris Studio compiled multithreaded version for Linux.
Pathscale compiled serial version for Linux.
Older Linux g95 compiled serial version.
sib-pair.tar.gz: Linux installation of Sib-pair, including binary, documentation, examples and source.
sib-pair.n900: For the discerning mobile telephone owner, Sib-pair for the Nokia n900 running Maemo 5.

MacOSX

sib-pair.macosx: The Sib-pair OSX (Mojave) binary. Compiled using gfortran 4.6.0.

Windows

sib-pair64.zip: The Sib-pair 64-bit Windows binary. Compiled using gfortran.
win-setup.zip: Windows Sib-pair installer. This installs the program and documentation, adds Sib-pair to the search path, and sets up a "help.start" command that starts up the local copy of the help pages in Internet Explorer Packaged using Jordan Russell's Inno Setup program.
win-sp.zip: The above Windows setup as a simple zip file that expands to "bin" and "doc" folders and contents.
sib-pair.exe: Old Sib-pair Windows 32 bit binary. Compiled using gfortran.

[Compiling, altering and testing Sib-pair]

sib-pair.f95.gz: Fortran 95 source code for Sib-pair. On linux, the most recent versions compile using gfortran, Oracle sunf95, Intel ifort, Pathscale pathf95. My version of Open64 openf95 does not fully support streams. For the Windows 32 and 64 bit versions, I use mingw gfortran successfully.
A Makefile: Makefile to compile various versions of Sib-pair.
A Regression Testing suite for Sib-pair.
A guide to the Fortran code.
Superceded Fortran 77 versions of Sib-pair Old version of program, including executables for various other platforms.
Version of SIB-PAIR in R (sib-pair.R): Missing a lot of functionality.
SIB-PAIR examples: Sample scripts with inline data sets.
SIB-PAIR GLMM examples: Standard datasets (mainly nongenetic) for testing of MCMC GLMM algorithms.
SIB-PAIR numerical test examples: Some of the NIST Statistical Reference Datasets for testing of statistical algorithms, as Sib-pair scripts.
Other SIB-PAIR examples: Brief descriptions of example data sets and scripts below.
examples.zip: Other example pedigree and script files for SIB-PAIR. These are pkzipped. (Compressed 26507; uncompressed 335872)
examples.tar.gz: Other example pedigree and script files for SIB-PAIR. These are tarred then gzipped. (Compressed 20742)
Appendix example: Age at appendicectomy example.

[Sib-pair utilities and ancillary programs]

SIB-PAIR utilities : Brief descriptions of awk programs for manipulating GAS type pedigree files.
utils.tar.gz: Small awk programs for manipulating GAS type pedigree files. These are tarred then gzipped. (Compressed 6012; uncompressed 32768)
utils.zip: Small awk programs for manipulating GAS type pedigree files. These are pkzipped. (Compressed 8942)
pad: A SIB-PAIR utility for adding extra (missing) data columns to GAS type pedigree files (requires sh, awk).
bester: A utility for aligning columns of data eg pedigree files (requires sh, awk).
catped: A SIB-PAIR utility for concatenating GAS type pedigrees where each file may contain data for different, as well as common, loci.
mergeped: A SIB-PAIR utility for merging GAS type pedigrees where each file may contain data for different, as well as common individuals (needs sh, awk, join, pad, bester).
updateped: A SIB-PAIR utility for updating trait values where there is duplicate data in a merged file, then purging the duplicates. Designed for use straight after mergeped, which renames duplicate loci to loc_v2, loc_v3...
diffped: A SIB-PAIR utility for comparing two GAS type pedigrees.
ISP: Little Tcl/Tk based GUI for Sib-pair: batch.
ISP2: Little Tcl/Tk based GUI for Sib-pair: interactive.
Manual for BINNING: This is the beta-test software for binning alleles.
Win32 binary for BINNING (binning.exe).
binning.f.gz: Fortran 77 source code for BINNING. The g77/f2c code only (though should compile with most Fortran compilers).

Documentation for cwsdpmi.exe
DOS binary of cwsdpmi.exe (uncompressed 20217)

LOGLIN

Program LOGLIN is a Fortran 77 program for performing generalised log-linear modelling of complete or incomplete count data. There are two versions of the source code here: one requires the NAG subroutine library to be available, while the other contains equivalent public domain routines. In addition, there is an R library with much the same functionality (updated 2018-11-26). LOGLIN may be used for:

Models where imprecise measures have been calibrated using a "perfect" gold standard, and the true association between imperfectly measured variables is to be estimated.
Where data is missing for a subsample of the population.
Latent variable models - eg ML gene frequency estimation from counts of observed phenotypes, latent class analysis.
Specialised measurement models eg where observed counts are mixtures due to perfect measures and error prone measures.
Standard models which are difficult to fit in some packages, such as symmetry and quasi-symmetry models.

LOGLIN documentation
GZIPPED Fortran source code for LOGLIN.
GZIPPED Fortran source code for NAG dependent LOGLIN.
R package (gllm_0.37) for log-linear modelling.
The win32/NT executable loglinnt.exe.
assoc.f. Fortran source code to generate the allelic association jobs.

TWINSIM

This program generates nuclear families, a proportion of which contain monozygotic twins, in which multiple quantitative trait loci are segregating. One of these QTLs is linked to multiple markers. Families can be selected to contain high and/or low values at the quantitative or ordinal trait.

Mystat

This is a Basic program that performs a number of simple statistical analyses of contingency tables useful in epidemiology and genetics. One can estimate tetrachoric correlations and odds ratios for 2x2 tables (with exact confidence intervals), combine multiple 2x2 tables via Mantel-Haenszel and maximum likelihood procedures (jackknife standard error for pooled MLE odds ratio), test for symmetry and quasi-symmetry in square contingency tables, and obtain exact (Pearson-Clopper) 95% confidence intervals on a proportion. A calculator (double precision) with scientific functions including inht(), fact(), and ran() is also accessible via the same menu.

R Programs

gllm 0.32 Loglin for R.
lodplot 1.2 Genome scan plotter.
bivqtl.R Power of variance components QTL linkage analysis
boxcox.R Box-Cox (power) transformation for bivariate (exchangeable) outcomes.
filliben.R Filliben r as test for normality
fstat.R Population genetic F statistics (bugfix 20081103)
hwetest.R Testing Hardy-Weinberg Equilibrium
mqls.R Thornton and McPeek kinship adjusted association tests
merlinlme.R Reading Merlin IBD matrices
plotibs.R Testing familial relationships using multiple marker identity-by-state ("Abecasis" plot)
calcibs.f
polyr.R Polychoric r for RxC contingency table
snp.R A few useful routines for SNP data
hap.R Toy EM haplotyping algorithm
wang-landau.R The Generalized Wang-Landau algorithm from Liang et al 2005
dotstack.R Symmetrical stacks for vertical dot plots

Other programs

rcexact. A program that calculates Fisher exact P-values for RxC contingency tables. Written by Mehta in Fortran 77 (Algorithm 643 from the ACM). I have altered the driving program slightly.

rcexact.exe: DOS executable of rcexact.
winrcex.exe: Win32/NT executable of rcexact.

readhap: Shell and awk script to summarize GENEHUNTER haplotype output.
readinh: Shell and awk script to summarize ALLEGRO haplotype output.

drawhap.sh. Takes SIMWALK2 haplotyping output file and draws the pedigree as a marriage-node graph with haplotypes using Graphviz (needs sh, awk, dot). Not completely satisfactory in terms of placement of haplotypes on the drawing. Colouring is of alleles, rather than haplotypes.

drawhap.sh: Shell and awk script to summarize SIMWALK2 haplotype output as a pedigree drawing.

join_unsorted.sh. Just like (unix) join, but files do not have to be sorted. Returns a file following the order of the key in the first named file:

  Usage: join.unsorted [OPTION]... FILE1 FILE2
  For each pair of input lines with identical join fields, write a line to
  standard output.  The default join field is the first, delimited
  by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.

    -a FILENUM        print unpairable lines coming from file FILENUM, where
                        FILENUM is 1 or 2, corresponding to FILE1 or FILE2
    -e EMPTY          replace missing input fields with EMPTY
    -i, --ignore-case ignore differences in case when comparing fields
    -j FIELD          equivalent to -1 FIELD -2 FIELD
    -o FORMAT         obey FORMAT while constructing output line
    -t CHAR           use CHAR as input and output field separator
    -v FILENUM        like -a FILENUM, but suppress joined output lines
    -1 FIELD          join on this FIELD of file 1
    -2 FIELD          join on this FIELD of file 2
        --help        display this help and exit
        --version     output version information and exit

  Unless -t CHAR is given, leading blanks separate fields and are ignored,
  else fields are separated by CHAR.  Any FIELD is a field number counted
  from 1.  FORMAT is one or more comma or blank separated specifications,
  each being FILENUM.FIELD or 0.  Default FORMAT outputs the join field,
  the remaining fields from FILE1, the remaining fields from FILE2, all
  separated by CHAR.

join_unsorted.sh: Shell and awk script to join unsorted files.

Fortran stuff

fscheme. Port of the tinyscheme (and minischeme) small Scheme interpreter to Fortran 95. Hopefully useful as an embedded interpreter (a stripped down version is present as a module in Sib-pair).

fscheme.f95 (updated on 2007-08-13).
init.scm Initialization file.
fscheme_mod.f95 As a module (extracted from Sib-pair code on 2019-11-27). This includes a number of extra statistical functions and bindings to the JAPI GUI library and EGGX graphics library (see here for the list of commands, and here for notes on embedding Scheme in a Fortran program).

grapheps. Port of Aubrey Jaffer's grapheps Postscript data plotting package to a Fortran 95 module. As used in Sib-pair.

grapheps.f95 (updated on 2010-07-01).

fortransockets. Minimal Fortran 95 sockets library for linux. Enough functionality for a simple server. Includes wrappers for socket(), setsockopt(), bind() and listen(); accept(); send(); recv(); close(); gethostbyname(). This code last updated 2007-12-19.

Makefile
fortransockets.f95
fortransockets_interfaces.f95
fsockets.c The C glue code.
testsocket1.f95 Test TCP server.
httpserver.f95 Test http server.
index.html.gz Test page for httpserver.

Sample code for multidimensional table. Enhanced version of Sib-pair code for sorting and tabulating multidimensional data. For my purposes, the inputs to set_table_cell would be character data, which would then be automatically cast to the correct type.

better_table.f90
better_table.csv Example data.

Interface for curses. Fortran 2003 modules that interface to the PDCurses and ncurses libraries. With PDCurses, this runs nicely using gfortran or g95 on linux or Windows. The ncurses and PDCurses libraries differ in a few places in their coding for attributes and some keys (eg backspace). This code last updated 2011-02-01.

A couple of awk scripts used to make the interfaces to C (usually will need some editing afterwards).

deftopar.awk Tries to change defined variables to parameter declarations.
mkinterface.awk Tries to write an appropriate Fortran interface

Interface to zlib. After Janus Weils's example on comp.lang.fortran (May 2009) and fgzlib (plus templates in fgsl). As others have noted, it is much faster to use gzread() and a buffer, rather than gzgets(). Extended slightly (2018-10-02), to interface the necessary routines to randomly access a BGZF (ie bgzipped) file.

f95zlib.f95 (updated on 2009-10-20).
zlib_stuff.f90 (updated on 2014-01-10).
bgzf1.f90 (updated on 2018-10-03).

Interface to readdir() etc. A module that allows listing directory contents using the Posix opendir, readdir, rewinddir, closedir. Works on Linux and Darwin.

dir_util.f90 (updated on 2020-06-26).

David Frank's C2F translator. This was available for many years at his website, but this seems defunct. He wrote about it on comp.lang.pl1 back in 2002 that:

I made my C2F tool freeware from its beginning several years back, BTW, I'm a old retired Fortran programmer, but if I were to go back into the job market I would certainly add this project to my resume,,,

C2F.ZIP (put up 2019-12-15).

JabRef style and layout files

JabRef reference formatting style files for assorted biomedical journals. The layout files are used to define a customized export format for a set of references. In this case, they produce a character delimited file which after postprocessing to make it nicely tab separated, can be uploaded to the Australian NHMRC grant application system (RGMS).

AJHG_style_file.jstyle (American Journal of Human Genetics)
BiomedCentral.jstyle
JAMA.jstyle
KI.jstyle (Kidney International)
NatureGroup.jstyle
nhmrc.begin.layout
nhmrc.layout

	QIMR Berghofer and departmental links		My software Sib-pair LOGLIN Twinsim Mystat R code Other programs Fortran stuff
	My CV/Publications
	About asthma
	About genetics
	A genetic map
	My links to other sites: genetics etc
	Cycling at QIMR