A Software note: the SIB-PAIR program

David L. Duffy

(2009-07-03)

David L. Duffy, MBBS PhD.
Queensland Institute of Medical Research,
300 Herston Road,
Herston, Queensland 4029, Australia.
Email: davidD@qimr.edu.au

INTRODUCTION

There are now several hundred software packages for genetic analysis listed on Wentian Li's website (http://www.nslij-genetics.org/soft/). Most of these implement particular statistical genetic procedures in the domains of linkage, association, variance components or segregation analysis, but a number of multifunctional packages such as MERLIN, MENDEL, PAP, SAGE, SOLAR are also in common use. Other packages are designed for the manipulation of pedigree data and conversion of data files from one format to another used by this rich choice of tools - the MADELINE package is one example.

Sib-pair is a computer program for both the manipulation and the statistical analysis of genetic datasets. It is somewhat misnamed, in that it was always capable of analysis of arbitrary pedigrees, but the first analyses implemented were multiallelic transmission-disequilibrium tests and sib pair Haseman-Elston linkage analysis [Duffy 1997]. Over the thirteen years of development, I have added a range of other analyses, ranging historically from Penrose sib-pair linkage analysis to the fitting of generalized linear mixed models, including genetic survival analysis. These are embedded in an interpreted language that offers a large number of pedigree and individual oriented data operations, and the ability to write out data files in the formats required by many other useful packages. After a rewrite of core routines in 2006, it can now deal with genome wide association sized datasets.

The program is written in Fortran 95, and is designed to be: a) a useful tool for day-to-day tasks encountered by a genetic analyst; b) an easily extensible platform for exploration of statistical methods, concentrating on simulation based approaches (Monte-Carlo and Markov Chain Monte-Carlo (MCMC)). It should be noted that, except in a few cases (multipoint identity-by-descent calculations, haplotype frequency estimation and association, the experimental homozygosity runs test), the analytic methods that Sib-pair implements are single locus.

METHODS

The full documentation for Sib-pair can be found online at http://www.qimr.edu.au/davidD, including results from analysis of standard datasets.

Language interpreter

The program implements an interpreted language with over 200 commands (see the online documentation for a complete list). Commands are entered at a command line prompt or read from a script file. The commands offered include calendrical, mathematical and genetic utilities, such as those for calculation of recurrence risks, genotype carrier probabilities and expected ibd sharing under a specified single major locus model (the ito, grr and sml commands), P-values for the chi-square and F distributions (pchisq); and simple tests on tables of counts entered from the keyboard such as the contingency chi-square (chisq), and asymptotic and exact tests of Hardy-Weinberg and linkage equilibrium (hwe, diseq).

In passing, users of the elegant GAS package [Young 1995] will find much that is familiar in the overall design and functionality of Sib-pair, and at one time, GAS type scripts could be read by Sib-pair. Automation of analysis tasks is augmented by a simple macro facility, as well as a small embedded Scheme (Lisp) interpreter.

Data Input and Output

Data can be read into Sib-pair in files of five different formats: Sib-pair, Linkage, MERLIN, HapMap, and PLINK. The Linkage and Merlin formats require the standard files describing the traits and genotypes, pedigree relationships and genetic maps for loci. Either text-based or binary (.bed) PLINK format files can be read.

The Sib-pair native pedigree file format is a modified "pre-Makeped" format (similar to the GAS [Young 1995] format) that allows letter codes for sex ("MmFfXx"), binary traits ("YyNnXx"), and marker alleles ("a-zA-Z1..999"), where "x" is reserved as the missing data placeholder. This can be read from a file, or inlined within a control script so making an analysis self-contained and self-documented. Each phenotype needs a one line declaration in the control script (or entered at the keyboard) taking the form:

set locus <name> <type> [<map position> [<comment>]]

The current data types are "affection" (binary), "quantitative", (including dates), "marker" (autosomal codominant), "xlinked" (codominant X-linked), and "haploid" (eg Y-chromosome, mitochondrial). The map position can be on a genetic or sequence map. The comment string is stored, and can be searched upon or accessed (eg the keep and drop commands), so allowing quick subsetting of the data in an analysis. If data from unrelated individuals is being read, then the data format can be simplified to just include an ID rather than the standard five fields required in the usual Linkage format. One quantitative trait can be declared to be a twin zygosity indicator, and one marker can be declared as "sex-informative" (eg Amelogenin).

Data can also be simulated by Sib-pair, including pedigree structures, traits and codominant genotypes, and this can be conditional on observed genotypes at a locus, or on linkage to another observed marker locus.

When the data are read in, a number of checks are automatically carried out (see Table 1 and Figure 1). Mendelian inconsistencies are first checked for at the nuclear family level, and then more broadly using the Lange-Goradia algorithm [Lange and Goradia 1988]. Additional tests can be requested, such as for consistency of dates of birth or age within a pedigree (test dob).

Table 1 Sib-pair error checking

Checks for duplicate records
Checks for impossible pedigree relationships (eg own father)
Checks sex of parents
Creates extra pedigree records as required
Sorts the pedigree by generation number
Checks for unconnected components within the pedigree
Tests and reports Mendelian errors
Tests that the reported sexes are consistent with sex-linked marker genotypes
Tests that monozygotic twins are concordant at all markers
Generates legal values for all missing genotypes within the pedigree -
used to start MCMC algorithms.
Impute missing genotypes if requested ("set imputation on")

Figure 1 Example Sib-pair data error reports


Pedigree     Individual  Sex  Post.Pr(M)   X-marker hets
------------ ----------  ---  ----------   --------------
[...]
32802        3280203      m    0.000000      14/  19
32802        3280204      f    1.000000       0/  19

Designated     Sex inferred via sex-linked markers
Sex          Likely Male    Uncertain   Likely Female
----------   -----------    ---------   -------------
 Male             721          1806             5
 Unknown            0           145             0
 Female             5          2136           811

NOTE:  inconsistency due child 34114-3411402 at locus GATA31F01P {145/153}

X-linked locus "GATA31F01P"
--------------
Sibship: 34114-3411403 x 34114-3411404

Inconsistency between sibling genotypes.

                      [3411403]           (3411404
                        x/-               145/153
                         |                   |
                         +=========+=========+
                                   |
               +---------+---------+---------+---------+
               |         |         |         |         |
            (3411401) (3411402) (3411451) [3411452] [3411453]
            145/149   145/153   145/153   153/-     153/-

NOTE:  Mendelian inconsistency in pedigree 02590 at X-linked locus "GATA31E08".

X-linked locus "GATA31E08"
--------------
Sibship: 02590-0259001 x 02590-0000005

Multigenerational inconsistency between genotypes.

                 [0259003] (0259004)
                   x/-     242/242
                    |         |
                    +====+====+
                         |
                      [0259001]           (0000005
                        x/-                 x/x  
                         |                   |
                         +=========+=========+
                                   |
                              +----+----+
                              |         |          
                           [0259008] (0259009) 
                           230/-     230/230    

ID         Count    Problem phenosets
---------- -------- -----------------

Paternal Gparents
0259003           3 230/- 242/- +/-
0259004       Typed 242/242

Paternal Uncles/Aunts
0259002           1 242/-

Father
0259001     Problem 242/-

Mother
0000005     Problem 230/230 230/242 230/+ 242/242 242/+ +/+

Children
0259008       Typed 230/-
0259009       Typed 230/230

Parent 02590-0259001 cannot carry the "230" allele found in child 02590-0259008.
Parent 02590-0259001 cannot carry the "230" allele found in child 02590-0259009.

Once read in, phenotype, locus and map data can be exported to a number of formats, currently including: Aspex, Beagle, Crimap, CSV, dot, eclipse, FBAT, Fisher, GDA, Genehunter, Haploview, Linkage (pre and post Makeped), Loki, Mendel, Merlin, MIM, MORGAN, PAP, RAM, Relpair, SAGE, Solar and Structure.

Data manipulation procedures

The simplest manipulations that Sib-pair provides are the subsetting of loci and of pedigrees. The keep and drop commands affect loci, and can be addressed by name (wild-card searchable or a range), index in list of loci, range of map positions, major allele frequency, HWE or association test P-value, number genotyped (phenotyped), strength of linkage disequilibrium with neighbours, or by a search of the associated comment string. The select statement selects a subset of pedigrees, either by a search on family name, or by a count of members meeting an eligibility criterion. These criteria can include family size, number of founders, number of markers genotyped at, the maximum number of markers genotyped in common with a relative, the value of a phenotype or genotype (including homozygous status).

The pedigree transformation commands allow regeneration of IDs, subdivision into component nuclear families, pruning of pedigrees based on a key phenotype or genotype, extraction of unrelated cases and controls based on a phenotype, concatenation of pedigrees based on shared individual IDs, identification of unrelated individuals and subpedigrees under a common pedigree ID and splitting off of these if requested.

There are all the usual algebraic and mathematical procedures aimed at individual level data available in standard statistical packages, and are written in the usual infix notation (eg "logBMI=log(weight/height^2)"). Other commands include ranking or inverse normal transformation of quantitative data (rank, blom), calculation of residuals from linear regression (residuals) or Kaplan-Meier survival analysis (kaplan).

Finally, the get command allows the writing of a summary statistic for the trait values in specified classes of relatives of each individual to a new variable. This can be a randomly sampled value from that set (allowing construction of permutation tests).

Simple Genetic Descriptive Analysis

For each type of locus, there are simple summaries that are a useful preliminary to formal analysis. For quantitative traite, as well as summaries of the distribution (mixtures of gaussian or other distributions can be fitted using the mix command), one obtains a tabulation of the familial correlations, along with a (delete-d) jackknife-derived standard error. For binary traits, recurrence risks and segregation ratios (including via the method of Davies [1979]). A fast classical twin analysis of binary or quantitative data is also implemented.

Both simple counting and maximum likelihood estimates of the allele frequencies of codominant markers can be produced (the latter is calculated by a Monte-Carlo Expectation Maximization (MCEM) algorithm). One also can test Hardy-Weinberg disequilibrium (exact P-value for diallelic markers). In intact pedigree data, the P-value for the HWE test is obtained by gene dropping (and so is a legitimate test combining population level Hardy-Weinberg disequilibrium in genotyped founders along with any evidence of segregation distortion in nonfounders). The diseq command tabulates all pairwise intermarker linkage disequilibria (for autosomal or X-linked codominant markers). It will also estimate haplotype frequencies for specified sets of markers, and if a stratum indicator variable is specified will give a likelihood ratio test statistic testing homogeneity of haplotype frequencies across the strata. An expectation-maximization algorithm is used, which for two-marker systems combines information from phased and unphased genotypes (phase being inferred for individuals with full parental information).

Genetic Association Analysis

There are a number of association procedures implemented in Sib-pair that cover a wide range of phenotypes and both within and between family analyses. In many cases, the tests rely on gene dropping simulation to give correct P-values when arbitrarily structured families are being analysed. Because this can be time consuming, I have followed Besag and Clifford [1991] in calculating sequential Monte-Carlo P-values, so that the number of simulations used to estimate a nonsignificant (ie large) P-value is small. For very large datasets, since most tests have an asymptotic counterpart, these can be used to screen for interesting results to be followed up using a MC approach. Sib-pair allows the gene-dropping to be either unconditional, or conditional on identity-by-descent (IBD) at a flanking marker.

The association command implements gene-dropping approach for binary and categorical traits, and quantitative traits where a ordinary least squares ANOVA is appropriate. If covariate effects need to be adjusted for, one must use other commands. For example, measured genotype analysis of a trait can be carried out by generalized linear model (binomial, gaussian, poisson, exponential) analysis with gene-dropping used to assess the strength of association allowing for residual family correlation (using the regress command along with the sim keyword). This approach is extended following Aitken and Clayton [1980] to Weibull distribution based parametric survival analysis. To incorporate imputed genotypes (for pedigrees where some marker genotypes are unobserved), multiple imputation via a hybrid Metropolis-Gibbs genotype sampler can be carried out (rep keyword).

Alternatively, a variance components based approach to residual familial correlation is available using the variance_components command, or in the case of survival analysis, the MCMC based fpm command, that allows fitting of generalized linear mixed models. The MCMC genotype sampler mentioned earlier can be used to generate expected allele dose for each unobserved genotype (gpe command).

For population genetic purposes, the same routines when called by the fst command also calculate the F-statistics for the categorical trait tests, and then combine these to give multilocus F statistics following Pons and Chaouche [1995].

Within-family association tests

The association command also implements the FBAT (Laird et al 2000) within-family test for binary data. The expectation and sampling variance of the FBAT test statistic is calculated by gene-dropping with rejection, conditioning on the observed genotypes necessary to impute the missing parental genotypes, and this approach obtains the same values as the exact evaluation used in the FBAT program. It is slower as a result, but is extendable to more complex situations eg transmission through three generations.

In addition, there are four other commands implementing different flavours of the transmission-disequilbrium test (TDT). The hrr command carries out the original Haplotype Relative Risk test, but again provides both asymptotic and gene-dropped P-values. The schaid command gives the Schaid and Sommer [1993] log-linear model for a diallelic genotypic TDT. The tdt command gives a quantitative trait TDT following Gauderman [2003], which is essentially identical to the default test provided by the QTDT program. For a binary trait, it gives four tests: the original test of Spielman et al [1993] for each allele in turn (unless it is a diallelic marker), and three multiallelic TDTs. Finally, the sdt command gives a conditional logistic regression analysis of a binary trait versus (allelic-coded) genotype stratifying on sibship (equivalent to the "sibship disequilibrium test" [Boehnke and Langefeld 1998; Horvath and Laird 1998]).

Homozygosity based tests

Sib-pair offers a variety of approaches to testing for marker homzygosity as a means to detecting trait loci or Hardy-Weinberg disequilibrium. The homoz command tests locus by locus, and as usual implements a gene-dropping approach to deal with pedigree data. The mulhom command tests for runs of homozygosity. Both the mean overall run length and best run length centred on each marker in turn is calculated for each eligible individual over the set of active markers. Simulation is used to approximate the expected distribution of these test statistics (this does not currently allow for intermarker linkage disequilibrium). Finally, the MCMC genotype sampler has an interface allowing the saving of a homozygosity-by-descent indicator for any single marker (which for completely linked markers could be averaged to approximate the fully multipoint estimate).

Linkage

All the Sib-pair linkage routines are "nonparametric" in nature (with the exception of the lod command, which performs two-point codominant marker linkage analysis). The penrose command implements the original sib pair linkage method of Penrose [1935, 1937]. The apm command carries out both identity by state (IBS) and identity by descent (IBD) sharing based analysis for single point linkage analysis of binary traits. The expectation and variances of the sharing statistics are generated by gene-dropping simulation, and are fast even in large pedigrees. The usual affecteds-only Spairs and Sall ("NPL") statistics of Whittemore and Halpern [1994] are calculated, along with the general pairs (GPM) statistic of Ward and Bonaiti-Pellie (1995). The latter combines results from affected-affected, affected-unaffected, and unaffected-unaffected pairs in a weighted fashion, and for Mendelian disorders at least, is often more powerful than the affected-only statistics.

For quantitative traits, Sib-pair offers several flavours of regression based linkage test [Haseman and Elston 1970; Sham and Purcell 2001; Visscher and Hopper 2001; Szatkiewicz and Feingold 2004] as well as variance components. The fastest options restrict analysis to full and half sibs, but complete pedigree variance components linkage analysis can also be performed, where IBD is estimated using the MCMC genotype sampler. To estimate multipoint IBDs, Sib-pair calculates variance weighted averages of "single-point" IBDs over sets of markers that are in complete linkage, the simplest application of the approach of Fulker and Cardon [1996] as extended by Almasy and Blangero [1998]. The variances are the empirical sampling variances arising from the MCMC simulation. This approach is not used by the apm routine.

Generalized linear mixed models (GLMMs)

The fpm (finite polygenic model) command uses MCMC algorithms to fit GLMMs, but can also include single or multiple major loci in the model, as the name suggests. These latter options are still unstable (and are not recommended for casual users), but polygenic GLMM fitting seems to be more robust. I have used Sib-pair to successfully replicate results for one and two level hierarchical models fitted to standard datasets from the general statistical literature on GLMMs (Bates 2005; http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-support/datasets.shtml), Gaussian linear mixed models for large animal breeding datasets, and multivariate survival analysis. The fpm command also allows the fitting of mixed models to survival data, under a Weibull model with Gaussian random effects. It has not yet been extended to linkage analysis.

Markov Chain Monte Carlo Statistical Methods

The single locus MCMC genotype sampler uses a slightly ad hoc combination of Metropolis and Gibbs sampler steps. A Metropolis step is carried out in 5% of iterations, and alternates between three types of global proposal: a. "mutation" of up to four founder alleles, changes which are propagated down the pedigree; b. a gene-dropping of 2*nfounder "identity-by-descent indicator" alleles, where transmissions that cause an incompatibility between observed genotypes are rejected (this can be seen as a mechanism for generating consistent descent graphs for the observed genotypes); c. a swap of grandparental origin of alleles in a randomly chosen individual (after Lange and Matthysse [1989], Sobel and Lange [1996]), which are then propagated up the pedigree. Following this, the (unobserved) genotypes for all untyped by untyped matings are resimulated, and the usual Metropolis acceptance-rejection step is then carried out.

Because it is difficult/expensive to calculate the proposal probabilities for these complex global proposals (in large pedigrees at least), these are assumed to be symmetric. In the case of proposal mechanism b., this is not the case. This is why the Metropolis steps are kept relatively infrequent compared to the standard single-site genotypic Gibbs sampler that makes up the bulk of iterations (in order to minimize the effects of that approximation). The global proposal steps allow the program to obtain better mixing and thus better likelihoods for the "difficult" (for pure Gibbs samplers) example pedigrees presented by Jensen and Sheehan [1998] and Canning and Sheehan [2002]. However, the resulting estimates are only approximate.

The MCMC IBD estimation is carried out via a "perfectly informative" gene-drop conditional on the simulated and observed genotypes (by rejection sampling) at each genotypic sampler iteration, with the empirical kinship coefficient being the average sharing of indicator alleles over the replicates. To carry out gene-dropping at a second marker locus conditional on IBD at a first locus, IBD is captured by a first "perfectly informative" gene drop, and a second gene drop using the marker allele frequencies is then matched to this, again by rejection sampling.

MCEM estimation [Wei and Tanner 1990] of allele frequencies replaces the usual likelihood evaluation step with the average result from a number of MCMC replicates with the parameters of interest (founder allele frequencies) set to the currrent trial value. It is accelerated by simple doubling of E-step size (damped with increasing number of iterations).

The GLMM MCMC sampler combines Gibbs sampling for unobserved diallelic trait locus genotypes with sliced Metropolis sampling for other model parameters such as gaussian (polygenotype) breeding values, trait locus penetrances/means, and fixed effect regression coefficients. The model is a simple extension of Guo and Thompson [1995], except that the simplifying assumption of no inbreeding effects on segregation variances is lifted. Monte Carlo standard errors are estimated via the usual batch method (see for example, Ripley [1987]). To improve estimation of individual random effects, individuals are automatically replicated as many times as requested, so obtaining multiple estimates (by the simple expedient of automatically duplicating pedigrees in the data), and adjusting the standard error calculations. Sib-pair does not offer the common approach of running multiple uncorrelated MCMC chains, though obviously this can be done sequentially. Summary statistics for the parameters are tabulated, notably the mean and mode (the latter a nonparametric maximum likelhood estimate assuming a unimodal distribution [Meyer 2001]), standard deviation, and MC standard error. A postscript plot of the likelihoods over the run is automatically generated, but other parameters can be dumped and examined using other software packages, such as Coda [Plummer et al 2006; RCDT 2007].

CONCLUSIONS

The Sib-pair program has been used for a number of years in several centres, and was cited 58 times in published scientific papers to 2007 (Web of Science citation search, performed August 2007). Many of the individual data analytic options it offers are available using other packages, though perhaps the full range is unique. My own mode of usage is to carry out preliminary analyses in Sib-pair itself, and then use the data export facilities to carry out specialized analyses in other packages, a procedure automated using Sib-pair macros. In addition, the API is simple, so that one can rapidly add additional procedures for particular problems, taking advantage of the now extensive library of genetic and statistical routines.

REFERENCES

Appendix 1: All the Sib-pair Commands

Global Commands
!|#comment
echoprint text
$ <cmd> shell command
dir <args> file listing for current directory
pwd <dir>print or change current directory
file del|ren <fil> [<new>]delete or rename a file
clearreset
help [<key> | Al | Gl | Op | Da | An] help
infoprogram information
list | ls [<loc1> [[to <locN> [$ (a | q | m | h | x) [m | r]] list loci
which [<loc1> [[to <locN> [$ (a | q | m | h | x) [m | r]] list positions of loci
show locilocus information
show pedigreestabulate pedigrees
show idstabulate IDs
show mapshow linkage map
show macroslist macros
timetotal elapsed time
set timertime procedures
include <fil> read commands from file
output [<fil>] divert text output to a file
last [<num>] command history
quitexit program
set promptdisplay prompt
set gui on | offactivate GUI
set ndecimal_widthoutput decimal digits
set epoch [jul | iso | mjd | <epo>] set epoch for Julian dates
set out | ple -1 | 0 | 1 | 2 | ver | on | off output verbosity
set weight founders Weighted allele frequency estimator
set analysis [imp | obs] Include imputed alleles in regression
set burn-in <it> No. of MCMC burn-in iterations
set iteration <it> Max MC iterations
set emit <it> No. of EM iteration
set batch <num> No. of MCMC batches
set chain <num> No. of MCMC chains
set tune <tun> MCMC tuning parameter
set mincount <num> Min MC iterations
set seeds <s1> <s2> <s3> RNG seeds
set fbatimpute on | off FBAT imputation
set sml <pA> <penAA> <penAB> <penBB> default SML model
set tdt TDT family selection
set ibd [<cM> [<Nmark>]] Threshold for lumping markers for multipoint ibd
set hre zero | chi assume zero recomb for phased LD model
set map function kosambi | haldane linkage mapping function
Utilities
pchisq <x2> <df> [<df2>] Chi-square or F-dist
chisq <nr> <nc> Contingency Chi-square
proportion <num> <den> [<width>] CI for proportion
sml <pA> <penAA> <penAB> <penBB> recurrence risks
grr <prev> <pA> <GRR> [<add | dom | rec>] recurrence risks
ito <pA> [<penAA> <penAB> <penBB>] carrier rates in relatives
Basic Language Objects
0,1,0.23 Numbers
pi y n x Named constants
"1" "a" "is a string"Quoted strings
"<all>/<all>"Genotypes
Alleles are 1..999, A-Z, a-z
Operators
( ) ; : Grouping and precedence
:Separate sequence of expressions
* | / | + | - | ^Arithmetic
< | > | =< | => | ge | le | eq | == | ne | ^= | and | orComparison
if | thenConditional statement execution
Mathematical Functions
log | sqrt | exp | sin | cos | tan | asin | acos | atan | absMaths functions
int | round Truncating functions
rand | rnormUniform or Gaussian random numbers
greg | julian Date functions
Data Functions
istyp | untypTyped at marker?
ishomHomozygote?
alla | allbFirst or second allele
marcomMaximum markers common with a relative
numtyp | anytyp | alltyp number of genotypes
num | nfound family size/no. founders
famnum | indexFamily or person index number
Macros
macro <nam> [<body>] create or delete a macro
%<nam> a macro variable
%% %0 %N %+Nmacro function arguments
eval [<SEXPR>] evaluate a Scheme expression
Command Iteration
<cmd> { <val1> [<val2> ...] }Loop over set
Reading in Data
set datadirectory Set data directory
set workdirectory Set work directory
set impute on | off Marker imputation
set errordrop on | off remove nuclear family mendelian errors
set checking on | off toggle error checking
set locus <nam> mar | xma | hap | qua | aff [<map_position> [<comments>]] declare locus position,type
declare loci <num>(m | x | q | a) [<N>(m | x | q | a) ...] batch declare loci (autoname)
rename <loc> [to] <new> rename locus
read locus linkage <fil> read Linkage locus file
read locus merlin <fil> read Merlin pedigree file
read pedigree <fil> read Sib-pair pedigree file
read linkage <fil> read pre-Makeped pedigree file
read ppd <fil> read post-Makeped pedigree file
read cases <fil> read nonpedigree data file
set sex marker <mar> Sex informative marker
set twin <twin> MZ twin indicator
set liability <tra> <liab> <nlev> declare liability classes for trait
set skiplines <slines> Skip lines at head of pedigree file
order <loc1>.[to]..<locN> [$(m | x | h | q | a)[rm]] reorder loci
set map <pos1>...<posN> set marker map
set distances <dis12> <dis23>...<disN-1N> set marker map
read map <fil> read map, guessing format
set frequencies <mar> [<frq1>...<frqN>] set allele frequencies for a marker
run process pedigree data
Subsetting Data
keep | drop <loc1>.[to]..<locN> [$m | x | h | q | a] keep/drop useful/less loci
keep | drop where (mon|max|num|dis|eve|pos|hwe|<str> keep/drop useful/less loci
undrop <loc1>.[to]..<locN> [$m | x | h | q | a] return loci to analysis
undrop where <str> return loci to analysis
select [con | exa <npro> whe] <expr> select pedigrees on criterion
select pedigree | id [not [in]] <ped1>...<pedN> select on name
unselect [Nth] return pedigrees to analysis (from Nth last selection)
pack [loc | ped] delete dropped pedigrees and loci permanently
Transforming Data
edit <ped> <per> | all <loc> [to] <val1> [<val2>] edit data
delete <ped> <per> | all set all data missing for person or
delete [<loc1>...<locN>] whe <expr> set selected data missing
get <rel> <sum> <tra> [<new>]get/save relatives trait value summary
recode [<mar> | $(m | x)] [fre] recode alleles to 1..n by size/freq
recode [<mar> | $(m | x)] [let | num] recode nt alleles to/from numbers
recode <loc> <val1>...<valN> to <new> recode old values to new
combine <mar1>...<marN> [<thr>] recode rare alleles
flip <mar> recode SNP nucleotides to complementary strand
date (<yyyymmdd> jul) | (<num> gre) julian date conversion
date [<tra>] [jul | gre | yea] julian date conversion
standardize <loc> [fam] standardize trait value
adjustNote: superceded by residuals command. linear regress adjust
residuals <tra> on <loc1>...<locN> [com] linear regress resid
predict <tra> on <loc1>...<locN> [com] linear regress predicted
impute <tra> on <loc1>...<locN> [com] linear regression imputation
impute <tra> familial imputation (esp age)
kaplan-meier <tra> <cen> [res] survivor function estimate
rank <tra> <rank> rank of trait value
blom <tra> <blom_score> inverse normal transform
simulate <mar> [<linked_to>] [<Nall> | <frq1>.<frqN>] simulate a trait
simulate <tra> [<h2>] [<linked_to>] simulate a trait
simulate pedigrees [<nped> [<ngen> [<min_kids> [<max_kids> ]]]] simulate pedigrees
Transforming Pedigrees
nuclear [<maxsibs>] [gra] convert to (trimmed) nuclear families
subpedigrees divide into subpedigrees (if compound)
join <ped1> [...<pedN>] join up pedigrees by shared IDs
prune <tra> [c_op <thr>] prune unaffecteds
cases <tra> divide into unrelated cases
unique_id [seq] generate numerical IDs
Outputting Data
print ped <ped1>...<pedN> [id <id1>...<idN>] print data
write [gas] [<fil>] write Sib-pair pedigree file
head | tail [<nrec>] print head or tail of pedigree file
write pap write to PAP
write var write MENDEL var file
write locus pap write PAP locus file
write arlequin [par | all] <fil> write Arlequin data file
write asp | tcl [dum] <fil> write Aspex pedigree file
write beagle <fil> [fou] write Beagle type data file
write crimap <fil> write Cri-map pedigree file
write csv <fil> [phe] [nop] write CSV type file
write dot [<tra> [<gen>]] <fil> write Dot graph file drawing pedigree
write fisher <fil> write FISHER pedigree file
write gda <fil> write GDA pedigree file
write gh [dum] <fil> write Genehunter pedigree file
write haploview [dum] <fil> write Haploview pedigree file
write linkage | pre [dum] <fil> write Linkage pre-Makeped pedigree file
write mendel <fil> [tra]write MENDEL pedigree file
write merlin [dum] <fil> write Merlin pedigree file
write mim <fil> write MIM pedigree file
write morgan <fil> write MORGAN pedigree file
write pap write PAP trip.dat and phen.dat
write ped <fil> write Sib-pair/GAS pedigree file
write phe <fil> write FBAT/Sibs type phenotype file
write ppd [dum] <fil> write Linkage post-Makeped pedigree file
write ram <tra> write LDL_rams ped and dat files
write sage <fil> write SAGE pedigree file
write solar <fil> [phe] [nop] write Solar type pedigree file
write structure <fil> [fou] write Structure type data file
write locus asp | tcl <fil> write ASPEX locus file
write locus eclipse <fil> write Eclipse data file
write locus fisher <fil> write FISHER locus file
write locus gas <fil> write GAS locus file
write locus loki <fil> [<pedfil>] write Loki "prep" control file
write locus gh <fil> [dum] [xli] write Genehunter locus file
write locus haploview <fil> write Haploview info file
write locus linkage <fil> [dum] [xli] write Linkage locus file
write locus mendel <fil> [tra]write MENDEL locus file
write locus merlin <fil> write Merlin locus file
write locus morgan <fil> write MORGAN locus file
write locus pap write PAP header.dat and popln.dat
write locus relpair <fil> write RELPAIR locus file
write locus sage <fil> write SAGE locus file
write locus sibpair <fil> [<pedfil>] write Sib-pair script
write locus structure <pedfil> [<locfil>] write Structure mainparam
write locus superlink <fil> [dum] [xli] write Superlink locus file
write map mendel <fil> write MENDEL map file
write map loki <fil> write Loki parameter file
write map mendel <fil> write MENDEL map file
write map merlin <fil> write Merlin map file
write map solar <fil> write Solar map file
write var <fil> write MENDEL var file
Analysing pedigrees
generations [<qua_tra> [rev]] summarize pedigree(s) and (save) generations
relatives <ped> <id> show immediate relatives of index
ancestors <tra> [c_op <thr>] common ancestor of most probands
Checking data
test dob <tra> [gre] [<thr>] Check consistency of DOBs
test age <tra> [<thr>] Check consistency of ages
test sex Check sex using markers
test haploid Check Mendel errors
Analysing data
frequencies | describe [snp | <loc1>..<locN>] descriptive statistics
mcf <loc> | $mMCEM allele frequencies
count [whe] <expr> count where expression true
print [whe] <expr> print data where expression true
mean | correlation [<loc1>..<locN>] phenotypic means and correlations
tab <tr1> [<tr2>...<trN>] contingency table
llm <tr1> ... [<tr1> * <tr2>...] [-1] log-linear model of a contingency table
kruskal-wallis <qua tra> <loc> Kruskall-Wallis test
regress <qua tra> on <loc1>.[to]..<locN> linear regression
regress <bin tra> on <loc1>.[to]..<locN> [off <off>] [sim] logistic regression
regress <tra> on <loc1>.[to]..<locN> [off <off>] poisson [sim] poisson regression
regress <tra> ... [off <off>] [(exponential | weibull <cens>)] [shape <sha>] [sim] survival regresssion
mixture <qua tra> [[<num>] [nor | poo | exp | poi]] test admixture
lifetable <sta> <end> <cen> [<wid1> [<wid2>]] [time] [cov <cov>]life table
Analysing genetic data
gpe <mar> [<dose>]Estimate genotype probabilities or expected gene dose
segregation <mar> [unp]Tabulate marker segregation
haplotypes <mar1> <mar2> <new> [<thr>] haplotypes for SNPs in complete LD
triads show triad-phaseable haplotypes
hwe [fou] [<mar1> ..[to].. <mar1>] [$(m | x)] test HWE
mztwin [zyg <co> [<thr>]] [clean | delete] MZ pair genotype discordance | drop one member
summary [<N_tests> | plot [<fil>]] summarize last genetic test
Genetic equilibria
homoz <tra> [<c_op> <thr>] marker homozygosity
multihomoz <tra> [<c_op> <thr>] multipoint homozygosity
fstats <tra> [fou] Population genetic F-statistics
dis [[<marker locus 1>] <marker locus 2>] intragametic association
Kinship
kinship [pai | inb] [<tra> [c_op <thr>]] kinship/inbreeding coefs
ibd <loc> [pai] IBD matrix for marker
ibs <loc> [pai] IBS matrix for marker
hbd <loc> [<coe>] homozygosity-by-descent at marker
cksib sib pair ibs sharing at multiple markers
share [pai] rel pair ibs sharing at multiple markers
Variance components and segregation analysis
davie <tra> <pro> segregation ratios under ascertainment
twin <tra>[<c_op> <thr>] analyse twin correlations/concordances
varcomp <tra> [[a][c][d]e] [cov <var1>..+ <varN>] Variance Components trait analysis
lrt compare last 2 models fitted (mix/VC/GLM/GLMM)
blup <tra> <h2> BLUP for AE variance components model
fpm <tra> [<c_op> <thr>] [nqtl <nqtl>] [p] [a] [d] [g] [c] [s] MCMC mixed/SML/finite polygenic
fpm <tra> ... [(p | g | a | c | s)va | AA | AB | BB | mu | var <val>] MCMC fpm start values
fpm <tra> ... [fixed p | a | c | d | e | g | m | mu | s | var MCMC fpm fixed pars
fpm <tra> ... [lin logit | probit | ln | mft] [lik gau | bin | poi] [cov <var1> [+ <var2>...] MCMC fpm
Association
assoc <tra> [<c_op> <thr> | cat] [fou] [gen] allelic/genotypic association with a trait
tdt <tra> [<c_op> <thr>] [pat | mat] several TDTs
hrr <tra> [<c_op> <thr>] Haplotype Relative Risk
schaid <tra> <mar> [<all>] Schaid & Sommer HWE test
Linkage
asp <tra> [<c_op> <thr>] affected sib-pair IBS and IBD linkage analysis
penrose <loc1> <loc2> Penrose sib-pair linkage
apm <tra> [<c_op> <thr>] [ibd | ibs] IBS or IBD NPL analysis
sibpair <tra> [<wei>] [sim] [cor <r> [mea <m>] [sd | var <v>]] regression-based QTL linkage
twopair <tra> <loc1> <loc2> <theta> two-point Haseman-Elston
qtlpair <tra> [full [cqe] [cov <var1>..+<varN>]] sibs or pedigree VC linkage
varcomp <tra> [aqe <mar1>..+ <marN>] [cov <var1>..+<varN>] Variance Components linkage analysis
linkage [<loc1> [<loc2>]] intermarker sib pair linkage