David L. Duffy, MBBS PhD.
Queensland Institute of Medical Research,
300 Herston Road,
Herston, Queensland 4029, Australia.
Email: davidD@qimr.edu.au
There are now several hundred software packages for genetic analysis listed on Wentian Li's website (http://www.nslij-genetics.org/soft/). Most of these implement particular statistical genetic procedures in the domains of linkage, association, variance components or segregation analysis, but a number of multifunctional packages such as MERLIN, MENDEL, PAP, SAGE, SOLAR are also in common use. Other packages are designed for the manipulation of pedigree data and conversion of data files from one format to another used by this rich choice of tools - the MADELINE package is one example.
Sib-pair is a computer program for both the manipulation and the statistical analysis of genetic datasets. It is somewhat misnamed, in that it was always capable of analysis of arbitrary pedigrees, but the first analyses implemented were multiallelic transmission-disequilibrium tests and sib pair Haseman-Elston linkage analysis [Duffy 1997]. Over the thirteen years of development, I have added a range of other analyses, ranging historically from Penrose sib-pair linkage analysis to the fitting of generalized linear mixed models, including genetic survival analysis. These are embedded in an interpreted language that offers a large number of pedigree and individual oriented data operations, and the ability to write out data files in the formats required by many other useful packages. After a rewrite of core routines in 2006, it can now deal with genome wide association sized datasets.
The program is written in Fortran 95, and is designed to be: a) a useful tool for day-to-day tasks encountered by a genetic analyst; b) an easily extensible platform for exploration of statistical methods, concentrating on simulation based approaches (Monte-Carlo and Markov Chain Monte-Carlo (MCMC)). It should be noted that, except in a few cases (multipoint identity-by-descent calculations, haplotype frequency estimation and association, the experimental homozygosity runs test), the analytic methods that Sib-pair implements are single locus.
The program implements an interpreted language with over 200 commands (see the online documentation for a complete list). Commands are entered at a command line prompt or read from a script file. The commands offered include calendrical, mathematical and genetic utilities, such as those for calculation of recurrence risks, genotype carrier probabilities and expected ibd sharing under a specified single major locus model (the ito, grr and sml commands), P-values for the chi-square and F distributions (pchisq); and simple tests on tables of counts entered from the keyboard such as the contingency chi-square (chisq), and asymptotic and exact tests of Hardy-Weinberg and linkage equilibrium (hwe, diseq).
In passing, users of the elegant GAS package [Young 1995] will find much that is familiar in the overall design and functionality of Sib-pair, and at one time, GAS type scripts could be read by Sib-pair. Automation of analysis tasks is augmented by a simple macro facility, as well as a small embedded Scheme (Lisp) interpreter.
Data can be read into Sib-pair in files of five different formats: Sib-pair, Linkage, MERLIN, HapMap, and PLINK. The Linkage and Merlin formats require the standard files describing the traits and genotypes, pedigree relationships and genetic maps for loci. Either text-based or binary (.bed) PLINK format files can be read.
The Sib-pair native pedigree file format is a modified "pre-Makeped" format (similar to the GAS [Young 1995] format) that allows letter codes for sex ("MmFfXx"), binary traits ("YyNnXx"), and marker alleles ("a-zA-Z1..999"), where "x" is reserved as the missing data placeholder. This can be read from a file, or inlined within a control script so making an analysis self-contained and self-documented. Each phenotype needs a one line declaration in the control script (or entered at the keyboard) taking the form:
set locus <name> <type> [<map position> [<comment>]]
The current data types are "affection" (binary), "quantitative", (including dates), "marker" (autosomal codominant), "xlinked" (codominant X-linked), and "haploid" (eg Y-chromosome, mitochondrial). The map position can be on a genetic or sequence map. The comment string is stored, and can be searched upon or accessed (eg the keep and drop commands), so allowing quick subsetting of the data in an analysis. If data from unrelated individuals is being read, then the data format can be simplified to just include an ID rather than the standard five fields required in the usual Linkage format. One quantitative trait can be declared to be a twin zygosity indicator, and one marker can be declared as "sex-informative" (eg Amelogenin).
Data can also be simulated by Sib-pair, including pedigree structures, traits and codominant genotypes, and this can be conditional on observed genotypes at a locus, or on linkage to another observed marker locus.
When the data are read in, a number of checks are automatically carried out (see Table 1 and Figure 1). Mendelian inconsistencies are first checked for at the nuclear family level, and then more broadly using the Lange-Goradia algorithm [Lange and Goradia 1988]. Additional tests can be requested, such as for consistency of dates of birth or age within a pedigree (test dob).
Table 1 Sib-pair error checking
Checks for duplicate records |
Checks for impossible pedigree relationships (eg own father) |
Checks sex of parents |
Creates extra pedigree records as required |
Sorts the pedigree by generation number |
Checks for unconnected components within the pedigree |
Tests and reports Mendelian errors |
Tests that the reported sexes are consistent with sex-linked marker genotypes |
Tests that monozygotic twins are concordant at all markers |
Generates legal values for all missing genotypes within the pedigree - used to start MCMC algorithms. |
Impute missing genotypes if requested ("set imputation on") |
Figure 1 Example Sib-pair data error reports
Pedigree Individual Sex Post.Pr(M) X-marker hets ------------ ---------- --- ---------- -------------- [...] 32802 3280203 m 0.000000 14/ 19 32802 3280204 f 1.000000 0/ 19 Designated Sex inferred via sex-linked markers Sex Likely Male Uncertain Likely Female ---------- ----------- --------- ------------- Male 721 1806 5 Unknown 0 145 0 Female 5 2136 811 NOTE: inconsistency due child 34114-3411402 at locus GATA31F01P {145/153} X-linked locus "GATA31F01P" -------------- Sibship: 34114-3411403 x 34114-3411404 Inconsistency between sibling genotypes. [3411403] (3411404 x/- 145/153 | | +=========+=========+ | +---------+---------+---------+---------+ | | | | | (3411401) (3411402) (3411451) [3411452] [3411453] 145/149 145/153 145/153 153/- 153/- NOTE: Mendelian inconsistency in pedigree 02590 at X-linked locus "GATA31E08". X-linked locus "GATA31E08" -------------- Sibship: 02590-0259001 x 02590-0000005 Multigenerational inconsistency between genotypes. [0259003] (0259004) x/- 242/242 | | +====+====+ | [0259001] (0000005 x/- x/x | | +=========+=========+ | +----+----+ | | [0259008] (0259009) 230/- 230/230 ID Count Problem phenosets ---------- -------- ----------------- Paternal Gparents 0259003 3 230/- 242/- +/- 0259004 Typed 242/242 Paternal Uncles/Aunts 0259002 1 242/- Father 0259001 Problem 242/- Mother 0000005 Problem 230/230 230/242 230/+ 242/242 242/+ +/+ Children 0259008 Typed 230/- 0259009 Typed 230/230 Parent 02590-0259001 cannot carry the "230" allele found in child 02590-0259008. Parent 02590-0259001 cannot carry the "230" allele found in child 02590-0259009. |
Once read in, phenotype, locus and map data can be exported to a number of formats, currently including: Aspex, Beagle, Crimap, CSV, dot, eclipse, FBAT, Fisher, GDA, Genehunter, Haploview, Linkage (pre and post Makeped), Loki, Mendel, Merlin, MIM, MORGAN, PAP, RAM, Relpair, SAGE, Solar and Structure.
The simplest manipulations that Sib-pair provides are the subsetting of loci and of pedigrees. The keep and drop commands affect loci, and can be addressed by name (wild-card searchable or a range), index in list of loci, range of map positions, major allele frequency, HWE or association test P-value, number genotyped (phenotyped), strength of linkage disequilibrium with neighbours, or by a search of the associated comment string. The select statement selects a subset of pedigrees, either by a search on family name, or by a count of members meeting an eligibility criterion. These criteria can include family size, number of founders, number of markers genotyped at, the maximum number of markers genotyped in common with a relative, the value of a phenotype or genotype (including homozygous status).
The pedigree transformation commands allow regeneration of IDs, subdivision into component nuclear families, pruning of pedigrees based on a key phenotype or genotype, extraction of unrelated cases and controls based on a phenotype, concatenation of pedigrees based on shared individual IDs, identification of unrelated individuals and subpedigrees under a common pedigree ID and splitting off of these if requested.
There are all the usual algebraic and mathematical procedures aimed at individual level data available in standard statistical packages, and are written in the usual infix notation (eg "logBMI=log(weight/height^2)"). Other commands include ranking or inverse normal transformation of quantitative data (rank, blom), calculation of residuals from linear regression (residuals) or Kaplan-Meier survival analysis (kaplan).
Finally, the get command allows the writing of a summary statistic for the trait values in specified classes of relatives of each individual to a new variable. This can be a randomly sampled value from that set (allowing construction of permutation tests).
For each type of locus, there are simple summaries that are a useful preliminary to formal analysis. For quantitative traite, as well as summaries of the distribution (mixtures of gaussian or other distributions can be fitted using the mix command), one obtains a tabulation of the familial correlations, along with a (delete-d) jackknife-derived standard error. For binary traits, recurrence risks and segregation ratios (including via the method of Davies [1979]). A fast classical twin analysis of binary or quantitative data is also implemented.
Both simple counting and maximum likelihood estimates of the allele frequencies of codominant markers can be produced (the latter is calculated by a Monte-Carlo Expectation Maximization (MCEM) algorithm). One also can test Hardy-Weinberg disequilibrium (exact P-value for diallelic markers). In intact pedigree data, the P-value for the HWE test is obtained by gene dropping (and so is a legitimate test combining population level Hardy-Weinberg disequilibrium in genotyped founders along with any evidence of segregation distortion in nonfounders). The diseq command tabulates all pairwise intermarker linkage disequilibria (for autosomal or X-linked codominant markers). It will also estimate haplotype frequencies for specified sets of markers, and if a stratum indicator variable is specified will give a likelihood ratio test statistic testing homogeneity of haplotype frequencies across the strata. An expectation-maximization algorithm is used, which for two-marker systems combines information from phased and unphased genotypes (phase being inferred for individuals with full parental information).
There are a number of association procedures implemented in Sib-pair that cover a wide range of phenotypes and both within and between family analyses. In many cases, the tests rely on gene dropping simulation to give correct P-values when arbitrarily structured families are being analysed. Because this can be time consuming, I have followed Besag and Clifford [1991] in calculating sequential Monte-Carlo P-values, so that the number of simulations used to estimate a nonsignificant (ie large) P-value is small. For very large datasets, since most tests have an asymptotic counterpart, these can be used to screen for interesting results to be followed up using a MC approach. Sib-pair allows the gene-dropping to be either unconditional, or conditional on identity-by-descent (IBD) at a flanking marker.
The association command implements gene-dropping approach for binary and categorical traits, and quantitative traits where a ordinary least squares ANOVA is appropriate. If covariate effects need to be adjusted for, one must use other commands. For example, measured genotype analysis of a trait can be carried out by generalized linear model (binomial, gaussian, poisson, exponential) analysis with gene-dropping used to assess the strength of association allowing for residual family correlation (using the regress command along with the sim keyword). This approach is extended following Aitken and Clayton [1980] to Weibull distribution based parametric survival analysis. To incorporate imputed genotypes (for pedigrees where some marker genotypes are unobserved), multiple imputation via a hybrid Metropolis-Gibbs genotype sampler can be carried out (rep keyword).
Alternatively, a variance components based approach to residual familial correlation is available using the variance_components command, or in the case of survival analysis, the MCMC based fpm command, that allows fitting of generalized linear mixed models. The MCMC genotype sampler mentioned earlier can be used to generate expected allele dose for each unobserved genotype (gpe command).
For population genetic purposes, the same routines when called by the fst command also calculate the F-statistics for the categorical trait tests, and then combine these to give multilocus F statistics following Pons and Chaouche [1995].
The association command also implements the FBAT (Laird et al 2000) within-family test for binary data. The expectation and sampling variance of the FBAT test statistic is calculated by gene-dropping with rejection, conditioning on the observed genotypes necessary to impute the missing parental genotypes, and this approach obtains the same values as the exact evaluation used in the FBAT program. It is slower as a result, but is extendable to more complex situations eg transmission through three generations.
In addition, there are four other commands implementing different flavours of the transmission-disequilbrium test (TDT). The hrr command carries out the original Haplotype Relative Risk test, but again provides both asymptotic and gene-dropped P-values. The schaid command gives the Schaid and Sommer [1993] log-linear model for a diallelic genotypic TDT. The tdt command gives a quantitative trait TDT following Gauderman [2003], which is essentially identical to the default test provided by the QTDT program. For a binary trait, it gives four tests: the original test of Spielman et al [1993] for each allele in turn (unless it is a diallelic marker), and three multiallelic TDTs. Finally, the sdt command gives a conditional logistic regression analysis of a binary trait versus (allelic-coded) genotype stratifying on sibship (equivalent to the "sibship disequilibrium test" [Boehnke and Langefeld 1998; Horvath and Laird 1998]).
Sib-pair offers a variety of approaches to testing for marker homzygosity as a means to detecting trait loci or Hardy-Weinberg disequilibrium. The homoz command tests locus by locus, and as usual implements a gene-dropping approach to deal with pedigree data. The mulhom command tests for runs of homozygosity. Both the mean overall run length and best run length centred on each marker in turn is calculated for each eligible individual over the set of active markers. Simulation is used to approximate the expected distribution of these test statistics (this does not currently allow for intermarker linkage disequilibrium). Finally, the MCMC genotype sampler has an interface allowing the saving of a homozygosity-by-descent indicator for any single marker (which for completely linked markers could be averaged to approximate the fully multipoint estimate).
All the Sib-pair linkage routines are "nonparametric" in nature (with the exception of the lod command, which performs two-point codominant marker linkage analysis). The penrose command implements the original sib pair linkage method of Penrose [1935, 1937]. The apm command carries out both identity by state (IBS) and identity by descent (IBD) sharing based analysis for single point linkage analysis of binary traits. The expectation and variances of the sharing statistics are generated by gene-dropping simulation, and are fast even in large pedigrees. The usual affecteds-only Spairs and Sall ("NPL") statistics of Whittemore and Halpern [1994] are calculated, along with the general pairs (GPM) statistic of Ward and Bonaiti-Pellie (1995). The latter combines results from affected-affected, affected-unaffected, and unaffected-unaffected pairs in a weighted fashion, and for Mendelian disorders at least, is often more powerful than the affected-only statistics.
For quantitative traits, Sib-pair offers several flavours of regression based linkage test [Haseman and Elston 1970; Sham and Purcell 2001; Visscher and Hopper 2001; Szatkiewicz and Feingold 2004] as well as variance components. The fastest options restrict analysis to full and half sibs, but complete pedigree variance components linkage analysis can also be performed, where IBD is estimated using the MCMC genotype sampler. To estimate multipoint IBDs, Sib-pair calculates variance weighted averages of "single-point" IBDs over sets of markers that are in complete linkage, the simplest application of the approach of Fulker and Cardon [1996] as extended by Almasy and Blangero [1998]. The variances are the empirical sampling variances arising from the MCMC simulation. This approach is not used by the apm routine.
The fpm (finite polygenic model) command uses MCMC algorithms to fit GLMMs, but can also include single or multiple major loci in the model, as the name suggests. These latter options are still unstable (and are not recommended for casual users), but polygenic GLMM fitting seems to be more robust. I have used Sib-pair to successfully replicate results for one and two level hierarchical models fitted to standard datasets from the general statistical literature on GLMMs (Bates 2005; http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-support/datasets.shtml), Gaussian linear mixed models for large animal breeding datasets, and multivariate survival analysis. The fpm command also allows the fitting of mixed models to survival data, under a Weibull model with Gaussian random effects. It has not yet been extended to linkage analysis.
The single locus MCMC genotype sampler uses a slightly ad hoc combination of Metropolis and Gibbs sampler steps. A Metropolis step is carried out in 5% of iterations, and alternates between three types of global proposal: a. "mutation" of up to four founder alleles, changes which are propagated down the pedigree; b. a gene-dropping of 2*nfounder "identity-by-descent indicator" alleles, where transmissions that cause an incompatibility between observed genotypes are rejected (this can be seen as a mechanism for generating consistent descent graphs for the observed genotypes); c. a swap of grandparental origin of alleles in a randomly chosen individual (after Lange and Matthysse [1989], Sobel and Lange [1996]), which are then propagated up the pedigree. Following this, the (unobserved) genotypes for all untyped by untyped matings are resimulated, and the usual Metropolis acceptance-rejection step is then carried out.
Because it is difficult/expensive to calculate the proposal probabilities for these complex global proposals (in large pedigrees at least), these are assumed to be symmetric. In the case of proposal mechanism b., this is not the case. This is why the Metropolis steps are kept relatively infrequent compared to the standard single-site genotypic Gibbs sampler that makes up the bulk of iterations (in order to minimize the effects of that approximation). The global proposal steps allow the program to obtain better mixing and thus better likelihoods for the "difficult" (for pure Gibbs samplers) example pedigrees presented by Jensen and Sheehan [1998] and Canning and Sheehan [2002]. However, the resulting estimates are only approximate.
The MCMC IBD estimation is carried out via a "perfectly informative" gene-drop conditional on the simulated and observed genotypes (by rejection sampling) at each genotypic sampler iteration, with the empirical kinship coefficient being the average sharing of indicator alleles over the replicates. To carry out gene-dropping at a second marker locus conditional on IBD at a first locus, IBD is captured by a first "perfectly informative" gene drop, and a second gene drop using the marker allele frequencies is then matched to this, again by rejection sampling.
MCEM estimation [Wei and Tanner 1990] of allele frequencies replaces the usual likelihood evaluation step with the average result from a number of MCMC replicates with the parameters of interest (founder allele frequencies) set to the currrent trial value. It is accelerated by simple doubling of E-step size (damped with increasing number of iterations).
The GLMM MCMC sampler combines Gibbs sampling for unobserved diallelic trait locus genotypes with sliced Metropolis sampling for other model parameters such as gaussian (polygenotype) breeding values, trait locus penetrances/means, and fixed effect regression coefficients. The model is a simple extension of Guo and Thompson [1995], except that the simplifying assumption of no inbreeding effects on segregation variances is lifted. Monte Carlo standard errors are estimated via the usual batch method (see for example, Ripley [1987]). To improve estimation of individual random effects, individuals are automatically replicated as many times as requested, so obtaining multiple estimates (by the simple expedient of automatically duplicating pedigrees in the data), and adjusting the standard error calculations. Sib-pair does not offer the common approach of running multiple uncorrelated MCMC chains, though obviously this can be done sequentially. Summary statistics for the parameters are tabulated, notably the mean and mode (the latter a nonparametric maximum likelhood estimate assuming a unimodal distribution [Meyer 2001]), standard deviation, and MC standard error. A postscript plot of the likelihoods over the run is automatically generated, but other parameters can be dumped and examined using other software packages, such as Coda [Plummer et al 2006; RCDT 2007].
The Sib-pair program has been used for a number of years in several centres, and was cited 58 times in published scientific papers to 2007 (Web of Science citation search, performed August 2007). Many of the individual data analytic options it offers are available using other packages, though perhaps the full range is unique. My own mode of usage is to carry out preliminary analyses in Sib-pair itself, and then use the data export facilities to carry out specialized analyses in other packages, a procedure automated using Sib-pair macros. In addition, the API is simple, so that one can rapidly add additional procedures for particular problems, taking advantage of the now extensive library of genetic and statistical routines.
Global Commands | |
---|---|
!|# | comment |
echo | print text |
$ <cmd> | shell command |
dir <args> | file listing for current directory |
pwd <dir> | print or change current directory |
file del|ren <fil> [<new>] | delete or rename a file |
clear | reset |
help [<key> | Al | Gl | Op | Da | An] | help |
info | program information |
list | ls [<loc1> [[to <locN> [$ (a | q | m | h | x) [m | r]] | list loci |
which [<loc1> [[to <locN> [$ (a | q | m | h | x) [m | r]] | list positions of loci |
show loci | locus information |
show pedigrees | tabulate pedigrees |
show ids | tabulate IDs |
show map | show linkage map |
show macros | list macros |
time | total elapsed time |
set timer | time procedures |
include <fil> | read commands from file |
output [<fil>] | divert text output to a file |
last [<num>] | command history |
quit | exit program |
set prompt | display prompt |
set gui on | off | activate GUI |
set ndecimal_width | output decimal digits |
set epoch [jul | iso | mjd | <epo>] | set epoch for Julian dates |
set out | ple -1 | 0 | 1 | 2 | ver | on | off | output verbosity |
set weight founders | Weighted allele frequency estimator |
set analysis [imp | obs] | Include imputed alleles in regression |
set burn-in <it> | No. of MCMC burn-in iterations |
set iteration <it> | Max MC iterations |
set emit <it> | No. of EM iteration |
set batch <num> | No. of MCMC batches |
set chain <num> | No. of MCMC chains |
set tune <tun> | MCMC tuning parameter |
set mincount <num> | Min MC iterations |
set seeds <s1> <s2> <s3> | RNG seeds |
set fbatimpute on | off | FBAT imputation |
set sml <pA> <penAA> <penAB> <penBB> | default SML model |
set tdt | TDT family selection |
set ibd [<cM> [<Nmark>]] | Threshold for lumping markers for multipoint ibd |
set hre zero | chi | assume zero recomb for phased LD model |
set map function kosambi | haldane | linkage mapping function |
Utilities | |
pchisq <x2> <df> [<df2>] | Chi-square or F-dist |
chisq <nr> <nc> | Contingency Chi-square |
proportion <num> <den> [<width>] | CI for proportion |
sml <pA> <penAA> <penAB> <penBB> | recurrence risks |
grr <prev> <pA> <GRR> [<add | dom | rec>] | recurrence risks |
ito <pA> [<penAA> <penAB> <penBB>] | carrier rates in relatives |
Basic Language Objects | |
0,1,0.23 | Numbers |
pi y n x | Named constants |
"1" "a" "is a string" | Quoted strings |
"<all>/<all>" | Genotypes |
Alleles are 1..999, A-Z, a-z | |
Operators | |
( ) ; : | Grouping and precedence |
: | Separate sequence of expressions |
* | / | + | - | ^ | Arithmetic |
< | > | =< | => | ge | le | eq | == | ne | ^= | and | or | Comparison |
if | then | Conditional statement execution |
Mathematical Functions | |
log | sqrt | exp | sin | cos | tan | asin | acos | atan | abs | Maths functions |
int | round | Truncating functions |
rand | rnorm | Uniform or Gaussian random numbers |
greg | julian | Date functions |
Data Functions | |
istyp | untyp | Typed at marker? |
ishom | Homozygote? |
alla | allb | First or second allele |
marcom | Maximum markers common with a relative |
numtyp | anytyp | alltyp | number of genotypes |
num | nfound | family size/no. founders |
famnum | index | Family or person index number |
Macros | |
macro <nam> [<body>] | create or delete a macro |
%<nam> | a macro variable |
%% %0 %N %+N | macro function arguments |
eval [<SEXPR>] | evaluate a Scheme expression |
Command Iteration | |
<cmd> { <val1> [<val2> ...] } | Loop over set |
Reading in Data | |
set datadirectory | Set data directory |
set workdirectory | Set work directory |
set impute on | off | Marker imputation |
set errordrop on | off | remove nuclear family mendelian errors |
set checking on | off | toggle error checking |
set locus <nam> mar | xma | hap | qua | aff [<map_position> [<comments>]] | declare locus position,type |
declare loci <num>(m | x | q | a) [<N>(m | x | q | a) ...] | batch declare loci (autoname) |
rename <loc> [to] <new> | rename locus |
read locus linkage <fil> | read Linkage locus file |
read locus merlin <fil> | read Merlin pedigree file |
read pedigree <fil> | read Sib-pair pedigree file |
read linkage <fil> | read pre-Makeped pedigree file |
read ppd <fil> | read post-Makeped pedigree file |
read cases <fil> | read nonpedigree data file |
set sex marker <mar> | Sex informative marker |
set twin <twin> | MZ twin indicator |
set liability <tra> <liab> <nlev> | declare liability classes for trait |
set skiplines <slines> | Skip lines at head of pedigree file |
order <loc1>.[to]..<locN> [$(m | x | h | q | a)[rm]] | reorder loci |
set map <pos1>...<posN> | set marker map |
set distances <dis12> <dis23>...<disN-1N> | set marker map |
read map <fil> | read map, guessing format |
set frequencies <mar> [<frq1>...<frqN>] | set allele frequencies for a marker |
run | process pedigree data |
Subsetting Data | |
keep | drop <loc1>.[to]..<locN> [$m | x | h | q | a] | keep/drop useful/less loci |
keep | drop where (mon|max|num|dis|eve|pos|hwe|<str> | keep/drop useful/less loci |
undrop <loc1>.[to]..<locN> [$m | x | h | q | a] | return loci to analysis |
undrop where <str> | return loci to analysis |
select [con | exa <npro> whe] <expr> | select pedigrees on criterion |
select pedigree | id [not [in]] <ped1>...<pedN> | select on name |
unselect [Nth] | return pedigrees to analysis (from Nth last selection) |
pack [loc | ped] | delete dropped pedigrees and loci permanently |
Transforming Data | |
edit <ped> <per> | all <loc> [to] <val1> [<val2>] | edit data |
delete <ped> <per> | all | set all data missing for person or |
delete [<loc1>...<locN>] whe <expr> | set selected data missing |
get <rel> <sum> <tra> [<new>] | get/save relatives trait value summary |
recode [<mar> | $(m | x)] [fre] | recode alleles to 1..n by size/freq |
recode [<mar> | $(m | x)] [let | num] | recode nt alleles to/from numbers |
recode <loc> <val1>...<valN> to <new> | recode old values to new |
combine <mar1>...<marN> [<thr>] | recode rare alleles |
flip <mar> | recode SNP nucleotides to complementary strand |
date (<yyyymmdd> jul) | (<num> gre) | julian date conversion |
date [<tra>] [jul | gre | yea] | julian date conversion |
standardize <loc> [fam] | standardize trait value |
adjust | Note: superceded by residuals command. linear regress adjust |
residuals <tra> on <loc1>...<locN> [com] | linear regress resid |
predict <tra> on <loc1>...<locN> [com] | linear regress predicted |
impute <tra> on <loc1>...<locN> [com] | linear regression imputation |
impute <tra> | familial imputation (esp age) |
kaplan-meier <tra> <cen> [res] | survivor function estimate |
rank <tra> <rank> | rank of trait value |
blom <tra> <blom_score> | inverse normal transform |
simulate <mar> [<linked_to>] [<Nall> | <frq1>.<frqN>] | simulate a trait |
simulate <tra> [<h2>] [<linked_to>] | simulate a trait |
simulate pedigrees [<nped> [<ngen> [<min_kids> [<max_kids> ]]]] | simulate pedigrees |
Transforming Pedigrees | |
nuclear [<maxsibs>] [gra] | convert to (trimmed) nuclear families |
subpedigrees | divide into subpedigrees (if compound) |
join <ped1> [...<pedN>] | join up pedigrees by shared IDs |
prune <tra> [c_op <thr>] | prune unaffecteds |
cases <tra> | divide into unrelated cases |
unique_id [seq] | generate numerical IDs |
Outputting Data | |
print ped <ped1>...<pedN> [id <id1>...<idN>] | print data |
write [gas] [<fil>] | write Sib-pair pedigree file |
head | tail [<nrec>] | print head or tail of pedigree file |
write pap | write to PAP |
write var | write MENDEL var file |
write locus pap | write PAP locus file |
write arlequin [par | all] <fil> | write Arlequin data file |
write asp | tcl [dum] <fil> | write Aspex pedigree file |
write beagle <fil> [fou] | write Beagle type data file |
write crimap <fil> | write Cri-map pedigree file |
write csv <fil> [phe] [nop] | write CSV type file |
write dot [<tra> [<gen>]] <fil> | write Dot graph file drawing pedigree |
write fisher <fil> | write FISHER pedigree file |
write gda <fil> | write GDA pedigree file |
write gh [dum] <fil> | write Genehunter pedigree file |
write haploview [dum] <fil> | write Haploview pedigree file |
write linkage | pre [dum] <fil> | write Linkage pre-Makeped pedigree file |
write mendel <fil> [tra] | write MENDEL pedigree file |
write merlin [dum] <fil> | write Merlin pedigree file |
write mim <fil> | write MIM pedigree file |
write morgan <fil> | write MORGAN pedigree file |
write pap | write PAP trip.dat and phen.dat |
write ped <fil> | write Sib-pair/GAS pedigree file |
write phe <fil> | write FBAT/Sibs type phenotype file |
write ppd [dum] <fil> | write Linkage post-Makeped pedigree file |
write ram <tra> | write LDL_rams ped and dat files |
write sage <fil> | write SAGE pedigree file |
write solar <fil> [phe] [nop] | write Solar type pedigree file |
write structure <fil> [fou] | write Structure type data file |
write locus asp | tcl <fil> | write ASPEX locus file |
write locus eclipse <fil> | write Eclipse data file |
write locus fisher <fil> | write FISHER locus file |
write locus gas <fil> | write GAS locus file |
write locus loki <fil> [<pedfil>] | write Loki "prep" control file |
write locus gh <fil> [dum] [xli] | write Genehunter locus file |
write locus haploview <fil> | write Haploview info file |
write locus linkage <fil> [dum] [xli] | write Linkage locus file |
write locus mendel <fil> [tra] | write MENDEL locus file |
write locus merlin <fil> | write Merlin locus file |
write locus morgan <fil> | write MORGAN locus file |
write locus pap | write PAP header.dat and popln.dat |
write locus relpair <fil> | write RELPAIR locus file |
write locus sage <fil> | write SAGE locus file |
write locus sibpair <fil> [<pedfil>] | write Sib-pair script |
write locus structure <pedfil> [<locfil>] | write Structure mainparam |
write locus superlink <fil> [dum] [xli] | write Superlink locus file |
write map mendel <fil> | write MENDEL map file |
write map loki <fil> | write Loki parameter file |
write map mendel <fil> | write MENDEL map file |
write map merlin <fil> | write Merlin map file |
write map solar <fil> | write Solar map file |
write var <fil> | write MENDEL var file |
Analysing pedigrees | |
generations [<qua_tra> [rev]] | summarize pedigree(s) and (save) generations |
relatives <ped> <id> | show immediate relatives of index |
ancestors <tra> [c_op <thr>] | common ancestor of most probands |
Checking data | |
test dob <tra> [gre] [<thr>] | Check consistency of DOBs |
test age <tra> [<thr>] | Check consistency of ages |
test sex | Check sex using markers |
test haploid | Check Mendel errors |
Analysing data | |
frequencies | describe [snp | <loc1>..<locN>] | descriptive statistics |
mcf <loc> | $m | MCEM allele frequencies |
count [whe] <expr> | count where expression true |
print [whe] <expr> | print data where expression true |
mean | correlation [<loc1>..<locN>] | phenotypic means and correlations |
tab <tr1> [<tr2>...<trN>] | contingency table |
llm <tr1> ... [<tr1> * <tr2>...] [-1] | log-linear model of a contingency table |
kruskal-wallis <qua tra> <loc> | Kruskall-Wallis test |
regress <qua tra> on <loc1>.[to]..<locN> | linear regression |
regress <bin tra> on <loc1>.[to]..<locN> [off <off>] [sim] | logistic regression |
regress <tra> on <loc1>.[to]..<locN> [off <off>] poisson [sim] | poisson regression |
regress <tra> ... [off <off>] [(exponential | weibull <cens>)] [shape <sha>] [sim] | survival regresssion |
mixture <qua tra> [[<num>] [nor | poo | exp | poi]] | test admixture |
lifetable <sta> <end> <cen> [<wid1> [<wid2>]] [time] [cov <cov>] | life table |
Analysing genetic data | |
gpe <mar> [<dose>] | Estimate genotype probabilities or expected gene dose |
segregation <mar> [unp] | Tabulate marker segregation |
haplotypes <mar1> <mar2> <new> [<thr>] | haplotypes for SNPs in complete LD |
triads | show triad-phaseable haplotypes |
hwe [fou] [<mar1> ..[to].. <mar1>] [$(m | x)] | test HWE |
mztwin [zyg <co> [<thr>]] [clean | delete] | MZ pair genotype discordance | drop one member |
summary [<N_tests> | plot [<fil>]] | summarize last genetic test |
Genetic equilibria | |
homoz <tra> [<c_op> <thr>] | marker homozygosity |
multihomoz <tra> [<c_op> <thr>] | multipoint homozygosity |
fstats <tra> [fou] | Population genetic F-statistics |
dis [[<marker locus 1>] <marker locus 2>] | intragametic association |
Kinship | |
kinship [pai | inb] [<tra> [c_op <thr>]] | kinship/inbreeding coefs |
ibd <loc> [pai] | IBD matrix for marker |
ibs <loc> [pai] | IBS matrix for marker |
hbd <loc> [<coe>] | homozygosity-by-descent at marker |
cksib | sib pair ibs sharing at multiple markers |
share [pai] | rel pair ibs sharing at multiple markers |
Variance components and segregation analysis | |
davie <tra> <pro> | segregation ratios under ascertainment |
twin <tra>[<c_op> <thr>] | analyse twin correlations/concordances |
varcomp <tra> [[a][c][d]e] [cov <var1>..+ <varN>] | Variance Components trait analysis |
lrt | compare last 2 models fitted (mix/VC/GLM/GLMM) |
blup <tra> <h2> | BLUP for AE variance components model |
fpm <tra> [<c_op> <thr>] [nqtl <nqtl>] [p] [a] [d] [g] [c] [s] | MCMC mixed/SML/finite polygenic |
fpm <tra> ... [(p | g | a | c | s)va | AA | AB | BB | mu | var <val>] | MCMC fpm start values |
fpm <tra> ... [fixed p | a | c | d | e | g | m | mu | s | var | MCMC fpm fixed pars |
fpm <tra> ... [lin logit | probit | ln | mft] [lik gau | bin | poi] [cov <var1> [+ <var2>...] | MCMC fpm |
Association | |
assoc <tra> [<c_op> <thr> | cat] [fou] [gen] | allelic/genotypic association with a trait |
tdt <tra> [<c_op> <thr>] [pat | mat] | several TDTs |
hrr <tra> [<c_op> <thr>] | Haplotype Relative Risk |
schaid <tra> <mar> [<all>] | Schaid & Sommer HWE test |
Linkage | |
asp <tra> [<c_op> <thr>] | affected sib-pair IBS and IBD linkage analysis |
penrose <loc1> <loc2> | Penrose sib-pair linkage |
apm <tra> [<c_op> <thr>] [ibd | ibs] | IBS or IBD NPL analysis |
sibpair <tra> [<wei>] [sim] [cor <r> [mea <m>] [sd | var <v>]] | regression-based QTL linkage |
twopair <tra> <loc1> <loc2> <theta> | two-point Haseman-Elston |
qtlpair <tra> [full [cqe] [cov <var1>..+<varN>]] | sibs or pedigree VC linkage |
varcomp <tra> [aqe <mar1>..+ <marN>] [cov <var1>..+<varN>] | Variance Components linkage analysis |
linkage [<loc1> [<loc2>]] | intermarker sib pair linkage |