The existence of multiple modes in the frequency distributions of continuous traits in the population may often represent the action of major genes controlling that phenotype. In population samples, maximum likelihood (ML) methods (via the EM algorithm) can be used to fit mixtures of distributions, the means of each being the genotypic means. The frequencies (mixing proportions) associated with each distribution can be tested for Hardy-Weinberg proportions. In family data, such models allow values to be scored as discrete types, so that segregation analysis can be performed.
The major disadvantages of this approach are (1) that many biological distributions are skewed (often lognormal), and will support the inappropriate fitting of mixtures of normals; (2) if genotypic means are too close, the resulting distribution cannot be easily differentiated from a single normal distribution; (3) the likelihood ratios comparing mixtures of different numbers of distributions are not asymptotically chi-square [McLachlan & Bashford 1988], though Thode et al [1988] have published empirical percentage points for mixtures of one versus two normal distributions. The comparison of the likelihoods of mixtures of normals versus those of alternative single distributions such as the lognormal is also not straightforward. Schork and Schork [1989] have used bootstrapping to discriminate among alternative hypotheses in this case.
Sir Francis Galton founded the discipline of biometrics in the late 19th century. In collaboration with Karl Pearson, he developed mathematical tools for the description of resemblance between relatives that gave rise to much of modern statistics. The correlation coefficient was derived as a measure of the similarity observed among relatives for continuous traits such as height. Examination of plots of height of sons versus the (sex-adjusted) midparent height led to the "law of filial regression", simply regression to the mean, while his hypothesised mode of genetic transmission by blending of parental characteristics was unworkable (save for selfing). The rediscovery of Mendel's laws for the transmission of discrete traits moved the interests of geneticists temporarily away from continuous traits, but by 1910, a number of examples of continuous traits under the control of multiple loci had been worked out [Wright 1967]. With Fisher's and Wright's work, beginning with Fisher's seminal 1918 paper, biometrical concepts slowly became central to animal and plant breeding, population genetics and evolutionary theory.
Resemblances between particular classes of relatives on continuous traits are usually expressed as covariances between the measured values of the trait, and by various extensions of this such as interclass and intraclass correlation coefficients. For most of the metrical traits studied in humans, these types of measures of linear interdependence are appropriate, though mathematical transformation of traits to approximate linearity are often needed. Another related application of transformations is to remove interactions such as gene environmental interaction, or epistasis (gene x gene interaction).
Intraclass and interclass correlations arise naturally from analysis of variance, and are very appropriate for genetic usage when there are no reasons to differentiate within a group of relatives eg sibship. These correlations can be defined [after Stuart & Ord 1991] for a population containing p classes containing k_{p} members in each class on which Y_{ij} is the trait value for the jth member of the ith class so that,
E(Y_{ij}) | = u | |
Var(Y_{ij}) | = s^{2} | |
Cov_{I}(Y_{ij},Y_{i'j'}) | = r_{I} s^{2} | i=i', j ne j' |
= 0 | i ne i' | |
Cov_{B}(Y_{ij},Y_{i'j'}) | = r_{ii'} s^{2} | i=i', j ne j' |
= 0 | i=i' |
where r_{I} is the intraclass correlation and r_{ii'} denotes the interclass correlation between ith and i'th group. For constant class size (as in seen in regular pedigrees eg parents and twin offspring),
The intraclass correlation then is the common correlation between members of the same class, and the interclass correlation represents the common correlation between members of two different classes of observations, where individuals are not differentiated amongst eg parent-offspring correlation when one or two parents and a variable number of offspring are present. A small literature has arisen regarding the estimation of interclass and intraclass correlations for classes of varying size. Approaches can be characterised as either fully maximum likelihood (usually assuming multivariate normality) in nature [Rosner 1982; Bener & Huda, 1987; Shoukri & Ward, 1990], in which case numerical methods are required, and various simpler analytic approximations [Eliasziw & Donner 1991; Velu & Rao 1990]. The ML approach of Lange in FISHER [Lange 1976] is fully general and can be applied to this problem.
Fisher [1918] derived the expected covariances between pairs of relatives for (continuous) traits. Since these results hinge on ordinary least squares decomposition of variance rather than particular likelihood functions, I think they will hold for continuous distributions other than the (multivariate) Gaussian most commonly assumed. Lange [1989], for example, has used the multivariate t distribution for "robust" ML estimation. I will refer the interested reader to Kempthorne [1969] for a derivation of this result for the case of a monogenic trait, where environmental and genetic effects interact additively. A gene A of g alleles is present in a population with genotypic array,
where p_{i} is the frequency of the ith allele A_{i}. The genotypic deviation for A_{i}A_{j} of the trait Y around the population mean is y_{ij}. The total genotypic variance is then,
and is decomposed into additive and dominant genetic components by solving for the allelic values a_{m} for each allele A_{m} by minimising,
which gives the solution,
and so,
V_{A} | = 2 SUM(i=1..g) p_{m} [SUM(j=1..g) p_{i}y_{im]2 } |
V_{D} | = V_{T} - V_{A} |
This decomposition partitions the trait value for an individual Y into,
where d_{ij} is the dominance deviation for the genotype A_{i}A_{j}, and e is the (random) environmental deviation. The covariance between trait values for two noninbred individuals of genotypes A_{i}A_{j} and A_{k}A_{l} respectively is,
Cov(Y_{1},Y_{2}) | = E[ (a_{i}+a_{j}+d_{ij}) (a_{k}+a_{l}+d_{kl})] |
= E[a_{i}a_{k}] + E[a_{i}a_{l}] + E[a_{j}a_{k}] + E[a_{j}a_{l}] + E[d_{ij}d_{kl}] | |
= R V_{A} + K V_{D} |
where R - the coefficient of relationship - is the probability that an allele present at locus A in both members of the pair is of the same ancestral origin (identical by descent), and K - the coefficient of identity - is the probability that both alleles present at locus A in both members of the pair are identical by descent (IBD). This is because,
as the covariance between the additive deviations of the two individuals at these alleles only arise in the R/2 proportion of cases where the genes are the same. In a similar fashion, covariance between dominance deviations only arises in the K proportion of cases where both alleles at the locus in each individual are identical by descent,
The term epistasis refers to nonadditive interactions between the effects of multiple different loci on a trait. In the polygenic case, in the absence of epistasis and linkage, the result just derived holds, because covariation between the effects of each locus is zero. If epistasis, but no linkage is present, Cockerham [1954] presented a further extension of the components of variance model. This involves the factorial terms for the interactions between additive and dominance deviations up to the order of the total number of loci controlling the trait. This gives an expression for the genotypic variance of,
V_{G} | = V_{A} + V_{D} + V_{AA} + V_{AD} + V_{DD}... |
= SUM(r=1..n) SUM(s, where r+s>0) V_{r*A s*D} |
and the covariance between pairs of relatives is,
Cov(Y_{1},Y_{2}) | = R V_{A} + K V_{D} + R^{2}V_{AA}+ RKV_{AD} + K^{2}V_{DD}... |
= SUM(r=1..n) SUM(s, where r+s>0) R^{r}K^{s}V_{r*A s*D} |
In the presence of linkage, the expressions become more complex. For linkage disequilibrium, Ewens [1979] states that those for covariance between relatives for two linked loci contain "upwards of a hundred terms". For two linked loci in linkage equilibrium, the full sib covariance was derived by Cockerham [Ewens 1979] as,
where c is the recombination fraction. Values of R and K are tabulated for various classes of relatives below.
Table 8.1 Coefficients of relationship (R) and identity (K) for relative pairs
commonly utilised in genetic studies.
Type of Relative Pair | R | K |
Monozygotic Twins | 1 | 1 |
Siblings/Dizygotic Twins | ||
Half-Siblings | 0 | |
Parent-Offspring | 0 | |
Unilineal kth degree relative pair | 1/2^{k} | 0 |
Uncle-Nephew | 0 | |
Cousins | 0 |
Path analysis is a method of evaluating systems of regression equations. These can include both latent (unobserved) and observed variables, which can appear (repeatedly in the same or different equations) both as independent and dependent variables. As a result, a number of techniques usually thought of as separate can be treated in a uniform framework including multivariate regression, analysis of (co)variance, factor analysis, and time series analysis. As originally formulated by Wright, path diagrams describing the causation and association among measured variables were used to formulate the algebraic expressions for the linear models. The observed correlations were fitted to by these models using ordinary least squares. Li [1975] gives one of the best introductions to the method as practiced by Wright. Subsequently, structural equation modelling was developed independently, where covariances and raw regression coefficients were modelled in preference, and numerically intensive maximum likelihood and weighted least squares methods of model fitting have come to prominence.
Path analysis is a very general approach, but Sewall Wright originally developed it for use in genetical problems at all levels, though his earliest applications included econometric time series modelling, factor analysis of bone measurements in rabbits, multiple regression analysis of litter size and genetic modelling of coat colour in guineas pigs.
In a path diagram, variables are represented as either being (partly or wholly) determined by other variables (where single headed arrows run from the independent variable to the dependent variable), which in turn may be determined by other (or even the same) variables, back to ultimate variables, between which correlations (induced by more distant but unobserved variables) are represented by double headed arrows. Usually, observed variables are enclosed in a rectangle, and unobserved variables by circles. The single headed arrows represent linear regressive relationships. Variables must be unitary in nature, a point Wright [1968] emphasises. Path coefficients are correlation coefficients or regression coefficients, one of which is associated with every double or single headed arrows respectively.
Using Wright's rules for evaluating a path diagram, one can write the regression equations involving all the variables depicted. The rules arise from the standard theory of general linear models and are (since the variables are standardised in this case):
(1) Expected correlations between any two variables are the products of chains of paths (coefficients) formed between the two variables that move "up" one or more arrows (in the opposite direction to the direction of the arrowhead), across no more than one "sling" (correlation) and then down along other arrows to the second variable.
(2) Each variable may only be traversed once in a given chain.
(3) Variances (self-correlations) may repeatedly traverse a directly adjacent arrow no more than once.
If identified, the resulting system of equations allows the path coefficients to be estimated from the observed correlation matrix for the observed variables.
A genetic path model is that fitted by Wright [1920, 1968] to probit transformed proportion of coat coloured white in the guinea pig (Figure 8.2). Similar models are currently used for traits in human families. Data was obtained from two samples, one panmictic, the other markedly inbred.
In this model, M and F are the parental phenotypes, G_{1}..G_{4} the gametes giving rise to the offspring O_{1} and O_{2}. In the randomly bred sample, a and b are 1/2. In the inbred sample (F1), a=0.5 and b=1. Environmental agents affecting this trait (C) that act on all members of a sibship (or a family) equally such as availability of food, or maternal (intrauterine) effects, also give rise to phenotypic correlations among sibs. Either by tracing a path diagram, or by the results of the last section, one obtains the expectations under the model,
Cov_{po} | = V_{A}/2 |
Cov_{sib} | = V_{A}/2 + V_{C} |
The use of experimental animals was very powerful, in that the breeding program had suggested nonadditivity (dominance and epistasis) was unimportant in this trait, and the randomly bred and inbred samples gave consistent parameter estimates, as inbreeding to fixation should remove all genetic variance (in the randomly bred sample r_{po}=0.21, r_{sib}=0.21; in inbred sample r_{po}=0.01, r_{sib}=0.07), but effects of shared and unshared environment should remain constant.
Similar models have been fitted to human nuclear family data for a number of continuous phenotypes, especially by groups led by Newton Morton and later DC Rao [Cloninger 1979; Cloninger 1980; Morton et al 1983]. The models have been extended to allow for marital assortment by social and phenotypic homogamy, environmental influences shared by parents and children, and by children only (generational effects). I will discuss this model in detail, because although sophisticated, it does have shortcomings that can be overcome in other study designs, such as the adoption or twin-family study.
This is a linear additive model that includes variables and covariances representing these various modes of familial transmission. In Figure 8.3 [Rao et al 1984], P_{M} and P_{F} are the maternal and paternal trait values, P_{1} and P_{2} the representative offspring (allowing a sib-sib intraclass correlation), C represents (summed) environmental influences on the trait, G the additive genotypic (breeding) value, H social homogamy, and I, the environmental index.
If environmental transmission of a trait can occur from parents to offspring ("cultural" transmission), then this is confounded with genetic transmission in this model, unless the environmental factors responsible have been measured. To simplify evaluation of the model, Morton [1974] suggested that such factors could be represented by a single (unreliable) environmental index that is defined as the predicted outcome of a multiple regression based on a number of measured putative risk factors. In the diagram, the index I can be correlated with the individual's genotype G, though in practice this is not identifiable in this design, and so is not estimated. The criticism has been made that the "environmental" risk factors regressed out in this way are in fact often genetically correlated with the trait being analysed. For example Rao et al [1984] regress out the effects of body-mass index on blood lipids, and so any genetic effects found do not include genes affecting both weight and lipids, such as those underlying "Syndrome X" [Reaven et al, 1984]. Multivariate genetic analyses get around this problem to some extent.
A second criticism is that often the environmental factors relevant to cultural transmission have not been measured, or measured imperfectly, so that residual environmental covariation inflates the estimates of heritability. Other types of family study are needed to confirm the results of this type of analysis. I will take up this point later.
Marital assortment on phenotype leads to covariance between parental genotypes and between genotype and shared environment. This under the standard path analytic formulation leads to nonlinear constraints on the covariances between ultimate variables. An extension of the path diagram by Cloninger [1980] is to include a copath between parental phenotypes - a "sling" that can be traced in either direction. This allows a simple representation of these constraints.
If relative pairs are mutually independent or dependent in a regular fashion, such as the nuclear family depicted in Figure 8.3 (fixed number of parents and offspring), then model fitting is relatively straightforward. For irregular pedigrees, more numerically intensive ML methods must be used. For a (univariate) trait under the control of polygenes (no epistasis), we can write for a given pedigree size n, under the assumption of multivariate normality [eg Lange et al 1976],
y | = u1 + a + d + e |
S | = V_{A}A + V_{D}D + V_{E}I |
LogL | = -0.5 [ ln |S| + (y-u1)' (S)^{-1} (y-u1)] |
where,
y is the n x 1 vector of trait values for the pedigree;
u, the population mean for the trait;
1, the n x 1 unit vector;
a and d the n1 vectors of additive and dominance deviations;
S, the covariance matrix for y;
A, the n x n numerator relationship or kinship matrix, the i,jth element of which is the coefficient of relationship for the ith and jth relative pair;
D, the n x n "delta_{7}" matrix, the i,jth element of which is the coefficient of fraternity for the ith and jth relative pair;
I is the n x n identity matrix;
LogL, the log likelihood under y~N(0,V_{T}).
These equations can be further extended to the addition of terms for fixed effects (ie individual specific m's given measured covariates). The likelihood has been maximised numerically using the EM algorithm [Ott 1979; Thompson & Shaw 1990], Fisher scoring [Lange et al 1976], Newton-Raphson, quasi-Newton, and Nelder-Mead simplex methods [Thomas 1991], as well as by recursive regressive adjustment of trait values using those of relatives "lower down" in the pedigree ("upward" peeling) - the Elston-Stewart algorithm [Elston & Stewart 1971; Canning et al 1978; Ott 1979; Elston et al 1992]. The EM and the Elston-Stewart approaches avoid inverting the sometimes sizeable matrix. Another approach is minimum variance quadratic unbiased estimation [Henderson 1985], where genetic deviation scores for each individual are first estimated, and the components of variance subsequently solved for.
The extension of all the methods just discussed to multiple phenotypes is relatively straightforward. Martin et al [1978] showed how genetic factor analyses could be performed, parameterised as the LISREL model, and multivariate tracing rules have been devised [Vogler 1985]. Four sources of variation per trait are possible in such models, if epistasis, marital assortment, gene-environment interaction and other complications are absent. In nuclear family or classical twin study data, usually only three of these (additive genetic, dominant genetic or shared environment, and unshared environment) will be estimable per trait [eg Ott 1979].
A simple multitrait model is the "latent phenotype" model, where the observed traits are intercorrelated through an underlying unobserved trait - a phenotypic correlation. The covariance between relatives for the latent trait and residual covariances between the observed traits can be decomposed as before. This is a natural model for the case of atopy and atopic diseases.
An alternative is the "biometrical factor" model where covariance between the traits is due to underlying common genetic and environmental factors. One useful form of this has one common additive, one common nonadditive, one common shared environmental and one unshared environmental factor explaining covariance between traits, and a "triple" decomposition of the same trait residual covariances between the relatives (Figure 8.5). This model with four common factors is identifiable, except in the case that the latent phenotype model will fit the data [Duffy unpublished].
The "saturated" factor model (with three sources of variation per trait) requires constraints to be estimable. One set of constraints in current use is a triangular decomposition of the factor loadings (an "ACE cholesky decomposition"). Neale and Cardon [1992] describe other more complex models covering multivariate longitudinal observations, marital assortment, assessment of the direction of causation between traits in cross-sectional data, and sample ascertainment.
The aim of segregation analysis is to detect and estimate the effects of (an) individual gene(s) on a trait within a sample of families. Specifically, one estimates the number of alleles of the gene that affect the phenotypic value, the frequencies of the alleles, and the relationship between genotype and phenotype, most usually the mean trait value associated with each genotype. In the methods discussed in the last section, information is pairwise, as the covariance between genotypes of all pairs of a given degree of relationship is treated as constant (its expectation). By contrast, in segregation analysis, the covariance between genotypes for a given pair is estimated using not only the degree of relationship, but the phenotypic values of other relatives in that kindred. In human genetics, if no relevant genotypes are measured (either genes directly affecting the trait, or genes linked to the locus of interest - the latter giving information on IBD distribution), usually only the presence and characteristics of a single loci of major effect (megaphenic - explaining more than 10% of the variance of the trait) can be detected and estimated. If polygenic effects are present, these are modelled in conventional families as additive genetic (co)variances residual to that of the "single major" locus - "mixed model" segregation analysis. The "mixture" is the mixture of discrete and continuous genotypes, rather than the mixture of random and fixed effects this term usually denotes.
After the last equation, if the major gene A has g alleles, the trait value for an individual is partitioned as,
where dominance and epistasis are assumed to be neglible, and g_{ij} is an unknown fixed effect, the trait mean due to genotype A_{i}A_{j}. Within a pedigree, the distribution of y is now a mixture of multivariate normals (one distribution per discrete genotypic mean). The likelihood is then,
where the summation is over all "legal" genotypic arrangements for that pedigree [Ott 1979], g is the n x 1 vector of genotypic means, f(y|g) are multivariate normal densities, and P(g) are the mixing proportions. The term f(y|g) is the equivalent of the penetrances for a discrete phenotype, and allows the polygenic background to be separated out from the major locus in this equation.
The possible mixing proportions P(g) for founders, that is individuals in the pedigree whose parents are not in the pedigree, are determined by the population gene frequencies. Genotypic probabilities for other pedigree members are conditional on the Mendelian laws and the genotypes of their parents. This former information is given by transition probabilities T(g_{off},g_{fa},g_{mo}), where g_{off},g_{fa},g_{mo} are the genotypes of the offspring, father and mother respectively. These are usually stated in terms of transmission probabilities, the probability that genotype g_{fa} or g_{mo} (undifferentiated in the case of autosomal inheritance) will transmit a given allele A_{i}. In the case of two alleles (three genotypes), the Mendelian transmission probabilities for A_{1} are t_{11}=1, t_{12}=0.5, and t_{22}=0.
To exactly evaluate the likelihood of a given set of genetic parameters - population gene frequency, genotype means, additive genetic and unshared environmental variance components - one can enumerate all legal genotypic vectors, assess the likelihood as per the above equation for each such permutation and sum across these likelihoods, weighted by the probability of the genotype set or configuration [Ott 1979; Hasstedt 1991]:
where G is the number of genotypes, nf the number of founders, and a and g the vectors of (additive) polygenotypic and major locus genotypic values respectively. As one might expect, evaluation of this expression is not practicable for G>3, and n over approximately 10.
One alternative procedure is to again use the Elston-Stewart algorithm - which still requires the use of approximate formulae to hold down computational load, as in the program PAP [Hasstedt 1982, 1991]. Another is to use Monte Carlo procedures [Ott 1979], and a third is due to Morton and McLean [1974], where quadrature is applied (POINTER). In the "regressive" segregation models of Bonney et al [1984, 1986], as implemented in the program REGC included in the package SAGE, the correlations residual to the effects of the major gene are not explicitly decomposed, the univariate normal density for the phenotypic value for a member of the pedigree is conditioned on the phenotypic and genotypic values of the parents and the phenotypic values of the older siblings, and is evaluated by quadrature, again using approximations [Demenais 1990; Sorant et al 1991]. Most recently, Monte Carlo Bayesian approaches, such as the Metropolis algorithm or Gibbs sampling have been applied.
Likelihood ratios comparing models are asymptotically chi-square in distribution, as opposed to the case for commingling analysis of unrelated individuals. The Mendelian models, where 's are fixed to Mendelian expectations, and the environmental transmission models, where are all equal, are compared to a base model where all 's are estimated. An amusing example of this is McGuffin et al [1991]. If the Mendelian models are accepted, then setting the polygenic component to zero will be tested. Testing equality of "adjacent" genotypic means - dominance or recessivity, or additivity of allelic values are further refinements. There are very few examples of models containing more than three genotypes, though the presence of multiple likelihood maxima can suggest that more genotypic means need to be included.
Multivariate (ie multitrait) formulations of the multivariate normal [Lange et al 1985; Thompson et al 1992] and of mixed model segregation analysis have been formulated and occasionally applied, but are computationally even more demanding. An example of bivariate segregation analysis was discussed earlier [Borecki et al 1985]. Multilocus segregation analysis, usually digenic, has also been applied.