SIB-PAIR LINKAGE ANALYSIS

Introduction

After looking at tests for allelic association, and then combinations of allelic association with linkage (the TDT), we shall return to linkage analysis in natural populations. Compared to the large numbers of offspring seen in experimental genetics, human families are relatively small. This means phase is harder to evaluate. Also, matings are relatively random, so only a proportion of families in the population are informative for linkage analysis.

Codominant marker loci and the direct method

One way to work out the phase of a mating is to genotype three generations of a family. Then, if there enough doubly heterozygous parents, one can easily count up the recombination events.

Genotypes at D12S379 and D12S95 in an Amish family

D12S379         205 209   193 201        197 209   201 209
D12S95          146 152   146 158        146 158   156 158
                   1         2              3         4
                   |         |              |         |
                   +----+----+              +----+----+
                        |                        |
                     193 205                  197 201
                     158 146                  146 156
                        5                        6
                        |                        |
                        +-----------+------------+
                                    |
    +--------+--------+--------+----+----+--------+--------+--------+
    |        |        |        |         |        |        |        |
 193 201  197 205  193 201  193 197   193 197  197 205 201 205  201 205
 158 156  146 146  158 156  158 146   158 146  146 146 156 146  146 146
    7        8        9        10        11      12       13       14  
  NR  NR   NR  NR   NR  NR   NR  NR    NR  NR   NR  NR  NR  NR   NR   R

The grandparental data allows us to work out that the four gametes that gave rise to the parents 5 and 6 were {205,146} from individual 1, {193,158} from 2, {197,146} from 3, and {201,156} from 4. This allows us to score the children as to whether these haplotypes have been broken up by a recombination event or not. Our estimate of the recombination distance between these loci from this family is c= 1/16 = 0.0625. Because there are so few observations, the 95% confidence interval is wide, from 0.002 to 0.302. Actually, D12S379 and D12S95 are approximately 6 cM apart.

The smallest family useful for linkage

It is possible to estimate recombination distances using nuclear families only. This means the parental mating type is phase-unknown. There must be at least two offspring. For the backcross for two codominant markers AaBb x AABB,

Probabilities of different haplotypes in dihybrid testcross offspring for two unlinked codominant loci A and B
First Child
AB Ab aB abTot
Second Child AB 1/16 1/16 1/16 1/16 1/4
Ab 1/16 1/16 1/16 1/16 1/4
aB 1/16 1/16 1/16 1/16 1/4
ab 1/16 1/16 1/16 1/16 1/4
Tot1/4 1/4 1/4 1/4 1

To work out the form that deviations from this null hypothesis will take, we have to add up the probabilities over the two possible phases the heterozygote parent is in, either coupling AB/ab or repulsion Ab/aB. Assuming linkage equilibrium (D=0), these two genotypes are equally likely.

If the heterozygote parent genotype is in coupling, then a child with Ab or aB represents a recombinant, and AB or ab a nonrecombinant. In repulsion, the opposite holds. Therefore, the probability of a child being Ab or aB is,

Pr(Ab or aB)= Pr(Parent is AB/ab) . c + Pr(Parent is Ab/aB) . (1-c)
= 1/2 . c + 1/2 . (1-c) = 1/2.

The probability that both children are Ab or aB is,

Pr(Child 1=Ab or aB, Child 2=Ab or aB) = 1/2 . c2 + 1/2.(1-c)2
= (c2+(1-c)2)/2.

The probability that the first child is Ab or aB, and the second child AB or ab is,

Pr(Child 1=Ab or aB, Child 2=AB or ab) = 1/2 . c(1-c) + 1/2 . c(1-c)
= c(1-c),
and so forth.

Probabilities in phase-unknown backcross mating for two codominant loci A and B separated by recombination distance c
First Child
AB or ab Ab or aB
Second Child AB or ab (c2+(1-c)2)/2 c(1-c)
Ab or aB c(1-c) (c2+(1-c)2)/2

We can estimate c using a number of such families,

Family type Number of families
Children same type (both in {AB,ab} or both in {Ab,aB})O1
Children different types O2

as,

c=0.5 (1-sqrt[(O2-O1)/(O1+O2)]))

If this equation gives a value of c greater than 0.5, then we take 0.5 as our estimate. If we have phase-unknown backcross families with more than two offspring, or mixtures of different types of family, the formulae become increasingly more complex. I will take up the likelihood based methods necessary for estimation of c in general pedigrees later.

Non-parametric linkage analysis

If one of the loci of interest is not a simple Mendelian trait, then it becomes difficult to determine what the underlying genotypes are. One approach is to take penetrance and allele frequency information from other sources, and use to those to estimate the probabilities of each genotype in each member of the pedigree. Another is to perform simple tests looking for effects of ascertainment on segregation of the codominant marker locus in the selected families.

The affected sib pair (ASP) method

This method is used where one trait A is dichotomous (affected or unaffected), with unknown penetrances and allele frequencies for the underlying trait locus. The other locus B is a codominant marker (ideally). In this case, we ascertain families with two affected children. For backcross matings, we obtain

Sibship type, Bb x BB matingFrequency of each sibship type
Child 1 Child 2 Observed Expected under null hypothesis
B B O1 N/4
B b O2 N/4
b B O3 N/4
b b O4 N/4
Total Number of Sibships N N

The null hypothesis is that there is no distortion of segregation due to linkage between the trait locus and the marker locus. As in the dihybrid testcross above, we can simplify this table to,

Sibship type Number of families
Children same type (both B, or both b)O1+O4 N/2
Children different types O2+O3 N/2
Total NN

Deviations in the expected counts from the null expectations occur when c<0.5. As we did for the TDT, we can work out theoretical expectations for particular values of c, penetrances (f2, f1, f0) and allele frequencies (trait PA and mmarker PB). Assuming both Hardy-Weinberg and linkage equilibrium,

Pr(Children same type)=1/2 + (2c-1)2(4c(c-1)(VD-1)-1+2VA+3VD)/(16R+8VA+4VD)
Pr(Different)=1/2 - (2c-1)2(-4c(c-1)(VD-1)+1+2VA+VD)/(16R+8VA+4VD)

where,

R=PAf2+2PA(1-PA)f1+(1-PA)2f0,
VA=2PA(1-PA)(PA(f1-f0)+(1-PA)(f2-f1))2
VD=PA2(1-PA)2(f2-2f1+f0)2.

When c is 0.5, the second term disappears, giving the null expectations. If c is zero, then

Pr(Children same type)= 1/2 + (2VA+3VD-1)/(16R+8VA+4VD)
Pr(Different)=1/2 - (2VA+VD+1)/(16R+8VA+4VD).

In the case of a multiallelic marker, the test is exactly the same, the numbers for each heterozygous parent genotype still contributing to the sib pair being concordant or discordant at the marker.

Identity by descent and identity by state

In the above example, the heterozygous parent is informative for linkage analysis, in that we can determine whether each child received an allele from the same parental chromosome (or same grandparental gamete). The term identity by descent refers to this information. In the case of a homozygous parent, each child receives the same allele, but we do not know whether these came from the same grandparent. The term identical by state describes the situation where two relatives carry the same allele, regardless of whether it was inherited from a common ancestor or not.

In ambiguous cases, we will often calculate the probabilities of identity by descent. For example, if one parent is BB and the other bb, then the probability that both children carry the B allele is 100%. The probability that the B allele in one child is identical by descent with the B allele in the other child is 50%. Identity by descent, or ibd is useful in that it extends to other types of relatives, and probabilities can be estimated when particular individuals (such as parents) are not typed at the marker.

Here are some examples. Returning to the Amish pedigree above, individuals 2 and 7 both carry a 193 and a 201 (repeat) allele at the D12S379 locus. Therefore they share two alleles identical by state (ibs). However, the 201 allele was not transmitted from grandparent 2 to grandchild 7, so they share only the 193 allele identical by descent (ibd). Grandchild 7 shares no alleles ibs or ibd with his/her grandparents 1 and 3 at D12S379.

For the first four siblings in the third generation, the ibd sharing is the same as the ibs sharing,

  Individual 7 Individual 8 Individual 9 Individual 10
Individual 7 - 0% 100% 50%
Individual 8 0/2 - 0% 50%
Individual 9 2/2 0/2 - 50%
Individual 10 1/2 1/2 1/2 -

This family is ideal, as both parents carry different heterozygous genotypes. More often, the parents will be less informative,

Identity by descent sharing for sib pairs, after Table II from Haseman and Elston (1972).
Mating Type Sib pair Population frequency* ibd=0% ibd=50% ibd=100% Mean ibd
aa x aa aa, aa a 4 1/4 1/2 1/4 50%
aa x bb ab, ab 2a2b2 1/4 1/2 1/4 50%
aa x ab aa, aa a 3b 0 1/2 1/2 75%
aa, ab 2a 3b 1/2 1/2 0 25%
ab, ab a 3b 0 1/2 1/2 75%
aa x bc ab, ab or ac, ac a2bc 0 1/2 1/2 75%
ab, ac 2a2bc 1/2 1/2 0 25%
ab x ab aa, aa or bb, bb a2b2/4 0 0 1 100%
aa, bb a2b2/2 1 0 0 0%
aa, ab or bb, ab a2b2 0 1 0 50%
ab, ab a2b2 1/2 0 1/2 50%
ab x ac aa, aa a2bc/2 0 0 1 100%
aa, ab or aa, ac a2bc 0 1 0 50%
aa, bc a2bc 1 0 0 0%
ab, ab etc a2bc/2 0 0 1 100%
ab, ac a2bc 1 0 0 0%
ab, bc a2bc 0 1 0 50%
ac, bc a2bc 0 1 0 50%
ab x cd ac, ac etc abcd/2 0 0 1 100%
ac, ad etc abcd 0 1 0 50%
ac, bd or ad, bc abcd 1 0 0 0%

* Population frequency of that type of family in the population assuming random mating and HWE. Each letter represents the population frequency of that allele in the general population.

Faraway's improved (UMP) affected sib pair linkage test

We can therefore calculate ibd sharing for a sib-pair, or indeed any other kind of relative pair. If there is no inbreeding in the families sampled, the only kind of relative pair that can share more than one allele ibd (50% ibd sharing) is the sib pair (and MZ twins, but these contain no linkage information).

Using ibd sharing as the measure of similarity, there are actually three simple chisquare tests suggested for affected sib pair data in the following table.

Identity by descent allele sharing Total
ibd=100% ibd=50% ibd=0%
Observed Count O2O1 O0N
Expected Count N/4 N/2 N/4 N

Note that there are "fractional" contributions to these observed counts from less informative families. For example, an ASP with genotypes a/a and a/b arising from the backcross a/a x a/b mating will contribute one-half of an observation to the ibd=0 cell, and one-half to the ibd=50% cell.

We have already seen the overall best simple test, which is usually called the "mean" test,

Mean test = 2/N (2O2+O1-N)2

The other tests are superior only if the trait has particular mode of inheritance, such as a simple Mendelian recessive. The two-degree-of-freedom "genotypic" test is,

X22 =[O2 -N/4]2 /[N/4] + [O1 -N/2]2 /[N/2] +[O0 -N/4]2 /[N/4]

and the "two-allele" test is simply,

X12 =[O2 -N/4]2 /[N/4]

Faraway (1992) showed that a combination of these different tests is the theoretically best test against a genetic alternative hypothesis.

Observed identity by descent* Value of composite statistic
2p2+p1 > 1, p1 > 1/2 mean test
3p1/2 + p2 < 1, p2 > 1/4 two-allele test
2p2+p1 < 1, p2 < 1/4 Not consistent with genetic cause
Otherwise 2 d.f. chi-square

Here p2,p1,p0 is the observed proportion of pairs sharing two, one, zero alleles ibd. Unfortunately, since one has to choose a different test for each situation, a correct P-value can no longer be looked up in the conventional chi-square table. For example, if your sample has 150 ASPs, the critical chi-square value for a one-tailed P=0.05 is not 2.71, but 3.42. Most people continue to either use the mean test, or use computer programs such as ASPEX or GAS to calculate the P-value for the Faraway test (or an equivalent called the MLS test).

Affected sib pairs with untyped parents

If a disease occurs late in life, both parents of an ASP are likely to be dead. We can still work out the ibd probabilities for the sibs. If the marker is multiallelic, and the pair are a/b and c/d, for example, they must also be ibd=0. If we know the marker allele frequencies, and assume panmixia, HWE etc, we can obtain the expected ibds by adding up the probabilities under each possible mating type that could give rise to that pair,

Expected ibd sharing when parental genotypes are unknown, after Table III of Haseman and Elston (1972).
Sib pair Population frequency* ibd=0% ibd=50% ibd=100% Mean ibd
aa, aa a2(1+a)2/4 a2/(1+a)2 2a/(1+a)2 1/(1+a)2 1/(1+a)
aa, bb a2b2/2 1 0 0 0
aa, ab a2b(1+a) a/(1+a) 1/(1+a) 0 1/(2+2a)
aa, ac a2jk 1 0 0 0
ab, ab ab(1+a+b+2ab)/2 2ab/(1+a+b+2ab) (a+b)/(1+a+b+2ab) 1/(1+a+b+2ab) (2+a+b)/(2+2a+2b+4ab)
ab,ac abc(1+2a) 2a/(1+2a) 1/(1+2a) 0 1/(2+4a)
ab,cd 2abcd 1 0 0 0

* Population frequency of that type of family in the population assuming random mating and HWE. Each letter represents the population frequency of that allele in the general population.

For example, if the a allele has a population frequency of 0.5, an ASP with genotypes a/a and a/b will contribute one-third of an observation to the ibd=0 cell, and two-thirds to the ibd=50% cell. The expected counts and the chi-square will be worked out in the usual way.

Other types of relative pair

We can easily construct similar tests for other types of relative pair. For example, if we have a set of families containing an affected individual and their affected grandparent, or two affected half-sibs, the expected ibd is 25% (or half an allele). The observed value will either be one or zero alleles shared ibd. For this case, we can use an approximate chi-square, or exact binomial test on the observed counts. Because there are more "intervening" relatives between the members of the grandparent-grandchild pair, there is more room for ambiguous cases to arise (the connecting parent needs to be heterozygous, and the grandparental contributions need to be identifiable ie different grandparental genotypes).

Combining data from different relative pair

Whittemore and Halpern [1994], Kruglyak et al [1996] and Kong and Cox [1997] present methods to carry out such a nonparametric allele-sharing analysis for arbitrary pedigrees containing two or more affected members.

In the notation of Kong and Cox [1997], each pedigree contributes a allele sharing score, which might be simply the summed count of alleles shared ibd by all the possible pairs of affecteds in that pedigree (the Spairs statistic). This is compared to the value of S expected under the null hypothesis of random segregation, and a Z-score for each pedigree is:

Z = [S-E(S)]/SD

The expectation E(S) is the average of the score over all the possible transmission patterns (inheritance vectors) possible for a pedigree of that size and shape. The weighted sum of these Z scores gives an overall linkage score for the total number of families.

Kong and Cox [1997] suggested a likelihood based model for Z where ibd sharing varies according to a single parameter d, which combines trait locus allele and penetrances (see below for a natural two parameter model). They suggested linear and exponential model alternatives. The exponential model is equivalent to assuming a single recessive trait locus so that the single free parameter corresponds to the risk allele frequency (PD):

ASP Identity by descent allele sharing Total
ibd=100% ibd=50% ibd=0%
Expected Proportionx2 2x(1-x)2

where x=1/(1+PD), and d=-0.5*log(PD)2 [see Nicolae 1999, pp 29-30].

The linear model sets VD to zero (and meets the usual problem that this bounds the recurrence risks). These parameterizations allow a likelihood-based score test (or a LRTS) to be constructed.

A number of alternative sharing scores have been proposed in the literature, and these are often more powerful against specific alternatives, other than the recessive model. The Sall statistic suggested by Whittemore and Halpern [1994], for example, upweights sharing where the same allele is present in more than two members of a family.

Identity by state ASP

Lange [1986] suggested an alternative approach for ASPs with untyped parents. This is to work out the expected ibs under the null hypothesis. One can see that this would not be difficult to calculate, based on the previous table. This approach can also be easily generalized to other types of relatives [Bishop et al 1990],

Pr(ibs=0%)=k0 T00
Pr(ibs=50%)=k0 T10 + k 1 T 11
Pr(ibs=100%)=k0 T 20 + k 1 T21 + k2

With,

T00 =Sumi ne j[Pi Pj (1-Pi -Pj )2]+ Sumi[Pi2 (1-Pi )2]
T10 = 4 Sumi ne j[Pi Pj2 (1-Pi -Pj)] + 4 Sumi[Pi3 (1-Pi]
T11 =Sumi[Pi (1-Pi]
T20 = 2 Sumi ne j[Pi2 Pj2] + Sumi [Pi4]
T21 =Sumi[Pi2]

where Sumi[Pi] represents the usual summation notation, and

Kinship coefficients (expected ibd) for different classes of relatives
Type of relative pair k0 k1 k2
Full-sibs 1/4 1/2 1/4
Half-sibs 1/2 1/2 0
Grandparent-grandchild 1/2 1/2 0
Cousins 3/4 1/4 0

The k's are simply the mean or expected ibd sharing under the null hypothesis, and the T's, the probability of observing that ibs given the ibd. So for example, if a pair are ibd=100%, then it is certain they will be ibs=100%. Similarly, if a pair is ibd=50%, they will definitely have one allele identical by state (the allele shared ibd), and will share a second allele ibs Sum[Pi2] of the time, the same probability a genotype would be homozygous under HWE. As a marker becomes more and more informative, the closer the ibs sharing approaches ibd sharing. For example, for a marker with n equifrequent alleles, the expected mean ibs sharing for full-sibs is:

0.5 + 0.25 (1/n) [(1/n)2 - 2 (1/n) + 3]

and the relationship between the underlying ibd sharing and ibs sharing is:
mean ibd sharingmean ibs sharing
0 (0 alleles shared) (1/n) [(1/n)2-2(1/n)+2]
0.5 (1 allele shared)0.5 + 0.5*(1/n)
1.0 (2 alleles shared)1.0

The chi-square test for a particular type of relative pair will be based on this table,

  Identity by state allele sharing Total
ibs=100% ibs=50% ibs=0%
Observed Count O2 O1 O0N
Expected Count E2 E1 E0N

Combining IBS data from different relative pair: APM

Weeks and Lange [1988] present methods to carry out a nonparametric identity by state sharing analysis for arbitrary pedigrees containing two or more affected members. This is usually called the Affected Pedigree Member method or APM method.

This is not greatly different from the ibd based methods presented earlier (which indeed build on Weeks and Lange's concept). The ibs sharing statistic usually weights the ibs value by the frequency of the involved alleles, in order to better reflect ibd.

It has fallen put of favour because it is hard to develop a sensible multipoint version, tends to mix up association with linkage, and in earlier versions, tended to have a high false positive rate. It does have the advantage of being very quick.

Risch's parameterisation for ibd based ASP analysis

One will often encounter the results and notation derived in Risch [1990], a paper that summarizes much earlier work on ASP analysis. The expected values under specific genetic hypotheses were quite complicated using VA, VD and R. Risch introduced some simpler formulae for the expected values.

The recurrence risk is the probability a family member will be affected (for a dichotomous trait) given that a specified relative is affected. For example, for a rare fully penetrant recessive gene (f2=1, f1=0, f0=0), the recurrence risk to a sibling will be approximately 25%. James (1971) had shown that the recurrence risk was,

RecR = R + (k1 VA + k2 VD)/R

where k1 and k2 are kinship coefficients as before. If we define the Population Relative Risk (PRR) as RecR/R, then the expected ibd under a specific genetic hypothesis for a specific type (type i) of relative pair is,

  Identity by descent allele sharing
  ibd=100% ibd=50% ibd=0%
Expected Prop k 2PRRMZ/PRR k1PRRPO/PRR k0/PRRi

PRRMZ is the PRR for a monozygotic or identical twin of an affected individual, and PRRPO is the PRR for the child of an affected parent. Therefore, if descriptive data about a trait is available, we can work out firstly how many families we will need in our study to get a significant chi-square (the power of the study), as well as detecting if a trait locus linked to our marker explains all the cases of disease in the population.

A third, related use is to perform exclusion mapping. If we specify R, PRRMZ and PRRPO we can test whether our observed ibd counts are significantly different from what they would be if the trait locus was close to our marker locus. If the chi-square is large enough, we can exclude the trait from being in that chromosomal region. This allows us to quantify how "non-significant" a small ASP chi-square value is, since a small chi-square can either arise from having a small study (not very powerful) or from the trait and marker locus being unlinked.


Exercises

(1) A collaborative study of schizophrenia collected a number of sib pairs where both members were affected. These and their parents were typed at the marker D22S278.
GroupAlleles shared ibdAlleles not shared ibd
IOP, Cardiff 17 6
Johns Hopkins/MIT 32 17
MCV, Richmond 22 21
Natl USA/CRC Harrow 69 60
Uni of Utah/ Uni of Colorado 11 5
CNRS, Paris 21 14
Jerusalem/Mainz/Munich/Haar 28 23
Uni College Hosp 6 8
Edinburgh 17 15
Kiel Uni Hosp 11 8
USA/Australia 18 11

Do you think schizophrenia is linked to D22S278?