Programs useful for detecting genotyping and pedigree errors

This document tries to give an overview of the practicalities of cleaning genotyping and pedigree errors from family data, usually in the context of a genetic linkage study.

A list of programs

"Single-point" Mendelian error checking

"Multipoint" Mendelian error checking

Pedigree error checking

Are errors important?

Is it probably not possible to track down all genotyping errors in the usual kind of data set that we look at -- that is where there are important family members who have not been genotyped. Is it necessary that all or even most errors are cleaned? Are some methods of analysis robust to errors?

The answer is that a proportion of genotyping errors regarded as pretty average for most laboratories (3%) can have a severe effect, if by chance the individual whose genotype is affected is a key member of a pedigree. It is not uncommon, in small datasets, to perform a sensitivity analysis by deleting or altering genotypes (and phenotypes) from each pedigree member in turn.

An example

Here are the results of a small simulation study where 1%,3%,5%,7% and 10% error rates were applied to the data of Hall et al [1980], used originally to localize BRCA1. This was done by randomly altering each allele on 1-10% of occasions to a value chosen randomly from the set of all observed alleles.

Twenty replicates of the data set simulated under each error rate were first "cleaned", by deleting genotypes which gave rise to Mendelian errors, then reanalysed under the same model (single liability class intermediate dominant model) as the original data.

The mean resulting lod score curves are significantly lower than the original curve, and fall off rapidly at the 3% error rate.

Plot of mean lod curves

In the next graph, the lod curves for the 20 replicates under the 3% error regimen are shown. It can be seen that some replicates, due to the errors affecting key members of pedigrees, are more seriously affected than others.

Lod curves at 3% error rate

Further empirical evidence

Brzustowicz et al [1993] examined genotyping errors in the CEPH database (version 4.0), taking advantage of the fact that 21 individuals are "duplicated" in the set of pedigrees. There were 897 discordances out of 28391 genotypes at 1905 loci (3.2%). Most of these markers are diallelic.

The effects of these errors on multipoint linkage results for six markers on chromosome 5 was to double the total estimated map length.

Checking for Mendelian errors at a single locus

Simple errors

Most computer programs will point out obvious inconsistencies in pedigree data, though the ease of understanding the output varies.

Program UNKNOWN version 5.20
The following maximum values are in effect:
      30 loci
     325 single locus genotypes
      25 alleles at a single locus
    2000 individuals in one pedigree
       8 marriage(s) for one male
       3 quantitative factor(s) at a single locus
     120 liability classes
      25 binary codes at a single locus
       8 maximum number of loops
Opening DATAFILE.DAT
YOU ARE USING LINKAGE (V5.20) WITH  2-POINT
YOU ARE USING FASTLINK (V4.1P)
 AUTOSOMAL DATA
Opening PEDFILE.DAT
Ped.  1

 One incompatibility involves the family in which person 12 is a parent
 The person number refers to the second column in the pedigree file input to 
 UNKNOWN ERROR: Incompatibility detected in this family for locus            2
 *** Press <Enter> to continue

 One incompatibility involves the family in which person 8 is a parent
 The person number refers to the second column in the pedigree file input to 
 UNKNOWN
 *** Press <Enter> to continue

 One incompatibility involves the family in which person 11 is a parent
 The person number refers to the second column in the pedigree file input to 
 UNKNOWN
 *** Press <Enter> to continue

Ped.  2

 One incompatibility involves the family in which person 4 is a parent
 The person number refers to the second column in the pedigree file input to 
 UNKNOWN ERROR: Incompatibility detected in this family for locus            2
 *** Press <Enter> to continue...

The commonest detectable errors involve nuclear families. Several program (such as CRI-MAP, GAS or Sib-pair) check for these errors first (it is very quick), and produce more informative output.

Pedigree: 1          No. members:   18 No. founders:    5 No. sibships:    4

NOTE:  Inconsistency due child 1-16 at locus D17S74     {  5/6  }
NOTE:  Inconsistency due child 1-17 at locus D17S74     {  6/8  }

Locus "D17S74"
------------------
Sibship: 1-9 x 1-8

Multiple inconsistencies between parent and child genotypes.

                         9                   8    
                        2/7                 5/8  
                         |                   |
                         +=========+=========+
                                   |
                              +----+----+
                              |         |          
                              16        17      
                             5/6       6/8      


NOTE:  Inconsistency due child 1-18 at locus D17S74     {  1/2  }

Locus "D17S74"
------------------
Sibship: 1-12 x 1-11

Inconsistency between parent and child genotypes.

                         12                  11   
                        8/12                2/5  
                         |                   |
                         +=========+=========+
                                   |
                                   |
                                   18   
                                  1/2  

Pedigree: 2          No. members:    9 No. founders:    3 No. sibships:    2

NOTE:  Inconsistency due child 2-7 at locus D17S74     {  1/2  }

Locus "D17S74"
------------------
Sibship: 2-3 x 2-4

Inconsistency between parent and child genotypes.

                         3                   4    
                        2/4                 5/12 
                         |                   |
                         +=========+=========+
                                   |
                         +---------+---------+
                         |         |         |          
                         7         8         9       
                        1/2       2/5       4/5      

Complex Mendelian errors

If you have multiple-generation pedigrees with untyped connecting individuals (grandparents/greatuncles etc), errors can be difficult to understand. UNKNOWN is very fast at detecting these errors, but it takes some time to decide on the most likely person causing the trouble.

                             400       401   
                             x/x       x/x  
                              |         |
                              +====+====+
                                   |
                             +-----+----+--------+
                             |          |        |
        505                 501        500      504
        x/x                 x/x        9/9      7/10
         |                   |
         +=========+=========+
                   |
    +---------+----+----+---------+
    |         |         |         |          
   610       611       612       613      
   4/5       3/9       x/x       4/5      

Here is my attempt at automatic output that might help decide where the problem is:

ID       Count    Problem phenosets
-------- -------- -----------------

Maternal Gparents
400             2   7/  9   9/ 10
401             2   7/  9   9/ 10

Maternal Uncles/Aunts
500         Typed   9/  9
504             6   7/  7   7/  9   7/ 10   9/  9   9/ 10  10/ 10
506         Typed   7/ 10

Father
505       Problem   3/  3   3/  4   3/  5   3/  7   3/  9   3/ 10   4/  4

Mother
501       Problem   7/  7   7/  9   7/ 10   9/  9   9/ 10  10/ 10

Children
610         Typed   4/  5
611         Typed   3/  9
612       Problem   3/  3   3/  4   3/  5   3/  7   3/  9   3/ 10   4/  4
613         Typed   4/  5

Parent 1-501 cannot carry the   4 allele found in child 1-610.
Parent 1-501 cannot carry the   5 allele found in child 1-610.
Parent 1-501 cannot carry the   3 allele found in child 1-611.
Parent 1-501 cannot carry the   4 allele found in child 1-613.
Parent 1-501 cannot carry the   5 allele found in child 1-613.

The mother 501 has two siblings with genotypes 9/9 and 7/10, and so cannot have a child with a 4/5 genotype. One possibility is that an allele has dropped out in person 500's genotype, so 501 might be a 4/9 or 5/9 genotype.

Checking for Mendelian errors via multipoint linkage analysis

One standard approach to the detection of errors is to look for tight double recombinants. The level of interference in the human recombination means such events are relatively rare, so if obligate double recombinants involving three closely spaced markers are observed, these may represent a genotyping error for the central marker.

In the chromosome 5 linkage data of Brzustowicz et al [1993], there were three such tight double recombinants. These did represent errors, but rather than being simple substitution errors in the central marker in the child, were errors in flanking markers or parental genotyping. These represented 5 of the 43 errors involving the four markers selected for analysis.