This document tries to give an overview of the practicalities of cleaning genotyping and pedigree errors from family data, usually in the context of a genetic linkage study.
Is it probably not possible to track down all genotyping errors in the usual kind of data set that we look at -- that is where there are important family members who have not been genotyped. Is it necessary that all or even most errors are cleaned? Are some methods of analysis robust to errors?
The answer is that a proportion of genotyping errors regarded as pretty average for most laboratories (3%) can have a severe effect, if by chance the individual whose genotype is affected is a key member of a pedigree. It is not uncommon, in small datasets, to perform a sensitivity analysis by deleting or altering genotypes (and phenotypes) from each pedigree member in turn.
Here are the results of a small simulation study where 1%,3%,5%,7% and 10% error rates were applied to the data of Hall et al [1980], used originally to localize BRCA1. This was done by randomly altering each allele on 1-10% of occasions to a value chosen randomly from the set of all observed alleles.
Twenty replicates of the data set simulated under each error rate were first "cleaned", by deleting genotypes which gave rise to Mendelian errors, then reanalysed under the same model (single liability class intermediate dominant model) as the original data.
The mean resulting lod score curves are significantly lower than the original curve, and fall off rapidly at the 3% error rate.
In the next graph, the lod curves for the 20 replicates under the 3% error regimen are shown. It can be seen that some replicates, due to the errors affecting key members of pedigrees, are more seriously affected than others.
Brzustowicz et al [1993] examined genotyping errors in the CEPH database (version 4.0), taking advantage of the fact that 21 individuals are "duplicated" in the set of pedigrees. There were 897 discordances out of 28391 genotypes at 1905 loci (3.2%). Most of these markers are diallelic.
The effects of these errors on multipoint linkage results for six markers on chromosome 5 was to double the total estimated map length.
Most computer programs will point out obvious inconsistencies in pedigree data, though the ease of understanding the output varies.
Program UNKNOWN version 5.20 The following maximum values are in effect: 30 loci 325 single locus genotypes 25 alleles at a single locus 2000 individuals in one pedigree 8 marriage(s) for one male 3 quantitative factor(s) at a single locus 120 liability classes 25 binary codes at a single locus 8 maximum number of loops Opening DATAFILE.DAT YOU ARE USING LINKAGE (V5.20) WITH 2-POINT YOU ARE USING FASTLINK (V4.1P) AUTOSOMAL DATA Opening PEDFILE.DAT Ped. 1 One incompatibility involves the family in which person 12 is a parent The person number refers to the second column in the pedigree file input to UNKNOWN ERROR: Incompatibility detected in this family for locus 2 *** Press <Enter> to continue One incompatibility involves the family in which person 8 is a parent The person number refers to the second column in the pedigree file input to UNKNOWN *** Press <Enter> to continue One incompatibility involves the family in which person 11 is a parent The person number refers to the second column in the pedigree file input to UNKNOWN *** Press <Enter> to continue Ped. 2 One incompatibility involves the family in which person 4 is a parent The person number refers to the second column in the pedigree file input to UNKNOWN ERROR: Incompatibility detected in this family for locus 2 *** Press <Enter> to continue...
The commonest detectable errors involve nuclear families. Several program (such as CRI-MAP, GAS or Sib-pair) check for these errors first (it is very quick), and produce more informative output.
Pedigree: 1 No. members: 18 No. founders: 5 No. sibships: 4 NOTE: Inconsistency due child 1-16 at locus D17S74 { 5/6 } NOTE: Inconsistency due child 1-17 at locus D17S74 { 6/8 } Locus "D17S74" ------------------ Sibship: 1-9 x 1-8 Multiple inconsistencies between parent and child genotypes. 9 8 2/7 5/8 | | +=========+=========+ | +----+----+ | | 16 17 5/6 6/8 NOTE: Inconsistency due child 1-18 at locus D17S74 { 1/2 } Locus "D17S74" ------------------ Sibship: 1-12 x 1-11 Inconsistency between parent and child genotypes. 12 11 8/12 2/5 | | +=========+=========+ | | 18 1/2 Pedigree: 2 No. members: 9 No. founders: 3 No. sibships: 2 NOTE: Inconsistency due child 2-7 at locus D17S74 { 1/2 } Locus "D17S74" ------------------ Sibship: 2-3 x 2-4 Inconsistency between parent and child genotypes. 3 4 2/4 5/12 | | +=========+=========+ | +---------+---------+ | | | 7 8 9 1/2 2/5 4/5
If you have multiple-generation pedigrees with untyped connecting individuals (grandparents/greatuncles etc), errors can be difficult to understand. UNKNOWN is very fast at detecting these errors, but it takes some time to decide on the most likely person causing the trouble.
400 401 x/x x/x | | +====+====+ | +-----+----+--------+ | | | 505 501 500 504 x/x x/x 9/9 7/10 | | +=========+=========+ | +---------+----+----+---------+ | | | | 610 611 612 613 4/5 3/9 x/x 4/5
Here is my attempt at automatic output that might help decide where the problem is:
ID Count Problem phenosets -------- -------- ----------------- Maternal Gparents 400 2 7/ 9 9/ 10 401 2 7/ 9 9/ 10 Maternal Uncles/Aunts 500 Typed 9/ 9 504 6 7/ 7 7/ 9 7/ 10 9/ 9 9/ 10 10/ 10 506 Typed 7/ 10 Father 505 Problem 3/ 3 3/ 4 3/ 5 3/ 7 3/ 9 3/ 10 4/ 4 Mother 501 Problem 7/ 7 7/ 9 7/ 10 9/ 9 9/ 10 10/ 10 Children 610 Typed 4/ 5 611 Typed 3/ 9 612 Problem 3/ 3 3/ 4 3/ 5 3/ 7 3/ 9 3/ 10 4/ 4 613 Typed 4/ 5 Parent 1-501 cannot carry the 4 allele found in child 1-610. Parent 1-501 cannot carry the 5 allele found in child 1-610. Parent 1-501 cannot carry the 3 allele found in child 1-611. Parent 1-501 cannot carry the 4 allele found in child 1-613. Parent 1-501 cannot carry the 5 allele found in child 1-613.
The mother 501 has two siblings with genotypes 9/9 and 7/10, and so cannot have a child with a 4/5 genotype. One possibility is that an allele has dropped out in person 500's genotype, so 501 might be a 4/9 or 5/9 genotype.
One standard approach to the detection of errors is to look for tight double recombinants. The level of interference in the human recombination means such events are relatively rare, so if obligate double recombinants involving three closely spaced markers are observed, these may represent a genotyping error for the central marker.
In the chromosome 5 linkage data of Brzustowicz et al [1993], there were three such tight double recombinants. These did represent errors, but rather than being simple substitution errors in the central marker in the child, were errors in flanking markers or parental genotyping. These represented 5 of the 43 errors involving the four markers selected for analysis.