David L Duffy
QIMR Berghofer Medical Research Institute, Brisbane, Australia
Here I describe a Internet accessible database containing interpolated genetic map positions for 12922 marker loci. These are estimated via locally weighted linear regression (loess) from the Build 35.1 physical map position and the linkage map of Kong and coworkers (2004). For the pseudoautosomal region, I have interpolated a male map based on the sperm typing data of Lien and coworkers (2000).
To my knowledge, LDB (the Genetic Location Database, Collins et al 1996) was the first attempt to integrate multiple types of genetic mapping data (genetic linkage, radiation hybrid and physical maps) to provide a superior estimate of the physical map position of genetic markers. This often gave improved ordering of markers, and this information could then be used to construct finer grained genetic maps for linkage analysis.
Integration of linkage and physical map data has been greatly enhanced by the completion of the sequencing of the human genome and a number of groups have now released unified or comprehensive maps (Kong et al 2002; Nievergelt et al 2004), culminating in that of Kong and coworkers (2004, see http://genome.chop.edu for an interface to the June 2005 updated version of that map). In these applications, the only physical map data lent to the linkage analysis of the markers is the ordering of the loci. The resolution obtained from linkage analysis of even the combined Marshfield (Broman et al 1998) and DeCode datasets (Kong et al 2002) with up to 2026 informative meioses (Kong et al, 2004) is relatively low (those authors report zero recombination events in 56% of intermarker intervals in that study).
The relationship between physical and recombination distance does vary considerably by sex, chromosome and chromosome length, and position on the chromosome (eg Tapper et al 2005). Regression or smoothing methods can be applied to these data (Kong et al 2002; Bahlo et al 2004; Nievergelt et al 2004), and used to interpolate local recombination rates (cM/Mbp) and so recombination distances between closely spaced marker loci. These results can be tested by comparison to those obtained via coalescent modelling of linkage disequilibrium in the same regions (which has greater resolving power than conventional linkage analysis eg McVean et al 2004). However, localization of trait loci via linkage analysis is known to be fairly robust to variation in specified intermarker distances, as long as the marker order is correct, so this is not critical for most work.
In this note, I describe a Web-based database that contains genetic map positions for 12922 marker loci.
I have chosen to use locally weighted regression (loess) to give smoothed local recombination rates. Other authors (eg Bahlo et al 2004) have used simpler approaches such as linear interpolation based on flanking markers. The smoothing constant (alpha) has been chosen by inspection of the resulting plots, based on how accurately telomeric and pericentromeric markers are placed. Analysis was carried out in the R statistical language (R Core Development Team 2005) using the locfit package (Loader 2004), and all code is available on the same website as the database.
Originally, the database combined the publicly available Marshfield (Broman et al 1998) and DeCode (Kong et al 2002) datasets and used loess methods to integrate the Marshfield and DeCode linkage maps. The map lengths of several chromosomes differed by up to 10% between these two sources. Where sequence positions of the marker were not known, it was interpolated based on the same loess regression, though this is accurate only at the Mbp level. Since subsequently the Marshfield and DeCode datasets were merged and reanalysed de novo by Kong et al (2004), the positions of markers are now expressed in sex-averaged “Rutgers” cM. In the case of the male pseudoautosomal region, the results from Kong et al (2004) are too coarse, and I have instead based the analysis on the sperm typing map of Lien et al (2000), representing 1917 meioses.
The physical map positions are now based on Build 35.1 of the human genome sequence, as made available for the NCBI Mapviewer, and by BLAST search for the published primer sequences where the marker position is not given by public databases such as Entrez and Ensembl. The markers chosen include all the published Marshfield markers, the DeCode markers, and other microsatellite and SNP markers of interest to our research group. The markers have all been curated, and numerous naming inconsistencies on the public databases repaired. This is especially the case for numerous markers in the more recent Marshfield microsatellite mapping panels. This resolves several problems encountered by Nievergelt et al (2004) and Kong et al (2004) in the construction of their maps.
In the case of a few markers where the published linkage data is ambiguous, I have mapped these using our QIMR genome scan families (Zhu et al 2004). For example, Nievergelt et al (2004) placed D5S1454 (ATA4F06) on chromosome 4 based on a BLAST search of the Build 34 sequence. This marker is not placed on the most recent builds according to the public databases, and our linkage data confirmed it at its Marshfield position on chromosome 5 (between D5S433 and D5S2501). Documentation of these type of findings is on the website.
The database is in the form of a white space delimited flat ASCII file, that can read into any spreadsheet or statistical program (Table 1). It is easy to then write software to automatically interpolate the position of any novel marker on the map, given that its position in the sequence is known. Indeed, one application in which I have used this database is to interpolate the genetic map positions of the chromosomal band boundaries. This can then be used to give chromosome ideograms that are correctly scaled for addition to plots of linkage genome scan results (implemented in an R package lodplot). As noted above, all this material is accessible at http://www.qimr.edu.au/davidD
The main advantage to the described database and approach is the fact that all markers can be given unique consistent genetic map positions. It can obviously be extended to give sex-specific genetic maps, though these still see little use in routine linkage analysis. Finally, the marker names and aliases included here are more accurate than those available on the main public databases such as UniSTS, from which sequence based data for later Marshfield markers are often absent.
Bahlo M, Xing L, Wilkinson CR (2004). HumanMSD and MouseMSD: generating genetic maps for human and murine microsatellite markers. Bioinformatics 20:3280-3283.
Broman KW, Murray JC, Sheffield VC, White RL, Weber JL (1998). Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet 63:861\u2013869
Collins A., Frezal J., Teague J. & Morton NE. (1996). A metric map of humans:23,500 loci in 850 bands. Proc. Natl. Acad. Sci. USA 93:1477 1 14775.
Kong, A., Gudbjartsson, D.F., Sainz, J., Jonsdottir, G.M., Gudjonsson, S.A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., et al. 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31: 241247.
Kong X, Murphy K, Raj T, He C, White PS, Matise TC (2004). A Combined Linkage-Physical Map of the Human Genome. Am. J. Hum. Genetics 75: 1143-1148.
Loader C (2004). locfit: Local Regression, Likelihood and Density Estimation.. R package version 1.1-9. http://cm.bell-labs.com/stat/project/locfit/.
McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304:581-584.
R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Stassen HH and Scharfetter C (2000) Integration of genetic maps by polynomial transformations. Am. J. Med. Genetics 96: 108-113
Tapper W, Collins A, Gibson J, Maniatis N, Ennis S, Morton NE. A map of the human genome in linkage disequilibrium units (2005). Proc Natl Acad Sci USA 102:11835-11839.
Zhu G, Evans DM, Duffy DL, Montgomery GW, Medland SE, Gillespie NA, Ewen KR, Jewell M, Liew YW, Hayward NK, Sturm RA, Trent JM, and Martin NG (2004). A genome scan for eye colour in 502 twin families: most variation is due to a QTL on chromosome 15q. Twin Research 7:197-210.