|||| |\/| |/\| ||||
Program BINNING was written to take the allele sizes for a repeat polymorphic marker read from a gel by software, and round these to an integer number of repeats ("binning"). It is not image analysis software. It uses pedigree information to optimize these bins in a simple manner, based on algorithms used in my SIB-PAIR program.
This is a beta-test version. It is possible that the program may disguise Mendelian inconsistencies by recoding alleles to resolve differences between parents and children that are larger than the laboratory error involved. Setting the cutpoint criterion (cutp) to a small value will minimize this possibility, but at the expense of throwing up more problem pedigrees for manual resolution.
The program requires that the repeat length for the polymorphism (rpt) be known. Bins are set up to be rpt wide.
The position of these divisions with respect to the allele length data are altered so that sum of the distance from the midpoint of the nearest bin to each data point is minimized in a least-squares fashion. For example, if the data were:
101.1, 103.3, 101.4, 105.6
and the marker is a dinucleotide repeat, the program will attempt rounding these alleles to,
101,103,101,105: criterion=0.62
101.5,103.5,101.5,105.5: criterion=0.22
102,104,102,106: criterion=1.82
and so forth. The binning with the smallest criterion is then applied to the data. For example, If the second of the three models listed above were used, with the default bin width all observed values between 101 and 102 would be rounded to 101.5. These new values are then rounded to the nearest integer repeat value.
To reduce misclassification errors in binning, a plausible pattern of inheritance of alleles in each family is calculated using a gene dropping algorithm. This algorithm discards and resimulates ibd configurations that are inconsistent with the data, where inconsistency is defined as a greater than cutp*rpt length difference ("*" is multiplication) between an individual allele value and the mean value for the other family members carrying that allele ibd encountered so far. The mean value is binned, and the result used for all those pedigree members inheriting that allele. Any pedigrees where an acceptable ibd configuration cannot be reached are flagged, and set to missing for that marker. This approach is especially appropriate where pedigrees are run on the same gel, so errors are correlated within families. Finally, different ibd alleles are compared, and averaged if close enough. This avoids converting obvious homozygotes to heterozygotes when only one allele is chosen to be transmitted to a child, with a resulting change in its averaged value.
This is similar to that of SIB-PAIR and GCONVERT. The program reads commands from standard input, and writes results to standard output. Therefore, the program can be run interactively, or in batch mode. If the input file was test.in, the command "binning < test.in > test.out" would perform the commands in test.in, and write results to test.out. BINNING is case-sensitive, so that the keyword "READ" is not equivalent to "read".
A command is a single line of keywords, locus names and/or variable values. To allow compatibility with GAS, all statements can include optional brackets and semicolons. Commands are either global, which can be entered at any time; descriptive (set impute, set locus, read pedigree), which must precede the run statement; the run statement, that causes the dataset to be read and processed; or analytic, which act only after the run statement.
The following script bins a dataset containing two marker loci.
Test.in
set work c:\tmp\ set out verbose set locus quant quantitative set locus trait affection set locus marker1 marker 2 (dinucleotide repeat) set locus marker2 marker 1 (previously binned marker) read pedigree test.ped run bin write pedigree binned.ped
pedigree-id person-id father-id mother-id sex-of-person locus-value-1...locus-value- N
While a pedigree ID may be alphabetical, each person is designated by a (up to 5 digit) integer ID code. Missing values are denoted x (and represented internally as a trait value of -9999). Locus values for a binary trait are y (expresses trait), n (does not express trait). Sex takes the values m (male) and f (female), and may not be missing. Alleles at a marker locus are floating point values. A pedigree file may contain a comment at any time, prefaced by ! or #, and may contain a locus header of the form:
pedigree locus <locus-name-1>...<locus-name-N>.
Here is the data set analysed by the script test.in:
Test.ped
! ! test pedigree for binning ! 1000 1 x x m 10 y 126.1 132.4 1 1 1000 2 x x f 10 n 128.2 131.0 1 2 1000 3 x x f 25 n 127.5 132.8 2 2 1000 4 1 2 f 20 y 126.1 128.9 1 1 1000 5 1 2 m 30 y 131.1 132.2 1 1 1000 6 1 2 m 40 n 128.5 132.1 1 2 1000 7 1 2 f 50 n 126.7 129.1 1 2 1000 8 1 3 f 60 n 126.1 128.3 1 2 1000 9 1 3 m 40 y 131.8 133.0 1 2 ! end-of-pedigrees