A program to call alleles based on approximate length data

David L. Duffy, MBBS PhD.
Queensland Institute of Medical Research,
300 Herston Road,
Herston, Queensland 4029, Australia.
Email: davidD@qimr.edu.au


Program BINNING was written to take the allele sizes for a repeat polymorphic marker read from a gel by software, and round these to an integer number of repeats ("binning"). It is not image analysis software. It uses pedigree information to optimize these bins in a simple manner, based on algorithms used in my SIB-PAIR program.

This is a beta-test version. It is possible that the program may disguise Mendelian inconsistencies by recoding alleles to resolve differences between parents and children that are larger than the laboratory error involved. Setting the cutpoint criterion (cutp) to a small value will minimize this possibility, but at the expense of throwing up more problem pedigrees for manual resolution.


The program requires that the repeat length for the polymorphism (rpt) be known. Bins are set up to be rpt wide.

The position of these divisions with respect to the allele length data are altered so that sum of the distance from the midpoint of the nearest bin to each data point is minimized in a least-squares fashion. For example, if the data were:

101.1, 103.3, 101.4, 105.6

and the marker is a dinucleotide repeat, the program will attempt rounding these alleles to,

101,103,101,105: criterion=0.62
101.5,103.5,101.5,105.5: criterion=0.22
102,104,102,106: criterion=1.82

and so forth. The binning with the smallest criterion is then applied to the data. For example, If the second of the three models listed above were used, with the default bin width all observed values between 101 and 102 would be rounded to 101.5. These new values are then rounded to the nearest integer repeat value.

To reduce misclassification errors in binning, a plausible pattern of inheritance of alleles in each family is calculated using a gene dropping algorithm. This algorithm discards and resimulates ibd configurations that are inconsistent with the data, where inconsistency is defined as a greater than cutp*rpt length difference ("*" is multiplication) between an individual allele value and the mean value for the other family members carrying that allele ibd encountered so far. The mean value is binned, and the result used for all those pedigree members inheriting that allele. Any pedigrees where an acceptable ibd configuration cannot be reached are flagged, and set to missing for that marker. This approach is especially appropriate where pedigrees are run on the same gel, so errors are correlated within families. Finally, different ibd alleles are compared, and averaged if close enough. This avoids converting obvious homozygotes to heterozygotes when only one allele is chosen to be transmitted to a child, with a resulting change in its averaged value.


This is similar to that of SIB-PAIR and GCONVERT. The program reads commands from standard input, and writes results to standard output. Therefore, the program can be run interactively, or in batch mode. If the input file was test.in, the command "binning < test.in > test.out" would perform the commands in test.in, and write results to test.out. BINNING is case-sensitive, so that the keyword "READ" is not equivalent to "read".

A command is a single line of keywords, locus names and/or variable values. To allow compatibility with GAS, all statements can include optional brackets and semicolons. Commands are either global, which can be entered at any time; descriptive (set impute, set locus, read pedigree), which must precede the run statement; the run statement, that causes the dataset to be read and processed; or analytic, which act only after the run statement.

Global commands

  1. !|#. The rest of the line is a comment.

  2. @|$. The rest of the line (up to position 80) is a command, and is passed to the shell for execution.

  3. set out| plevel<level>|verbose|on|off. Print level 1 prints out the identities and genotypes of parents imputed where the genotype was missing. Print level 2 (or verbose) writes out binning information for all pedigrees.

  4. set cutpoint <cutpt>. Controls the difference between the lengths of two allele lengths below which the alleles are regarded to be the same length. The default is 0.5, so that a half repeat length difference is regarded as within the range of measurement error (two base pairs for a tetranucleotide repeat polymorphism etc).

    Data Declaration commands

  5. set workdirectory <pathname>. Sets directory to which temporary files are written.

  6. set locus <locus name> <locus type> [<repeat length>]. Declares position (by order within list), name and type of locus within pedigree file, and the repeat length, if a marker locus. Locus type may be either marker -- a (fully) codominant autosomal marker, xmarker -- a codominant X-linked marker, quantitative, or affection.

  7. set sex on. Creates a quantitative dummy variable for sex (field 6 in pedigree file).

  8. set checking on|off. Controls whether the programs tests for inconsistencies between parent and child genotypes as data is read in. An inconsistency is where the length difference is more than cutp*rpt for the transmitted allele derived from that parent.
  9. read pedigree <pedigree file name>. Reads a GAS type pedigree file.

  10. read linkage <pedigree file name>. Reads a LINKAGE type pedigree file.
  11. run|program. Reads in pedigree file and creates working pedigree file. Imputes genotypes if requested.

    Output and data manipulation commands

  12. keep <loc1>...<locN>. Retain loci in subsequent analysis.

  13. drop <loc1>...<locN>. Exclude loci from analysis.

  14. undelete <loc1>...<locN>. Include previously deleted loci in analysis.

  15. recode <marker> <all1|value1>...<allN|valueN> to <new allele|new value>. Allows pooling of marker alleles prior to subsequent analysis.

  16. edit <pedigree> <person> <locus> [to] <value1> [<value2>]. Alter genotype or trait value at locus locus for person pedigree-person to new value(s).

  17. write pedigree|gas <pedigree file name>. Use of the keywords pedigree or gas writes a GAS type pedigree file from the current dataset. Quantitative values are written as F8.4 (ie ddd.dddd).
  18. bin [<marker>]. Performs binning either on the named marker, or if unspecified, all markers, using the repeat length given by the corresponding set locus declaration.

The following script bins a dataset containing two marker loci.


set work c:\tmp\
set out verbose
set locus quant quantitative
set locus trait affection
set locus marker1 marker 2 (dinucleotide repeat)
set locus marker2 marker 1 (previously binned marker)
read pedigree test.ped
write pedigree binned.ped


The data set contains one record per individual. Records must be sorted into pedigrees. Records take the format used by GAS:

pedigree-id person-id father-id mother-id sex-of-person locus-value-1...locus-value- N

While a pedigree ID may be alphabetical, each person is designated by a (up to 5 digit) integer ID code. Missing values are denoted x (and represented internally as a trait value of -9999). Locus values for a binary trait are y (expresses trait), n (does not express trait). Sex takes the values m (male) and f (female), and may not be missing. Alleles at a marker locus are floating point values. A pedigree file may contain a comment at any time, prefaced by ! or #, and may contain a locus header of the form:

pedigree locus <locus-name-1>...<locus-name-N>.

Here is the data set analysed by the script test.in:


! test pedigree for binning
1000 1   x   x   m   10  y   126.1 132.4   1   1
1000 2   x   x   f   10  n   128.2 131.0   1   2
1000 3   x   x   f   25  n   127.5 132.8   2   2
1000 4   1   2   f   20  y   126.1 128.9   1   1
1000 5   1   2   m   30  y   131.1 132.2   1   1
1000 6   1   2   m   40  n   128.5 132.1   1   2
1000 7   1   2   f   50  n   126.7 129.1   1   2
1000 8   1   3   f   60  n   126.1 128.3   1   2
1000 9   1   3   m   40  y   131.8 133.0   1   2
! end-of-pedigrees


Add in a gel (or other covariate) specific random effect.


10-Sep-2003 (0.94)

X-chromosome marker listing of binned allele sizes for each family now doesn't include a superfluous second allele for male founders.

8-Aug-2003 (0.94)

Release of version supporting X-chromosome markers. Nuclear family Mendelian errors give a pedigree drawing with genotypes.

3-Mar-1999 (0.94)

Nicer message if parent-offspring incompatibility.

25-Nov-1997 (0.91)

Printed number of individuals typed now correct if multiple markers. Did not affect actual binning.

13-Nov-1997 (0.90)

First version made available to other users.