image is not displayed...


What it is?

mach2merlin is a tool for converting family based genotypic imputation data from Mach or Minimac format to merlin format for analysis in merlin-offline (information on analysis of imputed data using merlin-offline can be found here)

The code is writen in perl and is customised for the HapMap2 HapMap3 and 1KGP imputation references

If you use this program please site the following:
Medland, SE Mach2merlin: Facilitating analysis of imputed genomic data in family studies (submitted)

For the 1KGP version... (

It is assumed that Minimac was used for the imputation.
Variant names are are asumed to be in the chromosome:postion format (eg 1:5670) rather than rs numbers.

Two arguments are required:
• the name and location of a fam file (-fam or -f)
      •  containing parental & zygosity information
      •  assumed to have the following 6 columns
      •  FID IID PID MID Sex Zygosity
      •  Zygosity coding follows Merlin format, MZ=1, DZ=2, Singleton=0
• the imputed file name prefix (-prefix or -p)
      •  assumed to be the prefix for the Minimac dose and info files
         eg if the imputed data was called chr1.dose.gz the prefix argument would be -prefix chr1
      •  the dose and info files are assumed to have the come from Minimac without any reformating
      •  the dose and info files are assumed to be gziped

For the HapMap2 and HapMap3 versions... ( and )

It is assumed that Mach was used for the imputation. Variant names are are asumed to be in rs number format.
In addition to the -fam and -prefix arguments you also need to provide the name of (and path to) the reference legend file:
• legend (-legendfile or -l)
       •  assumed to be a HapMapII (r22b36) legend file without any reformating
       •  assumed to be gziped

mach2merlin produces five gziped output files:

• map file (eg
       •  Contains CHR SNP ~cM
       •  ~cM position = BP/1000000
• dat file (eg infer_format_22.dat.gz)
       •  This is an infer format dat file in which dose takes the following format
       T COUNT(C,rs11089130)
       T COUNT(A,rs738829)
• skip file (eg infer_format_22.skip.gz)
       •  This is an infer format dat file in which snps with an rsq < .3 and/or maf <.005 are skiped
       This can be very useful if you want to restrict analysis to high quality variants or ignore rare variants that are not
       suitable for GWAS analysis
• freq file (eg infer_format_22.freq.gz)
       •  This is an infer format freq file in which freq for each snp takes the following format
       M rs2738388
       A C 0.7714
       A A 0.2286
• ped file (eg infer_format_22.ped.gz)
       •  This is an infer format ped file in containing zygosity and dosage data
       •  idlist file (eg infer_format_22.idlist.gz)
       •  A list of FID and IID for those in the ped file

Optional arguments:

•  -batch use this option to cut the imputed file into smaller chunks to aid in parallelisation of analysis
     eg: if the file file contained data for 50,000 variants adding -batch 10000 would yeild 5 sets of dat, skip and ped files
     but only 1 map and freq file
•  -rsq provide an alturnate rsq threshold for the skip file, the default is .3
•  -maf provide an alturnate maf threshold for the skip file, the default is .005 (0.5%)
•  -out provide and alturnate prefix for the output files, default is infer_format_{chromosome number}
     if the batch option is specified this becomes infer_format_{chromosome number}.{batch number}
•  -chr for the HapMap versions only: provide the chromosome number, default is taken from the name of the legend file

Example usage:

./ -fam mysample.fam -prefix chunk2-mysample.16.imputed -batch 20000

     This would read in a fam file and the Minimac imputed 1KGP data and produce chunk2-mysample.16.imputed.freq
     and plus a series of ped dat and skip files
     chunk2-mysample.16.imputed.1.ped.gz chunk2-mysample.16.imputed.1.dat.gz
     chunk2-mysample.16.imputed.1.skip.gz ... chunk2-mysample.16.imputed.n.ped.gz
     chunk2-mysample.n.imputed.n.dat.gz chunk2-mysample.16.imputed.n.skip.gz

./ -fam mysample.fam -prefix mysample.16 -rsq .6 -maf .01 -legend genotypes_chr16_CEU_r22_nr.b36_fwd_legend.txt.gz

     This would read in a fam file, the Mach imputed data and the HapMap2 legend and produce mysample.16.freq,, mysample.16.dat, and mysample.16.ped, plus a skip file in which all variants with an rsq of
     less than .6 and/or a maf of 1% were skipped