Class | Data Declaration command |
Name | read locus vcf |
Arguments | <VCF file name> [<start_position> [<end_position>]] [human]. |
Read locus names, types and map positions from a VCF file. Reading can be retricted to a subset of loci within an interval (specified in base pairs). The related file vcf command prints locus information without reading it into Sib-pair - it offers the additional ability to search on locus names. This is designed to allow the user to refine the interval to be read in. If a tabix generated index file is present, this is used to speed up operations.
A VCF style file contains one record per locus. These are preceded by meta-information about the data fields, identifying and documenting the key-value pairs within each field. After the meta-information, there is a header line describing the columns (fields) of data. The first eight fields are compulsory and are:
CHROM: chromosome number, X, Y |
POS: physical map position (bp) |
ID: marker name |
REF: reference allele/sequence |
ALT: other alleles/sequences |
QUAL: quality metric for genotypes |
FILTER: usually "PASS" or "FAIL" |
INFO: other locus information eg number genotyped, allele count |
A ninth field is almost always present, and is followed by the individual IDs for the genotypes.
FORMAT: genotype format type GT=phased or unphased
genotype DS=dosage score etc |
<ID1>: first genotype column ID header |
Since the VCF format is designed for genome-wide sequence data, files are commonly large, so the command (unlike other Sib-pair locus commands) offers the opportunity to just read a subset of loci, by specifying an interval start and ending position in base pairs. Locus IDs are read from the "ID" field, and if missing, is generated from the map position. The human modifier causes a chromosome number from 23-26 to be interpreted as X, Y, XY, and mitochondrial (MT) respectively.
Information in the INFO field is copied to the locus annotation, and such variables can then be read by the summary get command. Since INFO fields may be extremely long, and the Sib-pair annotation slot is (currently) 40 characters, selected variables can also be copied into the locus annotation by the read annotation command, and multiway tables generated by the show annotation command.
Example:
>> file head chr9.refpanel.EUR.vcf.gz ##fileformat=VCFv4.1 ##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting ##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probabilit ##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from ##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from ##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate fro ##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was ca #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 9 10023 . CCAA C 15 PASS AN=758;NS=379;AC=74 GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 >> read loc vcf chr9.refpanel.EUR.vcf.gz 1 100000 Locus Position (bp) --------------- ---------------- chr9:10023 10023 AN=758;NS=379;AC=74 PASS CCAA C chr9:10097 10097 AN=758;NS=379;AC=6 PASS CCA C rs185444096 10177 AN=758;NS=379;AC=47 PASS C T rs190296880 10192 AN=758;NS=379;AC=44 PASS T A chr9:10288 10288 AN=758;NS=379;AC=11 PASS A AC chr9:10299 10299 AN=758;NS=379;AC=56 PASS TA T rs56377469 10469 AN=758;NS=379;AC=410 PASS G C rs143946323 10690 AN=758;NS=379;AC=3 PASS G A rs7341907 10869 AN=758;NS=379;AC=354 PASS C G rs149305563 14665 AN=758;NS=379;AC=86 PASS G A rs149079262 14690 AN=758;NS=379;AC=165 PASS C G rs141156662 15883 AN=758;NS=379;AC=132 PASS A G chr9:17614 17614 AN=758;NS=379;AC=606 PASS CT C rs184525769 33204 AN=758;NS=379;AC=13 PASS C T chr9:33302 33302 AN=758;NS=379;AC=19 PASS G GAT chr9:33311 33311 AN=758;NS=379;AC=28 PASS TC T rs2492179 39037 AN=758;NS=379;AC=489 PASS A C rs9408135 39043 AN=758;NS=379;AC=532 PASS T C rs188840810 39457 AN=758;NS=379;AC=25 PASS G A ... VCF file = chr9.refpanel.EUR.vcf.gz Number of markers = 174 (out of total 583105) Number passing QC = 174 (1.000) Number indels = 24 Number SVs = 0 Number of subjects = 379 Total genotypes = 65946 Chromosome = 9 Map range (bp) = 10023 -- 99384 Declaring 174 loci.
>> read loc vcf chr22.dose.vcf.gz No. INFO vars = 3 (MAF R2 ER2) Locus Position (bp) --------------- ---------------- 22:16050408 22:16050408 T C MAF=0.05871;R2=0.00325 PASS 22:16050612 22:16050612 C G MAF=0.08850;R2=0.00573 PASS 22:16050678 22:16050678 C T MAF=0.05335;R2=0.00283 PASS 22:16050984 22:16050984 C G MAF=0.00179;R2=0.00014 PASS 22:16051107 22:16051107 C A MAF=0.05733;R2=0.00315 PASS 22:16051249 22:16051249 T C MAF=0.09613;R2=0.67196 PASS 22:16051347 22:16051347 G C MAF=0.32469;R2=0.15099 PASS 22:16051453 22:16051453 A C MAF=0.09661;R2=0.67031 PASS 22:16051480 22:16051480 T C MAF=0.05600;R2=0.00307 PASS 22:16051497 22:16051497 A G MAF=0.32429;R2=0.15198 PASS 22:16051722 22:16051722 TA T MAF=0.01340;R2=0.00103 PASS 22:16051882 22:16051882 C T MAF=0.05332;R2=0.00286 PASS 22:16052080 22:16052080 G A MAF=0.13894;R2=0.01217 PASS 22:16052239 22:16052239 A G MAF=0.45802;R2=0.05861 PASS 22:16052240 22:16052240 C G MAF=0.00134;R2=0.00009 PASS 22:16052250 22:16052250 A G MAF=0.11032;R2=0.00784 PASS 22:16052271 22:16052271 G A MAF=0.00759;R2=0.00059 PASS 22:16052513 22:16052513 G C MAF=0.36434;R2=0.03912 PASS 22:16052618 22:16052618 G A MAF=0.30505;R2=0.02938 PASS ... VCF file = chr22.dose.vcf.gz Number of subjects = 1370 (with genotype data) Number of markers = 365643 Number passing QC = 365531 (1.000) Number indels = 18541 Number SVs = 0 Total genotypes = 500930910 Chromosome = 22 Map range (bp) = 16050408 -- 51242613 Declaring 365643 loci. >> summary get R2 >> sum Test: "R2". Total number of tests = 365225 Locus Chr Position Statistic ------------------ --- ---------- ----------- 22:46637511 22 46.637511 1.000 MAF=0.11387;R2=1.00037;ER2=1.00000 PASS 22:46637272 22 46.637272 1.000 MAF=0.11387;R2=1.00037;ER2=1.00000 PASS 22:32109911 22 32.109911 1.000 MAF=0.00219;R2=1.00036;ER2=1.00000 PASS 22:46636930 22 46.636930 1.000 MAF=0.11387;R2=1.00036;ER2=0.99938 PASS 22:32871383 22 32.871383 1.000 MAF=0.20730;R2=1.00036;ER2=0.99876 PASS >> sum tab Test: "R2". Intvl Midpt Count Histogram --------------------------------------------------------- 0.0500 252665 + ************************************************** 0.1501 37201 | ********** 0.2501 21433 ***** 0.3501 14206 *** 0.4502 10285 ** 0.5502 7858 ** 0.6502 6571 * 0.7503 5289 * 0.8503 4365 * 0.9504 5352 * Median (IQR) = 0.0224 ( 0.0024 -- 0.1484) Symmetry test J(.02) = 2.7801 (P=0.000) >> keep where test >= 0.8 Selecting markers where test statistic >= 0.80000000000000004 Keeping 9736 active loci. >> pack Permanently deleted 355598 loci.
See also:
merge vcf | merge VCF file |
file vcf | summarises VCF file |
Class | Data Declaration command |
Name | read annotation |
Arguments | <VCF file name> [<INFO_variable> [<array_subindex>]] |
Read INFO variables from a VCF file. If a variable name is not specified, a list of all the INFO variables declared in the header is printed. If a variable name is specified, then the value of that variable for each active locus in the current Sib-pair dataset that matches the VCF file will be prepended to the Sib-pair locus annotation (locnotes). Note that Sib-pair matches loci first by locus name, and if not found, it then attempts to match on map position. For INFO variables that are arrays, the index of the entry can be specified.
The read stat vcf carries out the same procedure, but the resulting value is saved to locstat, so as to be available to the summary command. Advantage is that it skips recording that variable to the locus annotation (cf summary get).
>> read loc vcf MetaboChip.markers.ncbi37.20101203.vcf.gz ##INFO=<ID=TOP_STRAND,Number=1,Type=String,Description="Orientation of the TOP snp ##INFO=<ID=ALLELEA,Number=1,Type=String,Description="Allele A in reference orien ##INFO=<ID=ALLELEB,Number=1,Type=String,Description="Allele B in reference orien ##INFO=<ID=FLANKPOS,Number=3,Type=String,Description="Position determined using No. INFO vars = 4 (TOP_STRAND ALLELEA ALLELEB FLANKPOS) Locus Position (bp) --------------- ---------------- chr11:125740030 0:0 G C TOP_STRAND=+;ALLELEA=C;ALLELEB=G;FINEMAPPING-Lipids-LDL probe_multi,flank_mu [...] VCF file = MetaboChip.markers.ncbi37.20101203.vcf.gz Number of subjects = 0 (with genotype data) Number of markers = 196725 Number passing QC = 0 (0.000) Number indels = 14 Number SVs = 0 Total genotypes = 0 Chromosomes = 0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y Map range (bp) = 0 -- 249154567 Declaring 196725 loci. >> lis 1 -- 10 Locus Type Position --------------- ---- ---------------- chr11:125740030 ms 1 [G/C] probe_multi,flank_multi TOP_STRAND chr11:125743843 ms 2 [T/C] probe_multi,flank_multi TOP_STRAND chr11:125748859 ms 3 [C/T] probe_multi,flank_multi TOP_STRAND chr11:125748919 ms 4 [A/G] probe_multi,flank_multi TOP_STRAND chr11:125750655 ms 5 [G/C] probe_multi,flank_multi TOP_STRAND chr11:125761748 ms 6 [A/G] probe_multi,flank_multi TOP_STRAND chr11:125770364 ms 7 [C/T] probe_multi,flank_multi TOP_STRAND chr11:16875499 ms 8 [C/T] probe_multi,flank_multi TOP_STRAND chr11:16875541 ms 9 [A/G] probe_multi,flank_multi TOP_STRAND chr11:61320875 ms 10 [G/T] probe_multi,flank_multi TOP_STRAND Number of marker loci= 10 Bonferroni corr. 5% = 0.005116 Bonferroni corr. 1% = 0.001005 Bonferroni corr. 0.1%= 0.000100 >> read ann MetaboChip.markers.ncbi37.20101203.vcf.gz TOP_STRAND ALLELEA ALLELEB FLANKPOS No. INFO vars = 4 >> read ann MetaboChip.markers.ncbi37.20101203.vcf.gz ALLELEB >> read ann MetaboChip.markers.ncbi37.20101203.vcf.gz ALLELEA [...] Total number of loci scanned = 196725 Number of annotations updated = 196725 (100.0%) >> lis 1 -- 10 Locus Type Position --------------- ---- ---------------- chr11:125740030 ms 1 ALLELEA=C ALLELEB=G [G/C] probe_multi,fl chr11:125743843 ms 2 ALLELEA=T ALLELEB=C [T/C] probe_multi,fl chr11:125748859 ms 3 ALLELEA=T ALLELEB=C [C/T] probe_multi,fl [...] >> clear >> read loc vcf clinvar_20140702.vcf.gz >> read ann clinvar_20140702.vcf.gz GENEINFO >> read ann clinvar_20140702.vcf.gz NSM [...] *Found* "NSM": Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42 [...] >> lis whe NSM*CDKN2A Locus Type Position ------------ ---- ---------------- rs3731249 ms 37102 NSM GENEINFO=CDKN2A:1029 [C/T] RS=373124 rs116150891 ms 37103 NSM GENEINFO=CDKN2A:1029 [G/A] RS=116150 rs6413464 ms 37104 NSM GENEINFO=CDKN2A:1029 [C/A] RS=641346 [...] Number of marker loci= 19 >> drop Dropping 88177 active loci. >> undrop whe NSM*{CDKN2A CDK4 POT1 ATM} Reactivated 19 loci. Reactivated 2 loci. Reactivated 0 loci. Reactivated 17 loci.
See also:
show annotation | show or tabulate VCF INFO variable values |
read statistics | read in test statistics for loci |
file vcf | summarises VCF file |
<< (read locus plink) | Up to index | >> (read pedigree) |