r - Extract sample data from VCF files -
i have large variant call format (vcf) file (> 4gb) has data several samples.
i have browsed google, stackoverflow tried variantannotation package in r somehow extract data particular sample, have not found information on how in r.
did try that, or maybe knows of package enable this?
in variantannotation use scanvcfparam
specify data you'd extract. using sample vcf file included package
library(variantannotation) vcffile = system.file(package="variantannotation", "extdata", "chr22.vcf.gz")
discover information file
scanvcfheader(vcffile) ## class: vcfheader ## samples(5): hg00096 hg00097 hg00099 hg00100 hg00101 ## meta(1): fileformat ## fixed(0): ## info(22): ldaf avgpost ... vt snpsource ## geno(3): gt ds gl
formulate request "ldaf", "avgpost" info fields, "gt" genotype field samples "hg00097", "hg00101" variants on chromosome 22 between coordinates 50300000, 50400000
param = scanvcfparam( info=c("ldaf", "avgpost"), geno="gt", samples=c("hg00097", "hg00101"), which=granges("22", iranges(50300000, 50400000)))
read requested data
vcf = readvcf(vcffile, "hg19", param=param)
and extract vcf relevant data
head(geno(vcf)[["gt"]]) ## hg00097 hg00101 ## rs7410291 "0|0" "0|0" ## rs147922003 "0|0" "0|0" ## rs114143073 "0|0" "0|0" ## rs141778433 "0|0" "0|0" ## rs182170314 "0|0" "0|0" ## rs115145310 "0|0" "0|0" head(info(vcf)[["ldaf"]]) ## [1] 0.3431 0.0091 0.0098 0.0062 0.0041 0.0117 ranges(vcf) ## iranges of length 1169 ## start end width names ## [1] 50300078 50300078 1 rs7410291 ## [2] 50300086 50300086 1 rs147922003 ## [3] 50300101 50300101 1 rs114143073 ## [4] 50300113 50300113 1 rs141778433 ## [5] 50300166 50300166 1 rs182170314 ## ... ... ... ... ... ## [1165] 50364310 50364312 3 22:50364310_gca/g ## [1166] 50364311 50364313 3 22:50364311_cat/c ## [1167] 50364464 50364464 1 rs150069372 ## [1168] 50364465 50364465 1 rs146661152 ## [1169] 50364609 50364609 1 rs184235324
maybe you're interested in genotype element "gs" simple r matrix, specify samples , / or ranges you're interested in , use readgeno
(or readgt
or readinfo
similar specialized queries).
there extensive documentation in variantannotation vignettes , reference manual; see ?scanvcfparam
; example(scanvcfparam)
.
Comments
Post a Comment