r - Extract sample data from VCF files -


i have large variant call format (vcf) file (> 4gb) has data several samples.

i have browsed google, stackoverflow tried variantannotation package in r somehow extract data particular sample, have not found information on how in r.

did try that, or maybe knows of package enable this?

in variantannotation use scanvcfparam specify data you'd extract. using sample vcf file included package

library(variantannotation) vcffile = system.file(package="variantannotation", "extdata", "chr22.vcf.gz") 

discover information file

scanvcfheader(vcffile) ## class: vcfheader  ## samples(5): hg00096 hg00097 hg00099 hg00100 hg00101 ## meta(1): fileformat ## fixed(0): ## info(22): ldaf avgpost ... vt snpsource ## geno(3): gt ds gl 

formulate request "ldaf", "avgpost" info fields, "gt" genotype field samples "hg00097", "hg00101" variants on chromosome 22 between coordinates 50300000, 50400000

param = scanvcfparam(     info=c("ldaf", "avgpost"),     geno="gt",     samples=c("hg00097", "hg00101"),     which=granges("22", iranges(50300000, 50400000))) 

read requested data

vcf = readvcf(vcffile, "hg19", param=param) 

and extract vcf relevant data

head(geno(vcf)[["gt"]]) ##             hg00097 hg00101 ## rs7410291   "0|0"   "0|0"   ## rs147922003 "0|0"   "0|0"   ## rs114143073 "0|0"   "0|0"   ## rs141778433 "0|0"   "0|0"   ## rs182170314 "0|0"   "0|0"   ## rs115145310 "0|0"   "0|0"   head(info(vcf)[["ldaf"]]) ## [1] 0.3431 0.0091 0.0098 0.0062 0.0041 0.0117 ranges(vcf) ## iranges of length 1169 ##           start      end width             names ## [1]    50300078 50300078     1         rs7410291 ## [2]    50300086 50300086     1       rs147922003 ## [3]    50300101 50300101     1       rs114143073 ## [4]    50300113 50300113     1       rs141778433 ## [5]    50300166 50300166     1       rs182170314 ## ...         ...      ...   ...               ... ## [1165] 50364310 50364312     3 22:50364310_gca/g ## [1166] 50364311 50364313     3 22:50364311_cat/c ## [1167] 50364464 50364464     1       rs150069372 ## [1168] 50364465 50364465     1       rs146661152 ## [1169] 50364609 50364609     1       rs184235324 

maybe you're interested in genotype element "gs" simple r matrix, specify samples , / or ranges you're interested in , use readgeno (or readgt or readinfo similar specialized queries).

there extensive documentation in variantannotation vignettes , reference manual; see ?scanvcfparam; example(scanvcfparam).


Comments

Popular posts from this blog

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -

SQL: Divide the sum of values in one table with the count of rows in another -