The Variant Caller Format developed by the 1000 genomes project makes it easy to filter and manage large amounts of variant information for a set of subjects.
STATA offers an easy interface for sorting, filtering, and manipulating large datasets. I have developed a tool, vcf that makes it easy to import .vcf files into Stata (no easy task!).
The program does two challenging things to prepare the file for Stata:
- It Splits the INFO column (delimited by ; ) into seperate columns. This is necessary because STATA has a string limit of 244 characters and truncates this column otherwise.
- It recodes genotypic data, showing the genotypes of each individual.
ssc install vcf
I have only tested with STATA 12/SE. I believe it will also work with STATA 11 and perhaps earlier.
vcf using "path/to/file.vcf"
- While it is possible to read in very large files – this program cannot handle enormous VCF Files. I have successfully loaded in files that are a few gigabytes. Therefore ideally you’ll filter enormous VCF Files prior to using this.
- If your VCF Files has more than 9 alternative alleles, this program will incorrectly assign alleles beyond the 9th alternative allele.
This program is no longer supported.