vcf

December 19, 2012   

The Variant Caller Format developed by the 1000 genomes project makes it easy to filter and manage large amounts of variant information for a set of subjects.

STATA offers an easy interface for sorting, filtering, and manipulating large datasets. I have developed a tool, vcf that makes it easy to import .vcf files into Stata (no easy task!).

The program does two challenging things to prepare the file for Stata:

  1. It Splits the INFO column (delimited by ; ) into seperate columns. This is necessary because STATA has a string limit of 244 characters and truncates this column otherwise.
  2. It recodes genotypic data, showing the genotypes of each individual.

Installation

ssc install vcf

Requirements

I have only tested with STATA 12/SE. I believe it will also work with STATA 11 and perhaps earlier.

Usage

vcf using "path/to/file.vcf"

Limits

  1. While it is possible to read in very large files – this program cannot handle enormous VCF Files. I have successfully loaded in files that are a few gigabytes. Therefore ideally you’ll filter enormous VCF Files prior to using this.
  2. If your VCF Files has more than 9 alternative alleles, this program will incorrectly assign alleles beyond the 9th alternative allele.

Important!

This program is no longer supported.

Programming  STATA Programs  1000genomes STATA tool vcf