Downloading and storing bioinformatic databases locally

January 20, 2014   

If you need to annotate biological data there are plenty of resources online (UCSC Genome Browser, BioMart), and plenty of programmatic tools to interact with these databases as well. But if you are going to be annotating a large dataset (like ChIP-Seq or RNA-Seq data) – you will probably not want to rely on web based services because a) It is inefficient b) You may get throttled or banned.

If you use python – its easy to download and store data in an SQlite database. This allows you to query the database using SQL and quickly and efficiently annotate large datasets.

Below – you will see that that is what I have done here for HapMap allele frequency data (2010-08_phaseII+III), and it allows me to retrieve allele frequency data from 26,278,275 rows across 11 populations instantly. The database itself is 3.22 Gb. A zipped version (~1Gb) is available Here.

You will need sqlalchemy for this script to work. Install using pip install sqlalchemy.

