From SRA Project to FastQ

October 25, 2014   

The Sequence Read Archive (SRA) contains sequence data from scientific studies stored in a special ‘sra’ format. Data is stored in a hierarchical format:

Project ▸ Study ▸ Sample ▸ Experiment ▸ Run

Recently, I had to use the SRA to download all of the sequence data for a given project. This required querying the SRA database for all the runs in a sequencing project and converting them to FASTQs. Here’s how I did it:

First, you’ll need entrez direct, and the sra toolkit. If you are on a mac, you can install both using homebrew.

brew install edirect # Entrez Direct
brew install sratoolkit

Once installed, the script below can be used to download all the sequence data associated with a given project. The script queries the project for all the associated sequence data, and converts to zipped fastqs. Note that it also uses gnu parallel (to speed things up) and fastqc for quality control. These can be installed on mac using:

brew install parallel
brew install fastqc
Bash  Bioinformatics  Programming  fastq gist sra