From SRA Project to FastQ


The Sequence Read Archive (SRA) contains sequence data from scientific studies stored in a special ‘sra’ format. Data is stored in a hierarchical format:

Project ▸ Study ▸ Sample ▸ Experiment ▸ Run

Recently, I had to use the SRA to download all of the sequence data for a given project. This required querying the SRA database for all the runs in a sequencing project and converting them to FASTQs. Here’s how I did it:

First, you’ll need entrez direct, and the sra toolkit. If you are on a mac, you can install both using homebrew.

brew install edirect # Entrez Direct
brew install sratoolkit

Once installed, the script below can be used to download all the sequence data associated with a given project. The script queries the project for all the associated sequence data, and converts to zipped fastqs. Note that it also uses gnu parallel (to speed things up) and fastqc for quality control. These can be installed on mac using:

brew install parallel
brew install fastqc

Calculate Depth and Breadth of Coverage From a bam File


What is the difference between depth and coverage in sequencing experiments? Actually – they refer to the same thing, the average number of reads aligned to an individual base. Previously, I had thought coverage referred to the percentage of the genome with aligned reads to it; however the more appropriate term for this is breadth of coverage. This paper more precisely defines what breadth of coverage and depth of coverage mean.

Depth vs. Breadth of Coverage

If you need to calculate depth of coverage and breadth of coverage you can do so using the python script below. To use the script, feed the function coverage a bam file, and the function will return a dictionary of the depth of coverage, breadth of coverage, sum of depths (at every position), and number of bases mapped, for every contig/chromosome individually, and the entire genome as a whole.

Additionally, if you specify the optional second parameter specifying the mitochondrial chromosome, the script will calculate the parameters listed above for the nuclear genome and calculate the ratio of mitochondrial depth of coverage to nuclear depth of coverage. This can act as a proxy for mitochondrial count/content within a cell.

Requires BCFTools


Generate a bedfile of masked ranges a fasta file


If you are calling variants as part of a NGS experiment, you likely are considering filters such as depth, quality, and filtering low complexity regions from the variant dataset. Programs such as [repeatmasker][^1] are used to identify low complexity regions, replacing repetitive sequences with N’s. Repetitive regions have a tendency to be aligned with inappropriate reads and results in false positives.

If you’ve been provided with or have generated a masked fasta file for a given genome, you can use the following script convert a masked fasta (left) into a bed file (right) with the masked ranges.

>CHROMOSOME_I 1 15072423
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNGTTTGTTNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TATTAAAAACTGTTCNNNNNNNNNNNNNNNNNNNN
chrI    0   4831
chrI    4869    5146
chrI    5181    5305
chrI    5340    5677
chrI    5706    7409
chrI    7431    9549
chrI    9593    9651
chrI    9683    9979
chrI    10014   18897
chrI    18941   19432
chrI    19468   19747
chrI    19782   19877
chrI    19898   21314
chrI    21357   24849
chrI    24903   27411
chrI    27456   27535
chrI    27561   28015
chrI    28054   28505
chrI    28527   28918
chrI    28961   30659
chrI    30682   39364
chrI    39419   42234
chrI    42307   56428
chrI    56455   57860

Each range corresponds with a low complexity region within the fasta file. The resulting bed file can be used to filter variants out of a VCF file using a tool such as bcftools

Usage

python generate_masked_ranges.py <fasta_file> > output_ranges.txt

Script



Generate fasta sequence lengths


This one liner:

Takes a fasta file as input:

>EF491733
tcagattcaaacaccgacgacgatgacgtggcaaagtctcgacgtgtgcg
caattcgtgtatgtgtccagcaggacctcccggagaacgcggaccagtag
gaccaccaggtctacggggatcgccaggatggcct
>EF491734
tcacagggaatgaaggcactgttcgacttgatcgctttgagaccaagacc
cgtggcaattctcggagggcaatgcactgaagtgaacgagccaatagcga
tggcgctcaagtattggcaaatcgtgcaattatcctatgcggagacacat
gccaa
>EF491735
gtcttgcatgacccaaaaggctcctgctcttctgtttcttcttccaatac
atccttctaaccagttggaagggttgacgtatcaagacttcctgcatcaa
aacttcttgaatttgccttcatttgtcgcaattgtgcagc
>EF491736
taaatggaaggaatcacttggcgctgaagaatttgctctccgcacagctt
aatcagactggaactccaatggttaatccaatgatggctttacaacaaca
agcggccgcagtaaacctgattcccaacacaccaatttacccaccc
>EF491737
actctcgcaatcgtctctccccaaatgatgttaacatcactagaaatgac
aaccgaacatatagcccagtcactcctcgtatcacaacaagtgagcggac
agtaacaccggaacagcggtcgccgggtcgaaaagcgttcgaaaccattc
>EF491738
tccctcgttcattcacaacaaaggaaaagcaaactatgggccattcattg
ttgaaattatgaactatcatcagtattctgcaatgacaagtcatatggtc
aaagtaatgaaacggccccaccaggttccgccaatgaaggtcgaccctga
gg
>EF491739
tccttccaactgttgccaactttccaactacaagacacactgaaccagaa
actacgcggagacctctgtcgccttcaaaaatgacaccttctcttccttc
tcctaccaccaccactttgcctgttttctttttgtcacaaatcactgacg
gcgatgaatcagaagatgaa

Outputs sequence name and length:

EF491733    135
EF491734    155
EF491735    140
EF491736    146
EF491737    150
EF491738    152
EF491739    170

I made this today when I needed a way to generate sequence lengths required for some ChIP-Seq analysis.


Visualizing Pairwise Queries in R


August 2, 2014    Bioinformatics  gist

diseasexorg

If you are doing a lot biological research and are interested in identifying whether an association exists between all the pairwise combinations between two sets of terms (e.g. two gene lists), you can use pubmed search results as a proxy for relative association.

In the example below, I show the results from organisms x diseases to give a rough estimate of how much each disease is studied in a given organism. Of course, this should all be taken with a (big) grain of salt because these organisms and diseases have many synonyms or related terms (e.g. M. Musculus is often referred to as Mouse in the literature). Additionally, the result count is based off of whether or not the terms were found together within the title and abstract of the literature only – and not the body of the text in many cases.