An alfred workflow for working with sequence data


June 11, 2015    Alfred Genetics

I’ve put together a simple alfred workflow with a few utilities for working with sequence data.

Download the latest release of Seq-Utilities

Usage

Generate a random dna sequence 200 base pairs long.

dna1

Generate the complement, reverse complement, RNA, and protein of a DNA sequence

dna2

Open up blast and pre-populate the search field

blast ATGTCCTCGTTCGACCGTCGTATTGAAGCTGCATGTAAA

Split a GFF File into Individual Features


The General Feature Format is a widely used format for annotating genome sequences. If indexed with tabix, gff files can be viewed in IGV or elsewhere. While features are organized in a nested manner (e.g. genes > exons > variant), you can pull out the individual types and index them, or combine only a few for viewing in your genome browser.

I was working with wormbase annotation files, which combine all the different types of features together (genes, ncRNA, mRNA, binding site, operon, G Quartets, piRNAs, etc). This results in a very dense track in IGV which makes it difficult to disentangle what role individual features (or features of interest) might have.

As a result, I wrote this very short script for splitting the individual feature types apart, sorting them, and indexing them with tabix. This way they can be selectively viewed in IGV or elsewhere.


Aggregate FastQC Reports


FastQC is a phenomenal sequence quality assessment tool for evaluating both fastq and bam files. If you are working with a large number of sequence files (fastq), you may wish to compare results across all of them by comparing the plots that fastqc produces. I’m talking about the set of plots that look like this:

fastqc

FastQC can be invoked from the command line by typing fastqc <fastq/bam>, and it will produce an html report and associated zip file containing data, plots, and some ancillary files. The zip file contains an Images folder where the plots that become incorporated into the html report are stored. They are:

  • Adapter Content
  • Duplication Levels
  • Kmer Profiles
  • Per base N Content
  • Per Base Quality
  • Per Base Sequence Content
  • Per Sequence GC Content
  • Per Sequence Quality
  • Per Tile Quality
  • Sequence Length Distribution

The zipped folder also contains a file called fastqc_data.txt and summary.txt. fastqc_data.txt contains the raw data and statistics while summary.txt summarizes which tests have been passed.

To easily compare data across reports I wrote this short shell script (below) which will ‘aggregate’ images, statistics, and summaries by:

  1. Unzipping all the avaible fastqc zip files.
  2. Creating a fq_aggregated folder, and individual folders within for each plot type.
  3. Move images from each unzipped fastqc report into the folder to which it belongs, and renaming it as the filename of the report (e.g. sample name).
  4. Concatenating summary.txt files as fq_aggregated/summary.txt.
  5. Concatenating the basic statistics from each report into fq_aggregated/statistics.txt.

Images will be reorganized as shown below:

aggregate fastqc

summary.txt

fq_aggregated/summary.txt will produce a tab delimited file that looks like this:

PASS Basic Statistics SeqA.fq
PASS Per base sequence quality SeqA.fq
PASS Per tile sequence quality SeqA.fq
PASS Per sequence quality scores SeqA.fq
FAIL Per base sequence content SeqA.fq
PASS Per sequence GC content SeqA.fq
PASS Per base N content SeqA.fq
   
PASS Basic Statistics SeqB.fq
PASS Per base sequence quality SeqB.fq
PASS Per tile sequence quality SeqB.fq
PASS Per sequence quality scores SeqB.fq
PASS Per base sequence content SeqB.fq
FAIL Per sequence GC content SeqB.fq
FAIL Per base N content SeqB.fq

statistics.txt

fq_aggregated/statistics.txt will look like this:

PASS Basic Statistics SeqA.fq
PASS Per base sequence quality SeqA.fq
PASS Per tile sequence quality SeqA.fq
PASS Per sequence quality scores SeqA.fq
FAIL Per base sequence content SeqA.fq
PASS Per sequence GC content SeqA.fq
PASS Per base N content SeqA.fq
   
PASS Basic Statistics SeqB.fq
PASS Per base sequence quality SeqB.fq
PASS Per tile sequence quality SeqB.fq
PASS Per sequence quality scores SeqB.fq
PASS Per base sequence content SeqB.fq
FAIL Per sequence GC content SeqB.fq
FAIL Per base N content SeqB.fq
The Code

Downgrade a VCF for viewing in IGV (4.2 > 4.1)


If you are using the new version of bcftools, and you frequently use IGV to view variants you may have run into issues loading the file in IGV. IGV currently does not support VCF version 4.2. However, I’ve been able to tweak the headers of newer VCF files to allow these variants to be viewable in IGV again.

All you have to do is revert the version number in the first line and replace a few characters IGV does not like. Below is a bash function that will do this – saving any inputted VCF as {vcf_filename}.dg.vcf.gz. The script also indexes the file making it ready for use.


Rename Samples within a VCF/BCF


These two simple bash functions make it easy to rename samples within a bcf file by using the filename given (if it is a single sample file) or adding a prefix to all samples. This is useful if you want to merge bcf files where the sample names are identical in both (for comparison purposes).