Split a GFF File into Individual Features

The General Feature Format is a widely used format for annotating genome sequences. If indexed with tabix, gff files can be viewed in IGV or elsewhere. While features are organized in a nested manner (e.g. genes > exons > variant), you can pull out the individual types and index them, or combine only a few for viewing in your genome browser.

I was working with wormbase annotation files, which combine all the different types of features together (genes, ncRNA, mRNA, binding site, operon, G Quartets, piRNAs, etc). This results in a very dense track in IGV which makes it difficult to disentangle what role individual features (or features of interest) might have.

As a result, I wrote this very short script for splitting the individual feature types apart, sorting them, and indexing them with tabix. This way they can be selectively viewed in IGV or elsewhere.

import sys

current_feature = ""

for line in sys.stdin:
    feature = line.split("\t")[2]
    if feature != current_feature:
        f = file(feature + ".gff", "a+")
    f.write(line)

gunzip -kfc <GFF> | grep -v ^"#" | sort -k3,3 | python process_gff.py

for i in `ls *.gff`; do
    (grep ^"#" $i.gff; grep -v ^"#" $i.gff | sort -k1,1 -k4,4n) | bgzip > $i.sorted.gff.gz;
    tabix $i.sorted.gff.gz
    rm $i.gff
done