Double Checking FASTQs

May 24, 2014   

When you have performed a sequencing project, quality control is one of the first things you will need to do. Unfortunately, sample mix-ups and other issues can and do happen. Systematic biases can also occur by machine and lane.

This script will extracting basic information from a set of FASTQs and output it to summary file (fastq_summary.txt). This will work with demultiplexed FASTQs generated by Illumina machines that appear in the following format:

@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1


  • @HWI-EAS209_0006_FC706VJ – Machine name
  • 5 – lane
  • 58 – tile within flowcell lane
  • 5894 – x coordinate of cluster within tile
  • 21141 – y coordinate of cluster within tile
  • #ATCACG – index
  • /1 – member of pair (/1 or /2)

The script below will extract the machine name, lane, index, and pair.

Bash  Bioinformatics  Programming  Python  fastq gist python