Aggregate FastQC Reports - Daniel E. Cook

Update: MultiQC (2019-06-21)

After I originally published this script for aggregating FASTQC reports, MultiQC was published by Phil Ewels. MultiQC aggregates quality-control and other associated data from sequencing tools into an interactive report. Instead of the script below, you can simply run:

# Run this command where your *_fastqc.zip files are
multiqc .

This will output a repor that looks like this:

multiqc screenshot

Publication

MultiQC: Summarize analysis results for multiple tools and samples in a single report Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller Bioinformatics (2016) doi: 10.1093/bioinformatics/btw354 PMID: 27312411

Original Post (2014-12-28)

FastQC is a phenomenal sequence quality assessment tool for evaluating both fastq and bam files. If you are working with a large number of sequence files (fastq), you may wish to compare results across all of them by comparing the plots that fastqc produces. I’m talking about the set of plots that look like this:

fastqc

FastQC can be invoked from the command line by typing fastqc <fastq/bam>, and it will produce an html report and associated zip file containing data, plots, and some ancillary files. The zip file contains an Images folder where the plots that become incorporated into the html report are stored. They are:

Adapter Content
Duplication Levels
Kmer Profiles
Per base N Content
Per Base Quality
Per Base Sequence Content
Per Sequence GC Content
Per Sequence Quality
Per Tile Quality
Sequence Length Distribution

The zipped folder also contains a file called fastqc_data.txt and summary.txt. fastqc_data.txt contains the raw data and statistics while summary.txt summarizes which tests have been passed.

To easily compare data across reports I wrote this short shell script (below) which will ‘aggregate’ images, statistics, and summaries by:

Unzipping all the avaible fastqc zip files.
Creating a fq_aggregated folder, and individual folders within for each plot type.
Move images from each unzipped fastqc report into the folder to which it belongs, and renaming it as the filename of the report (e.g. sample name).
Concatenating summary.txt files as fq_aggregated/summary.txt.
Concatenating the basic statistics from each report into fq_aggregated/statistics.txt.

Images will be reorganized as shown below:

aggregate fastqc

`summary.txt`

fq_aggregated/summary.txt will produce a tab delimited file that looks like this:


PASS	Basic Statistics	SeqA.fq
PASS	Per base sequence quality	SeqA.fq
PASS	Per tile sequence quality	SeqA.fq
PASS	Per sequence quality scores	SeqA.fq
FAIL	Per base sequence content	SeqA.fq
PASS	Per sequence GC content	SeqA.fq
PASS	Per base N content	SeqA.fq
	…
PASS	Basic Statistics	SeqB.fq
PASS	Per base sequence quality	SeqB.fq
PASS	Per tile sequence quality	SeqB.fq
PASS	Per sequence quality scores	SeqB.fq
PASS	Per base sequence content	SeqB.fq
FAIL	Per sequence GC content	SeqB.fq
FAIL	Per base N content	SeqB.fq

`statistics.txt`

fq_aggregated/statistics.txt will look like this:


PASS	Basic Statistics	SeqA.fq
PASS	Per base sequence quality	SeqA.fq
PASS	Per tile sequence quality	SeqA.fq
PASS	Per sequence quality scores	SeqA.fq
FAIL	Per base sequence content	SeqA.fq
PASS	Per sequence GC content	SeqA.fq
PASS	Per base N content	SeqA.fq
d	…
PASS	Basic Statistics	SeqB.fq
PASS	Per base sequence quality	SeqB.fq
PASS	Per tile sequence quality	SeqB.fq
PASS	Per sequence quality scores	SeqB.fq
PASS	Per base sequence content	SeqB.fq
FAIL	Per sequence GC content	SeqB.fq
FAIL	Per base N content	SeqB.fq

The Code

# Run this script in a directory containing zip files from fastqc. It aggregates images of each type in individual folders
# So looking across data is quick.

zips=`ls *.zip`

for i in $zips; do
    unzip -o $i &>/dev/null;
done

fastq_folders=${zips/.zip/}

rm -rf fq_aggregated # Remove aggregate folder if present
mkdir fq_aggregated

# Rename Files within each using folder name.
for folder in $fastq_folders; do
    folder=${folder%.*}
    img_files=`ls ${folder}/Images/*png`;
    for img in $img_files; do
        img_name=$(basename "$img");
        img_name=${img_name%.*}
        new_name=${folder};
        mkdir -p fq_aggregated/${img_name};
        mv $img fq_aggregated/${img_name}/${folder/_fastqc/}.png;
    done;
done;


# Concatenate Summaries
for folder in $fastq_folders; do
    folder=${folder%.*}
    cat ${folder}/summary.txt >> fq_aggregated/summary.txt
done;

# Concatenate Statistics
for folder in $fastq_folders; do
    folder=${folder%.*}
    head -n 10 ${folder}/fastqc_data.txt | tail -n 7 | awk -v f=${folder/_fastqc/} '{ print $0 "\t" f }' >> fq_aggregated/statistics.txt
    rm -rf ${folder}
done