fq-meta¶
Summarize FASTQ files, outputting the following:
- A best guess as to the type of sequencer based on instrument and flowcell information
- A best guess for the quality score format
- The barcode/index used to multiplex the sample which can be useful for identifying samples
- The machine, run, and other metadata about the FASTQ
- Information derived from the filename
- The file location
fq-meta
is useful for taking an inventory of sequencing data, and is particularly helpful if you are inheriting a poorly organized sequencing project.
Data is output in a tab-delimited format so that this tool can be used to assemble a database of FASTQs and their associated information. You can do so like this:
sc fq-meta --header > my_fq_database.txt # Use this to output just the variable names
sc fq-meta *.fq.gz >> my_fq_database.txt
The resulting dataset can be combined with other metadata and filtered to select samples for processing in a pipeline.
You can also parallelize the operation with GNU-parallel.
sc fq-meta --header > my_fq_database.txt # Use this to output just the variable names
parallel -j 8 sc fq-meta ::: find . -name '*.fq.gz' >> my_fq_database.txt
Example output
machine | sequencer | prob_sequencer | flowcell | flowcell_description | run | lane | sequence_id | index1 | index2 | qual_format | qual_phred | qual_multiple | min_qual | max_qual | n_lines | basename | absolute_path |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D00446 | HiSeq2000/2500 | high:machine+flowcell | C8HN4ANXX | High Output (8-lane) v4 flow cell | 1 | 8 | GCTCGGTA | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 14 | 14 | 1 | illumina_2000_2500.fq | … | ||
K00100 | HiSeq3000/4000 | high:machine+flowcell | H300JBBXX | (8-lane) v1 flow cell | 33 | 6 | GCCAAT | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 14 | 14 | 1 | illumina_3000_4000.fq | … | ||
D00209 | HiSeq2000/2500 | high:machine+flowcell | CACDKANXX | High Output (8-lane) v4 flow cell | 258 | 6 | CGCAGTT | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 0 | 37 | 1 | illumina_6.fq | … | ||
D00209 | HiSeq2000/2500 | high:machine+flowcell | CACDKANXX | High Output (8-lane) v4 flow cell | 258 | 6 | GAGCAAG | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 0 | 37 | 1 | illumina_7.fq | … |
__Input____
fq-meta
accepts both gzipped FASTQs (.fq.gz
, .fastq.gz
~ inferred from .gz
extension) and raw text FASTQs.