fq-dedup¶
The fq-dedup command de-duplicates a FASTQ by read ID (e.g. @@D00446:1:140101_HWI-D00446_0001_C8HN4ANXX:8:2210:1238:2018). Ideally, you should never see this happen, bu true I have observed it when a power outage occurred during a sequencing runs.
The command uses a Bloom filter to identify duplicates, and has to read through the file twice, and output the original FASTQ.
sc fq-dedup myfastq.fq.gz 2> dup.err | gzip > dedupped.fq.gz
fq-dedup can read both .fq.gz and .fq files. It sends the deduplicated FASTQ to stdout.
Output
Once complete, the following is sent to stderr:
total_reads: 2500000
duplicates 1086043
false-positive: 0
false-positive-rate: 0.0
The false-positive values are for diagnostics only based on reads initially labeled as duplicates by the bloom filter that were later found not to be true duplicates.
Benchmark
2.5M Reads; 1M+ duplicates; 2015 MacBook Pro
0m58.738s