massachusetts institute of technology (mit)
computer science and artificial intelligence laboratory (csail)
theory of computation group (toc)
computation and biology group (compbio)

email queries rqs@mit.edu

RQS (Read Quality-score Sparsifier) is an efficient de novo quality score compression tool based on traversing the k-mer landscape of NGS read datasets.

Source code and a testsuite for an implementation of RQS are available as a Github repo. Software dependencies for compilation are BOOST Multi-index and GCC 4.7 (or another C++11 compliant compiler). The code assumes the fairly standard x86-64 machine architecture, 8-bit bytes, and 8-byte unsigned longs. Note that the code described/implemented in this package is not scalable to full genome / large datasets due to high memory requirements. This package was written as a proof-of-principle for the RECOMB proceedings above and will NOT be actively maintained.

A pre-compiled binary package of RQS including example script and test files is also available: RQS-bin-0.1.0.tar.bz2.
It was produced on a Debian GNU/Linux 6.0 x86-64 machine using GCC 4.7.2, and is linked against

The binary has been tested on the following setups:

Additionally, we have made available the following resources:

To sparsify the quality scores for the exome reads using the provided dictionary, place all files in the RQS directory and run

bunzip2 dictionary-HG_chr21_r50.db.bz2
bunzip2 NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.bz2
./sparsify dictionary-HG_chr21_r50.db NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam
./threshold 'I' NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.filtered

Note that this does not directly compress the file, only sparsifies the quality vector. However, the increased redundancy will make the quality vectors far more compressible, as can be measured using any general purpose text compressor. We recommend piping the original and sparsified quality scores through BZIP2 and measuring the differences in size:

cut -f11 NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam | bzip2 > original_quals.bz2
cut -f11 NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.filtered.reduced | bzip2 > sparsified_quals.bz2
ls -sh *.bz2
    52M original_quals.bz2     3.6M sparsified_quals.bz2

If everything has gone as it should, the output of ls -sh should be as displayed above.