massachusetts institute of technology (mit)
computer science and artificial intelligence laboratory (csail)
theory of computation group (toc)
computation and biology group (compbio)
email queries rqs@mit.edu
RQS (Read Quality-score Sparsifier) is an efficient de novo quality score compression tool based on traversing the k-mer landscape of NGS read datasets.
Source code and a testsuite for an implementation of RQS are available as a Github repo. Software dependencies for compilation are BOOST Multi-index and GCC 4.7 (or another C++11 compliant compiler). The code assumes the fairly standard x86-64 machine architecture, 8-bit bytes, and 8-byte unsigned longs. Note that the code described/implemented in this package is not scalable to full genome / large datasets due to high memory requirements. This package was written as a proof-of-principle for the RECOMB proceedings above and will NOT be actively maintained.
A pre-compiled binary package of RQS including example script and test files is also available: RQS-bin-0.1.0.tar.bz2.
It was produced on a Debian GNU/Linux 6.0 x86-64 machine using GCC 4.7.2, and is linked against
Additionally, we have made available the following resources:
dictionary-HG_chr21_r50.db.bz2
352 MiB, MD5: b3cd7312ec059251acf1d2de2aa830b0
NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.bz2
141 MiB, MD5: e84f1e99b384990cdd1a35660f835b1b
To sparsify the quality scores for the exome reads using the provided dictionary, place all files in the RQS directory and run
bunzip2 dictionary-HG_chr21_r50.db.bz2
bunzip2 NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.bz2
./sparsify dictionary-HG_chr21_r50.db NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam
./threshold 'I' NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.filtered
Note that this does not directly compress the file, only sparsifies the quality vector. However, the increased redundancy will make the quality vectors far more compressible, as can be measured using any general purpose text compressor. We recommend piping the original and sparsified quality scores through BZIP2 and measuring the differences in size:
cut -f11 NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam | bzip2 > original_quals.bz2
cut -f11 NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam.chr21.2x76bp.sam.filtered.reduced | bzip2 > sparsified_quals.bz2
ls -sh *.bz2
52M original_quals.bz2 3.6M sparsified_quals.bz2
If everything has gone as it should, the output of ls -sh
should be as displayed above.