quartz crystal

Quartz

massachusetts institute of technology (mit)
computer science and artificial intelligence laboratory (csail)
theory of computation group (toc)
computation and biology group (compbio)

email queries bab@mit.edu



Quartz (QUAlity score Reduction at Terabyte scale) is an efficient de novo quality score compression tool based on traversing the k-mer landscape of NGS read datasets. It takes as input FASTQ files and outputs FASTQ files with a modified quality score string. The modified quality score string resets nearly all quality score strings to a (user-selectable) high confidence value, decreasing the entropy and allowing any off-the-shelf text compressor (e.g. Gzip, Bzip2) to achieve extremely high compression ratios. Furthermore, although the compression is lossy, we have found that by using the k-mer landscape of the read sequence to inform which quality scores are reset, downstream genotyping accuracies are even improved.

Y. William Yu, Deniz Yorukoglu, Jian Peng, Bonnie Berger. "Quality score compression improves genotyping accuracy". In press, Nature Biotechnology.

Up-to-date source code is available on Github: https://github.com/yunwilliamyu/quartz.

An earlier pre-compiled static binary package of Quartz including example script and test files can also be downloaded: quartz-bin-0.1.1.tar.bz2.
It was produced on a Debian GNU/Linux 6.0 x86-64 machine using GCC 4.7.2, and has been tested on the following setups:

The source package published with the paper can be found here: quartz-v0.2.tar.bz2.

Additionally, a 507 MiB package with the results of post-Quartz variant calling and an R script for the regeneration of all plots in the paper can be found here: quartz_figure_regeneration.tar.bz2. Furthermore, example shell scripts with the options we used for mappers and variant callers can be found here: quartz_benchmarking_params.tar.bz2


Dictionary generation can be very expensive. The authors have already generated a high quality human genome dictionary:

For the manuscript, the authors used recent FASTQ files from 1000 Genomes Project. One set, for NA12878, is linked below:

To compress the quality scores, after downloading and decompressing the quartz-bin-0.1.1.tar.bz2 package, enter the following commands in the quartz directory. (note that 64 GiB of RAM and 150 GiB of disk space are required for the following to run)

wget http://giant.csail.mit.edu/quartz/dec200.bin.sorted.gz
wget http://giant.csail.mit.edu/quartz/dec200.bin.sorted.swapped.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR622461_1.filt.fastq.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR622461_2.filt.fastq.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/NA12878/sequence_read/SRR622461.filt.fastq.gz
gunzip *.gz
./quartz dec200.bin.sorted 'S' 8 0 *.fastq

Note that this does not directly compress the file, only sparsifies the quality vector. However, the increased redundancy will make the quality vectors far more compressible, as can be measured using any general purpose text compressor. We recommend piping the original and sparsified quality scores through BZIP2 and measuring the differences in size.

Also note that this is running in low-memory mode. If you have over 70GiB of RAM, you'll want to use high-memory mode instead for a 2x speedup by calling: ./quartz dec200.bin.sorted 'S' 8 1 *.fastq

As an additional note, we have both theoretical justification and experimental measurements demonstrating that the choices of k-mer length = 32 and Hamming neighbor distance = 1 are optimal for current machines. For more details, see the manuscript; however, for this reason, both of these two parameters (unlike the default quality score replacement value) are hard-coded into the implementation.