ammolite crystal


Entropy-scaling Search of Massive Biological Data: In this paper, we introduce a data structure that allows similarity search in time- and space-complexity asymptotically linear in entropy. Using this data structure, we demonstrate at least an order of magnitude acceleration of algoritms in problems drawn from small molecule search, metagenomics, and protein structure search. Here we provide code and instructions for the three tools we've developed: Ammolite, MICA, and esFragBag.

Compressive Mapping for Next-generation Sequencing: In this paper, we introduce a very fast and sensitive NGS read-mapping framework that leverages the redundancy within the reference genome(s) as well as the redundancy within and across raw read datasets. Our implementation of this framework, CORA, achieves substantial runtime improvement through compressive representation of the reads and a comprehensive homology map of the reference genome. CORA's acceleration improves as the total size of the multiple individual read dataset increases.

Quality Score Compression Improves Genotyping Accuracy: In this paper, we present a highly efficient and scalable compression tool, Quartz (quality score reduction at terabyte scale), which compresses quality scores by capitalizing on sequence redundancy. Unlike other quality score compression methods, Quartz simultaneously improves genotyping accuracy while achieving high compression ratios, and it is able to do so in orders of magnitude less time.


Ammolite is production-quality software designed to do a Tanimoto distance similarity search over chemical graphs for small molecules. On searches against the entire Protein Data Bank, we achieve ~150x speedup over comparable state-of-the-art tools.

Code is available here

The databases used in the paper are available: 1 Million (1GB), 10 Million (8+ GB), Full pubchem as of 2013-10-01 (30+ GB, needs 300+GB on disk decompressed).


MICA (Metagenomic Inquiry Compressive Acceleration) is a full drop-in replacement for BLASTX and DIAMOND supporting all command-line options that is 3.5x faster than DIAMOND (and over 3000x faster than BLASTX) with no loss in specificity and less than 5% loss in sensitivity.

Code is available on Github and locally.

For installation and usage instructions please refer to the file in the Github repository.

The NCBI NR database for June 20, 2015 is available here

Code for the caBLASTX prototype (for paper validation only) is available on Github and locally

AMD64 Linux binaries are also available

The NCBI NR database used in the paper, from 2014-09-17, is available (17GB)


esFragBag (entropy-scaling FragBag) is prototype software that applies entropy-scaling to accelerate only the all r-nearest neighbor search functionality of FragBag by a factor of ~10 with no loss in specificity and less than 0.2% loss in sensitivity. At the moment, this software supports only benchmarking and comparison functionality for measuring the effects of entropy-scaling on FragBag.

Code will be available on Github once the paper is published:

Additionally, all the code is provided in addition to the raw data and figure generation scripts here: (md5sum: c4ee46c180685db3ddf847eb94969aff). This package includes full source code, as well as a two line command to regenerate all esFragBag benchmarks and figures found in the manuscript.


CORA (Compressive read-mapping accelerator) is a compressive-acceleration tool for NGS read mapping methods.

When plugged into existing mapping tools, CORA achieves substantial runtime improvement through the use of compressive representation of the reads and a comprehensive homology map of the reference genome.

Code is available on Github:

All of the software executables, manuals, and quick-start scripts are provided in addition to the raw data here:


Quartz (QUAlity score Reduction at Terabyte scale) is an efficient de novo quality score compression tool based on traversing the k-mer landscape of NGS read datasets.

Quartz takes as input FASTQ files and outputs FASTQ files with a modified quality score string. The modified quality score string resets nearly all quality score strings to a (user-selectable) high confidence value, decreasing the entropy and allowing any off-the-shelf text compressor (e.g. Gzip, Bzip2) to achieve extremely high compression ratios. Furthermore, although the compression is lossy, we have found that by using the k-mer landscape of the read sequence to inform which quality scores are reset, downstream genotyping accuracies are even improved.

Code is available on Github:

All of the sofware executables, manuals are provided in addition to the raw data and plot generation scripts here:

Y. William Yu*, Noah M. Daniels*, David C. Danko, Bonnie Berger. "Entropy-scaling search of massive biological data". Cell Systems, Volume 1, Issue 2, p130–140, 26 August 2015 (*These authors contributed equally to this work).
Deniz Yorukoglu, Y. William Yu, Jian Peng, Bonnie Berger. "Compressive Mapping for Next-generation Sequencing". Nature Biotechnology 34, 374-376 (2016).
Y. William Yu, Deniz Yorukoglu, Jian Peng, Bonnie Berger. "Quality score compression improves genotyping accuracy". Nature Biotechnology 33, 240-243 (2015).

Questions or comments? Please contact bab at Berger Lab | CSAIL | MIT