ammolite crystal

About


Entropy-scaling Search of Massive Biological Data: In this paper, we introduce a data structure that allows similarity search in time- and space-complexity asymptotically linear in entropy. Using this data structure, we demonstrate at least an order of magnitude acceleration of algoritms in problems drawn from small molecule search, metagenomics, and protein structure search. Here we provide code and instructions for the three tools we've developed: Ammolite, MICA, and esFragBag.

Compressive Mapping for Next-generation Sequencing: In this paper, we introduce a very fast and sensitive NGS read-mapping framework that leverages the redundancy within the reference genome(s) as well as the redundancy within and across raw read datasets. Our implementation of this framework, CORA, achieves substantial runtime improvement through compressive representation of the reads and a comprehensive homology map of the reference genome. CORA's acceleration improves as the total size of the multiple individual read dataset increases.

Quality Score Compression Improves Genotyping Accuracy: In this paper, we present a highly efficient and scalable compression tool, Quartz (quality score reduction at terabyte scale), which compresses quality scores by capitalizing on sequence redundancy. Unlike other quality score compression methods, Quartz simultaneously improves genotyping accuracy while achieving high compression ratios, and it is able to do so in orders of magnitude less time.

Metagenomic binning through low density hashing: In this paper, we introduce Opal, which is a metagenomic classifer for next-generation sequencing (NGS) read data. It introduces the usage of even-coverage low-density locality sensitive hashing to create compressed fingerprints of k-mers within the read to use as training vectors for a downstream SVM classifier. The classifier should be trained on a multifasta file, coupled with a taxid file where each line specifies the label of the corresponding fasta line. Once trained, the classification model can be used to predict the species of origin of individual reads in a test-set, also in multifasta format. We have found Opal to be particularly suited in regimes with high substituion error/mutation rate.

Ammolite


Ammolite is production-quality software designed to do a Tanimoto distance similarity search over chemical graphs for small molecules. On searches against the entire Protein Data Bank, we achieve ~150x speedup over comparable state-of-the-art tools.

Code is available here

The databases used in the paper are available: 1 Million (1GB), 10 Million (8+ GB), Full pubchem as of 2013-10-01 (30+ GB, needs 300+GB on disk decompressed).

MICA


MICA (Metagenomic Inquiry Compressive Acceleration) is a full drop-in replacement for BLASTX and DIAMOND supporting all command-line options that is 3.5x faster than DIAMOND (and over 3000x faster than BLASTX) with no loss in specificity and less than 5% loss in sensitivity.

Code is available on Github and locally.

For installation and usage instructions please refer to the README.md file in the Github repository.

The NCBI NR database for June 20, 2015 is available here


Code for the caBLASTX prototype (for paper validation only) is available on Github and locally

AMD64 Linux binaries are also available

The NCBI NR database used in the paper, from 2014-09-17, is available (17GB)

esFragBag


esFragBag (entropy-scaling FragBag) is prototype software that applies entropy-scaling to accelerate only the all r-nearest neighbor search functionality of FragBag by a factor of ~10 with no loss in specificity and less than 0.2% loss in sensitivity. At the moment, this software supports only benchmarking and comparison functionality for measuring the effects of entropy-scaling on FragBag.

Code will be available on Github once the paper is published: https://github.com/yunwilliamyu/fragbag

Additionally, all the code is provided in addition to the raw data and figure generation scripts here: http://giant.csail.mit.edu/gems/cellsystems_esFragBag_figures.tar.bz2 (md5sum: c4ee46c180685db3ddf847eb94969aff). This package includes full source code, as well as a two line command to regenerate all esFragBag benchmarks and figures found in the manuscript.

CORA


CORA (Compressive read-mapping accelerator) is a compressive-acceleration tool for NGS read mapping methods.

When plugged into existing mapping tools, CORA achieves substantial runtime improvement through the use of compressive representation of the reads and a comprehensive homology map of the reference genome.

Code is available on Github: github.com/denizy/cora

All of the software executables, manuals, and quick-start scripts are provided in addition to the raw data here: cora.csail.mit.edu

Quartz


Quartz (QUAlity score Reduction at Terabyte scale) is an efficient de novo quality score compression tool based on traversing the k-mer landscape of NGS read datasets.

Quartz takes as input FASTQ files and outputs FASTQ files with a modified quality score string. The modified quality score string resets nearly all quality score strings to a (user-selectable) high confidence value, decreasing the entropy and allowing any off-the-shelf text compressor (e.g. Gzip, Bzip2) to achieve extremely high compression ratios. Furthermore, although the compression is lossy, we have found that by using the k-mer landscape of the read sequence to inform which quality scores are reset, downstream genotyping accuracies are even improved.

Code is available on Github: github.com/yunwilliamyu/quartz

All of the sofware executables, manuals are provided in addition to the raw data and plot generation scripts here: quartz.csail.mit.edu

Opal from en.wikipedia.org


Opal is a metagenomic classifer for next-generation sequencing (NGS) read data. It introduces the usage of even-coverage low-density locality sensitive hashing to create compressed fingerprints of k-mers within the read to use as training vectors for a downstream SVM classifier. The classifier should be trained on a multifasta file, coupled with a taxid file where each line specifies the label of the corresponding fasta line. Once trained, the classification model can be used to predict the species of origin of individual reads in a test-set, also in multifasta format.

Code is available on Github: github.com/yunwilliamyu/opal

All of the sofware executables, manuals are provided in addition to the raw data and plot generation scripts here: opal.csail.mit.edu

Y. William Yu*, Noah M. Daniels*, David C. Danko, Bonnie Berger. "Entropy-scaling search of massive biological data". Cell Systems, Volume 1, Issue 2, p130–140, 26 August 2015 http://dx.doi.org/10.1016/j.cels.2015.08.004. (*These authors contributed equally to this work).
Deniz Yorukoglu, Y. William Yu, Jian Peng, Bonnie Berger. "Compressive Mapping for Next-generation Sequencing". Nature Biotechnology 34, 374-376 (2016).
Y. William Yu, Deniz Yorukoglu, Jian Peng, Bonnie Berger. "Quality score compression improves genotyping accuracy". Nature Biotechnology 33, 240-243 (2015).
Yunan Luo, Y. William Yu, Jianyang Zeng, Bonnie Berger, Jian Peng. Metagenomic binning through low density hashing. Submitted for publication.
Questions or comments? Please contact bab at csail.mit.edu Berger Lab | CSAIL | MIT