GEMS: entropy-scaling and compressive biological software

About

Entropy-scaling Search of Massive Biological Data: In this paper, we introduce a data structure that allows similarity search in time- and space-complexity asymptotically linear in entropy. Using this data structure, we demonstrate at least an order of magnitude acceleration of algoritms in problems drawn from small molecule search, metagenomics, and protein structure search. Here we provide code and instructions for the three tools we've developed: Ammolite, MICA, and esFragBag.

Compressive Mapping for Next-generation Sequencing: In this paper, we introduce a very fast and sensitive NGS read-mapping framework that leverages the redundancy within the reference genome(s) as well as the redundancy within and across raw read datasets. Our implementation of this framework, CORA, achieves substantial runtime improvement through compressive representation of the reads and a comprehensive homology map of the reference genome. CORA's acceleration improves as the total size of the multiple individual read dataset increases.

Quality Score Compression Improves Genotyping Accuracy: In this paper, we present a highly efficient and scalable compression tool, Quartz (quality score reduction at terabyte scale), which compresses quality scores by capitalizing on sequence redundancy. Unlike other quality score compression methods, Quartz simultaneously improves genotyping accuracy while achieving high compression ratios, and it is able to do so in orders of magnitude less time.

Metagenomic binning through low density hashing: In this paper, we introduce Opal, which is a metagenomic classifer for next-generation sequencing (NGS) read data. It introduces the usage of even-coverage low-density locality sensitive hashing to create compressed fingerprints of k-mers within the read to use as training vectors for a downstream SVM classifier. The classifier should be trained on a multifasta file, coupled with a taxid file where each line specifies the label of the corresponding fasta line. Once trained, the classification model can be used to predict the species of origin of individual reads in a test-set, also in multifasta format. We have found Opal to be particularly suited in regimes with high substituion error/mutation rate.

Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads: In this paper, we present Carnelian, a pipeline for alignment-free functional binning and abundance estimation, that leverages low-density even-coverage locality sensititve hashing to represent metagenomic reads in a low-dimensional manifold. When coupled with one-against-all classifiers, our tool bins whole metagenomic sequencing reads by molecular function encoded in their gene content at significantly higher accuracy than existing methods, especially for novel proteins. Carnelian is robust against study bias and produces highly informative functional profiles of metagenomic samples from disease studies as well as environmental studies.

Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes: In this paper, we introduce Aldy, a combinatorial optimization framework that successfully reconstructs the structure and sequence content of each copy of some of the most challenging (structurally altered and highly polymorphic) genes in the human genome. For a large diverse sequencing data set, Aldy identifies multiple rare and novel alleles for several important pharmacogenes, significantly improving upon the accuracy and utility of current genotyping assays.

Statistical binning for barcoded reads improves downstream analyses: In this paper, we introduce EMerAld (EMA for short), an alignment tool for barcoded short-read sequencing data, such as those produced by 10x Genomics' Chromium platform. EMA is faster and more accurate than current aligners, and produces not only the final alignments but interpretable per-alignment probabilities. EMA takes a set of barcoded FASTQs as input, preprocesses them into a series of barcode buckets that can be processed in parallel, and produces a standard SAM/BAM file as output.

Ammolite is production-quality software designed to do a Tanimoto distance similarity search over chemical graphs for small molecules. On searches against the entire Protein Data Bank, we achieve ~150x speedup over comparable state-of-the-art tools.

Code is available here

The databases used in the paper are available: 1 Million (1GB), 10 Million (8+ GB), Full pubchem as of 2013-10-01 (30+ GB, needs 300+GB on disk decompressed).

MICA

MICA (Metagenomic Inquiry Compressive Acceleration) is a full drop-in replacement for BLASTX and DIAMOND supporting all command-line options that is 3.5x faster than DIAMOND (and over 3000x faster than BLASTX) with no loss in specificity and less than 5% loss in sensitivity.

Code is available on Github and locally.

For installation and usage instructions please refer to the README.md file in the Github repository.

The NCBI NR database for June 20, 2015 is available here

Code for the caBLASTX prototype (for paper validation only) is available on Github and locally

AMD64 Linux binaries are also available

The NCBI NR database used in the paper, from 2014-09-17, is available (17GB)

esFragBag

esFragBag (entropy-scaling FragBag) is prototype software that applies entropy-scaling to accelerate only the all r-nearest neighbor search functionality of FragBag by a factor of ~10 with no loss in specificity and less than 0.2% loss in sensitivity. At the moment, this software supports only benchmarking and comparison functionality for measuring the effects of entropy-scaling on FragBag.

Code will be available on Github once the paper is published: https://github.com/yunwilliamyu/fragbag

Additionally, all the code is provided in addition to the raw data and figure generation scripts here: http://giant.csail.mit.edu/gems/cellsystems_esFragBag_figures.tar.bz2 (md5sum: c4ee46c180685db3ddf847eb94969aff). This package includes full source code, as well as a two line command to regenerate all esFragBag benchmarks and figures found in the manuscript.

CORA

CORA (Compressive read-mapping accelerator) is a compressive-acceleration tool for NGS read mapping methods.

When plugged into existing mapping tools, CORA achieves substantial runtime improvement through the use of compressive representation of the reads and a comprehensive homology map of the reference genome.

Code is available on Github: github.com/denizy/cora

All of the software executables, manuals, and quick-start scripts are provided in addition to the raw data here: cora.csail.mit.edu

Quartz

Quartz (QUAlity score Reduction at Terabyte scale) is an efficient de novo quality score compression tool based on traversing the k-mer landscape of NGS read datasets.

Quartz takes as input FASTQ files and outputs FASTQ files with a modified quality score string. The modified quality score string resets nearly all quality score strings to a (user-selectable) high confidence value, decreasing the entropy and allowing any off-the-shelf text compressor (e.g. Gzip, Bzip2) to achieve extremely high compression ratios. Furthermore, although the compression is lossy, we have found that by using the k-mer landscape of the read sequence to inform which quality scores are reset, downstream genotyping accuracies are even improved.

Code is available on Github: github.com/yunwilliamyu/quartz

All of the sofware executables, manuals are provided in addition to the raw data and plot generation scripts here: quartz.csail.mit.edu

Opal from en.wikipedia.org

Opal is a metagenomic classifer for next-generation sequencing (NGS) read data. It introduces the usage of even-coverage low-density locality sensitive hashing to create compressed fingerprints of k-mers within the read to use as training vectors for a downstream SVM classifier. The classifier should be trained on a multifasta file, coupled with a taxid file where each line specifies the label of the corresponding fasta line. Once trained, the classification model can be used to predict the species of origin of individual reads in a test-set, also in multifasta format.

Code is available on Github: github.com/yunwilliamyu/opal

All of the sofware executables, manuals are provided in addition to the raw data and plot generation scripts here: opal.csail.mit.edu

carnelian from google images

Carnelian is a functional profiler whole metagenome sequencing read data. It leverages Opal's even-coverage low-density locality sensitive hashing and MurmurHash to represent reads in a low-dimensional feature space for an ensemble of downstream one-againse-all classifiers. The classifier ensemble should be trained on a protein multifasta file, coupled with a label file where each line specifies the functional label of the corresponding fasta line. Once trained, the classifier model can be used to compute functional profiles of diverse study populations.

Code is available on Github: github.com/snz20/carnelian

Supplementary information and datasets are provided here: carnelian.csail.mit.edu

EMerAld logo
EMerAld

EMerAld is an alignment tool for barcoded short-read sequencing data, such as those produced by 10x Genomics' Chromium platform. EMerAld is faster and more accurate than current aligners, and produces not only the final alignments but also interpretable per-alignment probabilities. EMA takes a set of barcoded FASTQs as input, preprocesses them into a series of barcode buckets that can be processed in parallel, and produces a standard SAM/BAM file as output.

More information is available at http://ema.csail.mit.edu.