September 2014 CAST v1.2.2
By Andrew Haskell, Noah Daniels, and Bonnie Berger
A new release of CaBLAST and CaBLAT is now available, which provides a much more readily usable program that boosts the existing nucleotide BLAST and BLAT programs using a plugin architecture, and is able to be passed in BLAST and BLAT arguments.
A version of CaBLAST that works with proteins can be found here.
CaBLASTX (caBLASTP2) for metagenomic search can be found here.
Illustration by Steven H. Lee. Thanks also to Leslie Gaffney, Broad Institute.
The past two decades have seen an exponential increase in sequencing capabilities, outstripping advances in computing power. Extracting new insights from the data sets currently being generated will require not only faster computers; it will require smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected; thus, the amount of novel sequence information is growing much more slowly.
We show that this redundancy can be exploited by compressing data in a way that allows direct computation on the compressed data. This approach reduces the computational task of operating on many highly similar genomes to only slightly more than that of operating on just one. We demonstrate this compressive architecture by implementing accelerated versions of both BLAST and BLAT, and emphasize how compressive genomics, more generally, will enable biologists to keep pace with current data.
Source Code
We have implemented two prototype algorithms that demonstrate the compressive genomics paradigm: Compression-accelerated BLAST (CaBLAST) and Compression-accelerated BLAT (CaBLAT). These algorithms serve as proof-of-concept that computationally-aware compression not only reduces storage space but also accelerates analysis (in this case, sequence search).
Our source code can be downloaded here for academic and non-profit use:
- CAST-1.2.2.tar.gz is the release of CAST (caBLAST and caBLAT), and does not rely on the NCBI toolkit. It is also available on GitHub.
- cast_v0.9.tar.gz is the publication version of CaBLAST and CaBLAT. It relies on the NCBI toolkit.
For a detailed description of the algorithms and discussion of relevant implementation trade-offs, please see the Supplementary Methods of our article "Compressive genomics" in Nature Biotechnology, July 2012.
Contact
We welcome feedback, questions and suggestions. Contact information is available at the authors' websites: Po-Ru Loh, Michael Baym, Bonnie Berger.
Referencing CaBLAST/CaBLAT
If you use CaBLAST or CaBLAT, please reference the following:
- Loh P-R, Baym M, Berger B. Compressive genomics. Nature Biotechnology, Volume 30 Number 7, July 2012.