Cora

Massachusetts Institute of Technology (mit)
Computer Science and Artificial Intelligence Laboratory (csail)
Theory of Computation Group (toc)
Berger Lab for Computation and Biology (compbio)

email queries bab@mit.edu



CORA is a compressive-acceleration tool for NGS read mapping methods. When plugged into existing mapping tools, CORA achieves substantial runtime improvement through the use of compressive representation of the reads and a homology map of the reference genome.

Deniz Yorukoglu, Jian Peng, Bonnie Berger. "Compressive Mapping for Next-generation Sequencing". Manuscript submitted for publication.

A pre-compiled binary package of Cora including a manual is available: CORA.v0.9.2.zip.
It was produced on a Debian GNU/Linux 6.0.9 x86-64 machine using GCC 4.7.0.
Software dependencies for compilation are tr1/unordered_map and GCC 4.7 (or another C++11 compliant compiler).



Sample dataset and configuration

Download the pre-compiled binary above and the sample package here, which contains a sample paired-end read dataset (together with read file list needed by CORA), as well as a sample configuration file. Uncompress the archive within the CORA package folder (otherwise the file paths within the configuration file need to be modified in order to reflect the correct paths for the reference, auxiliary files as well as the CORA executables). You may need to change user permissions to be able to run executables (using chmod); this is true for other tool packages used in conjunction with CORA.

A sample reference dataset can be obtained from here (chr21_g1000210_random.fa.gz). Run "gunzip chr21_g1000210_random.fa.gz" in order to extract compressed reference.

Download BWA or Bowtie2 executables, and make sure their executables can be run on your system. Latest tested compatible versions: BWA v0.7.9 and Bowtie2 v2.1.0. If there is no executable file provided in the mapper package, run "make" to compile the executable, which you should be able to move freely after compilation.

Index the sample reference file using either mapper (that will be used within CORA). Manual pages for indexing references using these two mappers can be found here and here . For BWA v0.7.9, default indexing command is "./bwa index chr21_g1000210_random.fa"; for Bowtie2 v2.1.0, "./bowtie2-build chr21_g1000210_random.fa chr21_g1000210_random.fa", with the reference written twice (the second is the bowtie2 index base -- CORA requires the reference and index base to be the same).

Modify the configuration file based on the mapper used and the executable path of the mapper. Refer to the manual for detailed description of each line in the configuration file. Other dynamic file paths should be valid if the sample folder is placed correctly as stated above; if not, modify the paths in the configuration file accordingly.

CORA also uses *.fai index of fasta references generated by samtools faidx (which can be downloaded from here). You can either run "./samtools faidx chr21_gl000210_random.fa" in order to generate chr21_gl000210_random.fa.fai. Alternatively you can provide samtools executable path within configuration line 06, allowing CORA generate fai file as needed.

Run CORA executable giving the configuration file as the only command-line argument (e.g "./cora sample_config_file").

Read the manual for CORA usage and how different stages of the pipeline can be turned off or run separately (the sample configuration file includes homology table generation by default [config option 00], which only needs to be performed once for a reference).

After the run has completed, a mapping output file (SAM format) should include mapping results from the reads in the read file (Files that start with Step* are intermediate files that can be deleted if wanted). A subsequent run can be done without generating the homology table if the configuration line for homology table generation is modified to 0. At its current version, CORA requires the homology table to be generated for a specific read-mer size, but in future versions the homology table will be made more flexible.



Real datasets

The exact and inexact homology tables used for hg19 human reference genome for 108bp reads can be downloaded from here and here (If download limit is exceeded, alternative links are provided below).

Read dataset used for FIN1 experiment (Modified from 1000 Genomes Project in order to extract uniform paired-end reads that are 108bp long) can be found here (If download limit is exceeded, alternative links are provided below).

Configuration file and the read file list used for the FIN1 experiment using BWA can be downloaded from here and here. The whole genome mapping of FIN1 with CORA with the current version requires ~60GB of memory. In future updates we will reduce memory requirements of CORA with compacter representations of homology table and links table, and additionally provide a user option to reduce the memory while slightly increasing runtime.


Warning! We recently became aware that Dropbox limits the downloads of very large files. Alternative links are provided below. We sincerely apologize for the inconvenience.

Alternative links for homology tables can be found here and here .

Alternative link for the read file can be found here .