cora crystal

CORA

Massachusetts Institute of Technology (mit)
Computer Science and Artificial Intelligence Laboratory (csail)
Theory of Computation Group (toc)
Berger Lab for Computation and Biology (compbio)

Corresponding author for study: bab@mit.edu
Queries regarding usage of CORA software: denizy@mit.edu


CORA is a compressive-acceleration tool for NGS read mapping methods.
When plugged into existing mapping tools, CORA achieves substantial runtime improvement through the use of compressive representation of the reads and a comprehensive homology map of the reference genome.

The compressive-mapping framework that CORA software implements is explained in the following study:

Deniz Yorukoglu, Y. William Yu, Jian Peng, Bonnie Berger.
"Compressive Mapping for Next-generation Sequencing". Nature Biotechnology 34, 374-376 (2016) doi:10.1038/nbt.3511.


Pre-compiled binary packages of CORA (including manuals) can be found below. Source code is available at https://github.com/denizy/cora. All versions were produced on a Ubuntu GNU/Linux 3.13.0 x86-64 machine using GCC 4.8.X.

Version 1.1.5b:      CORA.v1.1.5b.zip
Simple setup script:      setup
Updates since previous version:
- Fixed bugs from v1.1.3a regarding number of chromosomes in a reference file.
- Note: For v1.1.3a, v1.1.4b, some portion of read mappings could be missing for reference chromosomes between 127 and 256, or larger than 32000. This problem has been fixed with v1.1.5b

Version 1.1.4b:      (deprecated, please use v1.1.5b)
Updates since previous version:
- @SQ header lines are printed by default.
- User option to add read group header lines and mapping tags (@RG and RG:Z:).

Version 1.1.2b:      CORA.v1.1.2b.zip
Simple setup script:      setup
Updates since previous version:
- Support for longer reads (up to 2x180bp paired-end) and higher distance all-mapping (6-9 Hamming/Levenshtein) with the new threeway k-mer compression mode.
- Added support for reads that cannot be divided evenly by k-mer length
- Major code cleanup and minor performance, stability and user interface improvements.
- New faiGenerate command that replaces "samtools faidx" for purposes of running CORA.
- Now, CORA prints original read names and quality scores by default and all-mapping results are more compact.

Version 1.1.1b:      CORA.v1.1.1b.zip
Simple setup script:      setup
Updates since previous version:
- Github source release: https://github.com/denizy/cora
- Unique and Stratum mapping modes are now consistent with results from Best and All mapping modes.
- Minor stability fixes (e.g. OoB chr positions)
- Users can now modify LPT?.dat files to optimize load-balancing.

Version 1.1.0b:      CORA.v1.1.0b.zip
Simple setup script:      setup
Updates since version 1.0.1b:
- Parallelization support for the coarse-mapping stage (--coarseP).
- Reporting Mapping quality for best mapping modes.
- Temporary directory option for intermediate files (--TempDir). This allows running multiple CORA jobs within the same folder.
- Gzipped FASTQ input option (--ReadComp).
- Option for limiting maximum number of mappings in all-mapping mode (--MaxMapCount)
- Automatic estimation of --fs parameter.
- Revised parameters for coarse-mapping with Bowtie2.
- Stability fixes in Linux environments with different exit codes.
- Minor bug fixes for command-line option errors for and mapping recovery incompatibility issues for some mapping modes.
- Overall runtime and memory performance improvements
- *** Increased error distance threshold for best-mapping (up to 12) through flexible mapping recovery stage*** This functionality is currently being performance-tested before being fully integrated into the CORA interface. Here is an easy way to activate and use it in current version: High_Error_Distance_Mode_Guide

Version 1.0.2b:      CORA.v1.0.2b.zip      CORA.v1.0.2b_static.zip
Simple setup scripts:      setup      setup_static
Updates to version 1.0.1b:
- Expanded coarse-mapper compatibility to the most recent BWA versions (up until v0.7.12)
- Fixed bug that prevented '--cm' option from being specified

Version 1.0.1b:      CORA.v1.0.1b.zip      CORA.v1.0.1b_static.zip
Simple setup scripts:      setup      setup_static
Updates to version 1.0.0b:
- Fixed bug: Default coarse mapper wasn't correctly set as bwa.
- Fixed bug: In some system configurations, exit signals could cause crashes in stage0.

Version 1.0.0b:      CORA.v1.0.0b.zip      CORA.v1.0.0b_static.zip
Updates since version 0.9.2:
- Gapped alignment support.
- Option for mapping with larger Hamming distance (or Levenshtein/edit distance) thresholds.
- Modified best mapping mode that is substantially faster than before and uses much less memory.
- Simplified command-line interface as well as improved Manual and Readme files.
- Option to report mappings with original read names and quality scores (read-order is still preserved).
- Expanded coarse mapping tool compatibility (BWA, BWA mem, Bowtie, Bowtie2, MrsFAST, MrsFAST-Ultra).
- More flexible options for parallelization (for indexing) as well as user-specified memory/runtime performance optimizations.
- Further runtime improvements for compressively-accelerated mapping.

Older versions of CORA and manuals can be found here.

All source codes of CORA pipeline as well as C++ developer libraries for homology table generation/traversal and k-mer based read compression will be made available upon publication.

Up-to-date coarse-mapper compatibility list :
CORA supports all versions of Bowtie 2 between v2.0.1 - v2.2.6.
CORA supports all versions of BWA between v0.5.1 - v0.7.12 with the only exception of v0.7.6a (which does not include aln/samse algorithms).


Running CORA on the sample dataset

Simple setup:

1) Download this setup script into a folder of your preference.

Since these scripts automatically download external tools from SourceForge, they may not work if SourceForge is down.

2) Run "bash -x sample_setup" in the terminal.

This automatically downloads all necessary executables, coarse mapping tools, sample datasets, and sets up permission, proper folder structure, etc.

3) Once the setup is complete, run "bash -x sample_runs" in the same folder.

These files contain sample usage of CORA's different stages via command-line interface (using BWA and Bowtie2 for coarse mapping), which can also be copy-pasted into the terminal and run individually.

After the run has completed, there should be 8 mapping output files (SAM format) which include mapping results from BWA and Bowtie2, for best and all-mapping, with and without indels. (Files that start with Step* are intermediate files that you can ignore for basic use).

A subsequent run can be employed without generating the homology table, mapper index or read file list. You can copy the last line in the sample_run_manually file and paste it into the terminal to redo the mapping.

In the current version, CORA requires the homology table to be generated for a specific k-mer size, but in future versions the homology table will be made more flexible.

If the simple setup described above doesn't work in your system or you would like to set up CORA manually, please see the manual setup steps below.

Manual Setup:

Download the sample package here as well as the pre-compiled binary package above. Uncompress both archives. You may need to change user permissions to be able to run executables (using chmod); this is true for other tool packages used in conjunction with CORA.

A sample reference dataset can be obtained from here (chr21_g1000210_random.fa.gz). Run "gunzip chr21_g1000210_random.fa.gz" in order to extract compressed reference.

CORA also uses *.fai index of multi-fasta references generated by samtools faidx (which can be downloaded from here). You can run "./samtools faidx chr21_gl000210_random.fa" in order to generate chr21_gl000210_random.fa.fai.

Depending on which mapper you'd like to use for coarse mapping, download one of BWA, Bowtie, Bowtie2, MrsFAST, or MrsFAST-Ultra, and make sure their executables can be run on your system. Refer to the "Up-to-date coarse-mapper compatibility list" provided above for choosing a version compatible with CORA.

Download the following file sample_run_manual in the same folder as the sample dataset.

Replace all [CORA_PATH] with the path of the cora executable.
Replace all [MAPPER_TYPE] with BWA, BWA_MEM, BOWTIE, BOWTIE_2, MRSFAST or MRSFAST_ULTRA depending on the mapper.
Replace all [MAPPER_PATH] with corresponding mapper's executable path (if it's installed to the machine or environment variables are properly set, the executable name is also acceptable, e.g. "bwa").

Then simply run "bash -x sample_run_manual"


Running CORA on 1000 Genomes FIN1 Dataset

Simple setup:

0) If you haven't done so, please run the simple setup script for the sample dataset above.

1) Download the following setup script for read data: real_setup.

2) Run "bash -x real_setup" in the terminal.

Depending on your bandwidth, it might take several minutes to a few hour to download all of the read datasets and the homology table.

3) Once setup is complete, run "bash -x real_run" which will run CORA-BWA's ungapped all-mapping algorithm on the FIN1 dataset.

You can also perform best-mapping, gapped mapping, or change the coarse mapper type if you'd like.


Manual Setup:

Download the human reference genome (hg19.2bit file) from UCSC download site: hg19.2bit.
You can convert the 2bit file to a multi-fasta file using twoBitToFa
Index the refence genome using a mapper of your preference


The exact and inexact homology tables used for hg19 human reference genome for 108bp reads can be downloaded from here and here, respectively.

Read dataset used for FIN1 experiment (Modified from 1000 Genomes Project in order to extract uniform paired-end reads that are 108bp long) can be found here.

The read file list used for the FIN1 read dataset can be generated using cora, or alternatively downloaded from here.

Download the following script that contains execution command: real_manual_run

Replace [CORA_PATH] with the path of the cora executable.
Replace [MAPPER_TYPE] with BWA, BWA_MEM, BOWTIE, BOWTIE_2, MRSFAST or MRSFAST_ULTRA depending on the mapper.
Replace [MAPPER_PATH] with corresponding mapper's executable path (if it's installed to the machine or environment variables are properly set, the executable name is also acceptable, e.g. "bwa").
Replace [READ_FILE_LIST] with the read file list you generated/downloaded above.
Replace [REFERENCE_GENOME] with the path to the hg19.fa
Replace [EXACT_HOMOLOGY_TABLE] with the path to the Exact file in the homology table
Replace [INEXACT_HOMOLOGY_TABLE] with the path to the InExact file in the homology table

Whole genome all-mapping of FIN1 with CORA with the current version requires ~48GB of memory (w/o memoization), slightly higher with memoization threshold set to 20.
Whole genome best-mapping pipeline of CORA for FIN1 can be run with ~20GB of memory.
In future updates we will reduce memory requirements of CORA with compacter representations of homology table and links table.


Read datasets used for comparison benchmarks.
FIN1: 1 2
FIN2: 1 2
FIN4: 1 2



Additional Homology Tables for hg19
Below are various k-mer length and Hamming distance homology tables for the hg19 human reference genome downloaded from UCSC: hg19.2bit
. When using default parameters, please download Exact and Hamming 2 tables. Mapping distance metric (edit vs. Hammnig distance) is independent from the homology table, but the distance threshold needs to be compatible.

Table type / k-mer size 33bp 34bp 35bp 36bp 37bp 38bp 39bp 40bp 41bp 42bp 43bp 44bp 45bp 46bp 47bp 48bp 49bp 50bp 51bp 52bp 53bp 54bp 55bp 56bp 57bp 58bp 59bp 60bp
Exact X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Hamming 1 X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Hamming 2 X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Hamming 3 X X X X X X X X X X X X X X X X X X X X X X X X X X X X