Carnelian

Carnelian: enhanced metabolic functional profiling of whole metagenome sequencing reads

Massachusetts Institute of Technology (mit)
Computer Science and Artificial Intelligence Laboratory (csail)
Theory of Computation Group (toc)
Berger Lab for Computation and Biology (compbio)

Corresponding author for study: bab@mit.edu
Queries regarding the usage of Carnelian pipeline: nazeen@mit.edu

About

We present Carnelian, a pipeline for alignment-free functional binning and abundance estimation, that leverages low-density even-coverage locality sensititve hashing to represent metagenomic reads in a low-dimensional manifold. When coupled with one-against-all classifiers, our tool bins whole metagenomic sequencing reads by molecular function encoded in their gene content at significantly higher accuracy than existing methods, especially for novel proteins. Carnelian is robust against study bias and produces highly informative functional profiles of metagenomic samples from disease studies as well as environmental studies. This work is presented in the following paper:

Sumaiya Nazeen, Yun William Yu, and Bonnie Berger*. Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads. Genome Biology 2020, 21(1): 47. [Link to paper]
*: correspondence

Availability

The in-house gold standard dataset is available for download here: EC-2010-DB.
Source code for Carnelian is available at Github.
A helper script for creating the gold standard EC database is available here.
All intermediate results and supplementary materials will be made available after publication.

Note on Availability of Data Sets from Alm Lab:

The Bostonian data set is now available through NCBI's SRA (Study Accession: SRP200548). This is a time-series metagenomics study; hence, there are multiple samples available from each of the 84 individuals in the study. In our study, we used data from a single time point per individual, usually the one with the most number of reads. The sample accession numbers are provided here.
The Baka data set is unpublished data on a sensitive population from the Eric Alm lab and is currently being prepared to be released through dbGaP by April, 2020. We will provide the accession numbers once the data becomes available through dbGaP.

Docker container

A docker container for Carnelian is available with all its dependencies. Please follow the following instructions.

Install docker following the instuctions here
Start the container for Carnelian:
$ docker run -ti snazeen/carnelian
Use Carnelian for functional profiling of your metagenomic data:
$ carnelian -h
Test Carnelian installation:
$ cd /usr/local/src/carnelian/tests
$ python basictest_carnelian.py
$ python advancedtest_carnelian.py

Note on Requirements:
Carnelian depends on Vowpal Wabbit 8.1.1, R 3.3.2, Python 2.7.13, and BioPython 1.70 in the system path, and assumes a Unix-like environment with standard POSIX tools, and that Python 2.7.13 is the default Python engine, and R 3.3.2 is the default R engine. For running details, please see the README and instructions in the Github. Future updates will be posted on Github.

Carnelian was tested with GCC 6.3.0 on Ubuntu 17.04, running under Bash 4.4.7(1) on a server with Intel Xeon E5-2695 v2 x86_64 2.40 GHz processor and 320 GB RAM. If you are planning to run it on a laptop/macbook, be sure to adjust the "--bits" parameter during training according to the available memory on your machine. Use a value as large as possible for the "--bits" parameter (default: 31; you might need to lower it if your RAM is smaller in size).

Workflow

Training Carnelian with gold standard: (Needs to be done only once.)
carnelian train -k 8 --num_hash number-of-hash-functions-be-used -l 30 -c 5 data/EC-2192-DB model_dir
Note: We recommend using 2 or 4 hash functions. Even with one hash function, the results are reasonably good.
Functional binning:
1. Preprocessing reads: (Needs to be done for every sample/individual. Other third-party tools can also be used for pre-processing.)
  - Paired-end:
    1. Quality filtering and adapter trimming:
      java -jar trimmomatic-0.36.jar PE input_forward_fastq input_reverse_fastq output_paired_forward_fastq output_unpaired_forward_fastq output_paired_reverse_fastq output_unpaired_reverse_fastq ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:90
    2. Deconvolving: (removing human sequence contaminations)
      perl deconseq.pl -f output_paired_forward_fasta_from_trimming -dbs hsref -i 90 -c 90 -out_dir forward_dir
      perl deconseq.pl -f output_paired_reverse_fasta_from_trimming -dbs hsref -i 90 -c 90 -out_dir reverse_dir
    3. Concatenate reads into a multi-fasta file:
      cat forward_dir/*_clean.fa reverse_dir/*_clean.fa > sample.fasta
  - Single-end:
    1. Quality filtering and adapter trimming:
      java -jar trimmomatic-0.35.jar SE input_fastq output_fastq ILLUMINACLIP:adapters/TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:90
    2. Deconvolving: (removing human sequence contaminations)
      perl deconseq.pl -f output_fasta_from_trimming -dbs hsref -i 90 -c 90 -out_dir output_dir
    3. Rename the clean fasta file:
      mv output_dir/*_clean.fa sample.fasta
2. Annotating samples:
  carnelian annotate -k 8 -n number_of_cpus path_to_directory_containing_input_fasta trained_model_directory path_to_sample_output_directory path_to_FragGeneScan
Note on paired-end inputs:
If you have paired-end fastq files, you can either treat the reads individually or merge the pairs. In the former case, you need to convert the reads to fasta format and put them in a single file before running Carnelian on it. If you want to use the paired-end relationships, please use the merge_pairs.py script under utils folder to generate a merged fasta file. The script will construct a longer read if the reads on forward and reverse strands (reverse complemented) have overlaps. If they don't overlap, the script will link them by filling the inner distance with 'N's. The inner distance can be specified by the user (default:50). Note that, the script assumes that both the fastq files will have the reads in the same order.
Functional profiling and abundance analysis: (Needs to be performed once all the samples have been labeled.)
1. Move the label files from individual sample directories to one single directory labels_dir. The files should be named after the same sample identifiers that appear in the sampleinfo_file file. See the readme.txt file for the format of sampleinfo_file.
2. Run the following command for creating the raw and effective counts matrices:
  carnelian abundance labels_dir abundance_matrix_dir sampleinfo_file data/EC-2010-DB/ec_lengths.tsv
Note: Carnelian provides both raw read counts and normalized read counts as output for the user. If the user wants to use a different normalization technique than the one presented in the paper, the raw counts matrix can be utilized.

Helper scripts for analyses performed in the paper:

Scripts for normalizing the output of mi-faser, HUMAnN2, and Kraken2 and pathway abundance estimation are available here. See the README.txt file for instructions.

Pre-trained EC and COG models:

All models are trained with vowpal-wabbit 8.1.1 and will not work with other versions of vowpal-wabbit. See full requirements on github.

Model trained on EC-2010-DB [download]
Model trained on COG categories from the publicly available eggNOG database [download]
Corresponding COG database contains 1,785,722 protein sequences from 3,285 COG categories. [download]

Performance comparison between default mode and precise mode of Carnelian:

Benchmarks were performed on subsets of EC-2010-DB data set. For the seen protein case, 100 bp reads were randomly drawn from the backtranslated protein sequences. To simulated mutated proteins, 5% mutation was randomly introduced in the reads. For hold-out experiment, proteins were randomly held out from multi-protein EC bins in our gold standard database.

We recommend using Carnelian in default mode as it has been thoroughly tested. The purpose of precise mode is only to provide probability scores for the EC labels and this mode is suitable for smaller databases.

Supplementary Information

Metadata associated with the real data sets analyzed in our study can be found here.

© All rights reserved to Sumaiya Nazeen. 2018. This work is published under MIT License. Permission is hereby granted to whoever obtains a copy of the files associated with our pipeline to use, copy, modify, merge, publish, and distribute free of charge. For details see the license file.