opal crystal

Opal

massachusetts institute of technology (mit)
computer science and artificial intelligence laboratory (csail)
theory of computation group (toc)
computation and biology group (compbio)

email queries bab@mit.edu



2018-May-14: 1st CAMI Challenge results (not yet submitted or in bioRxiv)

We recently have been running Opal on the 1st CAMI Challenge medium complexity dataset (viz Figure 2e of Sczyrba, et al., 2017, Nature Methods).

To this end, we trained Opal using mostly default parameters (adjusted for fragment length=150 and split in 150 small batches of coverage 0.1 for memory reasons) on the source genomes provided by the benchmark. Given this training, Opal's overall accuracy in predicting the originating sequence was 77%.

Note, however, that many of the originating sequences correspond to the same taxonomic IDs. When we separate out the gold standard labels by taxonomic level (i.e. sub(species, genus, family, etc.), we can lower bound the accuracy of predicting a particular taxonomic level (by assuming that anything Opal predicts at a higher taxonomic level is incorrect for the lower one), to match Figure 2e referenced above:

This is not a perfect comparison against the benchmarks in Figure 2 of the Nature Methods paper for two reasons, one working for Opal, one working against Opal:

  1. If Opal predicted a source sequence corresponding to species label, but was incorrect for species prediction, we count that as incorrect at the genus level even if the species predicted is in the same genus.
  2. Opal is trained on the source sequences for the CAMI challenge, so the labels it produces are of precisely that granularity and cannot be from outside the set of source sequences.

However, we note that under these caveats, we seem to outperform the other methods in Figure 5e, as none of them are above 50% at the species level, or above 75% at the genus level, or above 85% at the family level.

Our model has not yet been submitted to the CAMI challenge website, though that in the works.

Opal is a metagenomic classifer for next-generation sequencing (NGS) read data. It introduces the usage of even-coverage low-density locality sensitive hashing to create compressed fingerprints of k-mers within the read to use as training vectors for a downstream SVM classifier. The classifier should be trained on a multifasta file, coupled with a taxid file where each line specifies the label of the corresponding fasta line. Once trained, the classification model can be used to predict the species of origin of individual reads in a test-set, also in multifasta format. We have found Opal to be particularly suited in regimes with high substituion error/mutation rate.

Yunan Luo, Y. William Yu, Jianyang Zeng, Bonnie Berger, Jian Peng. "Metagenomic binning through low density hashing". Submitted for publication.

We make available here a set of scripts to reproduce the Opal pipeline described in the manuscript and do most of the major experiments: http://giant.csail.mit.edu/opal/opal-package.tar.bz2. This is the code actually used for Figure 3 in the manuscript.

The in-house dataset of 50-species referred to in Figure 2 and Supplementary Figures S2-3 is available here: http://giant.csail.mit.edu/opal/50_species.tar.gz.

The large dataset of 853-species referred to in Supplementary Table S8 is composed of fasta files referenced here: large_db.genom_id.txt.

We have also implemented a more user-friendly Python interface on Github: https://github.com/yunwilliamyu/opal. Note that this code depends on Vowpal Wabbit ≥ 8.3.0 in the system path, and assumes a Unix-like environment with standard POSIX tools, and that Python 2 is the default Python engine. For running details, please see the README and instructions in the Github repo. Future updates will be made to this separate source repository.

Note: Users are advised not to put integers 1 ... K as proxies for taxids since that does not align with the expectations of vowpal-wabbit oaa classifier. Use large natural numbers as taxids.


Additionally, for reference, the frequencies of spaced k-mer occurences in the species used for our experiments are available. Each folder contains the frequency of species of each dataset. Files in folders are named by the NCBI taxonomy ID or NCBI genome ID. Each file is a tab-delimited table, in which each line contains the kmer string and the number of occurrences of this kmer. To count the frequency of kmers, fragments of length 200 were first sampled from the complete genome of each species. The coverage C was set to 1, which means, on average, each position will be covered once by one of the fragments. Next, spaced (24, 12)-mer were generated using the Opal's LDPC-hash functions. Four functions were used here. In the frequency table, the index (0-based) of hash function was appended to the kmer string. For example, a kmer string "ACGT0" means the kmer "ACGT" was generated by the first hash function.