Betascan: a program to predict beta-sheets from sequence alone using pairwise probabilistic modeling


VERSION HISTORY

 2008-10-27 final cleanup for web download release
 2008-10-21 add batch fallthrough: failure to find strands now does not crash program
 2008-03-14 update to documentation
 2007-10-18 add zip option; move correction into initial output
 2007-09-14 betascan web output 
 2007-08-04 integrate matlab files into perl code and implement image switch
 2006-09-14     reimplement pairwise detector with improved dynamic programming
 2006-08-24 add two-sequence capability
 2006-08-21 add additional weighting for tyrosine, glutamine, asparagine pairings
 2006-08-09 calculations with length disabled (may be useful for stitcher)
 2006-07-30 antiparallel calculations
 2006-07-30 even more switches and access points added; considering changing defaults on many switches
 2006-07-27 rearranged architecture; many more switches and access points added 
 2006-07-26 transferred calcProbs out of code4.cpp and into perl script
 2006-07-24 started work on corrector
 2006-07-17 extensive commenting and organization
 2006-07-14 graphing capability added for maxima scores down to -1
 2006-07-13 maxima calculators and graphers added
 2006-07-11 log-odds score calculations added; color settings adjusted
 2006-07-10 normalization by length
 2006-07-07 normalization by orientation 
 2006-07-05 bug found: was calculating beta strands that hung off the end of the chart.
 2006-07-03 -m and -M switches: minimum and maximum loops between strands now frobbable on command line
 2006-07-01 bsplotpair.m debugged and incorporated. now generating on-the-fly pairwise graphs
 2006-06-29 bsplotsingle.m and code4.cpp now use data from switches
 2006-06-28 switches implemented. input file name, beta strand length, and length of file to be graphed
 2006-06-27 seperate perl files now incorporated as subroutines
 2006-06-26 matlab invokation debugged. bsplotsingle.m now working
 2005-05-20 incorporating seperate drivers into a single master driver perl script
 2005-05-20 see archived individual perl scripts for previous version histories


DESCRIPTION

Betascan uses the pairwise beta-structure databases compiled for Betawrap (latest version: McDonnell et al., 2006) to calculate the probability of

* (a) a beta strand of given length, position, and orientation forming (the singleton probability) and

* (b) two beta strands of given length, positions, and orientation forming part of a beta-sheet. We calculate these matters progressively over a specified length of sequence and normalize:

* (c) by length (according to a Poisson model fitted to the data of Penel et al., 2003) and generate log-odds scores by calculating a null model where all conditional probabilities are 0.05. Next, we identify all local maxima in singleton and pairwise scores over a space populated by the calculated scores and with possible transitions defined by the physically possible changes in beta-strand (or beta-sheet) folding. Beta-strand and sheet scores for the all possibilities and/or for local maxima only may be graphed.

Dependencies

Betascan consists of the following files:

 betascan.pl
        Main program
 BSweights.txt
 BSprobs.txt
        Probability tables
        
Betascan requires the following installations:

 Perl 6 or later
 Math::Matlab::Local
 Getopt::Std
 MATLAB (7.0.1 or later)

Input files

 BETASCAN requires one input file: a FASTA format containing one or more sequences. 
 IMPORTANT: The name of the protein(s) MUST be non-identical in the first word!

Output files

BETASCAN generates the following data files:

 [Name of protein] prion single strand [in-out|out-in] [strand length] residue.txt
        singleton predictions for each orientation and strand length
 [Name of protein] prion single strand complete swap.txt
        arranged for singleton graphing
 [Name of protein] prion forward [in-out|out-in] [strand length] residue.txt
        pairwise predictions of sheets, with upper strand as prior
 [Name of protein] 3dranged prion forward [in-out|out-in] [strand length] residue.txt
        arranged for pairwise graphing
 [Name of protein] prion combined [in-out|out-in] [strand length] residue.txt
        pairwise predictions of sheets multiplied by singleton predictions of the upper sheet
 [Name of protein] 3dranged prion forward [in-out|out-in] [strand length] residue.txt
        arranged for pairwise graphing
 [Name of protein] prion single strand complete maxima.txt
        singleton prediction local maxima, arranged for graphing
 [Name of protein] prion combined scores complete maxima.txt
        pairwise prediction local maxima, arranged for graphing

 singleInFreqs.txt: singleton interior frequencies for the twenty amino acids
 singleOutFreqs.txt: singleton exterior frequencies for the twenty amino acids
 
 bsplotsingle400(run ID).m
 bsplotpair400(run ID).m
 bsbetasingle400(run ID).m
 bsbetapair400(run ID).m
        Matlab files custom-tailored to generate desired plots

 Error files, one for each protein and one for each pairwise maxima calculation.

 BETASCAN also generates an image file for each arranged file, in pdf format.
 BETASCAN may also generate in eps, tiff, or jpg formats, as controlled by the -i switch, but compatibility is not guaranteed!
 Each invocation of MATLAB generates a diary for debugging.

Switches

The following switches work with BETASCAN:

Mandatory inputs:

 -n B<inputfilename:> takes the input FASTA file.
 -b B<betalength:> the maximum length of a beta strand in the model. 
            Must be integer greater than 2. 
 -l B<seqlength:> the length of the sequence. Only the first I<seqlength> residues will be graphed.

Optional inputs:

 -m B<minloop:> the minimum distance between two possibly paired beta strands. 
        Current default is 0.
 -M B<maxloop:> the maximum distance between two possibly paired beta strands. 
        Current default is 35. 
            Setting to 0 will force checking of all possible pairings (i.e., all interchain contacts).
 -z B<startPoint:> the number of the first residue in the sequence, if not 1.
 -c B<modelname:> the amino acid composition model to be used. 
        Must be I<none:>, I<null:>, or one of the models in BSweights.txt (or specified by -w). 
                Defaults to I<allproteins:>.
 -i B<format:> requested output format for images. 
        Currently available: pdf, tiff, jpg, all.

Location inputs:

 -j B<jobname:> the name of the job, to be used in assigning file and directory names.
 -G B<matlabLoc:> the location of MATLAB on the hard drive, if not in the default location.
                Default location is "/Applications/MATLAB701/bin/matlab".
 -C B<cacheAddr:> the cache address, to be used to place BETASCAN output in a temporary directory.

Alternate prior data files:

 -w B<weightfilename:> takes the file with amino acid composition models. 
        Defaults to BSweights.txt.
 -p B<probfilename:> takes the file with conditional probability tables. 
        Defaults to BSprobs.txt.

Calculation switches:

 -L: turn off normalization by length
 -e: turn on maxima filter to remove less likely maxima
 -y: use bonuses for tyrosine, asparagine, and glutaimne side-chain stacking interactions.

Output switches:

 -1: only generate singleton graphs.
 -2: only generate pairwise graphs.
 -g: turn off singleton and pairwise image generation.
 -s: skip all pre-maxima calculations and graphs.
 -F: generate only most-used graphs.
 -f: turn off finding and graphing local maxima.
 -d: do not move data files and images to their own subdirectories at the end of calculations.
 -h: turn off HTML file with links to data files.
 -T: enlarge all graph text, suitable for generating images for publication.
 -Z: generate zipped archive of all outputs

Debug switches:

 -D: explicit debugging. I<Warning::> this will cause the debug file to become VERY large (>25 MB).
 -t: include timestamps in debugging messages.

Deprecated experimental switches:

 -a B<seqID:> the secondary sequence to run all sequences against.
                Must exist within the input FASTA file.
                Used only to do hetero pairwise runs (runs of one sequence forming beta-structure with another).
 -o: turn on normalization by orientation.
 -A: calculate antiparallel pairwise probabilities.
 -N B<normParam:> change parameter of length normalization.
 


AUTHOR

 Allen W. Bryan, Jr., lead author and implementer
 Bonnie Berger, Susan Lindquist, advisors
 Matt Menke, Andrew McDonnell, coders and implementers
 Lenore Cowen, massive amounts of good advice and paper-writing help
 Additional immense amounts of advice and test data from the Susan Lindquist lab and Thronson Luehrs lab
 Random additional support by the Bonnie Berger lab
 Help with Matlab debugging from Shannon Weiland and Sepideh Baghaii
 Parts of the amino acid propensity tables provided by Oliver King
 BETASCAN conditional probability tables updated by Nathan Palmer
 Website maintenance and debugging by Kenny Lu
 
 A production of:
 * Massachusetts Institute of Technology
        * Computer Science and Artificial Intelligence Laboratory
        * Department of Mathematics (Applied)
        * Department of Biology
 * Harvard/MIT Divison of Health Sciences and Technology
 * Whitehead Institute for Biomedical Research
 * Tufts University
        * Computer Science Department
 
 With generous support by:
 * National Institutes of Health
        * Informatics for Integrating Biology & the Bedside, grant # U54-LM008748
        * grant # R01-GM25974
 * National Science Foundation
    * award # 0428715
 
 For relevant legal verbage, talk to their lawyers...
 
 ALL INQUIRIES ARE DIRECTED TO: betascan@csail.mit.edu
 
# #include <disclaimer.h>