Supplementary Information for:
Sterner, B., Singh, R., and Berger, B.
Predicting and Annotating Catalytic Residues: An Information Theoretic Approach. Submitted to the International Conference on Intelligent Systems for Molecular Biology, 2006.
- Clustering with Other Distance Measures:
We tested five other distance measures for profile widths of 11 to 21: the correlation coefficient, Spearman correlation, the cosine of the included angle between
data points, the Mahalanobis distance[1], and the Euclidean distance. These all performed significantly worse than the Kullback-Leibler distance did in the same test at generating the expected, correct
hierarchical clustering of the positive training set of catalytic residues. The table below gives the error rate of each method, which we define as the number of profiles correctly clustered as proton donors or nucleophiles divided by the total number of profiles.
Distance Measure | Percent Error |
Correlation | 34 |
Spearman | 36 |
Cosine | 34 |
Euclidean | 34 |
Mahalanobis | 43 |
Kullback-Leibler | 22 |
- K-Nearest Neighbor Classification:
We tested our algorithm's performance for several values of k in both the annotation and prediction stages (see table below). For annotation, using the optimized parameters for center column weighting and profile width found from the clustering, we found that k=3 is optimal. For prediction, we took the same profile width but with a center column weighting of 8 in order to get the best sensitivity without a high false positive rate, and varying k for that value of the center weighting we found that k=1 is best.
k | Annotation | Prediction |
1 | 75 | 82 |
3 | 80 | 78 |
5 | 75 | 76 |
7 | 74 | 76 |
9 | 70 | NA |
- Data Set:
We used 37 glycohydrolase families taken from the alignments generated by PFAM. Each PFAM alignment was reduced to 40% redundancy
using the CD-HIT sequence clustering algorithm[2,3]. The locations of catalytic residues for each family were then determined by aligning
a representative solved structure to the PFAM alignment using CLUSTALW. In a few cases, the PFAM alignment did not cover enough of the protein
sequences or misaligned the sequences in a local area around the catalytic residues (i.e. by visual inspection we found that the local part of the
alignment was in error). In the table below, the positions of the catalytic residues used are given for the PDB structure and in parentheses for the PFAM alignment.
Family | PDB ID | Nucleophile | Proton Donor |
1 | 1qox | E354(1560) | E165(693) |
2 | 1bgl | E537(597) | E461(291) |
3 | 1j8v | D285(465) | NA[3] |
4* | 1obb | E391(562) | E113(138) |
5 | 1ece | E282(949) | E162(470) |
6 | 1bvw | D315(182) | D136(182) |
7 | 1ovw | E197(267) | E202(272) |
8 | 1cem | D120(160) | E63(78) |
9 | 1js4 | D415(1181) | E424(1190) |
10 | 1exp | E233(493) | E127(268) |
11 | 1bk1 | E79(201) | E170(352) |
12 | 1nlr | E120(76) | E203(204) |
13 | 2aaa | D206(1100) | E230(1271) |
14 | 1fa2 | E382(467) | E187(213) |
15* | 1agm | D175(311) | E178(314) |
16* | 1gbg | E105(254) | E109(273) |
17 | 1ghs | E231(569) | E288(707) |
18 | 1ctn | NA[1] | E315(350) |
19 | 1dxj | E90(380) | E68(115) |
20 | 1nou | NA | E306(347) |
22 | 1gd6 | D49(65) | E32(46) |
25 | 1jfx | NA[2] | E100(175) |
26 | 1j9y | E282 | E174(230) |
28 | 1czf | NA | D203(665) |
30 | 1ogs | E340(527) | E235(380) |
31 | 1we5 | D419(502) | D482(816) |
32 | 1y4w | D41(11) | E241(687) |
35 | 1tg7 | E299(317) | E200(195) |
38 | 1o7d | D147(210) | NA |
39 | 1uhv | NA[2] | E160(203) |
45 | 1hd5 | D10(9) | NA[2] |
46 | 1qgi | D55(26) | E3(8) |
47* | 1kkt | NA | D375(1069) |
49 | 1ogm | D376(396) | D395(433) |
53 | 1fhl | E246(363) | E136(206) |
56 | 1fcu | NA[1] | E113(126) |
57 | 1k1w | E123(440) | D214(727) |
*: The alignments of families 4,15,16, and 47 were either done by hand or manually fixed.
NA: Location of catalytic residue is unknown.
NA[1]: Nucleophile is acetamido group of substrate.
NA[2]: Homology match from structure to alignment failed.
NA[3]: PFAM alignment does not include residue position.
References:
[1] V. Susana and A. Jonas. Alignment-free Sequence Comparison - A Review. Bioinformatics, 19(4):513-523, 2003.
[2] W. Li, L. Jaroszewski, and A. Godzik. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics, 17:282-283, 2001.
[3] W. Li, L. Jaroszewski, and A. Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18:77-82, 2002.