Supplementary Information for Sterner et al. ISMB'06 submission

Supplementary Information for:

Sterner, B., Singh, R., and Berger, B. Predicting and Annotating Catalytic Residues: An Information Theoretic Approach. Submitted to the International Conference on Intelligent Systems for Molecular Biology, 2006.

Clustering with Other Distance Measures: We tested five other distance measures for profile widths of 11 to 21: the correlation coefficient, Spearman correlation, the cosine of the included angle between data points, the Mahalanobis distance[1], and the Euclidean distance. These all performed significantly worse than the Kullback-Leibler distance did in the same test at generating the expected, correct hierarchical clustering of the positive training set of catalytic residues. The table below gives the error rate of each method, which we define as the number of profiles correctly clustered as proton donors or nucleophiles divided by the total number of profiles.

Distance Measure Percent Error

Correlation 34

Spearman 36

Cosine 34

Euclidean 34

Mahalanobis 43

Kullback-Leibler 22

K-Nearest Neighbor Classification: We tested our algorithm's performance for several values of k in both the annotation and prediction stages (see table below). For annotation, using the optimized parameters for center column weighting and profile width found from the clustering, we found that k=3 is optimal. For prediction, we took the same profile width but with a center column weighting of 8 in order to get the best sensitivity without a high false positive rate, and varying k for that value of the center weighting we found that k=1 is best.

k Annotation Prediction

1 75 82

3 80 78

5 75 76

7 74 76

9 70 NA

Data Set: We used 37 glycohydrolase families taken from the alignments generated by PFAM. Each PFAM alignment was reduced to 40% redundancy using the CD-HIT sequence clustering algorithm[2,3]. The locations of catalytic residues for each family were then determined by aligning a representative solved structure to the PFAM alignment using CLUSTALW. In a few cases, the PFAM alignment did not cover enough of the protein sequences or misaligned the sequences in a local area around the catalytic residues (i.e. by visual inspection we found that the local part of the alignment was in error). In the table below, the positions of the catalytic residues used are given for the PDB structure and in parentheses for the PFAM alignment.

Family	PDB ID	Nucleophile	Proton Donor
1	1qox	E354(1560)	E165(693)
2	1bgl	E537(597)	E461(291)
3	1j8v	D285(465)	NA[3]
4*	1obb	E391(562)	E113(138)
5	1ece	E282(949)	E162(470)
6	1bvw	D315(182)	D136(182)
7	1ovw	E197(267)	E202(272)
8	1cem	D120(160)	E63(78)
9	1js4	D415(1181)	E424(1190)
10	1exp	E233(493)	E127(268)
11	1bk1	E79(201)	E170(352)
12	1nlr	E120(76)	E203(204)
13	2aaa	D206(1100)	E230(1271)
14	1fa2	E382(467)	E187(213)
15*	1agm	D175(311)	E178(314)
16*	1gbg	E105(254)	E109(273)
17	1ghs	E231(569)	E288(707)
18	1ctn	NA[1]	E315(350)
19	1dxj	E90(380)	E68(115)
20	1nou	NA	E306(347)
22	1gd6	D49(65)	E32(46)
25	1jfx	NA[2]	E100(175)
26	1j9y	E282	E174(230)
28	1czf	NA	D203(665)
30	1ogs	E340(527)	E235(380)
31	1we5	D419(502)	D482(816)
32	1y4w	D41(11)	E241(687)
35	1tg7	E299(317)	E200(195)
38	1o7d	D147(210)	NA
39	1uhv	NA[2]	E160(203)
45	1hd5	D10(9)	NA[2]
46	1qgi	D55(26)	E3(8)
47*	1kkt	NA	D375(1069)
49	1ogm	D376(396)	D395(433)
53	1fhl	E246(363)	E136(206)
56	1fcu	NA[1]	E113(126)
57	1k1w	E123(440)	D214(727)

*: The alignments of families 4,15,16, and 47 were either done by hand or manually fixed.

NA: Location of catalytic residue is unknown.

NA[1]: Nucleophile is acetamido group of substrate.

NA[2]: Homology match from structure to alignment failed.

NA[3]: PFAM alignment does not include residue position.

References:

[1] V. Susana and A. Jonas. Alignment-free Sequence Comparison - A Review. Bioinformatics, 19(4):513-523, 2003. [2] W. Li, L. Jaroszewski, and A. Godzik. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics, 17:282-283, 2001.

[3] W. Li, L. Jaroszewski, and A. Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18:77-82, 2002.

Distance Measure	Percent Error
Correlation	34
Spearman	36
Cosine	34
Euclidean	34
Mahalanobis	43
Kullback-Leibler	22

k	Annotation	Prediction
1	75	82
3	80	78
5	75	76
7	74	76
9	70	NA