«Recent enhancements to the Blocks database servers Jorja G. Henikoff, Shmuel Pietrokovski and Steven Henikoff1,* Fred Hutchinson Cancer Research ...»
Recent enhancements to the Blocks database servers
Jorja G. Henikoff, Shmuel Pietrokovski and Steven Henikoff1,*
Fred Hutchinson Cancer Research Center, 1124 Columbia Street, Seattle, WA 98104,
USA and 1Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center,
Seattle, WA 98104, USA. *To whom correspondence should be addressed.
The Blocks Database contains multiple alignments of conserved regions in protein
families which can be searched by e-mail (email@example.com) and World Wide Web (http:/blocks.fhcrc.org/) servers to classify protein and nucleotide sequences. Recent enhancements to the servers include: Improved calculation of position-specific scoring matrices from blocks; availability of the Prints protein fingerprint database for searching in Blocks format; a representative sequence biassed towards the blocks of a protein family for sequence database searching;
a tree constructed from the blocks of a protein family to facilitate subfamily classification; links to World Wide Web sites dedicated to a family; and implementation of the Local Alignment of Multiple Alignments (LAMA) method to search a block against a database of blocks.
The Blocks Database was designed to classify biological sequences. Protein or nucleotide sequences can be searched against blocks, which are converted to positionspecific scoring matrices (PSSMs) for this purpose. Comparison takes into account both local similarity between the query and family, represented by a single block, and global similarity, represented by the entire set of blocks for families with more than one conserved region (8). Intervening non-conserved regions are ignored, so that the comparison concentrates on the regions characteristic of each documented family.
Searching the Blocks Database can be more sensitive and selective than searching the sequence databases (3, 8).
IMPROVED POSITION-SPECIFIC SCORING MATRICES FROM BLOCKSWhen a query sequence is compared with a database of blocks, it is slid along each block and every possible alignment with each block is scored. A PSSM, sometimes called a profile (9), is computed from each block for this purpose. A PSSM has 20 rows, one for each amino acid, and as many columns as there as positions in the block. Each column and row entry in a PSSM contains a numerical score for the alignment of that block column with that amino acid in the query sequence. The total alignment score is the summed over all block columns.
Computing PSSM scores is an area of active research (10-13), and those used by the Blocks Searcher were improved in early 1996 to take advantage of this work. The current scores are log-odds ratios which incorporate position-based pseudo-counts computed from the substitution probabilities underlying the BLOSUM series of scoring matrices (14). The addition of pseudo-counts to the observed counts of amino acids in blocks helps compensate for possible inadequate sampling of family members. We have shown these scores to be an improvement over the former scores, which were based on odds ratios, as well as over other methods in general use (12).
SEARCHING THE PRINTS DATABASE IN BLOCKS FORMATThe Blocks Database depends on documented groups of protein families. While we continue to depend on PROSITE (7) for this documentation, we have begun to supplement it. The Prints protein fingerprint database (15) is very similar in concept to the Blocks Database, although it is constructed differently (16). Groups of conserved motifs for a protein family are excised semi-manually from sequence alignments and used as fingerprints. Additional family members are sought by scanning the protein databases with these fingerprints.
A copy of the PRINTS Database in Blocks Database format can now be searched by the Blocks WWW server with links to the PRINTS WWW server. While many of the families represented in PRINTS are also in the Blocks Database, the PRINTS blocks may differ. PRINTS also contains families not represented in Blocks, so users are encouraged
to search both databases.
AN EMBEDDED CONSENSUS SEQUENCE FOR SEARCHINGMultiple alignment information represented by the blocks for a family provides a simple strategy for improving similarity searches of sequence databases. Consensus residues obtained from the blocks are embedded into a family member to bias it towards the consensus in the family's conserved regions (17). These COBBLER (COnsensus Biasing By Locally Embedding Residues) sequences are provided for each family in the Blocks Database as part of the "Get Block" request and can be used with standard searching programs such as BLAST (18) and FASTA (19) to improve detection of distant family members in the sequence databases. Figure 1B shows the COBBLER sequence for the lipolytic "G-D-X-G" enzyme family (BL01173).
A FAMILY TREE FROM BLOCKSThe "Get Block" request now returns a tree made from the block alignments for the family. The tree is made by the CLUSTAL W program (20) from the alignments in the blocks using the neighbor-joining method (21), and is useful for inferring similar function.
It is based on a matrix of distances between all pairs of block sequence segments. For requests from the Blocks e-mail server, the text "treefile" is returned. It can be displayed by programs such as "drawgram" from the PHYLIP (22) package, or by other programs that recognize the tree format used by PHYLIP. For requests from the Blocks WWW server, there are options to display the tree. The tree made from the lipolytic "G-D-X-G" enzyme family blocks is shown in Figure 1C.
PROTEIN FAMILY WWW LINKSOne of the advantages of basing the Blocks Database on PROSITE has been the link to PROSITE's documentation. This documentation can be immediately accessed when a hit is detected in the Blocks Database. With the expansion of the WWW, some researchers are building WWW pages that contain a more comprehensive set of documentation for their family of interest, including graphics, meeting notifications, etc. We have started an effort to promote the development of more WWW resources (23) of this type. The Blocks Searcher now includes links to related WWW pages with the "Get Block" request.
SEARCHING THE BLOCKS DATABASE WITH BLOCKSComparing sequences with blocks and other types of multiple alignments can be more sensitive than sequence-to-sequence comparisons (9, 17). Advancing one step further, we have developed a method for comparing multiple alignments with multiple alignments (24). Each multiple alignment is treated as a sequence of amino acid distributions.
Multiple alignments can then be aligned with each other by using an appropriate measure for scoring the similarity between amino acid distributions (analogous to the use of an amino acid substitution matrix in sequence-to-sequence alignment). We termed the method LAMA and tested it by searching blocks against the Blocks and other multiple alignment databases. LAMA identified genuine relations between protein families beyond the range of sequence-to-sequence and sequence-to-block search methods (24). LAMA can be used at our WWW server for searching the Blocks and PRINTS Databases and for comparing individual blocks. The WWW server also provides a tool for converting user-provided multiple alignments to the Blocks format required by LAMA, as well as computing the PSSM of the resulting block and a graphical logo representation of it (25, 5).
The following example illustrates the method. LAMA searches of the Blocks Database identified sequence similarity between serine active sites from carboxylesterase, lipase and endopeptidase families (Figure 2). All of these enzymes apparently cleave their substrates using a catalytic triad charge relay system. These relations were very difficult or impossible to detect with sequence-to-sequence searches, and sequence-to-blocks searches only detected the relation between the type-B carboxylesterases and lipolytic "G-D-X-G" enzymes (Table 1).
CURRENT VERSIONS AND USAGEThe Blocks Database and searching program (26, 27) were first made available by anonymous ftp in 1991. Version 9.1 of the Blocks Database contains 3,300 blocks representing 906 different protein families documented in version 12 of PROSITE, and is complemented by version 12.0 of the PRINTS Database (15) with 2,875 blocks from 550 families for a total of more than 1,000 unique families. With an median of 12 (mean
23) sequences per family, about 40% of SWISS-PROT (28) is represented in Blocks 9.1.
The BLIMPS searching program (v. 3.1) is currently used to query these databases. Both BLIMPS and the Blocks Database are freely available by anonymous ftp, however, most biologists prefer to use the Blocks Searcher e-mail service, which was initiated in the summer of 1992, or WWW service, initiated in 1994. The Blocks Searcher averages about 100 searches per day with the e-mail server handling slightly more than half of the requests.
ACCESSAnonymous FTP Siteftp://ncbi.nlm.nih.gov/repository/blocks
E-mail server firstname.lastname@example.org Send the word 'help' in the subject line or as the only word in the message body.
WWW server http://blocks.fhcrc.org/ Additional information or assistance may be obtained by sending an e-mail message to email@example.com.
ACKNOWLEDGEMENTSThis work is supported by a grant from the NIH (GM29009). SP is a Howard Hughes Medical Institute Fellow of the Life Sciences Research Foundation.
REFERENCESHenikoff, S. and Henikoff, J. G. (1991) Nucleic Acids Res., 19,6565-6572.
Henikoff, J. G. and Henikoff, S. (1996) Meth. Enzymol., 266,88-105.
Posfai, J., Bhagwat, A. S., Posfai, G. and Roberts, R. J. (1989) Nucleic Acids 3.
Smith, H. O., Annau, T. M. and Chandrasegaran, S. (1990) Proc. Natl. Acad. Sci.
Henikoff, S., Henikoff, J. G., Alford, W. J. and Pietrokovski, S. (1995) Gene, 5.
6. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993) Science, 262,208-214.
Bairoch, A. (1992) Nucleic Acids Res., 20,2013-2018.
Henikoff, S. and Henikoff, J. G. (1994) Genomics, 19,97-107.
Gribskov, M., McLachlan, A. D. and Eisenberg, D. (1987) Proc. Natl. Acad. Sci.
10. Brown, M. P., Hughey, R., Krogh, A., Mian, I. S., Sjolander, K. and Haussler, D.
(1993) In Hunter, L., Searls, D. and Shavlik, J. (eds), Proc. First Int. Conf. on Intelligent Systems for Molecular Biology AAAI Press, Washington D. C., pp. 47Tatusov, R. L., Altschul, S. F. and Koonin, E. V. (1994) Proc. Natl. Acad. Sci. USA, 11.
Henikoff, J. G. and Henikoff, S. (1996) CABIOS, 12,135-143.
13. Bailey, T. L. and Gribskov, M. (1996) In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp. 15-24.
Henikoff, S. and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA, 89,10915Attwood, T. K. and Beck, M. E. (1994) Protein Engineering, 7,841-848.
Parry-Smith, D. J. and Attwood, T. K. (1992) CABIOS, 8,451-459.
17. Henikoff, S. and Henikoff, J. G. (1996),,submitted.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) J. Mol.
Pearson, W. R. and Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA, 85,2444Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994) Nucleic Acids Res., 20.
Saitou, N. and Nei, M. (1987) Mol. Biol. Evol., 4,406-425.
Felsenstein, J. (1988) Ann. Rev. Genet., 22,521-565.
Henikoff, S., Endow, S. A. and Greene, E. A. (1996) Trends Biochem. Sci., 21,in 23.
Pietrokovski, S. (1996) Nucleic Acids Res., 24,in press.
Schneider, T. D. and Stephens, R. M. (1990) Nucleic Acids Res., 18,6097-6100.
Henikoff, S., Wallace, J. C. and Brown, J. P. (1990) Meth. Enzymol., 183,111-132.
Wallace, J. C. and Henikoff, S. (1992) CABIOS, 8,249-254.
Bairoch, A. and Boeckmann, B. (1992) Nucleic Acids Res., 20,2019-2022.
Henikoff, S. and Henikoff, J. G. (1994) J. Mol. Biol., 243,574-578.
FIGURE LEGENDSFigure 1.
Blocks from v. 9.1 of the Blocks Database representing the lipolytic "G-D-X-G" enzymes family.