Drew Bryant, Mapping the Structural Landscape of Protein Families with Geometric Feature Vectors

Structural variations caused by a wide range of physicochemical and biological sources directly influence the function of a protein. For enzymatic proteins, most, if not all, of the functional properties are associated with the binding site, which can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels. Here we describe two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine intra-family "ontologies." Intra-family ontologies characterize the substructural variation within a protein family.

Our results demonstrate examples of automatically determined intra-family ontologies that can be linked to phylogenetic distance between family members, segregation by ligation state, and organization by ancestry among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework, constructs a representative template for each of the protein sub-groups within the intra-family ontology determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing templates. FASST contributes a critical feedback and assessment step to existing substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines.

Our work provides an unbiased, automated assessment of the structural variability of identified substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.