Accurate prediction of protein function using statistics-informed graph networks

Accurate prediction of protein function using statistics-informed graph networks

PhiGnet for protein function annotations

In this study, we developed the PhiGnet method using statistics-informed graph networks to annotate protein functions and to identify functional sites across species based on their sequences (Fig. 1). To assimilate knowledge from the evolutionary couplings (EVCs, relationships between pairwise residues at two co-variant sites) and the residue communities (RCs, hierarchical interactions among residues)12, we devised the method with a dual-channel architecture, adopting stacked graph convolutional networks (GCNs) (Fig. 1a). This method specializes in assigning functional annotations, including Enzyme Commission (EC) numbers and Gene Ontology (GO) terms (biological process, BP, cellular component, CC, and molecular function, MF), to proteins. When provided with a protein sequence, we derive its embedding using the pre-trained ESM-1b model31. Subsequently, we input the embedding as graph nodes, accompanied by EVCs and RCs (graph edges), into the six graph convolutional layers of the dual stacked GCNs. These layers, working in conjunction with a block of two fully connected (FC) layers, meticulously process the information from the two GCNs, ultimately generating a tensor of probabilities for assessing the viability of assigning functional annotations to the protein. In addition, an activation score, derived using the gradient-weighted class activation maps (Grad-CAMs) approach32, is used to assess the significance of each individual residue in a specific function. The score allows PhiGnet to pinpoint functional sites at the level of individual residues (bottom, Fig. 1c, see Methods).

Fig. 1: PhiGnet annotates protein functions.
figure 1

a PhiGnet predicts protein function from sequence alone. Given a sequence, PhiGnet learns the pre-embedding, EVCs, and RCs using stacked GCNs to infer protein function annotations. b RCs of the Serine-aspartate repeat-containing protein D (SdrD, PDB ID: 4JDZ). The two communities (community I and community II) with coupling strengths in bars are highlighted in red and blue. Each bar in either community I or II illustrates the strength of coupling that a residue has with others, while the conservation scores of these residues are depicted in the bars on the right. On the tertiary structure of SdrD (right), the residues within the community I (red) bind to the calcium ions (sphere in yellow) are shown in sticks, while the residues within the community II (blue) adopt cartoon in blue. c Function annotations of the MgIA protein at the residue level. The activation score (bottom) computed by PhiGnet is to measure the importance of each residue, where the higher the score is, the more likely it is to adopt a functional role in biological activity. Compared to functional sites in BioLip (marked with Y in black), the score indicates that co-evolved residues may be more important than those at conserved positions (top). The scores are mapped to color the MgIA 3D structure (PDB ID: 6IZW) from lower (blue) to higher (red), GDP is shown with sphere in yellow, SO4 in stick in cyan, and Mg2+ ion in a sphere with orange. Source data are provided as a Source Data file.

As an example, we computed RCs for the Serine-aspartate repeat-containing protein D (SdrD) that promotes bacterial survival in human blood by inhibiting innate immune-mediated bacterial killing33,34. Two RCs are mapped on a fully β sheet fold that binds to three Ca2+ ions (1Ca2+ is enclosed in a loop, 2Ca2+ is more solvent exposed and closer to 3Ca2+, which is coordinated by an asparagine (N564) and an aspartic acid (D665), Fig. 1b). Within the community I, most residues (in red sticks) that are identified from EVCs bind to the three Ca2+ ions, contributing together to stabilize the SdrD fold. This suggests that EVCs contain the essential information for deducing the functional role of residues, even when they are sparsely distributed across RCs. Empowered by EVCs and RCs, we implemented the present PhiGnet to assess the functional significance of residues. We carried out PhiGnet to calculate the activation scores for the functional sites of the mutual gliding-motility (MgIA) protein (annotated with EC 3.6.5.2) (Fig. 1c). The resulting activation scores show that the residues with high scores ( 0.5) are in agreement with or close to that of semi-manually curated BioLip database35. Moreover, these residues are located at the most conserved positions (top left, Fig. 1c). Upon mapping these scores onto the 3D structure of MgIA, the activation scores highlight residues (red) that constitute a pocket that binds the guanosine di-nucleotide (GDP) and play a role in facilitating nucleotide exchange36. Together, this suggests that residues at functional sites are conserved through natural evolution, and that PhiGnet is capable of capturing such information, improving the method for predicting protein function at the residue level, even in the absence of structural data.

PhiGnet annotates protein functional sites

Many proteins perform their biological functions through essential residues that are sparsely distributed across different structural levels (e.g., primary, secondary, and tertiary) and are linked to functional sites (such as enzyme active sites, ligand-binding sites, or protein-protein interaction sites). Given the functional contributions of amino acids can significantly differ from one function to another, a key feature of PhiGnet is its ability to quantitatively estimate the importance of individual amino acids for a specific function, enabling us to identify residues that are pertinent to distinct biological activities.

Are the computational predictions as accurate as experimentally determined function annotations? To address this question, we carried out quantitative examinations of the contribution of each amino acid to a protein function using the activation score. We evaluated the predictive performance of PhiGnet and assessed the importance of residues (their contributions to protein function) in nine proteins: the c2-domain of cytosolic phospholipase A2α (cPLA2α), Tyrosine-protein kinase BTK (TpK-BTK), Ribokinase, alpha-lactalbumin (αLA), MCM1 transcriptional regular (MCM1-TR), the Fos-Jun heterodimer (FosJun), the thymidylate kinase (TmpK), Ecl18kI, and helicobacter pylori uridylate kinase (HPUK). These proteins vary in size from approximately 60 to 320 residues, harbor different folds, and perform diverse functions, including ligand binding, ion interaction, and DNA binding. We calculated the activation score for each residue in the nine proteins, comparing them to residues identified through either experimental or semi-manual annotations. Our method demonstrated promising accuracy (with an average 75%) in predicting significant sites at the residue level, in a good agreement with actual ligand-/ion-/DNA-binding sites (Fig. 2). The activation score per residue, mapped onto their 3D structures, exhibits significant enrichment for functional relevance at the binding interfaces. PhiGnet accurately identifies functionally significant residues with high activation scores for the proteins (Fig. 2, Supplementary Figs. S1 and S2).

Fig. 2: PhiGnet annotates protein function at the residue level.
figure 2

a The activation score of each residue is predicted using PhiGnet and compared to the biologically relevant ligand-protein binding sites from the BioLiP database. b The activation scores are mapped to the tertiary structures of nine proteins, including (left to right, top to bottom) the c2-domain of cytosolic phospholipase A2α (cPLA2α, PDB ID: 6IEJ)37, Tyrosine-protein kinase BTK (TpK-BTK, PDB ID: 6W8I), Ribokinase (PDB ID: 6XK2), alpha-lactalbumin (αLA, PDB ID: 1HFX)38, MCM1 transcriptional regular (MCM1-TR, PDB ID: 1MNM)60, the Fos-Jun heterodimer (FosJun, PDB ID: 1FOS)61, the thymidylate kinase (TmpK, PDB ID: 3TMK)62, Ecl18kI (PDB ID: 2GB7)39, and helicobacter pylori uridylate kinase (HPUK, PDB ID: 4A7W)63. Source data are provided as a Source Data file.

Across the proteins cPLA2α, Ribokinase, αLA, TmpK, and Ecl18kI, PhiGnet predicted near-perfect functional sites compared to the experimental identifications. For instance, for cPLA2α, our method accurately identified residues, Asp40, Asp43, Asp93, Ala94 and Asn95, that bind to 1Ca2+ and residues, Asp40, Asp43, Asn65 and Thr41, that bind to 4Ca2+, as well as a residue Asn65 supports 3Ca2+ for stabilizing fold37. Moreover, our method predicted a high score (0.6) for the residue Tyr96, which plays a crucial role in lipid headgroup recognition through cation-π interaction with the phosphatidylcholine trimethylammonium group37. We also applied PhiGnet to αLA, which contains a single, tightly bound calcium ion that is cradled in the EF-hand motif to stabilize the protein against denaturation38. In the αLA protein, the important motif is computationally characterized by a constellation of residues: Lys79, Asp82, Asp84, Asp87, and Asp88. In Ecl18kI, the major groove contacts the bases of the recognition sequence through the three consecutive residues Arg186, Glu187 and Arg188. Specifically, Arg186 and Arg188 form bidentate hydrogen bonds to the outer and inner guanines, respectively. The side chain oxygen atoms of Glu187 each accept one hydrogen bond from the two neighboring cytosines of the recognition sequence. Moreover, the sequence-specific minor groove contacts are exclusively mediated by Glu11439. To evaluate the importance of each residue in Ecl18kI, we computed the activation scores for each residue. These scores confirmed the agreement between the residues captured by PhiGnet and those identified through experimental data. For the proteins MCM1-TR and FosJun, our method captured residues with top activation scores that bind to DNAs, although not all of the residues at functional sites were characterized by high probabilities. Meanwhile, the activation scores failed to highlight function-relevant sites for a few residues. For instance, few residues with scores  >0.5 were not located at the functional sites in Ribokinase, αLA, and HPUK. This discrepancy could be attributed to the noise present in EVCs. Together, the activation scores can indicate essential ligand-/ion-contacting residues, suggesting that learning from diverse levels of evolutionary knowledge can identify binding interfaces at the residue level. Such capability would be valuable in discerning interfaces both inter- and intra-proteins, even in the absence of structural information. Moreover, the predictions suggest that learning from evolutionary knowledge enables us to understand residues arranged in highly ordered patterns, relevant to diverse binding activities. On the other hand, biases originating from the evolutionary data could obscure the activation scores for accessing the functional significance of residues. Collectively, the activation scores can underscore essential ligand-/ion-contacting residues, indicating that learning from diverse levels of evolutionary knowledge can effectively identify binding interfaces at the residue level. Conversely, noise originating from the evolutionary data could influence the activation scores, potentially leading to biases in the identification of functional sites.

PhiGnet outperforms other state-of-the-art methods

To assess the predictive performance of PhiGnet, we implemented the method to infer function annotations (EC numbers and GO terms) for proteins in the two benchmark test sets (see Methods). We proceeded to compare our method against state-of-the-art methods, including alignment-based methods (BLAST18, FunFams40, and Pannzer41), deep learning-based methods (DeepGO25, DeepFRI21, DeepGOWeb42, ProteInfer43, SPROF-GO44, ATGO+45, and CLEAN46). Two essential metrics, including the protein-centric Fmax-score and the area under the precision-recall curve (AUPR), were utilized for the comparisons. Our method demonstrated predictive capabilities for assigning function annotations to proteins across the two test sets. It achieved an average AUPR of 0.70 and 0.89, as well as Fmax scores of 0.80 and 0.88, for GO terms and EC numbers, respectively (Fig. 3). Moreover, it consistently maintained strong performance, with average AUPR scores of 0.64, 0.65, and 0.80, alongside corresponding Fmax values of 0.82, 0.75, and 0.81, for the three branches of GO terms – CC, BP, and MF (Fig. 3d). Overall, PhiGnet significantly outperformed all supervised and unsupervised approaches across the benchmark datasets. For example, in the benchmark of EC numbers, we compared the predictions of various methods, including BLAST, FunFams, DeepGO, DeepFRI, Pannzer, ProteInfer, and CLEAN, against experimentally determined function annotations across the test proteins. Our method yielded Fmax score of 0.88 and AUPR of 0.89, surpassing the performance of other approaches (Fig. 3a, b, Supplementary Fig. S3). All the compared methods exhibited various performances, as illustrated in the precision-recall curves. DeepFRI, Pannzer, and ProteInfer achieved a similar Fmax score, approximately 0.68, outperforming BLAST and DeepGO. In terms of AUPR, FunFams, DeepFRI, and CLEAN yielded similar performances, which were better than those of ProteInfer and Pannzer. PhiGnet achieved Fmax of 0.88 and AUPR of 0.89, respectively, outperforming the CNN-based DeepGO (Fmax of 0.37 and AUPR of 0.21), structure-based DeepFRI (Fmax of 0.69 and AUPR of 0.70), and the contrastive learning-based CLEAN (Fmax of 0.76 and AUPR of 0.70) (Fig. 3a, b, Supplementary Fig. S3). These results suggest that PhiGnet has the ability to achieve accurate assignment of EC numbers to proteins. In the benchmark of GO terms, we compared our method against nine state-of-the-art methods, utilizing the same metrics to evaluate their performance. Across predictions of CC, BP, MF ontologies, PhiGnet achieved Fmax of 0.82, 0.75, 0.81 and AUPR of 0.64, 0.65, 0.80, respectively, which are significantly better than those of the compared methods. Notably, although ensemble-networks-based ProteInfer outperformed the remaining approaches over MF and BP ontologies, and the alignment-free SPROF-GO and structure-based DeepFRI excelled over CC ontology, PhiGnet’s performance remained superior (Fig. 3d, e, Supplementary Figs. S4–S7, and Table S1). Comparing predictive performances on the GO terms, we found that PhiGnet achieved first place in both accuracy and robustness, significantly better than the eight methods above and another prediction from a web server, DeepGOWeb (Fig. 3d–f).

Fig. 3: Comparisons among different methods across GO terms in various ontologies and EC numbers.
figure 3

a Precision-recall curves illustrate the performance of different methods in predicting EC numbers for proteins. b Protein-centric Fmax scores and function-centric AUPR scores are computed across all test proteins to predict EC numbers, where the scores are presented as mean values with standard deviations of 10 bootstrap iterations. c Evaluation of robustness in predicting EC numbers as sequence identity increases, where the Fmax scores of each method at different sequence identities are depicted as boxplots of 50 bootstrap iterations, with the median values at the center and the interquartile range shown by the upper and lower edges of the boxes. d Precision-recall performance across GO terms in different ontologies. e Left, violin plots showing AUPR with the median values at the center of the distribution of 10 bootstrap iterations, and right, Fmax scores for the different methods in predicting CC, BP, and MF. f Computed Matthews correlation coefficient between predicted scores and ground-truth values for both EC numbers and GO terms using different methods. Source data are provided as a Source Data file.

Moreover, we demonstrated the robustness of PhiGnet for generalization to test proteins with varying thresholds of sequence identity compared to the proteins in the training set. At various maximum sequence identity levels (30%, 40%, 50%, 70%, and 95%), PhiGnet exhibited improved predictive performance as sequence identity increased (Fig. 3c, Supplementary Fig. S5). PhiGnet has been ranked among the top two robust methods for the test set of EC numbers, demonstrating consistently predictive performance with Fmax values of 0.61 and 0.72 at sequence identity levels of 30% and 40%, respectively. When compared to the domain-based method FunFams (Fmax of 0.67 and 0.74), PhiGnet slightly underperformed at sequence identity thresholds of 30% and 40%. However, PhiGnet achieved comparable or better performance when sequence identity exceeded 50%. Similarly, the performance of DeepFRI, FunFams, ProteInfer, and CLEAN also improved as sequence identity increased. Pannzer exhibited a similar trend when sequence identity was below 50%, but its performance remained nearly constant with a slight decrease in Fmax. In contrast, both BLAST and DeepGO showed slight improvements as the proteins in the test set increased sequence identity to those in the training set. The robustly predictive performance of PhiGnet has also been demonstrated by predicting the three branches of GO terms, maintaining high accuracy even at low sequence identity (Supplementary Fig. S5). In predictions of both EC numbers and GO terms, we also calculated the Matthew’s correlation coefficient (MCC) between the predicted scores and ground truth to quantitatively compare the performance of various methods. PhiGnet achieved an average MCC of 0.76, which is higher than the average MCCs of the other ten state-of-the-art methods (Fig. 3f).

PhiGnet driven by evolutionary signatures

The evolutionary data plays an important role in PhiGnet for predicting protein function annotations and identifying functional sites. First, we performed ablation experiments to test how EVCs/RCs contribute to PhiGnet. We trained PhiGnet using either EVCs or RCs alone and assessed its performance in terms of Fmax-score and AUPR over predictions of EC numbers/GO terms. To accomplish this, we chose a threshold (0.2) for both EVCs and RCs based on the similar performances in predicting EC numbers/GO terms (Supplementary Fig. S8), aiming to mitigate potential noise arising from coevolution or weak couplings between pairwise residues. We first test whether the information in EVCs, which preserve evolutionary couplings at sites of co-variation, is sufficient to infer functional annotations. The second experiment tests the necessity of information in RCs that independently capture high-order couplings. Similarly, we built a model using RCs alone to computationally assign functional labels to proteins, and this model produced slightly better predictions (Supplementary Figs. S9 and S10). The two experiments indicate that both models demonstrate the capability to accurately assign functional annotations to proteins. Moreover, PhiGnet, utilizing either EVCs or RCs, demonstrates a robust capacity to learn general sequence-function relationships, often better than or as good as other approaches, even test proteins exhibiting low sequence identity in presence of the training set (Fig. 3c, Supplementary Figs. S9c and S10c). Through precision and robustness comparisons, we have demonstrated that the evolutionary signatures (EVCs and RCs) constitute crucial attributes capable of enhancing deep learning-based methods for protein function annotations.

Secondly, we asked whether the residues, particularly within RCs that are often relevant to the specific function, can be quantified for functional sites. To address this, we further investigated the capability of PhiGnet to characterize meaningful features from the identified function-relevant residues within the residue communities. The activation scores were computed for the residues to underscore their contributions to the protein function. Notably, the predicted residues concurred with those at the functional sites identified through experimental determinations, better identifications than those in RCs (Fig. 4). In the human cytidine deaminase (hCDA) protein47, compared to residues within RCs that were identified as functionally relevant, PhiGnet quantitatively characterized their importance in the binding between hCDA and Zn2+/BRD through more accurate predictions of active sites: Cys65, Cys99, and Cys102, which coordinate with the zinc ion, as indicated by the activation scores (Fig. 4a). In the Peroxide operon regulator (PerR), we also observed that PhiGnet narrowed down the number of residues located within RCs48 and effectively distinguished non-Zn2+-binding residues from the binding ones, compared to RCs. Specifically, Cys96, Cys99 and Cys136, Cys139 exhibited much higher activation scores. These residues collectively coordinate the zinc ion, locking the three β-strands together to form the arrangement of the dimeric β-sheet, in contrast to the non-binding residues (Fig. 4b). In light of these results, we conclude that the evolutionary information, particularly that contained in RCs, is sufficient to specify a protein’s function and to quantitatively characterize the residues at the functional sites. Moreover, the results argue that RCs contain evolutionary knowledge at a higher-ordered level than the information in EVCs at a lower-ordered level. Meanwhile, information contained in RCs plays an important role in enhancing PhiGnet’s ability to identify functionally relevant sites at the residue level.

Fig. 4: PhiGnet learns evolutionary signatures for identification of protein functional sites.
figure 4

Mappings of RCs and activation scores of (a) the human cytidine deaminase protein (hCDA, PDB ID: 1MQ0-A, GO term 0008270), and (b) the Peroxide operon regulator (PDB ID: 2FE3-A, GO term 0046872). The residues within each RC are shown in the chord plotting with coupling strength and degree of conservation in bars. The activation scores (dotted lines) of each protein are compared to the BioLip identifications (marked with Y in black), and residues with high scores (in red) are also compared to those within RCs on their 3D structures. The 1-beta-ribofuranosyl-1,3-diazepinone (BRD) and Zn2+ ions are shown with spheres in yellow (orange for the Zn2+ ion in hCDA). Source data are provided as a Source Data file.

Test on CAFA3 targets

To assess whether the different performances of the methods under evaluation, and the superiority of PhiGnet were inherent to the algorithms or due to different training sets, we re-executed two alignment-based methods (BLAST and FunFams) and conducted retraining on four deep learning-based methods (DeepGO, ATGO+, SPROF-GO, and PhiGnet). Other methods were excluded primarily due to the unavailability of trainable source codes or because such method required unavailable structural information) against an identical dataset. We used the third Critical Assessment of Protein Function Annotation (CAFA3) dataset consisting of 66,841 proteins49. To address homology issues, proteins sharing over 30% sequence identity with the test proteins were excluded from the training dataset45. The remaining proteins were utilized to construct databases for BLAST and FunFams. 95% of them were randomly selected for training DeepGO, ATGO+, SPROF-GO, and PhiGnet, with the remaining 5% reserved for validation to fine-tune the methods’ parameters. Moreover, we conducted comparisons among the different methods using the CAFA3 test proteins either with less than 60% sequence identity to those in the training dataset or without redundancy removal (Supplementary Fig. S12).

A comparison among the six different methods implemented on the CAFA3 dataset reveals that PhiGnet exhibits the best performance across both \(\rmF_\max \) and AUPR metrics (Table 1, Supplementary Fig. S12). PhiGnet achieved the highest \(\rmF_\max \) scores across all three categories: BP (0.531), CC (0.584), and MF (0.606), indicating its superior capability in predicting functional annotations across diverse biological processes, cellular components, and molecular functions compared to methods such as BLAST, DeepGO, FunFams, and ATGO+. Furthermore, PhiGnet outperformed other methods with AUPR scores of 0.425 for BP, 0.590 for CC, and 0.571 for MF, demonstrating its effectiveness in accurately identifying true positive annotations while minimizing false positives across various functional categories. Although methods like BLAST, DeepGO, FunFams, and ATGO+ exhibited respectable performance in specific categories, none consistently achieved high scores across both \({{{{\rmF}}}}_\max \) and AUPR metrics as PhiGnet did. Overall, the comparison underscores PhiGnet as one of the state-of-the-art methods on the CAFA3 dataset, demonstrating that its increased performance is independent of the training dataset used.

Table 1 Comparison of different methods on the CAFA3 dataset

Predicting functions of holdout and unannotated proteins

Can PhiGnet annotate uncharacterized proteins? We carried out our predictions for the independent hold-out set of 6229 proteins (Supplementary Fig. S13). We followed the same procedures to collect EVCs, RCs, and sequence embeddings for all the proteins. They were utilized to feed into the fine-tuned PhiGnet in order to compute a probability tensor for assigning functional annotations to the proteins. Among the collected proteins, our method’s overall performance was superior to that of state-of-the-art methods. Given that these proteins were independently collected, our computational predictions can be valuable in assigning functional annotations to new proteins (Supplementary Figs. S14, S15, and Table S2). For example, across the T. forsythia NanH (PDB ID: 7QXO) and human Sar1b (PDB ID: 8E0A), the activation scores successfully indicate the functional sites that bind to Oseltamivir and guanosine tetraphosphate (Supplementary Fig. S16). Our analysis shows that PhiGnet’s high confidence prediction is in a good agreement with experimental annotations, suggesting that it would contribute to computational efforts for assigning function annotations to proteins with unknown labels. This applies even when dealing with experimental annotations of lower confidence scores, and can benefit experimental investigations of different biological activities. Moreover, by leveraging evolutionary information, PhiGnet provides function annotations as well as residue-level activation scores for over 2.5 million individual sequences within the UniProt database. The activation score assigned to each individual residue offers a quantitative measure of its significance in a specific activity, proving beneficial for screening experiments aimed at identifying functionally important sites.

link