Language models can identify enzymatic binding sites in protein sequences

Recent advances in language modeling have had a tremendous impact on how we handle sequential data in science. Language architectures have emerged as a hotbed of innovation and creativity in natural language processing over the last decade, and have since gained prominence in modeling proteins and chemical processes, elucidating structural relationships from textual/sequential data. Surprisingly, some of these relationships refer to three-dimensional structural features, raising important questions on the dimensionality of the information encoded within sequential data. Here, we demonstrate that the unsupervised use of a language model architecture to a language representation of bio-catalyzed chemical reactions can capture the signal at the base of the substrate-binding site atomic interactions. This allows us to identify the three-dimensional binding site position in unknown protein sequences. The language representation comprises a reaction-simplified molecular-input line-entry system (SMILES) for substrate and products, and amino acid sequence information for the enzyme. This approach can recover, with no supervision, 52.13% of the binding site when considering co-crystallized substrate-enzyme structures as ground truth, vastly outperforming other attention-based models.


BPE datasets
and BACC for the binding site prediction using PLIP as ground truth.All the models present a very low F1 score indicating the strugle of the models to balance precision and recall.This happens because of the high false positives.This high false positive come from the inherit imbalance characteristic of binding sites, as they represented a very small portion of the total sequence.From the remaining reactions, we extracted the sequences and noticed that we have many sequences with more than 512 amino acids.Without compressing these longer sequences using a Byte Pair Encoding (BPE) approach, we would have been left with only around 50% of the reactions for model training and evaluation.The use of BPE allowed us to effectively represent these longer sequences, enabling the model to learn from a more comprehensive dataset and improving its ability to generalize to unseen data.This frequency plot highlights the alignment between the training and test set distributions, demonstrating the successful incorporation of longer sequences (exceeding 1000 amino acids) in our datasets through the utilization of BPE.
Supervised approach vs. RXNAAMapper: We conducted a comparative study between a supervised methodology and RXNAAMapper.The supervised method leveraged token embeddings extracted from a ProtBert model and adopted the XGBoost algorithm as a predictive model to predict if a token is part of the binding site or not.
To train the model, we use a portion of plip, obtained by filtering sequences appearing in our evaluation set and ProtBert context length (512 tokens).
Based on these discernments, we infer that for cases where prior knowledge of the amino acid (AA) sequence is  This figure presents a comparative analysis of the computational efficiency of RXNAAMapper against other language models, such as ProtBERT and BERT-large.The x-axis depicts the number of parameters, while the y-axis represents the number of FLOPs required.The figure shows that RXNAAMapper exhibits a significantly lower number of parameters and FLOPs compared to models like ProtBERT, indicating its computational efficiency.

Figure 1 :Figure 2 :
Figure 1: Distance between the predicted binding sites and ground truth.The distance between the barycenter of the grid boxes centred on the predicted binding sites and the ground facts was used to compare our prediction to those from the homology-based.(A) has been computed by grouping the points in out set by EC classes, while (B) on reaction classes.

Figure 4 :Figure 5 :
Figure 4: Performance of the model evaluated by Recall.Plotted across different layers and attention heads.The heatmap showcases the model's recall score for predicting binding sites, with the X-axis representing layers, the Y-axis representing attention heads, and the color bar indicating the recall score.Results are shown for the top k = 5 setting, with optimal performance observed at head 10 and layer 5.

Table 1 :
Tokenizers training statistics: Median number of tokens per sequence

Table 2 :
F1 score and Balanced Accuracy (BACC) on binding site prediction.Reported in the table are the F1 Binding sites distance from ground truth.This distribution plot illustrates the distances of the predicted binding sites obtained through RXNAAMapper and Token Classification approaches, with reference to the PLIP annotations.While both methods display comparable mean values, a notable difference emerges in their standard deviations.RXNAAMapper exhibits a tighter spread with µ = 9.99 and σ = 4.56, while Token Classification presents a broader distribution with µ = 10.07 and σ = 10.87.
available, training a supervised model on similar protein data and employing it for binding site prediction confers distinct advantages.Conversely, in scenarios where the protein's origin remains elusive or the protein significantly diverges from well-characterized counterparts, adopting RXNAAMapper yields a more profound insight into the probable AAs implicated in the binding site.