A unified deep framework for peptide–major histocompatibility complex–T cell receptor binding prediction

0 12 minutes read

Data processing

Base dataset processing: we collect a base dataset, in which the P–M binding dataset is obtained from BigMHC³⁵; the P–T binding dataset is obtained from PanPep, DLpTCR³⁶ and NetTCR³⁷; and the P–M–T binding dataset is obtained from ERGO³⁸ and pMTnet. To further expand the number of P–T pairs, we downloaded the TCR-full-v3 dataset from the IEDB database²², and amalgamate all peptides and TCRs involved in the dataset. Following data collection, we uniformly preprocess all datasets to ensure the consistent formatting of peptides, MHC and TCR. We remove duplications and anomalies, such as garbled characters and incomplete MHC subtypes. We compile all datasets to obtain three edge datasets, namely, P–M–T, P–M and P–T, saved as three separate files. On the basis of the compiled edge dataset, peptides, MHCs and TCRs are uniformly processed as follows:

Extract and deduplicate all peptides involved in P–M–T, P–M and P–T, encoding them as (p_1,p_2ldots ,p_k_p).
Extract and deduplicate all MHC subtypes involved in P–M–T and P–M, encoding them as (m_1,m_2ldots ,m_k_m).
Extract and deduplicate all TCR β sequences involved in P–M–T and P–T, encoding them as (t_1,t_2ldots ,t_k_t).

The above process ensures the creation of a comprehensive and well-structured dataset for subsequent analysis. The created dataset contains 291,632 peptides, 208 MHCs, 144,053 TCRs, 593,109 P–M–T edges, 70,112 P–M edges and 155,479 P–T edges. The statistics of the generated dataset are listed in Supplementary Table 1.

P–M–T dataset processing: we obtain the dataset from pMTnet, which contains 30,801 triplets in the P–M–T training set and 619 triplets in the P–M–T testing set. Following pMTnet, we keep those triplets with class I MHC pseudo-sequence available, resulting in 29,342 positive triplets in the P–M–T training set and 551 positive triplets in the P–M–T testing set (186 out of 219 Ps are unseen in the training set). For each positive triplet in the P–M–T training set, ten times more negative triplets were generated by random shuffle (the P of a positive triplet is fixed, and we randomly select an M and T among all Ms and Ts) to obtain the overall P–M–T training set. The manner of generating the P–M–T testing set is the same as that for generating the P–M–T training set. Our model requires not only the P–M–T triplets but also the corresponding P–M and P–T pairs. To obtain the corresponding P–M training set, we search for the P–M pairs in our collected base dataset, where their Ps are within the P–M–T training set and omit those pairs in which their Ms only exist in the P–M–T testing set (2,060 pairs). To obtain the P–T training set, we search for the P–T pairs in our collected base dataset, where Ps are within the P–M–T training set and omit those pairs in which Ts only exist in the P–M–T testing set (33,995 pairs, including 33,959 positive pairs and 36 negative pairs, and 33,959 negative pairs are generated via random shuffling). Note that this negative sampling strategy, which fixes P as M and T are randomly selected, may generate negative samples that are relatively easier to predict. For instance, if a P–M pair is readily predicted as non-binding, the corresponding P–M–T triplet might be directly classified as negative. However, this does not affect the fairness of our evaluation, since both our model and baselines encounter the same set of negative samples generated through this strategy. The statistics of the P–M–T dataset we use and the number of positive interactions in the P–M–T training set for Ts in the P–M–T testing set are shown in Supplementary Tables 2 and 4.

P–M–T neoantigen dataset processing: the P–M–T training set and its corresponding P–M and P–T training sets are the same as the above processed P–M–T dataset. We obtained the P–M–T neoantigen testing set (12 positive and 96 negative pairs via random shuffling) by restricting the Ts to neoantigen-specific Ts and not shuffling the P–M pairs in the negative set. Specifically, we identified 12 neoantigen triplets from the original P–M–T testing set, forming the positive sample set. From the neoantigen-specific Ts present in the P–M–T dataset, we identified 36 unique Ts with the available embeddings. For each positive sample, we fixed Ps and Ms as all possible Ts are shuffled to generate the negative samples, ensuring that the generated negative P–M–T triplets did not overlap with any positive pairs.

P–M IEDB dataset processing: same as CapsNet-MHC, we retrieve the sequence-level binding data of P–M from IEDB (MHC class I binding prediction)²². The P–M dataset named BD2013 is downloaded from /, where both immunogenicity prediction and antigen presentation data are involved. On the basis of the dataset that contains 186,684 pairs, the description of HLA alleles belonging to HLA-I was retained, and we kept those pairs with class I MHC pseudo-sequence available. In the end, the dataset contains 156,844 pairs, and we randomly select 80% data (125,475 pairs) for training and the remaining for testing (31,369 pairs, where 2,357 out of 14,608 Ps are unseen in the training set). Notice that we retain the P–M pairs in which the peptide length is nine for training and testing DeepSepPan and Anthem. Our model requires not only the P–M pairs but also the corresponding P–M–T and P–T pairs. To obtain the corresponding P–M–T training set, we search for the P–M–T pairs in our collected base dataset in which both peptide and MHC are within the P–M training set and omit those pairs in which Ps or Ms only exist in the P–M–T testing set, resulting in 45,227 positive pairs. The same number of negative pairs were generated via random shuffling (the P of a positive pair is fixed, and we randomly select an M among all Ms). To obtain the corresponding P–T training set, we search for the P–T pairs in our collected base dataset in which their Ps are within the P–M training set and omit those pairs in which Ps only exist in the P–M testing set (117,699 pairs). The statistics of the P–M IEDB dataset and the seven alleles with more than 1,000 test pairs used for further analysis are summarized in Supplementary Tables 2 and 3.

P–T zero-shot dataset processing: we obtain the dataset from PanPep. The zero-shot dataset contains 699 unique peptides, 29,467 unique TCRs and 32,080 related P–T binding pairs, including 31,223 pairs in the P–T training set and 857 pairs in the P–T testing set (all 543 Ps are unseen in the training set and 410 out of 543 Ts are unseen in the training set). Unlike PanPep, which selects the negative samples from a control set, for each positive pair in the P–T training set, we generate the same number of negative samples via random shuffling (the P of a positive pair is fixed, and we randomly select a T among all Ts) to obtain the overall training set. The manner of generating the testing set is the same as that for generating the training set. Our model requires not only the P–T pairs but also the corresponding P–M–T and P–M pairs. To obtain the P–M–T training set, we search for the P–M–T triplets in our base dataset that both peptides and TCRs are within the P–T training set (25,929 positive pairs and the same number of negative pairs generated via random shuffling). To obtain the P–M training set, we search for the P–M pairs in our base dataset in which the peptide is within the P–T training set (1,641 pairs). The statistics of the P–T zero-shot dataset we use and the number of positive interactions for Ts in the P–T zero-shot training set are shown in Supplementary Tables 2 and 5.

P–T TEIM dataset processing: we obtain the dataset from TEIM, which contains a total of 19,692 positive P–T pairs. For the zero-shot setting, we split the dataset into five folds with no overlapping of Ps between different folds. Following TEIM, five times more negative samples were generated through random shuffling (the P of a positive pair is fixed, and we randomly select a T among all Ts) to obtain the overall training and testing dataset. For the general setting, we randomly split the dataset into five folds, and different folds may contain the same P. Our model requires not only the P–T training set but also the corresponding P–M–T and P–T training sets. To obtain the P–M–T set, we search for the P–M–T triplets in our base dataset that both peptides and TCRs are within the P–T set (12,397 positive pairs and the same number of negative pairs as those generated via random shuffling). To obtain the P–M set, we search for the P–M pairs in our base dataset in which Ps exist in the P–T set (1,633 pairs). The P–M–T and P–T training sets are then adjusted to guarantee that only pairs in which Ps are within the training set are kept according to the fivefold cross-validation. The statistics of the P–T TEIM dataset we use and the number of positive interactions for Ts in the P–T TEIM dataset are shown in Supplementary Tables 2 and 6.

Evaluation methods

P–T TEIM evaluation: same as TEIM, we adopt a cluster-based strategy for fivefold cross-validation splitting, where the similarity between sequences in training and validation datasets is guaranteed to be less than a threshold. The similarity of two TCR (or peptide) sequences s_i and s_j is computed as (fracrmSW(s_i,s_j)sqrtrmSW(s_i,s_i)rmSW(s_j,s_j)), where SW denotes the Smith–Waterman alignment score calculated from the SSW library³⁹. We chose the similarity thresholds of 0.5 for the sequence-level dataset and 0.8 for the residue-level dataset, which is the same as that for TEIM.

Structure-based model evaluation: we take the 3D complex structures of 2PYE and 1QRN as the two test models, where the model of peptide and HLA was taken as one molecule and the model of TCR β chain was taken as the other. On the basis of the consistent-valence forcefield, the 3D structures of peptide-HLA and TCR β were optimized via the steepest descent method (convergence criterion, 0.1 kcal mol^–1; 8,000 steps) and conjugate gradient method (convergence criterion, 0.05 kcal mol^–1; 10,000 steps), respectively. The TCR β mutants were optimized using the same method. We calculate the binding energy of the predicted P–M–T triplets between their peptide-HLA and TCR β. When selecting the predicted P–M–T triplets for validation, we established the following principles. (1) From the random sampling perspective, we aimed to cover the ranges of 1–10, 10–100 and 100–1,000. (2) To ensure the accuracy of structural modelling, we selected P–M–T triplets that have strong template complex molecules for peptide, MHC and TCR in the Protein Data Bank database. (3) To minimize the errors arising from the homology modelling process, we use a unified template for the selected triplets. On the basis of the above fundamental principles, we selected 15 triplets from the predictions of UniPMT and used 7RTR (Protein Data Bank ID: 7RTR) as the homologous modelling template (7RTR serves as the optimal structural template for 12 out of the 15 triplets). The frameworks of TCR β are determined by BLASTp (). The 3D structure of the complex between TCR β and peptide-HLA is modelled using SWISS-MODEL to calculate their binding energy. All computational methods were used with InsightII 2000 and performed on a Sun workstation, and PyMOL was used for visualization.

Model architecture

In this section, we introduce UniPMT, a multifaceted learning model using GNNs to predict the TCR binding specificity of pathogenic peptides presented by class I MHCs. Our model innovatively integrates three key biological relationships, namely, P–M–T, P–M and P–T, into a cohesive framework, capitalizing on the synergistic potential of these relationships. UniPMT comprises a structured approach beginning with graph construction, where biological entities are represented as nodes and their interactions as edges. This is followed by graph learning via GraphSAGE, which learns robust node embeddings. Finally, the model uses a DMF-based learning framework to unify the binding prediction tasks for P–M–T, P–M and P–T interactions, harnessing a comprehensive and integrated learning strategy. UniPMT outputs a continuous binding score between 0 and 1, reflecting the binding probability. Higher binding scores will have higher binding probabilities (clonally expand) for each P–M–T, P–M and P–T pair.

Graph construction

We represent the complex interplay of P–T, P–M and P–M–T interactions in UniPMT as a heterogeneous graph (mathcalG(mathcalV,mathcalE)). The node set (mathcalV) is defined as

$$mathcalV=p_i\cup m_!j\cup t_k,$$

(1)

where p_i, m_j and t_k represent peptides, TCRs and MHCs, respectively. The edge set (mathcalE) comprises

$$beginarraylmathcalE=(,p_i,m_j)cup ,textbinding exists,\\qquad;cup ,textbinding exists,,endarray$$

(2)

This structure effectively encapsulates the intricate molecular interactions essential for understanding the binding dynamics in immunology. We use a GraphSAGE-based GNN model²⁰ for learning node embeddings, capturing the intricate relationships and properties of the p, m and t nodes:

$$bfh_n_i^(l+1)=,textReLU,left(bfW^(l)times ,textmean,left(left\bfh_n_j^(l)right)right),$$

(3)

where (bfh_n_i^(l+1)) is the updated embedding of node n_i at layer l + 1.

Multitask learning

Our UniPMT model provides an example of an end-to-end multitask learning framework, simultaneously addressing the P–M–T, P–M and P–T binding prediction tasks.

P–M binding prediction task learning: for this task, we first generate a vector representation v_pm by inputting the embeddings of P and M nodes into a neural network (for example, a multilayer perceptron), denoted as f_pm:

$$bfv_rmpm=f_rmpm(bfh_rmp,bfh_rmm).$$

(4)

To get the P–M binding probability, we map v_pm to a scalar value in the range [0, 1]. We use a linear mapping layer w to transform v_pm into a scalar, followed by a sigmoid function:

$$P_rmpm=sigma (bfwtimes bfv_rmpm),$$

(5)

where σ denotes the sigmoid function. This probability is then used for cross-entropy loss minimization with respect to the actual binary label. We optimize the models of the P–M binding prediction task through cross-entropy loss:

$$mathcalL_rmpm=-frac1N_rmpmmathopsum limits_i=1^N_rmpmy_rmpm^(i)rmlog left[P_rmpm^(i)right]+left(1-y_rmpm^(i)right)log left[1-P_rmpm^(i)right],$$

(6)

where (mathcalL_rmpm) represents the cross-entropy loss for the P–M binding prediction task. The summation iterates over all N_pm samples in your dataset. For each sample i, (y_rmpm^(i)) is the true label (0 or 1 for a binary classification task) and (P_rmpm^(i)) is the predicted probability that the ith sample belongs to the positive class. The loss is computed as a sum of the negative log-likelihood of the true labels, effectively penalizing predictions that diverge from the actual labels.

P–M–T binding prediction task learning: for the P–M–T binding prediction task, we first reuse the representation v_pm learned in the P–M binding prediction task. To learn the M–T representation v_mt, we use a similar approach using a neural network f_mt:

$$v_rmmt=f_rmmt(bfh_rmm,{bfh}_rmt).$$

(7)

This process ensures that both v_pm and v_mt are learned in a consistent and comparable manner. We assess the binding affinity between v_pm and v_mt using a DMF-based approach¹⁶. The DMF approach is particularly well suited for modelling the P–M–T binding interactions. DMF can learn refined latent variables v_pm and v_mt from high-dimensional sparse data, capturing the intricate associations between P–M and M–T. Moreover, DMF offers flexibility in modelling the interactions between these latent variables, aligning with the biological mechanisms of P–M–T binding. Specifically, our approach models the P–M–T binding score as the interaction between v_pm and v_mt. First, it performs an element-wise product between the two embeddings called bilinear interaction⁴⁰. Then, a neural network is used to model the nonlinear interactions between v_pm and v_mt and output the binding score. Finally, we map the score into the binding probability through a sigmoid function:

$$P_rmpmt=sigma (,f_rmDMF(bfv_rmpmodot v_rmmt)),$$

(8)

where v_pm is parameter-efficiently shared from the P–M binding prediction task, ⊙ represents the element-wise product of two embeddings and f_DMF is a multilayer perceptron.

Given the general absence of negative labels in the P–M–T data, we implement negative sampling to generate negative instances. Furthermore, we adopt the information noise contrastive estimation learning approach to optimize the learning process, enhancing the model’s ability to distinguish between positive and artificially generated negative samples.

To optimize the P–M–T binding prediction model, we use the information noise contrastive estimation learning loss¹⁷, which is designed to distinguish between positive and negative samples. The loss function is defined as follows:

$$mathcalL_rmpmt=-frac1N_rmpmtmathopsum limits_i=1^N_rmpmtlog left[fracexpleft(P_rmpmt^(i)/tauright)expleft(P_rmpmt^(i)/tauright)+mathopsum nolimits_j = 1^Kexpleft(P_rmpmt^(i,,j)/tauright)right],$$

(9)

where (mathcalL_rmpmt) represents the information noise contrastive estimation learning loss for the P–M–T binding prediction task, N_pmt is the number of positive samples in the dataset, (P_rmpmt^(i)) denotes the binding probability of the ith positive sample obtained by applying the sigmoid function to the output of f_DMF with the concatenated embeddings (bfv_rmpm^(i)) and (v_{rmmt}^(i)), (P_rmpmt^(i,,j)) represents the binding probability of the ith positive sample’s v_pm embedding concatenated with the jth negative sample’s v_mt embedding, K is the number of negative samples for each positive sample and τ is a temperature hyperparameter that controls the distribution concentration.

The numerator of the fraction inside the log represents the exponential of the binding probability for the positive sample, whereas the denominator is the sum of the exponentials of the binding probabilities for the positive sample and K negative samples. By minimizing this loss, the model learns to assign higher probabilities to positive samples compared with negative samples, effectively distinguishing between them. The negative samples are generated using negative sampling techniques, where for each positive sample, K negative samples are randomly selected from the dataset. This process helps the model learn to differentiate between true P–M–T interactions and artificially generated negative instances.

P–T binding prediction task learning: for the P–T binding prediction task, following the aggregation of results across all possible MHCs, we use cross-entropy loss to optimize the model. The loss function is defined as follows:

$$mathcalL_rmpt=-frac1N_rmptmathopsum limits_i=1^N_rmptleft(y_rmpt^(i)logleft[P_rmpt^(i)right]+left(1-y_rmpt^(i)right)logleft[1-P_rmpt^(i)right]right),$$

(10)

where (mathcalL_rmpt) represents the cross-entropy loss for the P–T binding prediction task, N_pt is the number of samples in the dataset, (y_rmpt^(i)) is the true label of the ith sample (0 or 1) and (P_rmpt^(i)) is the aggregated binding probability of the ith sample, calculated as

$$P_rmpt^(i)=frac1Mmathopsum limits_j=1^MP_rmprmm_jrmt^(i),$$

(11)

where M is the number of MHCs considered, and (P_{rmp{rmm}_jrmt}^(i)) is the binding probability of the ith sample with respect to the jth MHC.

Integration of three tasks: finally, we integrate the three losses to simultaneously optimize the three tasks. The losses from each task are combined to form a unified learning objective. Specifically, the total loss (mathcalL) is computed as a weighted sum of the individual task losses (mathcalL_rmpm), (mathcalL_rmpmt) and (mathcalL_rmpt):

$$mathcalL=lambda _{{rmpm}}mathcalL_{{rmpm}}+lambda _{rmpt}mathcalL_{{rmpt}}+lambda _{rmpmt}{{mathcalL}}_{{rmpmt}},$$

(12)

where λ_pm, λ_pmt and λ_pt are the weighting factors that balance the contribution of each task to the overall learning process. This multitask framework ensures that each task benefits from the shared learning, leading to a more robust and generalizable model.