Mask-prior-guided denoising diffusion improves inverse protein folding

0 14 minutes read

Discrete denoising diffusion models

Denoising diffusion models are a class of deep generative models trained to create new samples by iteratively denoising sampled noise from a prior distribution. The training stage of a diffusion model consists of a forward diffusion process and a reverse denoising process. Given an original data distribution q(x₀), the forward diffusion process gradually corrupts a data point x₀ ∼ q(x₀) into a series of increasingly noisy data points x_1:T = x₁, x₂,⋯, x_T over T time steps. This process follows a Markov chain, where $q({{\bf{x}}}_{1:T}| {{\bf{x}}}_{0})=\mathop{\prod}\nolimits_{t=1}^{T}q({{\bf{x}}}_{t}| {{\bf{x}}}_{t-1})$. Conversely, the reverse denoising process, denoted by ${p}_{\theta }({{\bf{x}}}_{0:T})=p({{\bf{x}}}_{T})\mathop{\prod }\nolimits_{t=1}^{T}{p}_{\theta }({{\bf{x}}}_{t-1}| {{\bf{x}}}_{t})$, aims to progressively reduce noise towards the original data distribution q(x₀) by predicting x_t − 1 from x_t. The initial noise x_T is sampled from a predefined prior distribution p(x_T), and the denoising inference p_θ can be parametrized by a learnable neural network. Although the diffusion and denoising processes are agnostic to the data modality, the choice of prior distributions and Markov transition operators varies between continuous and discrete spaces.

In this work, we followed the settings of the discrete denoising diffusion proposed by Austin et al.³⁶ and Clement et al.³⁰. In contrast with typical Gaussian diffusion models that operate in continuous state space, discrete denoising diffusion models introduce noise to categorical data using transition probability matrices in discrete state space. Let x_t $\in$ {1, ⋯ , K} denote the categorical data with K categories and its one-hot encoding represented by ${{\bf{x}}}_{t}\in {{\mathbb{R}}}^{K}$. At time step t, the forward transition probabilities can be denoted by a matrix ${\bf{Q}}_{t}\in {{\mathbb{R}}}^{K\times K}$, where ${[{\bf{Q}}_{t}]}_{ij}=q({x}_{t}=j| {x}_{t-1}=i)$ is the probability of transitioning from category i to category j. Therefore, the discrete transition kernel in the diffusion process is defined as

$$q({{\bf{x}}}_{t}| {{\bf{x}}}_{t-1})={\rm{Cat}}({{\bf{x}}}_{t};{\bf{p}}={{\bf{x}}}_{t-1}{\bf{Q}}_{t}),$$

(1)

$$q({{\bf{x}}}_{t}| {{\bf{x}}}_{0})={\rm{Cat}}({{\bf{x}}}_{t};{\bf{p}}={{\bf{x}}}_{0}{\overline{\bf{Q}}}_{t}),\,{\rm{with}}\,{\overline{\bf{Q}}}_{t}={\bf{Q}}_{1}{\bf{Q}}_{2}\cdots {\bf{Q}}_{t},$$

(2)

where Cat(x; p) represents a categorical distribution over x_t with probabilities determined by ${\bf{p}}\in {{\mathbb{R}}}^{K}$. As the diffusion process has a Markov chain, the transition matrix from x₀ to x_t can be written as a closed form in equation (2) with ${\overline{\bf{Q}}}_{t}={\bf{Q}}_{1}{\bf{Q}}_{2}\cdots {\bf{Q}}_{t}$. This property enables efficient sampling of x_t at arbitrary time steps without recursively applying noise. Following the Bayesian theorem, the calculation of posterior distribution (with the derivation in Supplementary Information Section 3) from time step t to t − 1 can be written as

$$q({{\bf{x}}}_{t-1}| {{\bf{x}}}_{t},{{\bf{x}}}_{0})\propto {{\bf{x}}}_{t}{\bf{Q}}_{t}^{T}\odot {{\bf{x}}}_{0}{\overline{\bf{Q}}}_{t-1},$$

(3)

where ⊙ is a Hadamard (element-wise) product. The posterior q(x_t−1∣x_t, x₀) is equivalent to q(x_t−1∣x_t) owing to its Markov property. Thus, the clean data x₀ is introduced for denoising estimation and can be used as the target of the denoising neural network. In MapDiff, we introduce two simple but effective choices for the transition matrix Q_t: uniform transition³⁶ and marginal transition³⁰. The uniform transition is parametrized by ${\bf{Q}}_{t}=(1-{\beta }_{t}){\bf{I}}+{\beta }_{t}{{\bf{1}}}_{K}{{\bf{1}}}_{K}^{T}/K$, where K = 20 represents the number of native amino acid types and the noise schedule β_t $\in$ [0, 1]. Similarly, the marginal transition is parametrized by Q_t = (1 − β_t)I + β_t1_Kp^T, where ${\bf{p}}\in {{\mathbb{R}}}^{20}$ denotes the marginal probability distribution of AA types in the training data. All matrix values are strictly positive, and each row sums to one, ensuring the conservation of probability mass. Given these properties, along with the condition ${\lim }_{t\to T}{\beta }_{t}=1$, q(x_t) can converge to a stationary uniform or marginal distribution, regardless of the initial x₀.

Residue graph construction

IPF prediction aims to generate a feasible AA sequence that can fold into a desired backbone structure. Given a target protein of length L, we present it as a proximity residue graph ${\mathcal{G}}=(\bf{X},\bf{A},\bf{E})$, where each node denotes an AA residue within the protein. The node features X = [X^aa, X^pos, X^prop] encode the AA residue types, 3D spatial coordinates and geometric properties. The adjacency matrix A $\in$ {0, 1}^N×N is constructed using the k-nearest-neighbour algorithm. Specifically, each node is connected to a maximum of k other nodes within a cutoff distance smaller than 30 Å. The edge feature matrix ${\bf{E}}\in {{\mathbb{R}}}^{{\rm{M}}\times 93}$ illustrates the spatial and sequential relationships between the connected nodes. More details on the graph feature construction are provided in Supplementary Information Section 4. For sequence generation, we define a discrete denoising process on the types of noisy AA residues ${{\bf{X}}}_{t}^{\rm{aa}}\in {{\mathbb{R}}}^{N\times 20}$ at time t. Conditioned on the noise graph ${{\mathcal{G}}}_{t}$, this process is subject to iteratively refine noise ${{\bf{X}}}_{t}^{\rm{aa}}$ towards a clean ${{\bf{X}}}_{0}^{\rm{aa}}={{\bf{X}}}^{\rm{aa}}$, which is predicted by our mask-prior-guided denoising network.

IPF denoising diffusion process

Discrete diffusion process

In the diffusion process, we incrementally introduced discrete noise to the clean AA residues over a number of time steps t $\in$ {1,⋯, T}, which resulted in transforming the original data distribution to a simple uniform or marginal distribution. Given a clean AA sequence ${{\bf{X}}}_{0}^{\rm{aa}}=\{{{\bf{x}}}_{0}^{i}\in {{\mathbb{R}}}^{1\times 20}| 1\le i\le N\}$, we used a cumulative transition matrix ${\overline{\bf{Q}}}_{t}$ to independently add noise to each AA residue at arbitrary step t

$$q({{\bf{x}}}_{t}^{i}| {{\bf{x}}}_{0}^{i})={\rm{Cat}}({{\bf{x}}}_{t}^{i};{\bf{p}}={{\bf{x}}}_{0}^{i}{\overline{\bf{Q}}}_{t}),{\rm{with}}\,{\overline{\bf{Q}}}_{t}={\bf{Q}}_{1}{\bf{Q}}_{2}\cdots {\bf{Q}}_{t},$$

(4)

$$q({{\bf{X}}}_{t}^{{\rm{aa}}}| {{\bf{X}}}_{0}^{{\rm{aa}}})=\prod _{1\le i\le N}q({{\bf{x}}}_{t}^{i}| {{\bf{x}}}_{0}^{i}),$$

(5)

where ${\bf{Q}}_{t}=(1-{\beta }_{t}){\bf{I}}+{\beta }_{t}{{\bf{1}}}_{K}{{\bf{1}}}_{K}^{T}/K$, and K denotes the number of native AA types (that is, K = 20). The weight of the noise, β_t $\in$ [0, 1] was determined by a common cosine schedule³⁷.

Training objective of denoising network

The denoising neural network, denoted by ϕ_θ, is an essential component to reverse the noise process in diffusion models. In our framework, the network takes a noise residue graph ${{\mathcal{G}}}_{t}=({\bf{X}}_{t},\bf{A},\bf{E})$ as input and aims to predict the clean AA residues ${{\bf{X}}}_{0}^{\rm{aa}}$. Specifically, we designed a mask-prior-guided denoising network ϕ_θ to effectively capture inherent structural information and learn the underlying data distribution. To train the learnable network ϕ_θ, the objective is to minimize the cross-entropy loss between the predicted AA probabilities and the real AA types over all nodes.

Reverse denoising process

After the denoising network has been trained, it can be used to generate new AA sequences through an iterative denoising process. In this study, we first used the denoising network ϕ_θ to estimate the generative distribution ${\hat{p}}_{\theta }({\hat{{\bf{x}}}}_{0}^{i}| {{\bf{x}}}_{t}^{i})$ for each AA residue. Then the reverse denoising distribution ${p}_{\theta }({{\bf{x}}}_{t-1}^{i}| {{\bf{x}}}_{t}^{i})$ was parametrized by combining the posterior distribution with the marginalized network predictions as follows:

$${p}_{\theta }\left({{\bf{x}}}_{t-1}^{i}| {{\bf{x}}}_{t}^{i}\right)\propto \sum _{{\hat{{\bf{x}}}}_{0}^{i}}q\left({{\bf{x}}}_{t-1}^{i}| {{\bf{x}}}_{t}^{i},{\hat{{\bf{x}}}}_{0}^{i}\right){\hat{p}}_{\theta }\left({\hat{{\bf{x}}}}_{0}^{i}| {{\bf{x}}}_{t}^{i}\right),$$

(6)

$${p}_{\theta }\left({{\bf{X}}}_{t-1}^{\rm{aa}}| {{\bf{X}}}_{t}^{\rm{aa}}\right)=\prod _{1\le i\le N}{p}_{\theta }\left({{\bf{x}}}_{t-1}^{i}| {{\bf{x}}}_{t}^{i}\right),$$

(7)

where ${\hat{{\bf{x}}}}_{0}^{i}$ represents the predicted probability distribution for the ith residue ${{\bf{x}}}_{0}^{i}$. The posterior distribution is defined as

$$\begin{array}{rcl}q\left({{\bf{x}}}_{t-1}^{i}| {{\bf{x}}}_{t}^{i},{\hat{{\bf{x}}}}_{0}^{i}\right)&=&\frac{q\left({{\bf{x}}}_{t}^{i}| {{\bf{x}}}_{t-1}^{i},{\hat{{\bf{x}}}}_{0}^{i}\right)q\left({{\bf{x}}}_{t-1}^{i}| {\hat{{\bf{x}}}}_{0}^{i}\right)}{q\left({{\bf{x}}}_{t}^{i}| {\hat{{\bf{x}}}}_{0}^{i}\right)},\\ &=&{\rm{Cat}}\left({{\bf{x}}}_{t-1}^{i};{\bf{p}}=\frac{{{\bf{x}}}_{t}^{i}{{\bf{Q}}}_{t}^{T}\odot {\hat{{\bf{x}}}}_{0}^{i}{\overline{{\bf{Q}}}}_{t-1}}{{\hat{{\bf{x}}}}_{0}^{i}{\overline{{\bf{Q}}}}_{t}{({{\bf{x}}}_{t}^{i})}^{T}}\right).\end{array}$$

(8)

By applying the reverse denoising process, the generation of less-noisy ${{\bf{X}}}_{t-1}^{\rm{aa}}$ from ${{\bf{X}}}_{t}^{\rm{aa}}$ is feasible (derivation in Supplementary Information Section 3). The denoised result is determined by the predicted residues from the denoising neural network, as well as the predefined transition matrices at steps t and t − 1. To generate a new AA sequence, the complete generative process begins with a random noise from the independent prior distribution p(x_T). The initial noise is then iteratively denoised at each time step using the reverse denoising process, gradually converging to a desired sequence conditioned on the given graph ${\mathcal{G}}$.

DDIM with Monte-Carlo dropout

Although discrete diffusion models have demonstrated impressive generation ability in many fields, the generative process suffers from two limitations that hinder their success in IPF prediction. First, the generative process is inherently computationally inefficient due to the numerous denoising steps involved, which require a sequential Markovian forward pass for the iterative generation. Second, the categorical distribution used for denoising sampling lacks sufficient uncertainty estimation. Many studies indicate that the logits produced by deep neural networks do not accurately represent the true probabilities. Typically, the predictions tend to be overconfident, leading to a discrepancy between the predicted probabilities and the actual distribution. As the generative process iteratively draws samples from the estimated categorical distribution, insufficient uncertainty estimation will accumulate sampling errors and result in unsatisfactory performance.

To accelerate the generative process and improve uncertainty estimation, we propose a discrete sampling method by combining DDIM with Monte-Carlo dropout. DDIM²¹ is a widely used method that improves the generation efficiency of diffusion models in continuous space. It defines the generative process as the reverse of a deterministic and non-Markovian diffusion process, making it possible to skip certain denoising steps during generation. As discrete diffusion models possess analogous properties, Yi et al. (2023)³⁸ extended DDIM into discrete space for IPF prediction. Similarly, we define the discrete DDIM sampling to the posterior distribution by

$$q\left({{\bf{x}}}_{t-k}^{i}| {{\bf{x}}}_{t}^{i},{\hat{{\bf{x}}}}_{0}^{i}\right)={\rm{Cat}}\left({{\bf{x}}}_{t-k}^{i};{\bf{p}}=\frac{{{\bf{x}}}_{t}^{i}{\bf{Q}}_{t}^{T}\cdots {\bf{Q}}_{t-k}^{T}\odot {\hat{{\bf{x}}}}_{0}^{i}{\overline{\bf{Q}}}_{t-k}}{{\hat{{\bf{x}}}}_{0}^{i}{\overline{\bf{Q}}}_{t}{({{\bf{x}}}_{t}^{i})}^{T}}\right),$$

(9)

where k is the number of skipping steps.

Then we introduce the application of Monte-Carlo dropout within the generative process, a technique designed to enhance prediction uncertainty in neural networks. Specifically, we use dropout not only to prevent overfitting during the training of our denoising network, but also to maintain its activation in the inference stage. By keeping dropout enabled and running multiple forward passes (Monte-Carlo samples) during inference, we generate a prediction distribution for each input, as opposed to a single-point estimation. To improve uncertainty estimation, we aggregate the predictions by taking a mean pooling over all output logits corresponding to the same input. This operation leads to the predicted logits that perform reduced estimation bias, and their normalized probabilities can more accurately reflect the actual distribution. Therefore, we can leverage Monte-Carlo dropout to enhance the generative process towards more reliable samplings.

Mask-prior-guided denoising network

In diffusion model applications, the denoising network plays a crucial role in generation performance. We have developed a mask-prior-guided denoising network, integrating both structural information and residue interactions for enhanced protein sequence prediction. Our denoising network architecture encompasses a structure-based sequence predictor, a pretrained mask sequence designer and a mask ratio adaptor.

Structure-based sequence predictor

We adopt an EGNN with a global-aware module as the structure-based sequence predictor, which generates a full AA sequence from the backbone structure. EGNN is a type of graph neural network that satisfies equivariance operations for the special Euclidean group SE(3). It preserves geometric and spatial relationships of 3D coordinates within the message-passing framework. Given a noise residue graph, we use H = [h₁, h₂, ⋯ , h_N] to denote the initial node embeddings, which are derived from the noisy AA types and geometric properties. The coordinates of each node are represented by ${{\bf{X}}}^{\rm{pos}}=[{{\bf{x}}}_{1}^{\rm{pos}},{{\bf{x}}}_{2}^{\rm{pos}},\cdots {{\bf{x}}}_{N}^{\rm{pos}}]$, whereas the edge features are denoted by E = [e₁, e₂, ⋯ e_M]. In this setting, EGNN consists of a stack of equivariant graph convolutional layers (EGCL) for the node and edge information propagation, which are defined as

$${{\bf{e}}}_{ij}^{(l+1)}={\phi }_{e}\left({{\bf{h}}}_{i}^{(l)},{{\bf{h}}}_{j}^{(l)},\parallel {{\bf{x}}}_{i}^{(l)}-{{\bf{x}}}_{j}^{(l)}{\parallel }^{2},{{\bf{e}}}_{ij}^{(l)}\right),$$

(10)

$${\hat{{\bf{h}}}}_{i}^{(l+1)}={\phi }_{h}\left({{\bf{h}}}_{i}^{(l)},\sum _{j\in {\mathcal{N}}(i)}{w}_{ij}{{\bf{e}}}_{ij}^{(l+1)}\right),$$

(11)

$${{\bf{x}}}_{i}^{(l+1)}={{\bf{x}}}_{i}^{(l)}+\frac{1}{{N}_{i}}\sum _{j\in {\mathcal{N}}(i)}\left({{\bf{x}}}_{i}^{(l)}-{{\bf{x}}}_{j}^{(l)}\right){\phi }_{x}\left({{\bf{e}}}_{ij}^{(l+1)}\right),$$

(12)

where l denotes the lth EGCL layer, ${{\bf{x}}}_{i}^{(0)}={{\bf{x}}}_{i}^{\rm{pos}}$ and ${w}_{ij}={\rm{sigmoid}}$$({\phi }_{w}({{\bf{e}}}_{ij}^{(l+1)}))\left.\right)$ is a soft estimated weight assigned to the specific edge representation. All components (ϕ_e, ϕ_h, ϕ_x, ϕ_w) are learnable and parametrized by fully connected neural networks. In the information propagation, EGNN achieves equivariance to translations and rotations on the node coordinates X^pos, and preserves invariant to group transformations on the node features H and edge features E.

However, the vanilla EGNN only considers local neighbour aggregation while neglecting the global context. Some recent studies^13,39 have demonstrated the importance of global information in protein design. Therefore, we introduce a global-aware module in the EGCL layer, which incorporates the global pooling vector into the update of node representations: that is,

$${{\bf{m}}}^{(l+1)}={\rm{MeanPool}}\left({\left\{{\hat{{\bf{h}}}}_{i}^{(l+1)}\right\}}_{i\in {\mathcal{G}}}\right),$$

(13)

$${{\bf{h}}}_{i}^{(l+1)}={\hat{{\bf{h}}}}_{i}^{(l+1)}\odot {\rm{sigmoid}}\left({\phi }_{m}\left({{\bf{m}}}^{(l+1)},{\hat{{\bf{h}}}}_{i}^{(l+1)}\right)\right),$$

(14)

where MeanPool( ⋅ ) is the mean pooling operation over all nodes within a residue graph. The global-aware module effectively integrates global context into modelling and only increases a linear computational cost. To predict the probabilities of residue types, the node representations from the last EGCL layer are fed into a fully connected classification layer with softmax function, which is defined as

$${{\bf{p}}}_{i}^{\rm{b}}={\rm{softmax}}\left({{\bf{l}}}_{i}^{\rm{b}}\right),\quad {{\bf{l}}}_{i}^{\rm{b}}={{\bf{h}}}_{i}^{(L)}{\bf{W}}_{\rm{o}}+{{\bf{b}}}_{\rm{o}},$$

(15)

where ${\bf{W}}_{\rm{o}}\in {{\mathbb{R}}}^{{D}_{h}\times 20}$ and ${{\bf{b}}}_{\rm{o}}\in {{\mathbb{R}}}^{1\times 20}$ are a learnable weight matrix and a bias vector respectively.

Low-confidence residue selection and mask ratio adaptor

As previously mentioned, structural information alone can sometimes be insufficient to determine all residue identities. Certain flexible regions display a weaker correlation with the backbone structure but are strongly influenced by their sequential context. To enhance the denoising network’s performance, we introduce a masked sequence designer module. This module refines the residues identified with low confidence in the base sequence predictor. We adopt an entropy-based residue selection strategy, as proposed by Zhou et al. (2023)²⁴, to identify these low-confidence residues. The entropy for the ith residue of the probability distribution ${{\bf{p}}}_{i}^{b}$ is calculated as

$${\rm{en{t}}}_{i}^{\rm{b}}=-\sum _{j}{\,\text{p}}_{ij}^{\rm{b}}\log \left({\text{p}}_{ij}^{\rm{b}}\right).$$

(16)

Given that entropy quantifies the uncertainty in a probability distribution, it can be used to locate the low-confidence predicted residues. Consequently, residues with the most entropy are masked, whereas the rest remain in a sequential context. The masked sequence designer aims to reconstruct the entire sequence by using the masked partial sequence in combination with the backbone structure. In addition, to account for the varying noise levels of the input sequence in diffusion models, we designed a simple mask ratio adaptor to dynamically determine the entropy mask percentage at different denoising steps: that is,

$${\rm{mr}}_{t}=\sin {\left(\frac{\uppi }{2}{\beta }_{t}\sigma \right)}+m,$$

(17)

where β_t $\in$ [0, 1] represents the noise weight at step t derived from the noise schedule, and σ and m are the predefined deviation and minimum mask ratio, respectively. With the increase of β_t, the mask ratio is proportional to its time step.

Mask-prior pretraining

To incorporate prior knowledge of sequential context, we pretrained the masked sequence designer by applying the masked language modelling objective proposed in BERT⁴⁰. It is important to clarify that we used the same training data in the diffusion models for pretraining purposes, to avoid any information leakage from external sources. In this process, we randomly sampled a proportion of residues in the native AA sequences and replaced them with the masking procedures: (1) masking 80% of the selected residues using a special MASK type; (2) replacing 10% of the selected residues with other random residue types; and (3) keeping the remaining 10% residues unchanged. Subsequently, we input the partially masked sequences, along with structural information, into the masked sequence designer. The objective of the pretraining stage was to predict the original residue types from the masked residue representations using a cross-entropy loss function.

Masked sequence designer

We used an IPA network as the masked sequence designer. IPA is a geometry-aware attention mechanism designed to facilitate the fusion of residue representations and spatial relationships, enhancing the structure generation within AlphaFold2¹⁵. In this study, we repurposed the IPA module to refine low-confidence residues in the base sequence predictor. Given a mask AA sequence, we denote its residue representation as S = [s₁, s₂,⋯, s_N], which is derived from the residue types and positional encoding. To incorporate geometric information, as with the IPA implementation in Frame2seq⁴¹, we constructed a pairwise distance representation ${\bf{Z}}=\{{{\bf{z}}}_{ij}\in {{\mathbb{R}}}^{1\times {d}_{z}}| 1\le i\le N,1\le j\le N\}$ and rigid coordinate frames ${\mathcal{T}}=\{{T}_{i}:= ({{\bf{R}}}_{i}\in {{\mathbb{R}}}^{3\times 3},{{\bf{t}}}_{i}\in {{\mathbb{R}}}^{3})| 1\le i\le N\}$. The pairwise representation Z was obtained by calculating interresidue spatial distances and relative sequence positions. The rigid coordinate frames were constructed from the coordinates of backbone atoms using a Gram–Schmidt process, providing a consistent local reference for ensuring the invariance of IPA to global Euclidean transformations. Subsequently, we took the residue representation, pairwise distance representation and rigid coordinate frames as inputs, and fed them into a stack of IPA layers for representation learning, which is defined as

$${{\bf{S}}}^{(l+1)},{{\bf{Z}}}^{(l+1)}={\rm{IPA}}({{\bf{S}}}^{(l)},{{\bf{Z}}}^{(l)},{{\mathcal{T}}}).$$

(18)

The IPA network follows the self-attention mechanism. However, it enhances the general attention queries, keys and values by incorporating 3D points that are generated in the rigid coordinate frame of each residue. This operation ensures that the updated residue and pair representations remain invariant by global rotations and translations. More details on the IPA feature construction and algorithm implementation are provided in Supplementary Information Section 6. For the ith residue, the predicted probability distribution and entropy in the masked sequence designer are calculated as

$${{\bf{p}}}_{i}^{\rm{m}}={\rm{softmax}}({{\bf{l}}}_{i}^{\rm{m}}),\quad {{\bf{l}}}_{i}^{\rm{m}}={{\bf{h}}}_{i}^{(L)}{\bf{W}}_{\rm{m}}+{{\bf{b}}}_{\rm{m}},$$

(19)

$${\rm{en{t}}}_{i}^{\rm{m}}=-{\sum _{j}{\,\rm{p}}_{ij}^{\rm{m}}\log ({\rm{p}}_{ij}^{\rm{m}})},$$

(20)

where ${\bf{W}}_{\rm{m}}\in {{\mathbb{R}}}^{{D}_{s}\times 20}$ and ${{\bf{b}}}_{\rm{m}}\in {{\mathbb{R}}}^{1\times 20}$ are the learnable weight matrix and bias vector, respectively. The training objective was to jointly minimize the cross-entropy losses for both the base sequence predictor and masked sequence designer. In the inference stage, we calculated the final predicted probability by weighting the output logits based on their entropy as

$${{\bf{l}}}_{i}^{\,{\rm{f}}\,}=\frac{\exp \left(-{\rm{en{t}}}_{i}^{\rm{b}}\right)}{\exp \left(-{\rm{en{t}}}_{i}^{\rm{b}}\right)+\exp \left(-{\rm{en{t}}}_{i}^{\rm{m}}\right)}{{\bf{l}}}_{i}^{\rm{b}}+\frac{\exp \left(-{\rm{en{t}}}_{i}^{\rm{m}}\right)}{\exp \left(-{\rm{en{t}}}_{i}^{\rm{b}}\right)+\exp \left(-{\rm{en{t}}}_{i}^{\rm{m}}\right)}{{\bf{l}}}_{i}^{\rm{m}}.$$

(21)

$${{\bf{p}}}_{i}^{\,{\rm{f}}\,}={\rm{softmax}}\left({{\bf{l}}}_{i}^{\,{\rm{f}}\,}\right).$$

(22)

By incorporating the mask-prior denoising network into the discrete denoising diffusion process, our framework enhanced the denoising trajectories, leading to more accurate predictions of protein sequences.

Experimental setting

Primary datasets

We evaluated MapDiff on experimentally validated protein structures curated from well-established databases. The CATH database²⁵ is widely used in inverse folding research, enabling fair comparisons across different methodologies. It classifies proteins into hierarchical levels based on class, architecture, topology and homologous superfamily, with filtering to reduce redundancy and ensure structural diversity. Following previous studies^13,26,27, proteins are partitioned based on their CATH topology classification codes, ensuring that the training, validation and test sets contain non-overlapping topologies. This partitioning strategy provided a robust evaluation of the model’s generalization to unseen proteins. For CATH 4.2, the dataset consisted of 18,024 structures for training, 608 for validation and 1,120 for testing. Similarly, in CATH 4.3, we followed the topology classification approach in ESM-IF²⁷, resulting in 16,630 proteins for training, 1,516 for validation and 1,864 for testing. By including both CATH 4.2 and CATH 4.3, we assessed the stability of model performance across dataset versions, ensuring robustness to updates in protein-structure databases.

Zero-shot generalization datasets

To further assess MapDiff’s zero-shot generalization ability, we evaluated it on the two independent TS50 and PDB2022 datasets. TS50 (ref. ⁵) is a commonly used benchmark for protein-sequence design, consisting of 50 diverse protein chains covering different structural classes. PDB2022 includes single-chain structures published in the Protein Data Bank (PDB)⁴² between 5 January 2022 and 26 October 2022, curated by Zhou et al.²⁴, with protein length ≤500 and resolution ≤2.5 Å. This dataset consists of 1,975 proteins published after those in the CATH dataset, ensuring a strict time-based test ‘split’ to evaluate real-world temporal generalization. Both datasets are entirely separate from the CATH-derived training set, minimizing data leakage and providing a robust evaluation of structural and temporal generalization.

Baselines

We compared MapDiff with recent deep-graph models for inverse protein folding, including StructGNN²⁶, GraphTrans²⁶, GVP⁴³, AlphaDesign⁴⁴, ProteinMPNN¹, PiFold¹³, LM-Design⁴⁵ and GRADE-IF¹. To ensure a reliable and fair comparison, we reproduced the open-source and four most state-of-the-art baselines (ProteinMPNN, PiFold, LM-Design and GRADE-IF) under identical settings in our experiments. ProteinMPNN uses a message-passing neural network to encode structure features, and a random decoding scheme to generate protein sequences. PiFold introduces a residue featurizer to extract distance, angle and direction features. It proposes a PiGNN encoder to learn expressive residue representations, enabling the generation of protein sequences in a one-shot manner. LM-Design uses structure-based models as encoders and incorporates the protein language model ESM as a protein designer to refine the generated sequences. GRADE-IF employs EGNN to learn residue representations from protein structures, and it adopts the graph denoising diffusion model to iteratively generate feasible sequences. All baselines were implemented following the default hyperparameter settings in their original papers.

Implementation set-up

MapDiff is implemented in Python v.3.8 and PyTorch v.1.13.1 (ref. ⁴⁶), along with functions from BioPython v.1.81 (ref. ⁴⁷), PyG v.2.4.0 (ref. ⁴⁸), Scikit-learn v.1.0.2 (ref. ⁴⁹), NumPy v.1.22.3 (ref. ⁵⁰) and RDKit v.2023.3.3 (ref. ⁵¹). It consists of two training stages: mask-prior pretraining and denoising diffusion model training, both of which use the same CATH 4.2/4.3 training set. The batch size was set to eight, and the models were trained up to 200 epochs in pretraining and 100 epochs in denoising training. We employed the Adam optimizer with a one-cycle scheduler for parameter optimization, setting the peak learning rate to 5 × 10⁻⁴. In the denoising network, the structure-based sequence predictor consisted of six global-aware EGCL layers, each with 128 hidden dimensions. In addition, the masked sequence designer stacked six layers of IPA, each with 128 hidden dimensions and four attention heads. The dropout rate was set to 0.2 in both the EGCL and IPA layers. A cosine schedule was applied to control the noise weight at each time step, with a total of 500 time steps. During sampling inference, the skip steps for DDIM were configured to 100, and the Monte-Carlo forward passes were set to 50. For the mask ratio adaptor, we set the minimum mask ratio to 0.4 and the deviation to 0.2. All experiments were conducted on a single Tesla A100 GPU. Following the regular evaluation in deep learning, the best-performing model was selected based on the epoch that provided the highest recovery on the validation set. After that, this selected model was subsequently used to evaluate performance on the test set. For the foldability analysis, we applied a single AlphaFold2 pTM model (that is, model_1_ptm) with three recycles to balance accuracy and computational efficiency. Multiple sequence alignment information was generated for each sequence using the MMSeqs2 (refs. ^52,53) server provided by ColabFold⁵⁴. We provide the algorithm details for the training and sampling inference in Supplementary Information Section 5, and the scalability study in Supplementary Information Section 8 and Supplementary Fig. 4.