AI

Fast, scale-adaptive and uncertainty-aware downscaling of Earth system model fields with generative machine learning

Training data

As a target and ground-truth dataset, we use observational precipitation data from the ERA5 reanalysis34 provided by the European Center for Medium-Range Weather Forecasting. It covers the period from 1940 to the present, and we split the data into a training set from 1940 to 1990, a validation set from 1991 to 2003 and a test set from 2004 to 2018. We bilinearly interpolate the reanalysis data to a resolution of 0.75° and 0.9375° in the latitudinal and longitudinal directions, respectively (that is 240 × 384 grid points), which corresponds to a four times higher resolution compared with the raw ESM simulations with 3° × 3.75° resolution (that is, 60 × 96 grid points).

For the ESM precipitation fields, we use global simulations from three different ESMs with varying complexities and resolutions. The fully coupled POEM35, which includes model components for the atmosphere, ocean, ice sheets and dynamic vegetation, is used as the primary model for comparison of the two generative downscaling methods for past and future climates. To demonstrate the ability of our CM-based method to correct any ESM with a coarser native resolution than the training ground truth, we further include daily precipitation simulations from the much more comprehensive and complex GFDL-ESM4 (ref. 36), with a native resolution of 1° × 1°. We initially upscale the GFDL-ESM4 resolution to the same grid as the POEM ESM to allow direct comparisons. We further include SpeedyWeather.jl37, with a native resolution of 3.75° × 3.75°, which only has a dynamic atmosphere and is, hence, less comprehensive than the fully coupled POEM ESM. Finally, we also use ERA5 data upscaled to the native POEM resolution as the test data for which a paired ground truth is available. Applying our method to the latter can, hence, be seen as a proof of concept.

For evaluation, we use 14 years of available historical data from each of the simulations, with periods 2004–2018 for POEM, 2000–2014 for GFDL-ESM4, 1956–1970 for SpeedyWeather.jl and 2007–2021 for ERA5.

We apply several preprocessing steps to the simulated input data. We first interpolate the input simulations onto the same high-resolution grid as the ground-truth ERA5 data for downscaling purposes and model evaluation. A low-pass filter is then applied to remove small-scale artefacts created by the interpolation. QDM27 with 500 quantiles is then applied in a standard way to remove distributional biases in the ESM simulation for each grid cell individually. As discussed in the ‘Bias correction’ section, the generative downscaling only corrects biases related to a specified spatial scale. Hence, the QDM step ensures a strong reduction in single-cell biases, whereas generative downscaling corrects spatial patterns that are physically consistent. Finally, the ESMs and ERA5 data are log-transformed, \(\tilde=\log \left(x+\epsilon \right)-\log \left[\epsilon \right]\), where ϵ = 0.0001 (ref. 9), followed by a normalization approximately in the range [–1, 1].

Score-based diffusion models

The underlying idea of diffusion-based generative models is to learn a reverse diffusion process from a known prior distribution x(t = T) ≈ pT, such as a Gaussian distribution, to a target data distribution x(t = 0) ≈ p0, where \(\in {{\mathbb}}^{d}\) and d is the data dimension, for example, the number of pixels in an image. Score-based generative diffusion models13,38,39 generalize probabilistic denoising diffusion models12,40 to continuous-time SDEs.

In this framework, the forward diffusion process that incrementally perturbs the data can be described as the solution of the SDE:

$${\rm{d}}{\bf{x}}=\mu ({\bf{x}},t){\rm{d}}t+g
(1)

where \(\mu ({\bf{x}},t):{{\mathbb{R}}}^{d}\to {{\mathbb{R}}}^{d}\) is the drift term, w denotes a Wiener process and \(g
(2)

where \(\bar{t}\) denotes a time reversal and xlogpt(x) is the score function of the target distribution. The score function is not analytically tractable, but one can train a score network, \(s({\bf{x}},t;{{\phi }}):{{\mathbb{R}}}^{d}\to {{\mathbb{R}}}^{d}\), to approximate the score function s(x, t; ϕ) ≈ xlogpt(x), for example, using denoising score matching39 (Supplementary Information). For sampling, we use the Euler–Maruyama solver to integrate the reverse SDE from t = T to t = 0 in equation (2) with 500 steps.

Consistency models

One major drawback of current diffusion models is that the numerical integration of the differential equation requires around 10–2,000 network evaluations, depending on the solver. This makes the generation process computationally inefficient and costly compared with other generative models such as GANs4 or NFs3,41, which can generate images in a single network evaluation. Distillation techniques can reduce the number of integration steps of diffusion models, which often represent a computational bottleneck21,22.

CMs can be trained from scratch without distillation and only require a single step to generate a new sample. They have been shown to outperform current distillation techniques23. CMs learn a consistency function, f(x(t), t) = x(tmin), which is self-consistent, that is,

$$f({\bf{x}}
(3)

where the time interval is set to tmin = 0.002 and tmax = 80 (ref. 23). Further, a boundary condition f(x(tmin), tmin) = x(tmin) for t = tmin is imposed. This can be implemented with the following parameterization:

$$f({\bf{x}},t;{{\theta }})={c}_{{\rm{skip}}}
(4)

where F() is a U-Net with parameter θ. The time information is transformed using a sine–cosine positional embedding in the network. The coefficients cskip(t) and cout are defined19,23 as

$${c}_{{\rm{skip}}}=\frac{{\sigma }_{{\rm{data}}}^{2}}{({(t-{t}_{\min })}^{2}+{\sigma }_{{\rm{data}}}^{2})},\qquad {c}_{{\rm{out}}}
(5)

The training objective is given by

$${\mathcal{L}}\left({{\theta }},\bar{{{\theta }}}\right)={{\mathbb{E}}}_{{\bf{x}},n,{t}_{n}}\left[d\left(\,f({\bf{x}}+{t}_{n+1}{\bf{z}},{t}_{n+1};{{\theta }}),f\left({\bf{x}}+{t}_{n}{\bf{z}},{t}_{n};\bar{{{\theta }}}\right)\right)\right],$$

(6)

where \({{\mathbb{E}}}_{{\bf{x}},n,{{\bf{x}}}_{{t}_{n}}}\equiv {{\mathbb{E}}}_{{\bf{x}} \sim {p}_{{\rm{data}}},n \sim {\mathcal{U}}(1,N(k)-1),{\bf{z}} \sim {\mathcal{N}}({\bf{0}},{\bf{1}})}\). The discrete time step is determined via

$${t}_{n}={\left({t}_{\min }^{\,1/\rho }+(n-1)/(N-1)\left({t}_{\max }^{\,1/\rho }-{t}_{\min }^{\,1/\rho }\right)\right)}^{\rho },$$

(7)

where ρ = 7 and the discretization schedule is given by

$$N(k)=\left(\sqrt{k/K\left({({s}_{1}+1)}^{2}-{s}_{0}^{2}\right)+{s}_{0}^{2}}-1\right)+1,$$

(8)

where k is the current training step and K is the estimated total number of training steps obtained from the PyTorch Lightning library. The initial discretization steps are set to s0 = 2 and the maximum number of steps to s1 = 150 (ref. 23). With \(\bar{{{\theta }}}\), we denote an exponential moving average over the model parameters θ, updated with \(\bar{{{\theta }}}={\rm{stopgrad}}\)\([w(k)\bar{{{\theta }}}+(1-w(k){{\theta }})]\), with the decay schedule given by

$$w(k)=\exp \left(\frac{{s}_{0}\log \left({w}_{0}\right)}{N(k)}\right),$$

(9)

where w0 = 0.9 is the initial decay rate23. For the distance measure d(,), we follow another work23 and use a combination of the learned perceptual image patch similarity42 and l1 norm:

$$d({\bf{x}},{\bf{y}})={\rm{LPIPS}}({\bf{x}},{\bf{y}})+| | {\bf{x}}-{\bf{y}}| {| }_{1}.$$

(10)

Thus, the training of the CM is self-supervised and closely related to representation learning43, where a so-called online network f(: θ) is trained to predict the same image representation as a ‘target network’ \(f(\cdot :{\bar{{\theta}}})\) (ref. 44). Importantly, the CM is, therefore, not trained explicitly for the downscaling tasks, which are purely performed at the inference stage.

Network architectures and training

We use a two-dimensional U-Net13,45 from the Diffusers library to train both score and consistency networks from scratch, with four down- and upsampling layers. For these four layers, we use convolutions with 128, 128, 256 and 256 channels, and 3 × 3 kernels, sigmoid linear unit activations, group normalization and an attention layer at the architecture bottleneck. The network has, in total, around 27M trainable parameters.

We train the score network with the Adam optimizer46 for 200 epochs, with a batch size of 1, a learning rate of 2 × 10−4 and an exponential moving average over the model weights with a decay rate of 0.999 (Supplementary Information).

The CM model is trained for 150 epochs23, with the RAdam optimizer47 and the same batch size, learning rate and exponential moving average schedule (with an initial decay rate of μ0 = 0.9) as the score network. We find that the loss decreases in a stable way throughout the training (Supplementary Fig. 1). The training of 150 epochs takes around six and a half days for the CM and four and a half days for the SDE on an NVIDIA V100 32 GB graphics processing unit. A summary of the training hyperparameters is given in Supplementary Table 1.

Scale-consistent downscaling

As shown in other work17,48, adding Gaussian noise with a chosen variance to an image (or fluid dynamical snapshot) results in removing spatial patterns up to a specific spatial scale associated with the amount of added noise. The trained generative model can then replace the noise with spatial patterns learned from the training data up to the chosen spatial scale.

In principle, the spatial scale can be chosen depending on the given downscaling task, for example, related to the ESM resolution or variable. Hence, our method allows for much more flexibility after training in which the optimal spatial scale could be defined with respect to any given metric. In general, ESM fields are too smooth at small spatial scales, which presents a key problem for Earth system modelling in general and impact assessments in particular. More specifically, when comparing the frequency distribution of spatial precipitation fields in terms of spatial PSDs, it can be seen that ESMs lack the high-frequency spatial variability, or spatial intermittency, which is a key characteristic of precipitation9. Hence, a natural choice for the spatial scale to be preserved in the ESM fields is the intersection of PSDs from the ESM and the ground-truth ERA5 (ref. 17) (Fig. 3), that is, the scale at which the ESM fields become too smooth.

For Gaussian noise, the variance as a function of time t can be related to the PSD of a given wavenumber k and the grid size N by17

$${\sigma }^{2}
(11)

Using equation (11), we choose k* = 0.0667 (Supplementary Fig. 2), such that it represents the wavenumber or spatial scale at which the PSDs of the ESM and ERA5 precipitation fields intersect. This corresponds to t* = 0.468 for the CM variance schedule and t* = 0.355 for the SDE bridge.

The diffusion bridge17 starts with the forward SDE in equation (1), initialized with a precipitation field from the POEM ESM. The forward SDE is then integrated until t = t*. Then, the reverse SDE (equation (2))—initialized at t = t*—denoises the field again, adding structure from the target ERA5 distribution.

For the CM approach, at inference, we apply the ‘stroke guidance’ technique16,23, where we first sample a noised ESM field \({\tilde{{\bf{x}}}}^{{\rm{ESM}}}\in {{\mathbb{R}}}^{d}\) with variance corresponding to t*:

$${\tilde{{\bf{x}}}}^{{\rm{ESM}}} \approx {\mathcal{N}}({{\bf{x}}}^{{\rm{ESM}}};{\sigma }^{2}({t}^{* }){\bf{1}}),$$

(12)

which is then denoised in a single step with the CM:

$$\hat{{\bf{x}}}=f\left({\tilde{{\bf{x}}}}^{{\rm{ESM}}},{t}^{* };{{\theta }}\right),$$

(13)

thereby highly efficiently generating realistic samples \(\hat{{\bf{x}}}\) that preserve unbiased spatial patterns of the ESM up to scale k*.

2025-03-13 00:00:00

Related Articles

Back to top button