Robust virtual staining of landmark organelles with Cytoland

2 11 minutes read

Datasets

We combined public and in-house datasets to develop the proposed training strategies and the models. Extended Data Table 1 provides a summary of the datasets used for training and testing specific models. Details of cell culture and image acquisition can be found in Supplementary Note 1.

The phase-contrast images from the training and validation split of the LIVECell dataset⁴⁰ were used for the FCMAE pre-training of VSCyto2D.

We used two subsets generated with different imaging protocols from the Allen Institute for Cell Science (AICS) iPSC dataset⁵ for training and testing VSCyto3D. We use all 3,446 FOVs from Pipeline 4.1 for training and a random subset of 20 FOVs from Pipeline 4 for testing.

Preprocessing

All internal datasets were acquired in uncompressed lossless formats (that is, OME-TIFF and ND-TIFF) and converted to OME-Zarr⁴⁶ using iohub (https://github.com/czbiohub-sf/iohub)⁴⁷. The public dataset was also converted to OME-Zarr from OME-TIFF stacks. The preprocessing, training and evaluation protocols below use OME-Zarr as input/output format to enable parallel processing and efficient storage.

Deconvolution

The reconstruction from bright-field and fluorescence stacks to phase density and fluorescence density was performed with the waveOrder package (https://github.com/mehta-lab/waveOrder)^11,32,34.

The acquired bright-field and fluorescence stacks were modelled as filtered versions of the unknown specimen properties, phase density and fluorescence density, respectively. This blur was represented by a low pass optical transfer function in Fourier space and a point spread function in the real space, which were simulated using properly calibrated parameters of the imaging system (numerical apertures of imaging and illumination, wavelength of illumination and pixel size at the specimen plane). The simulated point spread functions were calibrated using images of beads and test targets. The simulated optical transfer functions were used to restore phase density and fluorescence density, respectively, from the bright-field and fluorescence stakes using a Tikhonov-regularized inverse filter. The regularization parameters for the inverse filter were chosen such that the contrast due to the cellular structure in the mid-band of the optical transfer function is maximized³⁴.

Registration

The label-free and fluorescence channels were registered with biahub⁴⁸. After registration, the resulting volumes were cropped to ZYX shape of (50, 2,044, 2,005) for the HEK293T Zernike phase contrast test dataset, (9, 2,048, 2,048) for A549, (12, 2,048, 2,009) for BJ-5ta and (26, 2,048, 2,007) for iNeuron. The neuromast datasets acquired with the wide-field fluorescence microscope were registered to the phase density channel and cropped to (107, 1,024, 1,024). The datasets acquired in the iSIM set-up were cropped to (81, 1,024, 1,024).

Additional preprocessing (iNeuron)

The fluorescence signal in iNeuron cells was further processed to improve contrast for virtual staining and segmentation. Paired 2D images were generated from each imaging volume.

For the calcein channel, the soma is much brighter than the neurites. The mean projection along the axial dimension and natural logarithm of one plus the input (‘log1p’) were applied to compress the dynamic range. The result was normalized so that the 99th percentile is 0 and the 99.99th percentile is 1, and then clipped to a range of 0 to 5.

To suppress fluorescence from dead cells in the Hoechst channel, the maximum projection of Hoechst volumes was multiplied with the mean projection of the raw calcein channel. The result was normalized so that the median is 0 and the 99.99th percentile is 1, and then clipped to a range of 0 to 5.

To match the shape of the fluorescence channels, a single Z-slice (at 8 µm from the bottom of the volumes) was taken from the phase channel as the input to virtual-staining models.

Model architecture

There is an active debate^41,49,50,51 whether transformer models that use attention operations fundamentally outperform convolutional neural networks that rely on the inductive bias of shift equivariance for image translation and segmentation tasks. Systematic comparisons suggest that convolutional models perform as well as transformer models^51,52 when a large compute budget is spent, and outperform the transformer models when a moderate compute budget is spent. Therefore, we opted to use a fully convolutional architecture for this work. We integrated the concepts from U-Net³⁶, ConvNeXt v.2^35,37 and SparK³⁸ to develop an architecture for 2D, 3D or 2.5D image translation. The module in the network that enables flexible choice of number of slices in the input stacks and output stacks is a projection module in the stem and head of the network (Extended Data Fig. 1). The body of the network is a U-Net-like hierarchical encoder and decoder with skip connections that learns a high-resolution mapping between input and output.

We chose the layers and blocks of the model as follows. We developed an asymmetric U-Net model with ConvNeXt v.2³⁵ blocks for both virtual staining (Extended Data Fig. 1) and FCMAE pre-training (Extended Data Fig. 2). The original ConvNeXt v.2 explored an asymmetric U-Net configuration for FCMAE pre-training and showed that it has identical fine-tuning performance on an image classification task. In the meantime, SparK³⁸ used ConvNeXt v.1 blocks in the encoder and plain U-Net blocks in the decoder for its masked image modelling pre-training task. We use the ‘Tiny’ ConvNeXt v.2 backbone in the encoder. For FCMAE pre-training, 1 ConvNeXt v.2 block was employed per decoder stage. For virtual-staining models, each decoder stage consisted of 2 ConvNeXt v.2 blocks.

The UNeXt2 architecture provides 15 times more learnable parameters for 3D image translation than our previously published 2.5D U-Net at the same computational cost (Table 1). The efficiency gains are even more notable when compared with 3D U-Net. This approach enables the allocation of the available computing budget to train moderate-sized models faster or to train more expressive models that generalize to new imaging conditions and cell types. We evaluated a few different loss functions, shown in Supplementary Table 1. The models trained for joint prediction of nuclei and membranes are slightly more accurate than models trained for prediction of nuclei alone (Table 1).

Table 1 Computational complexity and capacity of architectures

Model training

Intensity statistics, including the mean, standard deviation and median, were calculated at the resolution of FOVs and at the resolution of the whole dataset by subsampling each FOV using square grid spacings of 32 pixels in each camera frame. These pre-computed metrics were then used to apply normalization transforms by subtracting the choice of median or mean and dividing by the interquartile range or standard deviation, respectively. This enables standardizing of the training data at the level of the whole dataset, at the level of each FOV and at the level of each patch¹¹, depending on the use case.

Training objectives

The mixed image reconstruction loss⁵³ was adapted as the training objective of the virtual-staining models: \({{\mathcal{L}}}^{{\rm{mix}}}=0.5 {{\mathcal{L}}}^{2.5{\rm{D}}\ {\rm{MS}}-{\rm{SSIM}}}+0.5 {{\mathcal{L}}}^{{{\mathcal{l}}}_{1}}.\) The first term \({{\mathcal{L}}}^{2.5{\rm{D}\ {\rm{MS}}-{\rm{SSIM}}}}\) is the multi-scale structural similarity index⁵⁴ measured without downsampling along the depth dimension, and \({{\mathcal{L}}}^{{{\mathcal{l}}}_{1}}\) is the L1 distance (mean absolute error). The virtual-staining performance of different loss functions is compared in Supplementary Table 1.

The mean square error loss is used for FCMAE pre-training on label-free images, following the original implementation³⁵.

Data augmentations

The data augmentations were performed with transformations from the MONAI library⁴³. We used spatial (Supplementary Table 6) and intensity (Supplementary Table 7) augmentations during training to simulate geometric and contrast variations introduced by different imaging systems, and applied them either to both the source and target channels to achieve equivariance, or only to the target channel to achieve invariance.

Normalization

Normalization was performed at both training and evaluation time.

VS-HEK293T

For each channel, the image volume was subtracted by its dataset level median and divided by the dataset level interquartile range. As our Zernike phase contrast microscope generates inverted contrast compared with the quantitative phase, the Zernike phase images of HEK293T cells were additionally inverted after normalization.

VSCyto2D, VS-BJ5-ta, VS-iNeuron and VSCyto3D

Each image volume was independently normalized before being used for model input to account for differences in culture confluence and background fluorescence. The phase channel was normalized to zero mean and unit standard deviation, and the fluorescence channels were normalized to zero median and unit interquartile range. For the iNeuron dataset, normalization was only applied for only the phase channel as the fluorescence target was already preprocessed for contrast adjustment.

VSNeuromast

This model normalizes the label-free channel per FOV by subtracting the median and interquartile range.

Training data pooling

VSCyto2D

Image volumes of HEK293T cells were downsampled from the 63x dataset with ZYX average pooling ratios of (9, 3, 3). For the VSCyto2D model reported in Fig. 1, training data were sampled from the downsampled HEK293T dataset, the A549 dataset and the BJ-5ta dataset with equal weights.

VSCyto3D and VS-infection

During FCMAE pre-training, phase images of uninfected and OC43-infected HEK293T⁵⁵, uninfected and ZIKV-infected A549, and the public iPSC dataset from AICS were used. This base model was used to initialize encoder weights for the VSCyto3D and VS-infection models. For VSCyto3D, phase and fluorescence images were sampled from the healthy HEK293T and A549 datasets, and the iPSC dataset from AICS.

VSNeuromast

The data used in our methods were pooled from four OME-Zarr stores, which contain neuromasts from 3 days post-fertilization (dpf), 6 dpf and 6.5 dpf stages. These stores include both the whole FOV and a centre-cropped version focused on the neuromast. For the cropped FOVs, a weighted cropping technique was applied to ensure the inclusion of training patches containing the neuromast. Conversely, the uncropped dataset employs an unweighted cropping method to incorporate additional contextual information. A high – content screening dataloader was developed to sample equally from the multiple datasets with variable length.

The time-lapse dataset were processed by registering the experimental fluorescence channels registered to the phase density channel and required downsampling of the data by the factor of 2.1 to match the pixel size between for the training and test set of VSNeuromast.

Training protocol

All models were trained on four graphics processing units with the distributed data parallel strategy. All FCMAE models were trained with a masking ratio of 0.5.

VS-HEK293T

Models were trained with a warmup-cosine-annealing schedule. A mini-batch size of 32 and base learning rate of 0.0002 was used. The training and validation patch ZYX size was (5, 384, 384). For testing the effect of deconvolution (Fig. 2b), models were trained for 100 epochs. For testing robustness to imaging conditions (Fig. 2d), models were trained for 50 epochs.

VSCyto2D, VS-BJ5-ta and VS-iNeuron

A training and validation patch ZYX size of (1, 256, 256), a mini-batch size of 32, automatic mixed precision and a 0.0002 base learning rate were used for all models. FCMAE pre-training was performed for 800 epochs. The mask patch size was 16. Both FCMAE and virtual-staining pre-training used a warmup-cosine-annealing schedule. For the VS-BJ5-ta experiments, the encoder weights were loaded from the FCMAE pre-trained models when applicable. The models were then trained for the virtual-staining task with the encoder weights either frozen or trainable. For testing data scaling with BJ-5ta, models were trained with constant learning rate. Six FOV models were trained for 6,400 epochs, 27 FOV models were trained for 1,600 epochs and 117 FOV models were trained for 400 epochs. For VS-iNeuron, the encoder weights were loaded from FCMAE pre-training. All model parameters were trained using a warmup-cosine-annealing schedule for 1,600 epochs.

VSCyto3D and VS-infection

A training and validation patch ZYX size of (15, 384, 384), automatic mixed precision and a 0.0002 base learning rate were used for all models. FCMAE pre-training used a mini-batch size of 80 for 800 epochs. The mask patch size was 32. VSCyto3D and VS-infection models have their encoder initialized from the FCMAE training above, and were trained for 100 epochs on the virtual-staining task, using 40 and 32 mini-batch sizes, respectively.

VSNeuromast

A training and validation patch ZYX size of (21, 384, 384), automatic mixed precision and a 0.0002 base learning rate were used for all models. FCMAE pre-training used a mini-batch size of 64 for 8,000 epochs. The mask patch size was 32. The virtual-staining pre-training step to get VSNeuromast used an encoder initialized from the FCMAE training, and was trained for another 65 epochs on the virtual-staining task, using a 32 mini-batch size.

Inference using trained models

For the 2D virtual-staining model VSCyto2D and its fine-tuned derivatives, each slice was predicted separately in a sliding window fashion.

For the 3D virtual-staining models (VS-HEK293T, VSCyto3D, VS-infection and VSNeuromast), a Z-sliding window of the model’s output depth and step size of 1 was used. The predictions from the overlapping windows were then average-blended.

Model evaluation

The correspondence between fluorescence and virtually stained nuclei and plasma membrane channels were measured with regression and segmentation metrics. We describe the segmentation models for each use case below. All segmentation models were also shared with the release of our pipeline, VisCy (‘Code availability’). In situations where the virtual stain rescues experimental stain (Extended Data Fig. 4), we manually curated the test FOVs to ensure that experimental fluorescence and its segmentation can be considered a benchmark. The instance segmentations were compared using the AP between segmented nuclei (or cell membranes) from fluorescence density images and from virtually stained images. An instance of a cell was considered to be true positive if the intersection over union (IoU) of both segmentations reached a threshold. We computed AP at IoU of 0.5 (AP@0.5) to evaluate the correspondence between instance segmentations at the coarse spatial scale and mean AP across IoU of 0.5–0.95 to evaluate the correspondence between instance segmentations at the finer spatial scales.

VS-HEK293T

Segmentation of H2B-mIFP fluorescence density and virtually stained nuclei was performed with a fine-tuned Cellpose ‘nuclei’ model (Supplementary Table 2). The nuclei segmentation masks were corrected by a human annotator. Segmentation of cells from CAAX-mScarlet fluorescence density and virtually stained plasma membrane was performed with the Cellpose ‘cyto3’ model (Supplementary Table 2). Owing to the loss of CAAX-mScarlet expression in some cells, positive phase density was blended with the CAAX-mScarlet fluorescence density to generate test segmentation targets. For the Zernike phase contrast test dataset, nuclei and cells were also segmented from the phase image using the Cellpose ‘nuclei’ and ‘cyto3’ models, in addition to segmentation from experimental fluorescence images.

PCC was computed between the virtual-staining prediction and fluorescence density images. AP@0.5 and mean AP of IoU thresholds from 0.5 to 0.95 at 0.05 interval (AP) was computed between segmentation masks generated from virtual-staining images and segmentation masks generated from fluorescence density images.

VSCyto2D, VS-BJ5-ta and VS-iNeuron

For HEK293T and A549, segmentation of fluorescence density images as well as virtual-staining prediction was performed with the ‘nuclei’ (nuclei) and ‘cyto3’ (cells) models in Cellpose. For BJ-5ta, the ‘nuclei’ model in Cellpose was used for nuclei segmentation and a fine-tuned ‘cyto3’ model was used for cell segmentation (Supplementary Table 3). The nuclei segmentation target was corrected by a human annotator. PCC was computed between the virtual-staining prediction and fluorescence density images. Average precision at IoU threshold of 0.5 (AP@0.5) as computed between segmentation masks generated from virtual-staining images and segmentation masks generated from fluorescence density images.

For iNeuron, the soma segmentation was performed with the ‘cyto3’ model in Cellpose (Supplementary Table 3). The neurites were traced from calcein fluorescence or virtual staining with scikit-image⁵⁶, by multiplying the image with its Meijering-ridge-filtered⁵⁷ signal, applying Otsu thresholding⁵⁸, removing small objects and skeletonizing⁵⁹. The total neurite length in each FOV was approximated by taking the sum of foreground pixels in the neurite traces. To count the number of neurites connected to each soma, the following steps were taken: (1) the soma foreground mask was first subtracted from the neurite traces; (2) the soma labels were then expanded for 6 pixels (~2 µm) without overlapping; and (3) the number of neurite segments intersecting with these expanded rings that were more than 100 pixels long were counted as belonging to the respective soma instances.

VSCyto3D

For the AICS iPSC dataset, segmentation of virtual-staining prediction was performed with the ‘nuclei’ (nuclei) and ‘cyto3’ (cells) models in Cellpose (Supplementary Table 4). Average precision at IoU threshold of 0.5 (AP@0.5) was computed between segmentation masks generated from virtual-staining images and segmentation masks published with the dataset (computationally generated from fluorescence images)⁵.

VSNeuromast

The nuclei and cell membranes of neuromasts were segmented using CellPose models, summarized in Supplementary Table 5. We refined the segmented cell instances using the Ultrack⁴⁵ algorithm, which jointly optimizes the instance segmentation and tracking. The segmentation and tracking parameters were fine-tuned individually for the fluorescence and virtual-staining volumes for optimal detection of cells (Fig. 5d).

Model visualization

We visualize principal components of learned features as follows: each XY pixel in the output of a convolutional stage was treated as a sample with channel dimensions and decomposed into eight principal components. The top-three principal components were normalized individually and rendered as RGB values for visualization.