Next-generation phenotyping of inherited retinal diseases from multimodal imaging with Eye2Gene

1 13 minutes read

Next generation phenotyping of inherited retinal diseases from multimodal imaging with.png

Dataset quality control and preparation

The MEH IRD cohort was previously described by Pontikos et al.¹ and encompasses 4,501 individuals with IRDs caused by variants in 189 distinct genes, of which 324 individuals (with variants in 72 genes) were younger than 18 years of age as of 2 August 2019. Individuals with an IRD and a confirmed genetic diagnosis by an accredited genetic diagnosis laboratory were identified and information about the genetic diagnosis was exported from the MEH electronic health record (OpenEyes) using a SQL query on the Microsoft SQL Server hospital data warehouse database.

Images were exported from the MEH Heidelberg Imaging (Heyex) database (Heidelberg Engineering) for all individuals with an IRD, on the basis of their hospital number, for records between 25 March 2004 and 22 October 2019. We selected the Heidelberg Spectralis as it is one of the most widely used medical imaging devices in IRD clinics worldwide and has previously been applied to AI-based approaches on IRDs. This resulted in a dataset of 2,103,692 images from 264,299 scans in 4,510 patients. For the quality control and data preparation process, images were divided by modality, with 87,534 short-wavelength FAF images in 4,000 patients, 35,608 IR images in 3,731 patients and 1,647,349 SD-OCT images in 3,731 patients. Since SD-OCT produces several B-scans, for each SD-OCT volume we selected only the median four B-scans corresponding to the four scans closest to the scan that traverses the fovea, as they were likely to be the most informative. Following this, 141,895 SD-OCT B-scans remained in 3,728 patients.

For all three modalities, we applied the filtering as shown in Supplementary Fig. 16. Any corrupted (unreadable) images were discarded. FAF scans feature two different imaging magnification levels, 30 degrees and 55 degrees. We kept 55-degree images and all other images were discarded, using data from the scan metadata to distinguish the two cases.

To remove low-quality and defective images, we used Retinograd-AI model to filter out poor-quality scans⁴⁷. These models were applied to the all FAF and SD-OCT images (using only the median B-scan for SD-OCT) within our dataset to obtain a prediction for each, and then all scans with a gradeability score of partially and un-gradable were rejected. Since IR and SD-OCT scans are captured simultaneously, for IR scans we took the gradeability score of the corresponding SD-OCT volume as the label and filtered similarly. After this process, 27,433 FAF, 33,706 IR and 134,293 (33,712 volumes) SD-OCT images remained in 3,315, 3,715 and 3,715 patients, respectively.

The number of images rejected at each stage of the process is provided in Supplementary Fig. 16. To ensure sufficient data for training and testing, we restricted our datasets to only genes with at least ten patients remaining after filtering, leaving 63 individual genes. The distribution of all 63 genes is presented in Supplementary Fig. 10 and the full breakdown by gene is given in Extended Data Table 2. Following the quality control and gene selection process, 25,233 FAF, 31,357 IR and 124,975 (31,363 volumes) SD-OCT images remained across 3,652 patients in 63 distinct genes. The phenotype distribution of a subset of 2,103 of these patients is provided in Supplementary Table 4 per gene and per phenotype in Supplementary Table 5. The visual acuity distribution across genes is provided in Supplementary Fig. 17.

Postquality control, these patients were split into a ‘development’ set of 3,128 patients, and a held-out internal test set of 524 patients (28,174 images). Stratified sampling was used to ensure at least three representative patients for each gene were present in the test set, and to ensure no families were present in both test and development sets. The development set was further split into train and validation sets according to an approximate 80/20 spit, with 2,451 patients (119,755 images) in the training set and 677 patients (31,605 images) in the validation set. The training set was used to train our 15 constituent Eye2Gene networks, while the internal test set was kept separate to enable testing of the final Eye2Gene model.

In addition to the MEH data described above we also obtained images from a further five centres to enable external validation of Eye2Gene. Oxford Eye Hospital (UK) provided a sample of 29,145 scans from 390 patients with distinct gene diagnoses in 33 different genes. The University Eye Hospital of Liverpool (UK) provided a sample of 6,174 scans from 156 patients with distinct gene diagnoses in 27 different genes. The University Eye Hospital Bonn (Germany) provided a sample of 2,838 scans from 129 patients with distinct gene diagnoses in 12 different genes. The Tokyo Medical Center (Japan) provided a sample of 1,493 scans from 60 patients with distinct gene diagnoses in 24 different genes. The Federal University of Sao Paulo (Brazil) provided a sample of 1,494 scans from 40 patients with distinct gene diagnoses in ten different genes. The MEH (UK) internal test dataset consisted of 28,174 scans from 524 patients across 63 gene diagnoses. Further breakdown by dataset is available in Extended Data Table 3 and with further breakdown by gene in Supplementary Table 6. Retrospective images from patients were selected by the clinical team at each of the five external centres according to the requirements: (1) that the patients had a confirmed genetic diagnosis within one of the 63 genes that are currently recognized by Eye2Gene; (2) the patient had retrospective retinal imaging available that was 55-degree FAF images or 30-degree OCT images obtained from the Heidelberg Spectralis as part of routine care and (3) the retinal images were considered of good quality. No further requirements were given regarding the ethnicity or the sex of the cases. For each patient a set of scans was selected, typically consisting of one scan per-patient per-eye per modality, along with their genetic diagnosis. These data were shared with us through our secure online portal, with the exception of the data from Bonn where the Eye2Gene models were run locally. No preprocessing or standardization of the retinal images took place but the image quality computed using Retinograd-AI⁴⁷ confirmed that the images were of comparable quality between centres and of slightly higher quality overall than those in the MEH test set given those ones were not explicitly selected by a clinician (Extended Data Table 3).

Model training

On each modality, five 63-class CoAtNets were trained for 100 epochs (passes over the entire dataset) for FAF and IR, and 25 epochs for OCT, which was found to be sufficient for the validation accuracy to converge in most settings (Supplementary Fig. 18). Random initialization with different random seeds was used for each individual of the 15 networks to ensure ensemble diversity. A CoAtNet0 architecture from the keras-cv-attention-models pypi library was used, where the final output layer was replaced by a dropout layer, followed by a linear layer with 63 outputs and softmax normalization. The CoAtNet architecture was chosen on the basis of an initial comparison of a number of different architectures evaluated on the FAF dataset. Cross entropy loss was used for the loss function, using additional class-weighting inversely proportional to gene frequency in the dataset, where the labels were given by the gene diagnosis of the underlying patients. This was to help address dataset imbalance due to the non-uniform gene distribution. For training, the Adam optimizer was used with the default parameters used in the Keras library (β₁ = 0.9, β₂ = 0.999). Learning rate was set to 0.0001 as it was found to work well across a range of architectures, and the batch size was set to 16, as this was the largest we could fit in graphical processing unit (GPU) memory. Dropout probability was fixed at 50%. Training was completed within 8 h for each neural network training on a single 3090 24 GB GPU.

To avoid overfitting to the training data, data augmentation techniques were applied. A number of plausible image transformations were applied automatically to the input data during training (Supplementary Fig. 19).

Evaluation

The output prediction of Eye2Gene is obtained by combining the predictions of its 15 constituent networks using an ensemble approach. For each single retinal scan of a specific modality, a five-model ensemble is applied by taking the simple arithmetic mean of each of the output probabilities per gene across the five constituent networks. Given a collection of retinal scans from a single patient, the appropriate ensemble model corresponding to each scan’s modality is applied, then the average predictions across all scans within each modality in the collection is taken in the same manner for a per-modality prediction per patient. Finally, the average across the three modalities is calculated and used as the final prediction for the patient. This approach was applied across all available scans per patient. Although there may be individual cases where it is better to down-weight or exclude certain images, or modalities overall we find that including more images per patient improves the overall accuracy (Supplementary Fig. 20). Additionally, we experimented with different class weightings, performing a grid search over modality weightings in 0.1 increments from 0 to 1.0 on the development validation set (the weights do not need to sum to 1 as only the relative class predictions affect the prediction). This improved validation top-five accuracy from 82.6 to 84.0%, with weights of 0.8, 0.1 and 0.5 for FAF, IR and OCT; however, applying this same weighting to the test data led to decreased top-five accuracy from 83.9 to 83.5%. With sufficient calibration data it may be possible to more accurately determine the optimal modality weighting; however, in the absence of other evidence, equal weighting provided a sufficiently good heuristic.

The model predictions were then compared against the underlying gene diagnoses for each patient to compute the overall accuracy of the model on the test data, the top-k accuracy (the proportion of images where the correct gene was within the highest k predictions of the network) for k = 2,3,5,10, and the average per-class F1, weighted F1, mean average precision (MAP) and AUROC (Supplementary Table 7). Accuracy was calculated by counting the number of times Eye2Gene’s top prediction matched the gene of the underlying patient. Per-gene precision-recall curves (for MAP) and ROC were produced for each gene in a one-versus-rest setup, using the Eye2Gene predictions for each output gene, and areas under the respective curves were calculated using trapezoid estimation (Supplementary Fig. 21). Confidence intervals were obtained by bootstrapping over 10,000 resamplings and taking the 2.5th and 97.5th percentiles. For convenience, all predictions were compiled into a single .CSV file along with additional data about each image (such as patient study ID, gene, appointment date) and the ID of the model used to generate the prediction. Eye2Gene combines predictions across multiple models (ensembling) and across multiple images acquired during one or more patient visits.

Taking each network individually without ensembling, the mean per-network top-five accuracies per image were 68.9, 70.8 and 74.9% for FAF, IR and OCT on our full validation dataset (which includes external sites). Applying ensembling of the five models per modality to the individual images we observed accuracies of 71.0, 72.7 and 77.2% for FAF, IR and OCT. Combining individual model predictions across multiple images (without ensembling the five models per modality) at the per-patient level, on the held-out test set, we observed an overall mean top-five accuracy across models of 81.5%. In general, we found that combining all images across all three modalities typically outperformed the best performing single modality (that is restricting to images of that modality only) on most genes, demonstrating the advantage of the multi-modality approach. In both cases these were superior to the single-network results, but inferior to the overall Eye2Gene model (83.9%), suggesting that both using ensembling across networks at a per-image level, and ensembling predictions across multiple images, was advantageous.

Conformal prediction

Conformal prediction sets construct a set of candidate classes instead of single class outputs. Crucially, this set is not fixed in size but is dynamically sized to reach some user-defined confidence threshold. This means that for ‘easy’ examples, the prediction set may be very small (or even just one class) but allows for larger prediction sets for more ambiguous examples. By adjusting the confidence threshold, we can trade off between the proportion of example instances in which the correct class was in the prediction set, which is defined as coverage, and the set size. Conformal prediction sets can be constructed for any classifier with probability outputs and are useful tool for interpretability of model output probabilities. The basic ‘naive’ conformal set construction algorithm is fairly simple and just increases the predicted classes included in the conformal prediction set until the desired confidence threshold is met; however, model outputs are often poorly calibrated in practice. Hence, various algorithms exist to calibrate conformal prediction sets (Supplementary Table 2).

We apply the least ambiguous adaptive prediction sets conformal prediction method, taking per-class probability outputs from Eye2Gene and adding the classes to the prediction set until the predetermined confidence threshold was exceeded. We selected least ambiguous adaptive prediction sets from among three methods, as it produced smaller average prediction sets for a given coverage value, which we report in Supplementary Fig. 9. We calibrate our conformal prediction confidence levels on the MEH model validation set (n = 677) (not seen by the model during training) taking the compiled predictions for each patient.

Evaluating phenotype-driven genetic variant prioritization

Clinical notes and retinal scans of 130 patients with IRD from MEH with a confirmed gene diagnosis, who were part of the Eye2Gene test set, were manually reviewed and HPO terms were identified by three ophthalmologists with expertise in IRDs as described in Cipriani et al.²¹. These HPO terms were used as input for the latest version of the Exomiser-hiPHIVE algorithm (v.14.0.0) (https://github.com/exomiser/Exomiser) to obtain a gene ranking for the most probably predicted gene. The Exomiser-hiPHIVE algorithm uses a gene-specific phenotype score based on the PhenoDigm algorithm between the patient’s phenotype encoded as a set of HPO terms and the phenotypic annotation of any known gene-associated phenotypes reported in disease databases that include human and model organisms such as mouse and zebra fish. The retinal scans for these 130 patients with IRD were also analysed using Eye2Gene to obtain gene predictions that were also ranked accordingly for direct comparison with the Exomiser-hiPHIVE ranking. The non-parametric Wilcoxon rank sum test was used to compare the Exomiser-hiPHIVE and Eye2Gene gene rankings. Note that the Exomiser also takes as input the genetic variants file (.VCF file), which also gives it an advantage over Eye2Gene in terms of reducing the set of possible genes to consider only genes that contain genetic variants that may be considered pathogenic.

Visualization and clustering of model embeddings

Visualizing and clustering model embeddings are valuable data-driven approaches for evaluating class diversity and identifying similarity between classes, even for new classes that the model has not been trained on. We applied one of the Eye2Gene FAF networks to all FAF scans in our test dataset, and the activations of the penultimate hidden layer were extracted to give a 768-dimensional vector for each scan. The UMAP dimensionality reduction algorithm was applied to the extracted activations, to obtain two-dimensional embeddings. The embeddings in UMAP space were then clustered using hierarchical clustering with Ward linkage, to produce a gene groupings dendrogram. By visualizing the UMAP-projected embeddings of the retinal images obtained from the different centres, we were able to show that no centre clustered separately and hence the images are unlikely to be systematically different (Supplementary Fig. 22). Using the embeddings, we were also able to derive a prototype-based methods inspired approach that matches the most similar images in embedding space according to the cosine similarity (Supplementary Fig. 8).

Eye2Gene screening component

For the screening component of Eye2Gene, a neural network was trained to distinguish FAF images of patients with IRD from patients with non-IRD. Hyperparameters, network architecture and training settings were the same as for the main Eye2Gene FAF module, except no ensembling across models or images was used. For the patients with non-IRD, a number of conditions were selected for presentation similar to IRDs in FAF imaging: acute zonal outer occult retinopathy, birdshot uveitis, central serous retinopathy, geographic atrophy and posterior uveitis. Patients with these conditions were extracted from the MEH hospital database and processed in the same manner as the IRD data (n = 2,292). Patients with IRD, before filtering for genes with more than one case, were selected (n = 3,315) (Supplementary Fig. 15). For evaluation, a held-out test set of 20% of patients was kept, and used as an evaluation set. For plotting of the ROC curve and area under the curve calculation, the outputs of the network were treated as a binary classification by taking the output probability of the IRD class as the predictive probability.

Interpretability of Eye2Gene image classifications

A well-known limitation of deep learning models in general is that their interpretability currently remains challenging and that existing approaches such as gradient-based saliency maps are not, as of now, sufficiently reliable for medical decisions. We leveraged the fact that the CoAtNet architecture uses self-attention to extract attention maps from one of the constituent Eye2Gene networks on a selection of FAF images. These attention maps show the areas of the image with the highest attention weights under the network’s self-attention mechanism, and hence the areas that are most strongly incorporated into the network’s final prediction (Supplementary Fig. 7). These maps were promising, consistently attending to areas of pathology, which are likely to be indicative of a particular condition according to our evaluation by human experts (Supplementary Fig. 22).

Human benchmarking

To contextualize the performance of Eye2Gene compared to human experts, a challenge dataset of 50 FAF images from patients sampled from the MEH held-out set was created, with 36 unique genes, and no more than two of any given gene. FAF was selected since it is one of the imaging modalities most commonly used in IRD clinics and hence the one for which ophthalmologists should overall have the most experience in the assessment of IRDs. We asked eight ophthalmologists from Moorfields (M.M., A.R.W., O.A.M.), Bonn (F.G.H., P.H., B.L.), Oxford (S.R.D.S.) and Liverpool (S.M.), specializing in IRDs, with 5–15 years of experience, to predict the causative gene based only on the images provided. These eight ophthalmologists (M.M., A.R.W., O.A.M., F.G.H., P.H., B.L., S.R.D.S., S.M.) were selected on the basis of the criteria of being board certified specialists in ophthalmic genetics who run dedicated IRD clinics at their respective hospitals and would have reviewed on average retinal images from hundreds of patients per year. For each image, each ophthalmologist was asked to name the five genes they thought was most likely out of a list of 36 genes. Eye2Gene was then run on the same images, taking the top-five predictions from the full 63 genes that Eye2Gene was trained on and then compared to the clinicians’ predictions.

Ethics

This research was approved by the Institutional Review Board and the UK Health Research Authority Research (HRA) Ethics Committee (REC) reference (22/WA/0049) ‘Eye2Gene: accelerating the diagnosis of inherited retinal diseases’ Integrated Research Application System (project ID 242050). The study sponsor was the University College London Joint Research Office (UCL JRO). The UCL JRO Data Protection reference number is Z6364106/2021/11/67. A summary of the research study can be found on the HRA website (https://www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/eye2gene-10/). The REC that approved this study is Wales REC 5 (Wales.REC5@Wales.nhs.uk). All research adhered to the tenets of the Declaration of Helsinki.