Data

We designed the data collection to parallel the decision time points in a clinical workflow for FETs. Typically, an embryo would be thawed at time 0 h (0 h) and assessed for initial survival and transferred at 2 h (2 h) or 3 h (3 h) post-thaw. However, its sustained survival following this time-point cannot be evaluated in vivo given it has been transferred; only the pregnancy outcome is able to be evaluated. In our design, we maintained the thaw and assessment time of a clinical workflow and determined the sustained embryo survival in culture at 25 h (25 h) following the freeze–thaw process. All embryos were cultured in Global media + 10% LGPS (Life Global Protein Supplement), which is the standard for our center. Survival at 25 h post thaw was chosen as embryos not meeting criteria to freeze or transfer on day 5 and 6 are cultured an extra day in our center.

Our primary outcome was the survival of the embryo post-thaw at 25 h. In this study, survival at 25 h post-thaw was defined as re-expansion of the blastocoel cavity, minimal cell lysis, and pulsing with progressively larger blastocoel volume. Thawed embryos that did not re-expand, exhibited progressively smaller volumes while pulsing, or had lysis of more than 50% of cells at 25 h post-thaw were defined as having failed to survive.

The dataset consisted of 652 time-lapse videos of post-thaw blastocysts collected at the Center for Reproductive Health at University of California, San Francisco between January 2019 and May 2020 from 119 patients who volunteered to donate their embryos. Study embryos were previously biopsied and collapsed prior to cryopreservation, and consisted of chromosomally abnormal embryos donated to research. These disposition decisions were made in writing by patients prior to cryopreservation and genetic testing. Images were captured using the Embryoscope time-lapse system (Vitrolife, Sweden). Each video was recorded starting at embryo thaw, with a frame rate between 0.1 and 0.2 h. Videos were reviewed, then annotated by an embryologist with the binary survival outcome: label 1 represented survival, and label 0 represented failure to survive, and the survival of the embryo was determined by the embryologist by reviewing the complete timelapse video. The embryologist was blinded to the prediction of the algorithm or the other participating embryologists and evaluated the continuous development of the embryo. There were no censored observations in the dataset.

The dataset was split into training, validation, and test sets. The training set was used to optimize models, the validation set was used to compare and choose models, and the test set was used to evaluate the chosen models. Embryologists were evaluated using the test set so that both the algorithm and the embryologists were evaluated on the same data. The test sets and validation sets were first sampled such that they contained an approximately equal number of embryos that survived and those that did not. The remaining videos were included in the training set. There was no patient overlap among the videos split into each set.

In this work, we focus on building out the proof-of-concept by demonstrating the development of a deep learning algorithm and focusing on combining the predictions of the algorithm with the experts. It is important to note that saliency maps, such as those enabled by class activation mapping, would be possible future work that could enable characterization of model attention. This has not been the focus of this work, and as such, we are unable to evaluate which parts of the embryo image are being used by the computer to make recommendations. We further note that recently shown limitations of saliency maps call into question their trustworthiness as a decision aid11,12.

All methods were carried out in accordance with relevant guidelines and regulations. The study was approved by the UCSF institutional review board (IRB). Informed consent was obtained from all subjects for the use of their tissues in this study.

Model development

The task for the deep learning algorithm was to predict embryo survival to 25 h using images of the embryo within the first 3 h. A deep learning model was trained to output the probability of survival to 25 h using a single image of an embryo taken at 0.5-h intervals post-thaw up to 3 h. In this study, a convolutional neural network, a particular type of deep learning model that is specially designed to handle image data, was used. Convolutional neural networks scan over an image to learn features from local structure and aggregate the local features to make a prediction on the full image. The parameters of each network were initialized with parameters from a network pretrained on ImageNet. The final fully connected layer of the pretrained network was replaced with a new fully connected layer producing a 2-dimensional output, after which the softmax function was applied. This is an activation function that outputs the probabilities of each potential outcome given a real-valued vector input, whereby larger values correspond to larger probabilities. In this instance, it is used to obtain the predicted probabilities of survival success and failure.

Models were trained, validated, and tested on frames from the same post-thaw time point. Before inputting the images into the network, the images were resized to 224 pixels by 224 pixels and normalized based on the mean and standard deviation (SD) of images in the ImageNet training set. Because the outcome is invariant to flipping and rotation transformations of the input, a random horizontal flip and random vertical flip was applied with 50% probability to each image in the training set before being fed into the model for data augmentation; a random rotation between 0 and 360 degrees was also applied.

All model variants were optimized on the binary cross-entropy loss using stochastic gradient descent with momentum to update the model’s parameters13. The learning rate was tuned for each model and strategy by selecting the learning rate among 5e−4, 1e−3, and 2e−3 which led to the lowest validation loss on the validation set. The optimal learning rate across all model architectures was 1e−3. For each model and training strategy, the momentum parameter was set to 0.9 and the dampening was set to 0. The momentum parameter is used to aggregate gradients at each iteration to facilitate convergence; in this instance, 90% of the previous gradient is aggregated with 10% of the current gradient. After every epoch (one pass of the training dataset), the model was evaluated on the validation set and the model was saved based on the best validation loss. To combat overfitting, L2 weight decay of 1e−4 was added to the loss for all trainable parameters. This works to decay larger parameters towards (but not exactly to) zero and reduce the chance of overfitting of the model.

Models were trained using four convolutional neural network architectures: ResNet18, ResNet34, ResNet50, and DenseNet121, widely used for medical image classification14,15,16,17. A non proprietary model ensemble, called EmbryoNeXt, was created by averaging the rank predictions of models trained with these different architectures, a common methodology applied in machine learning. All models were trained on NVIDIA GeForce GTX 1070 GPUs using the PyTorch library v1.4.0 using a batch size of 16 examples.

Embryologist benchmarks and augmented embryologist

We compared the performance of EmbryoNeXt to that of expert embryologists. Four embryologists, two junior embryologists (2 years of experience), and 2 senior embryologists (8 + years of experience), graded the test set on a scale of 1 to 10 (with 0.5 increments) at times 2 h and 3 h. The 1–10 score was a composite based on cell survival (0–3), expansion (0–3), cohesion (0–2), and cellularity (0–2) weighted to favor expansion and cell survival. Each embryologist graded the likelihood of survival, examining all frames at 2 h first and then all frames at 3 h independently. They were blinded to the true survival outcomes, clinical histories, and patient identifiers.

We also combined the predictions of EmbryoNeXt with the predictions of each embryologist. In this setup, the embryologist was ‘augmented’ by averaging the rank predictions of EmbryoNeXt with the rank of the embryologist (converted from the composite score) on a particular example, relative to other examples in the test set. Rank predictions were used rather than the absolute probability outputs or embryologist scores since both model outputs and embryologist outputs may not be calibrated. A schematic of the setup, model training, and inference is detailed in Fig. 3.

Figure 3 Schematic of the task, model training, and inference. (a) The task was to predict blastocyst survival at 25 h using an image of the blastocyst up to 3 h post-thaw. (b) At training time, frames at different time points up to 3 h were extracted from the video, and models were trained on both videos of survived and not survived embryos. (c) At inference time, the predictions of the model and embryologist were converted to prediction ranks, which were combined together to produce an augmented embryologist rank output.

Statistical analysis

We compared the diagnostic performance of the models and the embryologists using the area under the receiver operating characteristic curve (AUC). To assess whether augmentation significantly changed the performance, we computed the difference in performance on the test set with and without augmentation. The nonparametric bootstrap was used to estimate the variability around each of the performance measures; 1000 bootstrap replicates from the test set were drawn, and each performance measure and difference was calculated for the model and the embryologist on these same 1000 bootstrap replicates. This produced a distribution for each estimate, and the 95% bootstrap percentile intervals were reported to assess significance at the p = 0.05 level. Model training and statistical analysis were performed using Python3.