Improving generalization capability of deep learning-based nuclei instance segmentation by non-deterministic train time and deterministic test time stain normalization

With the advent of digital pathology and microscopic systems that can scan and save whole slide histological images automatically, there is a growing trend to use computerized methods to analyze acquired images. Among different histopathological image analysis tasks, nuclei instance segmentation plays a fundamental role in a wide range of clinical and research applications. While many semi- and fully-automatic computerized methods have been proposed for nuclei instance segmentation, deep learning (DL)-based approaches have been shown to deliver the best performances. However, the performance of such approaches usually degrades when tested on unseen datasets. In this work, we propose a novel method to improve the generalization capability of a DL-based automatic segmentation approach. Besides utilizing one of the state-of-the-art DL-based models as a baseline, our method incorporates non-deterministic train time and deterministic test time stain normalization, and ensembling to boost the segmentation performance. We trained the model with one single training set and evaluated its segmentation performance on seven test datasets. Our results show that the proposed method provides up to 4.9%, 5.4%, and 5.9% better average performance in segmenting nuclei based on Dice score, aggregated Jaccard index, and panoptic quality score, respectively, compared to the baseline segmentation model.


Introduction
The evaluation of histopathological images by experts remains an integral part of the diagnostic routine of many human diseases [15].An essential element of this process is the inspection of the appearance, morphology, and density of cells, which is subsequently used, for example, to diagnose different types of cancer or to assess the progression of certain diseases [4,8,34].Another important aspect in this context is the shape and structure of the cell nuclei [43].Nuclear segmentation -the task of finding all individual nuclei in digitized histological images -is therefore a key feature of many automated pathology frameworks, as it enables the subsequent extraction of important information, including cell count or quantities related to the shape and structure of the nuclei [36].Since automated pathology frameworks can improve diagnostic accuracy, reduce evaluation time, and create more efficient workflows, there has been an increased effort to automate nuclei segmentation with the goal of achieving more robust and objective segmentation results [23,30,56].Many computerized methods have been proposed for unsupervised, semi-supervised, and fully-supervised nuclei segmentation ranging from standard image processing techniques to advanced machine learning (ML) and deep learning (DL) algorithms [14,35].However, among these methods, supervised DL algorithms, such as convolutional neural networks (CNN), have achieved the best performances [1,14].Although there are many different nuclei instance segmentation algorithms based on CNNs, the state-of-the-art strategies can be broadly categorized into three different categories.There are detection-based methods such as adapted or improved Mask R-CNN [3,17], ternary segmentation models such as deep contour-aware networks (DCAN) or Kumar et al. method [6,22] and distance-based methods such as Hover-Net [11], two-stage U-Net algorithms [33] or dual decoder U-Net-based (DDU-Net) model [31].Additionally, these approaches can be jointly used in order to boost the segmentation performance [3,53].Furthermore, considering the recent advances of large language models, vision transformer-based architectures have also been utilized in the encoder-decoder-based models for various histological image analysis tasks, including nuclei instance segmentation [42,57].Comprehensive overviews of state-of-the-art methodologies for nuclei segmentation can be found in the respective studies [4,14,35,57].
Despite the fact that supervised CNNs achieve excellent nuclei segmentation performances for individual datasets, the CNN performance usually diminishes on external or unseen test data, because histological images are usually acquired under different settings [41,44].Specifically, there are many sources of variations in the appearance of cells acquired in different labs, including color variations caused by minor stain deviations, different organs, or different image acquisition devices.All these variations in image acquisition settings pose a challenge known as domain shift, the differences between the source domain (the training data) and the target domain (the unseen test data).This challenge arises because the model, during training, learns to recognize patterns and features that are specific to the training domain but when faced with the test data from a different domain, the model may encounter unfamiliar variations and struggle to generalize effectively [16,46].Thus, creating an algorithm that performs well on all datasets is a challenging task.Although many efforts were made to overcome these obstacles and improve the generalization power of DL-based methods, there is still a lack of robust algorithms with acceptable nuclei segmentation performance for unseen test datasets [11,28].
Normalization-based and augmentation-based approaches are the most wellknown applied methods to improve generalization [38,44].
Normalization algorithms focus on reducing the variability of input images from different sources.They often match certain properties of the target/input image to those of a reference image [38,44].Various methods have been proposed to perform normalization in histological images.These include classical image processing techniques such as histogram matching [39], stain-based normalization such as Macenko et al. method [27] or Vahadane et al. method [45], and neural network-based approaches such as cycle generative adversarial networks (CGAN) or HistoStarGAN [46,59].Classical techniques such as Reinhard et al. [39] are not originally designed for computational pathology and can introduce undesirable artifacts in histological images [19].Stain-based normalization methods are initially designed for matching source and target domain in histological images, and they have been shown to keep the structural details while altering the stain matrix of source images [16].While the application of these methods has led to improved image analysis performance in some studies [18,41], a number of other studies have shown adverse effects or no effects when using these techniques on classification or segmentation performances in Hematoxylin & Eosin (H&E)-stained histological images especially when combined with MLbased or DL-based approaches [2,44,50,52].GAN-based approaches have shown excellent performance in domain translation where there is a large gap between the source and target domain e.g., changing the staining type from H&E to immunohistochemistry [59].Although such approaches generally generate visually plausible image-to-image translation, they are still prone to create hallucinative image features, and they are quite sensitive to the model architecture and training procedure.Thus, their application for some tasks, including specific image segmentation in histological images, has shown to be limited [16,47].
Augmentation, on the other hand, exploits various geometrical or mathematical color transformations in order to introduce more variety during the training.Common techniques are rotations, mirroring, scaling, elastic deformations, or adding small perturbations to the channels in different color spaces [16,44].Many widely recognized augmentation techniques have their roots in the domains of natural image classification or segmentation.However, they have demonstrated promising applications in medical image analysis, including histological image segmentation [16].
Numerous valuable efforts have been dedicated to enhancing generalization in computational pathology.However, some studies have faced constraints due to the use of limited test sets or a restricted number of tissues/organs for assessing generalization capabilities [5,13,24,26,46].Additionally, certain investigations have simultaneously addressed multiple tasks, such as classification and segmen-tation [25], while others have focused on proposing solutions for distinct tasks, such as virtual stain transfer in histological images [46,47,55].
In this work, we explicitly focus on improving the generalization power for the nuclei instance segmentation task.We proposed and developed a hybrid approach by combining the normalization and augmentation techniques in the nuclei instance segmentation workflow.We use DDU-Net [31], one of the stateof-the-art DL-based nuclei instance segmentation models, as the baseline.For training, we incorporate non-deterministic stain normalization based on the Macenko et al. method.Instead of using a single reference image and performing normalization as an offline pre-processing, we select multiple reference image candidates from different organs.We normalize input images randomly as an online augmentation during the training phase.In the inference phase, besides using morphological test time augmentation, we also apply a deterministic test time stain normalization strategy followed by an ensembling step to create the final segmentation output.In our experiments, we used one single training set (including images from seven distinct organs/tissues) and evaluated the performance of seven test sets (including images from 40 organs/tissues).Our results on all test datasets show that the proposed method has improved nuclei instance segmentation performance compared to the baseline model.
The main contributions can be summarized as follows: • Integrating a novel non-deterministic stain normalization in the training procedure of DDU-Net.

Datasets
In our study, we used one single dataset for training and multiple datasets for testing to evaluate the generalization capability of our proposed approach (i.e., one single trained model was tested on multiple unseen images from various test datasets).The training was done on the training set from the MoNuSeg challenge [23], as it encompasses a large variety of nuclei from different organs and has been widely used as a benchmark dataset in previous studies [11,31,58].It includes a total of 21,623 nuclei, found in 30 (1000 × 1000 pixels) H&E-stained image patches extracted from whole slide images from the cancer genome atlas (TCGA) repository 5 .This dataset includes images from seven organs, namely liver (6 images), breast (6 images), kidney (6 images), bladder (2 images), prostate (6 images), stomach (2 images), and colon (2 images).For testing, we utilized seven test datasets.We used MoNuSeg test data [23], TNBC [37], Cry-oNuSeg [30], CPM-15 [51], CPM-17 [51], CoNSeP [11] and NuInsSeg dataset [29], all include image patches with H&E staining.Further depiction of the attributes of each dataset can be found in Table 1.

Segmentation model
This work is based on the DDU-Net [31] as the baseline nuclei instance segmentation model.This model has shown excellent performance in the nuclear segmentation task in various datasets (e.g., it achieved the first rank on the MoNuSAC post-challenge leaderboard for multi-organ nuclear segmentation and classification challenge [9,49]).It uses a U-Net-alike encoder-decoder structure [40], such that the images are fed into a shared encoder path, whose intermediate results are consequently passed to two different decoder branches.These decoders are designed to predict nuclear pixels (first decoder) and distance maps (second decoder) of all instances in a given image.The shared encoder consists of convolutional, drop-out, and max pooling layers, while the decoders consist of convolutional, drop-out, and transpose convolutional layers.The only difference between the two decoders is the last layer, where a sigmoid activation function and a linear activation function are used for the first and second decoders, respectively.For the first decoder, a combined Dice loss and a binary cross-entropy loss were used, and for the second decoder, a mean square error loss function was utilized.The general workflow of the utilized DDU-Net is shown in Fig. 1.
The results from decoders are post-processed using a Gaussian smoothing filter, a watershed algorithm, and morphological operations to form the final instance segmentation results.Further details about the model architecture and workflow can be found in the respective study [31].
Fig. 1.The generic workflow of the DDU-Net [31] for nuclei instance segmentation.

Non-deterministic stain normalization in training
In this work, we made use of the normalization algorithm introduced by Macenko et al. [27], which has been widely applied in the literature to normalize H&E-stained histological images [10,12].This method is based on deconvoluting images to retrieve the pure hematoxylin (i.e., nuclei) and eosin (i.e., other cell parts) components.After applying the negative logarithm to convert the images into the optical density (OD) space, it is possible to model the stains with the following equation: Where S and V are matrices containing the stain saturations (S) and the stain vectors (V).The stain vector, however, is not known and is hence approximated.Macenko et al. method [27] achieves this by utilizing singular value decomposition.Afterward, the saturations of the single stains can be calculated using the formula above.While this is shown to preserve histological information [38], the results rely on the sophisticated selection of the reference image, a step that is usually performed manually.
We introduced a new strategy for randomly applying the Macenko et al. method during training to improve the generalization capability, paired with an automated reference image selection algorithm, reducing the impact of reference image selection and creating more diversity in the training data.
To select the reference images automatically, we used the MoNuSeg training data histograms.We calculated the mean intensity value for nuclei regions and background regions using the provided binary segmentation masks for the MoNuSeg training data.We sorted all images by the absolute difference between the mean tissue intensity and the mean nuclear intensity and selected the images with the largest differences between the means (one per organ).Table 2 shows the differences for all images in the training set (selected reference images for each organ are shown in bold).For visualization, examples of selected and non-selected reference images are shown in Fig. 2.This approach allowed us to automatically select images, such that the colors in the tissue are different from the ones found in the nuclei, ensuring a good contrast of the nuclear boundaries and high visibility of the nuclei.We selected seven reference images (one per organ) using the described technique.The chosen reference images from the MoNuSeg training data are shown in Fig. 3.
We applied non-deterministic stain normalization as an online augmentation based on the selected reference images.For a given input image in the training phase, we either sent the unmodified image to the model (probability of 50%) or sent it to the normalization pipeline (probability of 50%).In the latter case, the image was subsequently normalized against one of the reference images based on the Macenko et al. method, introducing another non-deterministic component to the workflow.The probability for each path was chosen equally (i.e., 7.14% for each path).This step for a sample input image is depicted in Fig. 4.

Original image
Nuclei image Tissue image Histogram

Test time stain normalization
Test time augmentation (TTA) has been shown to boost the segmentation and classification performance for various medical image analysis tasks in former studies [32,48,54].In this work, besides morphological TTA (90-degree rotation and horizontal flipping), we propose to use test time stain normalization (TTSN).
The entire workflow of the inference phase for one of the fold in cross validation (refer to Section 2.6 for more details) is shown in Fig. 5.In this phase, the original test image and seven normalized test images (identical normalization as performed in training) were sent first to the morphological TTA block (blue boxes in Fig. 5) and then to the trained model.After averaging the results of the morphological TTA block (non-weighted averaging), the outputs were merged using a weighted average scheme.The weights were chosen based on the exploited probabilities in the training phase (i.e., 50 for the original input test image and 7.14 for each of the seven normalized images).

Evaluation
To evaluate the performance of the nuclei instance segmentation, we used the Dice, aggregate jaccard index (AJI), and panoptic quality (PQ) scores [11,21].These metrics have been widely used to evaluate and compare the performance of different nuclei segmentation methods [23,49].While the Dice score shows the general performance of semantic segmentation, AJI and PQ score are sensitive to the capability of the model to separate touching objects (show the instance segmentation performance) and hence are more critical metrics in this study.

Experimental setup
To show the effectiveness and generalization power of the proposed method, we designed six experiments.The schematic workflow of the experiments is shown in Fig. 6.The details of the experiments are as follows: -  Besides the differences mentioned above, all other parameters were kept identical in all experiments.We used 5-fold cross-validation to train the DDU-Net in all experiments and generated five models from cross-validation ensembled in the inference phase.Each model was trained for 200 epochs with an initial learning rate (LR) of 0.001.We used a LR scheduler (dropped by half after every 30 epochs).Random cropped images with the size of 992 × 992 pixels were used to train the models in the experiments.In the inference phases, all test datasets except CPM-15, test images were white-padded to form 1024 × 1024 images and then sent to the train models.The original part was then cropped from the results.For the CPM-15 dataset, all images were white-padded to 1056 × 1056 pixels, as one of the images in the CPM-15 dataset has a larger size than 1024 in one dimension (1032 × 808 pixels).Moreover, Adam optimizer [20], batch size of four, and a threshold of 0.5 (to convert probabilities to binary values) were used in training and testing.We also applied classical augmentation techniques in all experiments mentioned above.The classical augmentations included horizontal and vertical flipping, random 90-degree rotations, and random color, brightness and contrast shifts.
To investigate the robustness, the entire experiments (5-fold cross-validation training and ensemble for each of the six setups) were repeated three times, and the average results and standard deviations were reported.All models were trained using a single workstation with an Intel Core i7-8700 3.20 GHz CPU, 32 GB of RAM, and a TITIAN V Nvidia GPU card with 12 GB of installed memory.The DDU-Net was trained and evaluated using Tensorflow (version 2.11) DL framework.

Results and Discussion
In this section, we report the results for each test dataset in a separate table.Each table contains the results based on the different experimental setups explained in Section 2.6 including baseline segmentation results (first row), results from offline normalization technique (second row), results from extended offline normalization (third row), results from the offline normalization approach with atlas image (fourth row) and the results from the proposed method with or without TTSN (fifth and sixth rows).
The results in Table 3 are derived from the MoNuSeg test data.Although the test data and train data came from the same distribution (MoNuSeg dataset) in this experiment, the average results indicate a superior segmentation performance of the proposed method (with or without TTSN) in comparison to the baseline results.The proposed method with TTSN also delivered better semantic and instance segmentation performance compared to offline normalization (single image or atlas image) or extended offline normalization approaches.Comparing the fifth and sixth rows shows that using TTSN slightly improves the segmentation performance for all evaluation indices.
The results of Table 4 to Table 9 show the generalization capability of the models in different experimental setups for unseen test datasets.A number of observations can be inferred from the results of these tables.
First of all, for all test datasets, the proposed method (with or without TTSN) consistently delivers better average segmentation performance in comparison to the baseline model for all evaluation metrics.
Secondly, using conventional stain offline normalization is not always beneficial, and it can even degrade performance (for example, in degrading the average performance for the TNBC dataset (Table 4) across all evaluation metrics by comparing the first and second rows).This is in accordance with previous findings in other studies [2,44].However, in this study, we developed an ap-proach that consistently delivers superior performance on multiple test datasets compared to the baseline model.
Thirdly, adding TTSN to the workflow (i.e., comparison between the fifth and sixth rows in the tables) improves the average segmentation performance in most test datasets (5 out 7 based on Dice score, 5 out of 7 for AJI, 5 out of 7 for PQ score) and delivers very competitive performance in other cases.However, it is worth mentioning that adding the proposed TTSN to the workflow would increase the test time by eight folds (instead of sending one single test image to the models, eight images have to be sent to the model as shown in Fig. 5).This could be a barrier of using TTSN when limited computational resources are available to analyze whole slide histological images.However, with proper computational resources and parallel processing, this should not be an issue.To investigate if incorporating more images in the ensemble phase would change the segmentation results, we performed an additional experiment with 14 images for deterministic stain normalization.We chose two images per organ instead of choosing one image per organ.Again, we used the absolute differences between the mean tissue and nuclear intensity to select the second image per organ (the second suitable candidate for each organ).The results are reported in Table 10.As the results indicate, the performance is almost identical in all cases (We reported the results with two decimal precision in the table as they are very close).This suggests that adding more images for deterministic stain normalization in the inference phase does not necessarily lead to improved performance but increases the inference time, which is undesirable for practical applications.
Fourthly, for most cases (all except the CryoNuSeg dataset), the proposed method with TTSN delivers superior average semantic and instance segmentation performance compared to other normalization methods discussed in this study (offline normalization (single image or atlas image) and extended offline normalization).The results in Table 9 show the segmentation performance for the NuInsSeg dataset.NuInSeg is the largest test dataset used in this study, with 665 images that are derived from 31 different human and mouse organs.Therefore, this dataset can be considered the most important dataset to show the generalization capability of the proposed method, and hence, we describe its results separately.As the results show, we observe the same trend of superior performance of the proposed method (both fifth and sixth rows) in comparison to the baseline model.However, the average difference between baseline and proposed segmentation performance is more evident for the NuInsSeg dataset (4.9%, 5.4%, and 5.9% for Dice, AJI, and PQ score, respectively).The proposed method with TTSN also outperforms the other applied normalization techniques (3.4% for Dice, 3.1% for AJI, and 2.5% for PQ score, respectively for the offline normalization method, and 6.8% for Dice, 6.4% for AIJ and 5.9% for PQ score, respectively for the extended offline normalization strategy, and 1.7% for Dice, 2.8% for AIJ and 2.7% for PQ score, respectively for the offline normalization with atlas image approach).For qualitative comparison, we show sample segmentation results from the baseline model and the proposed method in Fig. 7.The derived standard deviations of the results indicate that the obtained outcomes from the proposed model with TTSN are highly robust, with an average standard deviation of 0.28%, 0.34%, and 0.47% for Dice, AJI, and PQ scores, respectively.Similarly, the model remains robust even without TTSN, with an  7) TTSN( 14) TTSN ( 7) TTSN ( 14) TTSN ( 7) TTSN ( 14 average standard deviation of 0.41%, 0.48%, and 0.6% for Dice, AJI, and PQ scores, respectively. There are some limitations in this study that can be addressed in future works.First, while we propose a generalizable framework for the nuclei instance segmentation task, we only used one of the state-of-the-art DL-based models (DDU-Net) in our study.Although DDU-Net has shown excellent nuclei instance segmentation performance [9,31,48], DDU-Net can be replaced by other state-of-the-art segmentation models in future research.However, we would like to emphasize that most other state-of-the-art DL-based models for nuclei instance segmentation (such as triple U-Net, HoverNet, or attention augmented distance regression model [7,11,58]) have similar encoder-decoder-based architectures to the utilized DDU-Net model.Second, in this study, we focused on the nuclei instance segmentation task, but the proposed framework can be evaluated for nuclei detection or nuclei instance segmentation and classification or nuclei detection tasks in the future as well.Finally, using stain normalization, as shown in former studies [16,25], introduces computational overhead, especially in the inference phase.While we observe improved nuclei instance performance in all test datasets, the gains in some datasets were not as notable as in other datasets.Thus, the application of the proposed TTSN method to increase the generalization at the expense of extra computation overhead should be considered, especially when limited computational resources are available.

Conclusion
While many ML-based and DL-based approaches have been proposed for nuclei instance segmentation in histological images, their performance usually degrades when tested on unseen new images.We proposed a framework for generalized nuclei instance segmentation with non-deterministic train time and deterministic test time stain normalization.Applied on seven independent test datasets, the results showed the superior performance of the proposed method compared to the baseline segmentation model.Therefore, the proposed approach can be considered a generalized framework for nuclei instance segmentation.

Original image
Ground truth Base prediction Proposed prediction Fig. 7. Qualitative comparison between baseline segmentation model and proposed approach (results from the first run).The first column shows some example test images from the NuInsSeg dataset (from the human bladder, human placenta, and human cerebellum in the first, second, and third rows, respectively).The second column shows the ground truth segmentation masks.The third column shows the prediction by the baseline DDU-Net model, and the fourth column shows the results of the proposed approach.The red bounding boxes in columns two to four show some example nuclei where the proposed method delivered a superior segmentation performance (better semantic segmentation performance in the first row and better instance segmentation performance in the second and third rows) compared to the baseline segmentation model.

Fig. 4 .
Fig. 4.During non-deterministic stain normalization, the input training images were randomly passed either directly to the segmentation model or into the normalization pipeline, where they were normalized to one of seven reference images.The probability for each path in the normalization pipeline was chosen equally.

Fig. 5 .
Fig. 5. Proposed inference approach with deterministic test time stain normalization.The blue dashed boxes in each branch show the morphological test time augmentation (TTA).Trained model n ∈ {1, 2, 3, 4, 5} represents the trained model for each fold of 5-fold cross-validation.

Fig. 6 .
Fig. 6.Schematic design of the experiments.The selected shown image and normalized images are derived from the MoNuSeg training set for visualization.Norm.: Normalization; Non-det.: Non-deterministic; TTSN: Test Time Stain Normalization.

Table 1 .
Summary of the used datasets for training (first row) and testing (other rows).Training and testing data contain images from 7 and 40 distinct organs/tissues, respectively.

Table 2 .
The differences between the mean tissue intensity and the mean nuclear intensity for all training images (30 images) of the MoNuSeg dataset.The selected reference images are shown in bold.
No normalization (baseline): In this experiment, we did not apply any normalization and just used the raw MoNuSeg training dataset to train the DDU-Net model.To evaluate the model on the test datasets, raw test images were used.-Offline normalization: In this setup, one single image was chosen as the reference image, and all other images (from the MoNuSeg training set and all test images from all test datasets) were normalized using the Macenko et al. approach.To choose the reference image, we used the histogram analysis described in Section 2.3.-Extended offline normalization: In this setup, we merged the raw and normalized MoNuseg training data to train the model (hence, the size of

Table 3 .
Segmentation results for the MoNuSeg test data (average and standard deviation over three runs).The last two rows represent the results from the proposed approach.Non-det.: Non-deterministic; TTSN: Test Time Stain Normalization; AJI: Aggregate Jaccard Index; PQ: Panoptic Quality.

Table 4 .
Segmentation results for the TNBC test data (average and standard deviation over three runs).The last two rows represent the results from the proposed approach.

Table 6 .
Segmentation results for the CPM-15 test data (average and standard deviation over three runs).The last two rows represent the results from the proposed approach.Non-det.: Non-deterministic; TTSN: Test Time Stain Normalization; AJI: Aggregate Jaccard Index; PQ: Panoptic Quality.

Table 7 .
Segmentation results for the CPM-17 test data (average and standard deviation over three runs).The last two rows represent the results from the proposed approach.Non-det.: Non-deterministic; TTSN: Test Time Stain Normalization; AJI: Aggregate Jaccard Index; PQ: Panoptic Quality.

Table 8 .
Segmentation results for the ConSep test data (average and standard deviation over three runs).The last two rows represent the results from the proposed approach.Non-det.: Non-deterministic; TTSN: Test Time Stain Normalization; AJI: Aggregate Jaccard Index; PQ: Panoptic Quality.

Table 9 .
Segmentation results for the NuInsSeg test data (average and standard deviation over three runs).The last two rows represent the results from the proposed approach.Non-det.: Non-deterministic; TTSN: Test Time Stain Normalization; AJI: Aggregate Jaccard Index; PQ: Panoptic Quality.

Table 10 .
Comparison between incorporating 7 or 14 images for deterministic stain normalization in the inference phase.TTSN: Test Time Stain Normalization; AJI: Aggregate Jaccard Index; PQ: Panoptic Quality.