ADatasetandaTechniqueforGeneralized NuclearSegmentationfor ComputationalPathology
Neeraj Kumar * Ruchika Verma Sanuj Sharma Surabhi Bhargava Abhishek Vahadane and Amit Sethi
Abstract-Nuclear segmentation in digital microscopictissue images can enable extraction of high-quality fea- tures for nuclear morphometrics and other analysis inputational pathology. Conventlonal image processing techniques such as Otsu thresholding and watershed seg-mentation do not work effectively on challenging cases such as chromatin-sparse and crowded nuclel. In con-trast machine learning-based segmentation can general- Ize across varlous nuclear appearances.However trainingPublicly accessible and annotated data sets along withwidely agreed upon metrics to pare techniques have catalyzed tremendous innovation and progress on othernition.Inspired by thelr success weIntroduce alargepublicly accessible data set of hematoxylin and eosinIngly annotated nuclear boundaries whose quality wasvalidated by a medical doctor. Because our data set Is nuclear appearances from several patlents disease states and organs techniques trained on it are likely to generalize well and work right out-of-the-box on other H&E-stainedImages. We also propose a new metric to evaluate nuclearsegmentation results that penalizes object- and plxel-level errors in a unified manner unlike prevlous metrics that penallize only one type of error. We also propose a seg-mentation technlque based on deep learning that lays aincludingthose between the touching oroverlappingnuclel and works well on a diverse set of test images.
Index Terms-Annotation boundaries dataset deeplearning nuclear segmentation nuclel.
I.INTRODUCTION
hardware some of the problems of manual assessment ITH improvements in puter vision techniques andof histology images such as inter- and intra-observer vari- ability inability to assess subtle visual features and the time taken to examine whole slides [1] [2] are being alleviatedby putational pathology [3]. A key module in several putational pathology pipelines is the one that segments
Fig. 1. tissue images show crowded and chromatin-sparse nuclei. Otsu thresholding [9] leads to merged nuclei (under-segmentation). Marker Challenges in nuclear segmentation: Original H&E stainedcontrolled watershed segmentation [10] leads to fragmented nuclei (over- segmentation). Proposed technique detects and segments almost allnuclei well Each segmented nucleus is shown in a seperate color.
nuclei. Nuclear morphometric and appearance features such asdensity nucleus-to-cytoplasm ratio average size and pleomor- phism can be helpful for assessing not only cancer grades butdifrent types of nuclei based on their segmentation can also also for predicting treatment effectiveness [4]7]. Identifyingyield information about gland shapes which for example isrately segment nuclei in diverse images spanning a range of important for cancer grading [8]. Thus techniques that accu-to the development of clinical and medical research software. patients organs and disease states can significantly contribute
in putational pathology be able to accurately segment The primary goal of this work is to help those workingnuclei in a diverse set of H&E stained histology images. Forthis purpose we are releasing a large dataset of images with annotated nuclear boundaries that are difficult to segment.
generalized) nuclear segmentation pipelines. This dataset is capture the segmentation quality. In Section IV-B we suggesthospitals and covers multiple organs patients and disease In Section V we introduce a readily usable deep learning-have released its source code that works well in parisonto other publicly available techniques.
Most of the earlier work on nuclear segmentation did nottake challenging cases into account. For example in pathological conditions (such as hyperplasia or certain cancer subtypes) nuclei enlarge and exhibit margination of chromatin such thatthey have a lighter inner body and slightly darker outer ring when stained using hematoxylin (a monly used bluish-nucleoli that are darker than rest of the nuleus appear inside purple dye). Additionally under such conditions prominent the nuclear boundary. Popular image segmentation techniquessuch as Otsu thresholding [9] or marker controlled water- shed segmentation [10][12] anticipate relatively uniform anddistinguishable colors or textures within a nucleus. Thisas shown in Figure 1 even if we control for the imaging assumption leads to under-segmentation or over-segmentation modality by using only one of the most mon staining method based on hematoxylin and eosin (H&E) and mag-ed s o(x d ()xde o nification (40x objective with standard 10x eyepiece leadinguse the term under-segmentation for a segment that is undulylarge. That is it almost pletely covers a ground truth nucleus and additionally covers significant area outside that nucleus which may include neighboring nuclei. Conversely to only one nucleus but it leaves out a large proportion of Sq s s n yo sxid jo o other pixels from the same nucleus uncovered we call it over-segmentation. This also includes cases where a nucleusis split into multiple detected objects. A rare phenomenon that we ignore is where there is only a small overlap between areasof a segment and a ground truth nucleus which is neitherunder- nor over-segmentation but wrong nonetheless.
Using machine learning new techniques have shown the images [13] [14]. A significant barrier in evaluating using or potential to accurately segment nuclei in the challengingimproving these techniques is the unavailability of large andin Section II by introducing a publicly accessible dataset of diverse annotated training datasets. We address this problem so s ) s omore than 21 000 manually annotated nuclear boundaries. Our dataset spans 30 patients and seven organs. We requested ane nsuo o suoono ssse sooed dx were of high quality. Further we suggest how to divide thedataset into a training and two testing sets. One of the testingorgans that are not in the training set. sets is even more challenging than the other because it covers
Although several metrics for measuring the nuclear detec-pixel-level (segmentation) errors in a unified manner. We show color variations in crowded and chromatin sparse nuclei.
This will enable training and testing of readily usable (or in Section II-D that the previous metrics are inadequate to
states. Additionally we propose a metric to evaluate nuclear based nuclear segmentation technique that seems to worksegmentation errors in a unified manner. Finally we propose is motivated by the need to identify pixels on the nuclear a nuclear segmentation technique based on deep learming and boundaries irrespective of whether they lie on the boundary between a nucleus and surrounding cytoplasm or betweentouching or overlapping nuclei. We therefore introduce a third class of pixels (nuclear boundary) in addition to the twousual classes background (ourrside all nuclei) and foregroundopen-source software and another technique based on deep (inside any nucleus). We pare this technique with the otherlearning in Section VI. We are also releasing the source code of our technique to aid its use evaluation and improvement.We conclude in Section VII.
II. BACKGROUND AND RELATED WORK
images nuclear segmentation techniques and metrics and In this section we review the importance of H&E stainedfeatures of the publicly accessible datasets for puter visionand nuclear segmentation problems.
AL Hematoxylin and Eosin (H&E) Stained Images
lium (glands) lumen (ducts within glands) adipose (fat) Histologic structure of a tissue primarily consists of epithe-and stroma (connective tissue that holds the glands together). Shape size color and crowding of glands as well as variousnuclei in epithelium and stroma reveal a lot of informationof hematoxylin and eosin or H&E is a ubiquitous general about the health of the tisue to a pathologist. The binationders nuclei dark blueish purple and epithelium light purple and inexpensive staining (dyeing) scheme. Hematoxylin ren-while eosin renders stroma pink. Together H&E enhance thetion under a microscope. contrast between nuclei epithelium and stroma for examina-
tion in H&E stained images that can be used for specific There seems to be a vast amount of untapped informa-diagnoses such as cancer molecular sub-types determina-tion [15] mortality prediction [4] and treatment effectiveness prediction [7] [16]. Due to its low cost widespread use formodels our dataset covers H&E stained images. However most machine learning techniques can easily be trained ontissue images with other types of stains.
B. Nuclear Segrmentation Techniques for H&E Stained Images
Most state-ofthe art nucleus segmentation techniques usebased thresholding active contours and their variants along watershed segmentation morphological processing color-with a multitude of pre- and post-processing techniques totion and segmentation quality have been reported in the litera-ever such methods fail to generalize across a wide spectrum of achieve the aforementioned goals [10] [11] [17][20]. How-ture these metrics do not penalize object-level (detection) and tissue morphologies (Figure 1) due to inter- and intra-nuclear
Techniques based on machine learming can give better For example videos shot in the same background of differentresults on the challenging cases of nuclear segmentation instances of the same action could be clubbed as a group.because they can be trained to recognize nuclear shape andKeeping an entire test group out of the training set assesses the generalization ability of the classification methods.patients athes f slids disease states or organ. In histology we propose that the groups can correspond to
color variations. One class of learning-based methods usehistograms Laplacian of Gaussian response geometric fea- hand-crafted features such as color-texture blue ratio colortures from the gradient or Hessian profiles and other imagecharacteristics in standard learning based models to segment nuclei and non-nuclei regions [5] [21][23]. A second classof leaming based methods use deep learning specifically convolutional neural networks (CNNs) instead of handcrafted features and have outperformed previous techniquesin nuclear detection or segmentation [13] [14] [24]. These uoddsappearances and rely on plex post-processing methods non-nuclear (two-class) regions based on the learmed nuclearto obtain the final nuclear shapes and separation betweenused by [24] while a distance transform of the nuclear map touching nuclei. For example a graph partitioning method wasfollowed by H-minima thresholding and region growing was used by [13]. A more prehensive review of state-of-the-art nuclei segmentation algorithms can be found in [25] and [26].multiple organs or disease states right out of the box (without These techniques have not yet been demonstrated to work onre-training) and their source codes are not publicly available.
D. Methods for Evaluation of Segmentation Techniques
A good evaluation criterion for nuclear segmentation tech-niques should penalize both object-level (nucleus detection)and pixel-level (nucleus shape and size) errors listed below:
1) Missed detection of ground truth (annotated) objects 3) Under-segmentation of correcly detected objects and 2) False detection of ghost objects 4) Over-segmentation of correctly detected objects.
For ground truth objects G; indexed by 1 and segmented A monly used object detection metric is the F1-score.objects S; indexed by J the F1-score is based on true positives T P (count of all ground truth objects G; with an associatedsegmented (detected) object S;) false positives FP (counttruth object G) and false negatives F N (count of all ground pno podsa om s sioqo plas e jotruth objects G; without a corresponding detected object S;).F1-score is defined as follows:
We propose to explicitly find inter-nuclear boundaries usinga third class of pixels to separate crowded nuclei later in this paper to improve upon the binary classes learned by previousdeep learning techniques. We also release both an instance of our technique trained to work on multiple organs as well as its source code.
(1)
deciding whether a ground truth object G; has an associated The key detail in evaluating an Fl-score is a criterion forsegmented object S§ that is whether G; has been detected successfully. This has been done in different ways where somedetection quality. For example the association criterion for a ways are more generalizable than others for evaluating theground truth nucleus proposed in [13] was based on findingpublicly available datasets and evaluation metrics for bench-were also les than a threshold apart This is not generalizable its closest segmented nucleus and checking if their centroidsmarking such as ImageNet [27] and CIFAR [28] for object because the distance threshold is set subjectively and may recognition in images and UCF for action recognition in need to change with magnification imaging modality objecttype organ and disease state. The association criterion forground truth glands proposed in [31] was based on finding segmented glands that cover at least 50% of the ground truththan a distance threshold-based criterion. (annotated) gland which is more generalizable for detection
C.Computer Vision Datasets
Significant progress has been made on certain putervision problems due to a healthy petition enabled byvideos [29]. Medical imaging munity has eventuallystarted to follow this lead with the release of well-annotated datasets and organization of petitions for segmentation classification and detection [30][33].
For detecting and segmenting nuclei a few large datasets have recently been released. One of them has around 29 000annotated nuclear boundaries required for training and testingonly one organ. Datasets with annotated boundaries of a fewthousand nuclei for single organs have previously beenreleased [34][36]. We introduce one of the first large datasets of diverse muclei from multiple organs with annotated bound-aries that contains more than twice the number of annotatednuclei pared to a previous notable effort to introduce a multi-organ multi-disease state dataset [20]. Additionally wecover cases of crowded nuclei more extensively.
A major shorting of Fl-score using any object asso-marked nuclear centers for detection [14] but does not have ciation criteria is that it does not take pixel-level (segmenta- tion) errors into account. For example while the associationsegmentation techniques. Additionally it contains images from criterion in [31| does not penalize under-segmentation the one u papmo go uoeusrao zeud pou sp [1] uThus most work on nuclear segmentation reports two metrics one to evaluate detection (object-level errors) and another to report shape concordance (pixel-level errors) between theground truth objects and their associated segmented objects.
To pute shape concordance between a ground truthobject G; and its associated segmented object S; one ofthe following metries is offen used Jaccard index [37] Dice's coefficient [38] or Housdroff distance [39]. Of these Hausdorff distance which penalizes distance of furthest pixelson contours of two shapes is less popular because of its
A useful concept utilized in some human action recognitiondatasets is that of groaps [29]. A group is a set of samples within a class that are similar in the way the data was acquired.
TABLE1COMPOSITION OF THE DATASET AND ITS PROPOSED DIVISION FOR TRAINING VALIDATION AND TESTING
asqms eeq Nuclei Total Total Breast Liver Kidney Prostate Bladder Colon Stomach ImagesSame organs tesing Training and validation 13 372 4 130 16 4 4 4Diferent organ esing 4 121 8 6 2 2 2 2 2 二 2 二 " 2Total z91z 6 6 6 6 2 2 2
putational plexity being O(IG IIS 1) while that of the other two is O(|G;I [S; I) where G; and S) are the sets of pixels in ground truth object / and segmented object / respec-tively. Jaccard index J and Dice's coefficient D are closely related and measure the relative area of overlap between thetwo sets and are widely used. Only the true positives among the detected objects affect these shape concordance metrics and thus object-level errors are not accounted for by suchmetrics. Later in this paper we modify Jaccard index because it is easy to understand and covers the same information asDice's coefficient. Our modification penalizes both detection and segmentation errors.
duced another source of appearance variation due to the These images came from 18 different hospitals which intro-differences in the staining practices across labs. The detailsof the dataset can also be found in the supplementary mate- rials available in the supplementary files /multimedia tab orour Since putational requirements for processing WSIs arehigh we cropped sub-images of size 1000 × 1000 from regionsdense in muclei keeping only one such cropped image per WSI and patient. To further ensure richness of nuclear appcarances we covered seven different organs viz breast liver kidney prostate bladder colon and stomach including both benignand diseased tissue samples. An illustration of the spectrumof tissue appearances and their nuclei is shown in Figure 3 row 1.
For two sets which in our case are pixels of a ground truthnucleus G; and its associated segmented nucleus S Jaccard index is defined as follows:
more than 21 000 nuclear boundaries in Aperio ImageScope?. After obtaining 1000 × 1000 sub-images we annotatedImages were enlarged to 200× on a 25" monitor such thateach image pixel occupied 5 × 5 screen pixels for clear visibility and the nuclear boundaries were annotated with alaser mouse. The annotators were engineering students and were trained to identify nuclear boundaries by the co-authors.The generated XML files containing pixel cordinates of theOur annotations included both epithelial and stromal nuclei. annotated nuclear boundaries are available on our website.1For overlapping nuclei we assigned each multi-nuclear pixel to the largest nucleus containing that pixel. A representativeset of annotations is shown in row 2 of Figure 3 and theposition of the dataset is shown in Table I.
(2)
It isn’t clear how two nuclei segmentation techniquesFl-score) and the other has a better average shape matching can be pared if one has a better object detection (e.g.for detected objects (e.g. Jaccard index). A unified metricthat bines both object-level and pixel-level performance is desirable. Additionally we show in Sections IV-B and VI-Cthat the currently popular metrics do not reflect segmentationquality as expected. Moreover a detection criterion that is free of hyper-parameters is also needed so that it can be appliedto different magnifications and image types.
Il. ANNOTATED DATASET
We sent annotated images to an expert pathologist forexamination of annotation quality. In a PowerPoint? deck we used one image per slide. On a slide we put the unannotatedand annotated images side by side to cover a large portionof the slide. The pathologist viewed the slide on a 25" monitor and was instructed to place an arrow shape on everyproblematic annotation whether it was a false positive a false negative an over-segmented or an under-segmented nucleus.We counted all the arrows and divided the count by theour annotators made less than 1% errors on any given image. number of annotated nuclei in those images to estimate thatWe left these errors uncorrected due to their low count.
Finding downloading and annotating tissue slides is a time-development of new nuclear segmentation software that can be used in putational pathology. Our publicly hosted datasetwith a diverse set of tissue images and painstakingly annotatednuclear boundaries can fill this gap. It can be used by the research munity to develop and benchmark generalizednuclear segmentation techniques that work on diverse nuclear types. Our dataset consists of one of most the monly usedimages H&E stained and captured at 40x magnification.Although we used H&E stained tissue slides digitized at 40x magnification for developing the proposed algorithm but ourapproach can be easily applied to the monly available 20x slides by re-training or appropriate upsampling to 40x usingsuper-resolution techniques tailored for such images [40].
IV. TRAINING AND TESTING PROTOCOL
mentation techniques we propose how to split this dataset To facilitate the development of generalized nuclear seg-into training and testing sets as well as an evaluation metric that penalizes both object-level and pixel-level errors.
digitized tissue samples of several organs from The Can- We downloaded 30 whole slide images (WSIs) ofcer Genomic Atlas (TCGA) [41] and used only one WSI per patient to maximize nuclear appearance variation.
AL Training and Testing Sets
As shown in Table I we propose to keep images correspond-two test sets.
ing to 16 patients equally divided among four organs breast kidney liver and prostate in the training and validationset. This corresponds to over 13 000 annotated nuclei. Forany segmentation technique based on pixel classification this corresponds to several hundred thousand pixels (along with class that can be used to train a machine learning system their surrounding and overlapping patches) belonging to nuclei(e.g. a CNN) to produce binary maps (e.g. in [13]) even without data augmentation (e.g. by rotating and fipping the surrounding patches). We have divided rest of the images into
1) Same Organ Tesr Set: The first test set has imigesfrom the same organs breast kidney liver and prostate that are represented in the training set although from dif-ferent potients. Most nuclear segmentation techniques basedSus ueao suo Aquo uo isan pue uen Suueal su uo this dataset their generalization to four different organs canbe tested.
more challenging because is images are taken from organs not 2) Different Organ Test Set: The second test set is evenrepresented in the training set bladder colon and stomach. An algorithm that performs well on the six images from thisset can be expected to generalize nuclear segmentation prettywell for H&E stained images imaged at 40x magnification.
B. Proposed Evaluation Criterion
regardless of nuclear size and magnification. We also propose We propose a parameter-free detection criterion that worksa unified evaluation metric that penalizes both object-leveland pixel-level errors to overe the shortings of other evaluation criteria discussed in Section I-D. We use the spirit of the Jaccard index (Equation 2) for both these goals of detecting a nuclei and evaluating the segmentation results.We also give primacy to the ground truth nuclei.
1) Detection Criterion: With each ground truth nucleusindexed by i represented as a set of pixels G we associate a detected nucleus that maximizes their pixel-wise Jaccard index as per Equation 2. This criterion does not dependon a subjective distance or pixel overlap threshold and can be applied across magnifications or object types. Thus onedetected object can correspond to more than one ground truthobjects (e.g. when under-segmented) but not the other way around. When bined with the evaluation metric proposedbelow the detection criterion accounts for the four types oferrors listed in Section I-D.
2) Eva/uation Metric: We propose the following metric toevaluate the performance of a nuclear segmentation method over an image or adataset of images which we call aggregatedJaccard index (AJI). It putes an aggregated intersectioncardinality numerator and an aggregated union cardinality denominator for all ground truth and segmented nuclei under ue sa n pno qeoo pio (or a dataset) after associating a segmented nucleus S we addthe contributions to the aggregated Jaccard index by adding the pixel count of G; S; to AJIs numerator and that of G; US;
to the denominator. This naturally adds pixels of those ground(detection false negatives) to the denominator. We also add truth nuclei that do not find an intersecting segmented nucleusthe pixel counts of all unclaimed segmented nuclei (detectionthe pixel counts of false positives and false negatives to the false positives) to the denominator. Because this metric addsdenominator in addition to the pixels of non-overlap amongground truth and detected nuclei (true detection) it penalizes all four types of errors listed in Section II-D. It is worth notingthat the previous segmentation metrics are only puted over true positives and the pixels in false positives and falsenegatives are pletely ignored in evaluating the segmenta-to multiple ground truth nuclei the proposed metric has the tion quality. When an under-segmented nucleus correspondsin the denominator. This is necessary because otherwise it potential to count several falsely detected pixels multiple timessegmentation system towards slight under-segmentation. Thus is possible to obtain low mean Jaccard index by biasing aaggregated Jaccard index (AJ) in general has a lower valuethan F1-score and mean Jaccard index. In Section VI-C we show empirical evidence of how higher AJI better representssegmentation (and detection) quality.
in detail in Algorithm 1. In case no segmented nucleus Our detection criterion and evaluation metric are describedcardinality is zero and union pixel cardinality is the same intersects with a ground truth nucleus the intersection pixelas |G;1 in step 4 of the algorithm. AJI will range between 0segmented objects) and 1 for the best case (perfect detection for the worst case (no intersection between ground truth andand segmentation).
We made additional enhancements to step 3 of Algorithm 1.In case of a tie of Jaccard indices between more than onesegmented nuclei the one that maximized the intersection with the ground truth nucleus was selected.
Algorithm 1 Computing Aggregated Jaccard Index (AJI)Input: A set of images with a bined set of annotated nucleiby k. G; indexed by i and a segmented set of nuclei Ss indexedOutput: Aggregated Jaccard Index A.1: Initialize overall correct and union pixel counts: C < 0: U02: for Each ground truth nucleus G; do 3: J arg max(IG n Ss1/|G U S)4: Update pixel counts: C < C |G; S;1:U < U |GUS 15: Mark Sj used6: end for 7: for Each segmented nucleus S; doIf Ss is not used then U < U |S|10:AC/U 9: end for V. DEEP LEARNING-BASED NUCLEAR SEGMENTATION able deep learming techniques especially CNNs have shown When a large number of annotated examples are avail-