Dan C. Ciresan*USI-SUPSI IDSIAdan@idsia . ch Lugano 6900
Luca M. Gambardella IDSIALugano 6900 USI-SUPSIlucaβidsia . ch
MembranesinElectronMicroscopyImages DeepNeuralNetworksSegmentNeuronal
Abstract
We address a central problem of neuroanatomy namely the automatic segmen-tation of neuronal strnuctures depicted in stacks of electron microscopy (EM) im- e s e de a su s s ssegment biological neuron membranes we use a special type of deep aoijcial neural network as a pixel classifier. The label of each pixel (membrane or non-The input layer maps each window pixel to a neuron. It is followed by a succes- membrane) is predicted from raw pixel values in a square window centered on it.extrat features with increasing levels of abstraction. The output layer produces sion of convolutional and max-pooling layers which preserve 2D information anda calibrated probability for each class. The classifier is trained by plain gradientdescent on a 512 × 512 × 30 stack with known ground truth and tested on a stack of the same size (ground truth unknown to the authors) by the organizers ofprocessing our appoach utperforms peting techniques by a large margin in the ISBI 2012 EM Segmentation Challenge. Even without problem-specific post-all three considered metrics i.e. raned error warping error and pixel error. For pixel error our approach is the only one outperforming a second human observe.
Alessandro GiustiUSI-SUPSI IDSIAalessandrog@idsia.ch Lugano 6900
Jurgen Schmidhuber IDSIALugano 6900 USI-SUPSIjuergen@idsia.ch
1Introduction
How is the brain structured? The recent field of connectomics [2] is developing high-throughputtechniques for mapping connections in nervous systems one of the most important and ambitious goals of neuroanatomy. The main tool for studying connections at the neuron level is serial-sectionpreparation a sample of neural tissue is typically sectioned into 50-nanometer slices; each slice s Transmitted Electron Microscopy (ssTEM) resolving individual neurons and their shapes. After
then recorded as a 2D grayscale image with a pixel size of about 4 × 4 nanometers see Figure 1).
The visual plexity of the resulting stacks makes them hard to handle. Reliable automated seg- mentation of neuronal structures in ssTEM stacks so far has been infeasible. A solution of thisproblem however is essential for any automated pipeline reconstructing and mapping neural con-nections in 3D. Recent advances in automated sample preparation and imaging make this increas-
Figure 1: Left: the training stack (one slice shown). Right: corresponding ground truth; black linesdenote neuron membranes. Note plexity of image appearance.
ingly urgent as they enable acquisition of huge datasets [6 21] whose manual analysis is simply unfeasible.
Our solution is based on a Deep Neural Network (DNN) [12 13] used as a pixel classifier. Thein a square window centered on the pixel iself. An image is then segmented by classifying all of network putes the probability of a pixel being a membrane using as input the mage intensitiesits pixels. The DNN is trained on a diferent stack with similar characteristics in which membranes were manually annotated.
DNN are inspired by convolutional neural networks introduced in 1980 [16] improved in the 1990sw Sueu q euod n n o wqnoq pue[s] soo a u pgdus pue pouya [s] both large and deep [12 13]. Lately DNN proved their efficiency on data sets extending fromhandwritten digits (MNIST) [10 12] handwritten characters [11] to 3D toys (NORB) [13] and faces [35]. Training huge nets requires months or even years on CPUs where high data transfer latencyprevented multi-threading code from saving the situation. Our fast GPU implementation [10 12]overes this problem speeding up single-threaded CPU code by up to two orders of magnitude.
Many other types of learning classifiers have been applied to segmentation of TEM mages whereare not correlated with high image gradients due to noise and many confounding micro-structures. different structures are not easily characterized by intensity differences and structure boundariesIn most binary segmentation problems classifiers are used to pute one or both of the follow-ing probabilities: (α) probability of a pixel belonging to each clas; (b) probability of a boundary dividing two adjacent pixels. Segmentation through graph cuts [7] uses (a) as the unary tem and(b) as the binary term. Some use an additional term to account for the expected geometry of neuron membranes[23].
We pute pixel probabilities only point (aabove) and diretlyobtain asegmentationbymildsmoothing and thresholding without sing graphcuts.Ourmain contibutionies therforen the classifieiself.Others have used off-the-shelfrandom forest classifiers to pute unary tems ofneuron membranes [22] or SVMs to pute both unary and binary terms for segmenting mito- chondria [28 27]. The former approach uses haar-like features and texture histograms puted ona small region around the pixel of interest whereas the latter uses sophisticated rotational [17] and xasuesau sou us ea ] saxididns uo pndo sm [e] eo d pue iss a e jo ss jo sxdssdssqdemosmeaningful features during training. Due to their convolutional structure the first layers of the network automatically learn to pute
The main contribution of the paper is a practical state-of-the-art segmentation method for neuronin Setion 3. The contribution is particularly meaningful because our approach does not rely on membranes in ssTEM data described in Section 2. It outperforms existing methods as validatedproblem-specific postprocessing: fruitful application to different biomedical segmentation problemsis therefore likely.
Figure 2: Overview of our approach (see text).
2Methods
For each pixel we consider two possible classes membrane and non-membrane. The DNN classifer (Section 2.1) putes the probability of a pixel p being of the former class using as input theraw intensity values of a square window centered on p with an edge of u pixelsau being an odd number to enforce symmetry. When a pixel is close to the image border is window willincludepixels outside the image boundaries; such pixels are synthesizedby mirroring the pixels in the actual image across the boundary (see Figure 2).
The classifier is first trained using the provided training images (Section 2.2). After training togodeessxd soedesn s brane probabilitiesie a new real-valued image the size of the input image. Binary membranesegmentation is obtained by mild postprocessing techniques discussed in Section 2.3 followed by thresholding.
2.1 DNN architecture
A DNN [13] consists of a succession of convolutional max-pooling and fully connected layers. It is a general hierarchical feature extractor that maps raw pixel intensities of the input mage into afeature vector to be classified by several fully connected layers. Alladjustable parameters arejointlyoptimized through minimization of the misclassification error over the training set.
Each convolutional layer performs a 2D convolution of its input maps with a square filter. Thepassed through a nonlinear activation function. activations of the output maps are obtained by summing the convolutional responses which are
The biggest architectural difference between the our DNN and earlier CNN [25] are max-poolingtion over non-verlapping square regions. Max-pooling are fixed non-trainable layers which selet e umxeu q ua ae sndno se uues-qns o pesu [1 e 0] sae]the most promising features. The DNN also have many more maps per layer and thus many moreconnections and weights.
After 1 to 4 stages of convolutional and max-pooling layers several fully connected layers furtherbine the outputs into a ID feature vector. The output layer is always a fully connected layer with one neuron per class (two in our case). Using a softmax activation function for the last layerguarantees that each neuron’s output activation can be interpreted as the peobability of a particularinput image belonging to that class.
2.2 Training
To train the classifier we use all available slices of the training stack i.e. 30 images with a 512 × 512 resolution. For each slice we use all membrane pixels as positive examples (on average about5000) and the same amount of pixels randomly sampled (without repetitions) among all non-membrane pixels. This amounts to 3 million training examples in total in which both classes are equally represented.
microscopythe appearance of structures is not affected by their orientation. We take advantage of As is often the case in TEM images-but not in other modalities such as phase-contrast
miroring each training instance and/or rotating it by ±90° this property and synthetically augment the training set at the beginning of each epoch by randomly
2.3 Postprocessing of network outputs
outputs cannot be direcly interpreted as probability values; instead they tend to severely overesti-mate the membrane probability. To fix this issue a polynomial function post-processor is applied to the network outputs.
To pute its coeficients a network N is trained on 20 slices of the training volume Tuain andtested on the remaining 10 slices of the same volume (Ts for which ground truth is available). We pare all outputs obtained on Ttea (a total of 2.6 million instances) to ground truth to putethe transformation relating the network output value and the actual probability of being amembrane; for example we measure that among all pixels of Te which were classified by N as having a 50%probability of being membrane only about 18% have in fact such a ground truth label; the reason being the different prevalence of membrane instances in Tu (i.e. 50%) and in Tea (roughly 20%).The resulting function is well approximated by a monotone cubic polynomial whose coefficients are puted by least-squares fiting. The same function is then used to calibrate the outputs of alltrained networks.
After calibration (a grayscale transformation in image processing terms) network outputs are spa- tially smoothed by a 2-pixel-radius median filte. This results in regularized of membrane boundariesafter thresholding.
2.4Foveation and nonuniform sampling
We experimented with two related techniques for improving the network performance by manipu-( amt aos) Suchuns aofpunuou pue aoao oueu ep ndu s ue[
Foveation is inspired by the structure of human photoreceptor topography [14] and has recently beenshown to be very effective for improving nonlocal-means denoising algorithms [15]. It imposes a(fovea) while the peripheral parts are defocused by means of a convolution with a disk kernel toremove fine details. The network whose task is to classify the center pixel of the window is then forced to disregard such peripheral ine details which are most likely irrelevant while stillretainingthe general structure of the window (context).
Figure 3: Input windows with u = 65 from the training set. First row shows original window (Plaint); other rows show effects of foveation (Fov) nonuniform sampling (Nu) and both (FovN).spectively. The leftmost image illustrates how a checkerboard pattem is affected by such transfor- Samples on the left and right correspond to instances of class Membrane and Non-membrane re-mations.
Nonuniform sapling is motivated by the observation that (in this and other applications) larger window sizes t generally result in significant performance improvements. However a large w
results in much bigger networks which take longer to train and at least in theory require larger amounts of training data to retain their generalization ability. With nonuniform sampling imagepixels are directly mapped to neurons only in the central part of the window; elsewhere their source pixels are sampled with decreasing resolution as the distance from the window center increases. Asa result the image in the window is deformed in a fisheye-like fashion and covers a larger area of the input image with fewer neurons.
Simultaneously applying both techniques is a way of exploiting data at multiple resolutionsfine atthe center coarse in the periphery of the window.
2.5Averaging outputs of multiple networks
We observed that large networks with different architectures often exhibit significant output dif-ferences for many image parts despite being trained on the same data. This suggests that these powerful and ffexible clasifiers exhibit relatively large variance but low bias. It is therefore reasonable to attempt to reduce such variance by averaging the calbrated outputs of several networks with different architectures.
This was experimentally verified. The submissions obtained by averaging the outputs of multiplelarge networks scored significantly better in all metrics than the single networks.
3Experimental results
All experiments are performed on a puter with a Core i7 950 3.06GHz processor 24GB of RAM and back propagation routines by a factor of 50. and four GTX 580 graphics cards. A GPU implementation [12] accelerates the forward propagation
We alidate ourapproach on the publicly-availale dataset [9] provided by the organizers of the IBI2012 EM Segmentation Challenge [1] which represents two portions of the ventral nerve cord of a Drosophila larva. The dataset is posed by two 512 × 512 × 30 stacks one used for training onefor testing. Each stack covers a 2 × 2 × 1.5 μm volume with a resolution of 4 × 4 × 50 nm/pixel.For the training stack a manually annotated ground truth segmentation is provided. For the testing stack the organizers obtained (but did not distribute) two manual segmentations by different expertneuroanatomists. One is used as ground truth the other to evaluate the performance of a second human observer and provide a meaningful parison for the algorithms’ performance.
A segmentation of the testing stack is evaluated through an automated online system which -putes three error metrics in relation to the hidden ground truth:
Rand error: defined as 1 Frad where Fa represents the F) score of the Rand index [29] which measures the accuracy with which pixels are associated to their respective neurons.
Warping error: a segmentation metric designed to account for topological disagreements [19];:it accounts for the number of neuron splits and mergers required to obtain the candidate segmentation from ground truth.
Pixel error: defined as 1 Fpe where Fpiet represents the Ff score of pixel similarity.
The automated system accepts a stack of grayscale images representing membrane probability val- ues for each pixel; the stack is thresholded using 9 different threshold values obtaining 9 binarystacks. For each of the stacks the system putes the eror measures above and returns the mini-mum error.
Pixel error is clearly not a suitable indicator of segmentation quality in this context and is reportedmostly for reference. Rand and Warping error metrics have various strengths and weaknesses without clear consensus in favor of any. The former tends to provide a more consistent measure but penalizes even slightly misplaced borders which would not be problematic in most practical ap-plications. The latter has a more intuitive interpretation but pletely disregards non-topological errors.
We train four networks N1 N2 N3 and N4 with slightly different architectures and window sizest = 65 (for N1 N2 N3) and tv = 95 (for N4); all networks use foveation and nonuniform sampling.