An Introduction of Image Segmentation Chien-Chi Chen E-mail: zdadadaz@yahoo.com.tw Graduate Institute of Communication Engineering National Taiwan University, Taipei, Taiwan, ROC Abstract Segmentation was used to identify the object of image that we are interested. We have three approaches to do it. The first is Edge detection. The second is to use threshold. The third is the region-based segmentation. It does not mean that these three of that method can solve all of the problems that we met, but these approaches are the basic methods in segmentation. 1. Introduction We first discuss from the case of the monochrome and static images. The fundamental problem in segmentation is to partition an image into regions. Segmentation algorithms for monochrome images are generally based on one of the following two basic categories. The first one is Edge-based segmentation. The second one is Region-based segmentation. Another method is to use the threshold. It belongs to Edge-based segmentation. There are three goals that we want to achieve: 1. 2. 3. The first one is speed. Because we need to save the time from segmentation and give complicate compression more time to process. The second, one is to have a good shape matching even under less computation time. The third, one is that the result of segmenting shape will be intact but not fragmentary which means that we want to have good connectivity. 2. Edge-Based Segmentation The focus of this section is on the segmentation methods based on detection in 1 sharp, local changes in intensity. The three types of image feature in which we are interested are isolated points, lines , and edges. Edge pixels are pixels at which the intensity of an image function changes abruptly. 2.1. Fundamental We know local changes in intensity can be detected using derivatives. For reasons that will become evident , first- and second-order derivatives are particularly well suited for this purpose. Figure 2.1 The intensity histogram of image We have following conclusions from Figure 2.1 The intensity histogram of image Figure 2.1: First-order derivatives generally produce thicker edges in an image. 1. Second-order derivative have a stronger response to fine detail, such as thin lines, isolated points, and noise. 2. Second-derivatives produce a double-edge response at ramp and step transitions in intensity. 3. The sign of the second derivative can be used to determine whether a transition into an edge is from light to dark or dark to light. 2.2. Isolated Points It is based on the conclusions reached in the preceding section. We know that point detection should be based on the second derivative, so we expect Laplacian mask. 2 Figure 2.2 The mask of isolation point We can use the mask to scan the all point of image , and count the response of every point, if the response of point is greater than T(the threshold we set), we can define the point to 1(light), if not, set to 0(dark). G( x, y) 1 0 if R ( x , y ) T otherwise (2.1) 2.3. Line Detection As discussion in section2.1,we know the second order derivative have stronger response and to produce thinner lines than 1st derivative. We can get four different direction of mask. Figure 2.3 Line detection masks. Let talk about how to use the four masks to decide which direction of mask is better than others. Let R 1 , R 2 , R 3 and R 4 denote the response of the masks in Figure 2.3. 3 If at a point in the image , R k > R j ,for all j k. R 1 > R j for j=2,3,4,that point is said to be more likely associated with a line in the direction of mask k. 2.4. Edge detection Figure 2.4 (a) Two region of constant intensity separated by an ideal vertical ramp edge.(b)Detail near the edge, showing a horizontal intensity profile. We conclude from the observation that the magnitude of 1st derivative can be used to detect the presence of an edge at a point in an image. The 2nd derivative have two properties : (1) it produces two values for every edge in an image.(2)its zero crossings can be used for locating the center of thick edges. 2.4.1. Basic Edge Detection (gradient) The image gradient is to find edge strength and direction at location (x,y) of image, and defines as the vector. f g x x f grad( f ) g y f y (2.2) The magnitude (length) of vector f , denoted as M(x,y) mag(f ) g x g y (2.3) The direction of the gradient vector is given by the angle gy g x ( x, y) tan 1 (2.4) The direction of an edge at an arbitrary point (x,y) is orthogonal to the direction. We are dealing with digital quantities,so a digital approximation of the partial 4 derivatives over a neighborhood about a point is required. 1. Roberts cross-gradient operators. Roberts [1965]. Figure 2.5 Roberts mask gx f (z 9 z 5) x (2.5) gy f (z 8 z 6 ) y (2.6) 2. Prewitt operator Figure 2.6 Prewitt’s mask gx f (z 7 z 8 z 9 ) (z 1 z 2 z 3) x (2.7) gy f (z 3 z 6 z 9 ) (z 1 z 4 z 7 ) y (2.8) 3. Sobel operator 5 Figure 2.7 (a)~(g) are region of an image and various masks used to compute the gradient at the point labeled z 5 gx f ( z 7 2 z 8 z 9 ) ( z 1 2 z 2 z 3 ) x (2.9) gy f ( z 3 2 z 6 z 9 ) ( z 1 2 z 4 z 7 ) y (2.10) The Sobel mask uses 2 in the center location for image smoothing. The Prewitt masks are simpler to implement than Sobel masks, but the Sobel masks have better noise-suppression(smoothing) characteristics makes them preferable. In the previous discussion, we just discuss to obtain the gx and gy . However, this implementation is not always desirable ,so an approach used frequently is to approximately the magnitude of the gradient by absolute values: M ( x, y ) g x g y (2.11) 2.4.2. The Marr-Hildreth edge detector(LoG) This method in use at the time were based on using small operators ,and we discuss previously the 2nd derivative is better than 1st derivative in small operator. Then we use Laplacian to make it. 6 Figure 2.8 5x5mask of LOG The Marr-Hildreth algorithm consists of convolving the LoG filter with an input image, f ( x, y ) . g ( x, y) [2G( x, y)] f ( x, y) (2.12) Because these are linear process, Eq(2.4-9) can be written also as g ( x, y) 2 [G( x, y) f ( x, y)] (2.13) It’s edge-detection algorithm may be summarized as follow: 1. Filter the input image with an n n Gaussian lowpass filter(It can smooth the large numbers of small spatial details). 2. Compute the Laplacian of the image resulting from Step1 using. 3. Finding the zero crossings of the image from Step2. To specify the size of Gaussian filter, recall that about 99.7% of the volume under a 2-D Gaussian surface lies between 3 about the mean. So n 6 . Figure 2.9 (a) Image of input. (b) After using the LoG with threshold 200. 7 2.5. Edge Linking and Boundary Detection Edge detection should yield sets of pixel lying on edge, but noise would breaks in the edges due to nonuniform illumination. Therefore, edge detection typically is followed by linking algorithms designed to assemble edge pixels into meaningful edges and/or region boundaries. 2.5.1. Local processing This edge linking is to analyze the characteristics of pixels in a small neighborhood about every point (x,y) that has been declared an edge point by previous techniques. The two principle properties used for establishing similarity of edge pixels are 1. the strength(magnitude) M (s, t ) M ( x, y) E (2.14) S xy denote the set of coordinates of a neighborhood centered at point ( x, y ) . E is a positive threshold. 2. The direction of the gradient vector (s, t ) ( x, y) A (2.15) A is a positive angle threshold. 2.5.2. Regional processing Often, the location of regions of interest in an image are known or can be determined. In such situations ,we can use techniques for linking pixels on a regional basis, with the desired result being an approximation to the boundary of the region. We discuss the mechanics of the procedure using the following Fig2.6 Figure 2.10 illustration of the iterative polygonal fit algorithm An algorithm for finding a polygonal fit may be stated as follows: 8 1. Specify two starting point A and B. 2. Connected A and B ,and compare which points are defined vertices of the polygon max and larger than T(threshold). Then connect all the point we have, thus compare it like step 2 until distance between every point and connected lines vertices is smaller than T. 3. 2.5.3. Global processing using the Hough transform In regional processing, it makes sense to link a given set of pixels only if we know that they are part of the boundary of a meaningful region. Often, we have to work with unstructured environments in which all we have. We can use Hough transform to use coordinate transition to find out the similar point in other place. Figure 2.11 (a)xy-plane. (b)Parameter space Consider a point ( x i , y i ) in the xy-plane and the general equation of a straight line in slope-intercept from, y i ax i b ,a second point ( x j , y j ) also has a line in parameter space associated with it .but a practical difficulty with the approach, is that a (slope of a line)approaches infinity as the line approaches the vertical direction. One way to use the normal representation of a line: x cos y sin (2.5-3) 9 Figure 2.12 (a) A line in the xy-plane. (b)Sinusoidal curves in the ρθ-plane; the point of the intersection (ρ,θ)corresponds to the line passing through point (xi,yi) and (xj,yj)in the xy-plane 2.6. Segmentation Using Morphological Watersheds 2.6.1. Background The concept of watershed is based on visualizing an image in three dimensions: two spatial coordinates and intensity. We consider three types of points: 1. The points belonging to the local minimum. 2. The points where a drop of water, if placed at the locations of these points, would fall to a single local minimum. It is called catchment basin or watershed. 3. The points where water would be equally likely to fall to more than one local minimum. They are similar to the crest lines on the topographic surface and are termed divide lines or watershed lines. The two main properties of watershed segmentation result are continuous boundaries and over-segmentations. As we know, the boundaries that made by the watershed algorithm are exact the watershed lines in the image. Therefore, the numbers of region basically will be equal to the numbers of minima in the image. There are two steps to achieve the solution using marker: 1. Preprocessing 2. Defining the criteria that the markers have to be satisfy. The following figures are the mechanism to construct dam. 10 Figur2.10(a)~(d) Watershed algorithm. Supposed that figure2.10 are the image of input , and the height of the “mountain” is proportional to intensity values input image. We start to flood water from below by letting water rise through the holes at a uniform rate. Figure (b) we see that water now has risen into the first and second catchment basins. So we will construct a dam to stop it to overflowing ,and do the same motion step by step. 2.6.2. The Use of Markers Direct application of the watershed segmentation algorithm in the form discussed in the previous section generally leads to oversegmentation due to noise and other local irregularities of gradient. An approach used to control oversegmentation is based on the concept of markers. Then we have markers. We have internal markers, associated with objects of interest, and external markers. A procedure for markers selection typically will consist of two principal steps: (1)preprocessing (usually smoothing) (2)definition of a set of criteria that markers must satisfy.(to do edge detection for every small region) (a)(b) (c)(d) 11 Figure 2.13 (a) Electrophoresis image(b)Result of applying the watershed segmentation algorithm to the gradient image, we can say that oversegmentation is evident.(c) is a image after (b) by smoothing, it shows internal markers(light gray region) and external markers(watershed lines)(d)Result of segmentation. Note the improvement over (b).(Courtesy of Dr. S.Beucher,CMM/Ecole des Mines de Paris.) 2.7. Edge detection using Hilbert transform(HLT) Compare with the derivative or so called the differential method, the impulse response of HLT is much longer. The longer impulse response will reduce the sensitivity of the edge detector and at the same time reduce the influence of noise. We can learn that the longer response has less sensitive but has good detection in ramp edge and more noisy robustness. We list the mathematics in form of discrete time version of the HLT below. The discrete version of the HLT is g H n IDFT H [ p ]DFT g[n] (2.16) Where N 1 DFT g[n] g n e j 2 pm / M , n0 1 N 1 IDFT F [m] e j 2 pm / N F [m], N m0 H [ p ] j for 0 p N / 2, H [ p] j for N / 2 p N , H [0] H [ N / 2] 0. (2.17) (2.18) 2.7.1. Short response Hilbert transform(SRHLT) We have realized the advantages and disadvantages of the derivative method and the HTL method for detecting edges. S. C. Pei and J. J. Ding proposed another method combining the two methods to detect edges in 2007. [D-1] [D-4] They combine the HLT and differentiation to define the Short Response Hilbert Transform (SRHLT). They define the short response Hilbert transform (SRHLT) as: gH hb x g x , where hb x b csch( bx) GH ( f ) H b ( f )G ( f ) where GH ( f ) FT [ g H ( )], G ( f ) FT [ g ( )], H b ( f ) j tanh( f / b). (2.19) (2.20) When b 0+ (0+ is a positive number very near to 0), the SRHLT becomes the HLT. 12 When b , the SRHLT tends to the differentiation operation. Choose a suitable value HLT b0 differentiation b Figure 2.14 The characters of the SRHLT (a) Time domain 1 Hilbert transform FT 0 0 -1 -2 (c) 1 -1 -1 0 1 2 (e) 1 SRHLT, b=0.25 -1 -2 1 2 -1 0 1 2 -1 0 1 2 -1 0 1 2 -1 0 1 2 -1 -1 0 1 2 (f) -2 1 SRHLT, b=1 FT 0 -1 -1 0 1 2 (h) -2 1 SRHLT, b=4 FT 0 -1 -1 0 1 2 (j) -2 10 differentiation FT 0 0 -1 -2 0 FT 0 0 (i) 1 -1 1 0 -1 -2 (g) 1 -2 (d) 0 -1 -2 Frequency domain (b) 1 -1 0 1 2 -10 -2 Figure 2.15 Impulse responses and their FTs of the SRHLT for different b. Higher b(differentiation) Lower b(HLT) Impulse response Shorter longer Noise robustness bad good Type of edge step ramp output sharp thick 3. Thresholding 3.1. Basic Global Thresholding As the fact that we need only the histogram of the image to segment it, 13 segmenting images with Threshold Technique does not involve the spatial information of the images. Therefore, some problem may be caused by noise, blurred edges, or outlier in the image. That is why we say this method is the simplest concept to segment images. When the intensity distributions of objects and background pixels are sufficiently distinct, it is possible to use a single(global) threshold applicable over the entire image. The following iterative algorithm can be used for this purpose: 1. Select an initial estimate for the global threshold, T. 2. Segment the image using T as g ( x, y ) 01 if f(x,y) T if f(x,y) T (3.1) This will produce two groups of pixels: G1 consisting of all pixels with intensity values > T, and G 2 consisting of pixels with values T. 3. Compute the average(mean)intensity values m 1 and m 2 for the pixels in G1 and G 2 . 4. Compute a new threshold values: T 5. 1 (m 1 m 2 ) 2 Repeats Step2 through 4 until the difference between values of T in successive iterations is smaller than a predefined parameter. 3.2. Optimum Global Thresholding Using Otsu’s Method Thresholding may be viewed as a statistical-decision theory problem whose objective is to minimize the average error incurred in assigning pixels to two or more groups. Let{0,1,2,…,L-1}denote the L distinct intensity levels in a digital image of size M N pixels, and let n i denote the number of pixels with intensity i . The total number, MN, of pixels in the image is MN n 0 n 1 n 2 ... n L1 . The normalized histogram has components p i n i / MN , from which it follows that L 1 p i 0 i 1, p i 0 14 (3.2) Now, we select a threshold T (k ) k , 0 k L 1 , and use it to threshold the input image into two classes, C 1 and C 2 , where C 1 consist with intensity in the range [0, k ] and C 2 consist with [k 1, L 1] . Using this threshold , P 1 (k ) ,that is assigned to C 1 and given by the cumulative sum. k P 1 (k ) p i (3.3) i 0 P 2 (k ) L 1 p i k 1 i 1 P 1 (k ) (3.4) The validity of the following two equations can be verified by direct substitution of the preceding result: P 1 m1 P 2 m 2 m G (3.5) P 1 + P 2 =1 (3.6) In order to evaluate the “goodness” of the threshold at level k we use the normalized, dimensionless metric = B2 (k ) G2 (3.7) Where G2 is the global variance L1 G2 = i m G p i 2 (3.8) i 0 And B2 is the between-class variance, define as : B2 P 1 (m 1 m G ) 2 P 2 (m 2 m G ) 2 (3.9) (m G P 1 (k ) m(k )) 2 (k ) P 1 P 2 (m1 m 2 ) (3.10) P 1 (k )(1 P 1 (k )) Indicating that the between-class variance and is a measure of separability 2 B 2 between class. Then, the optimum threshold is the value, k ,that maximizes B2 ( k ) B2 (k ) max B2 (k ) o k L 1 15 (3.11) In other word, to find k we simply evaluate (2.7-11) for all integer values of k Once k has been obtain, the input image f ( x, y ) is segmented as before: g ( x, y ) 10 if f(x,y) k* if f(x,y) k* (3.12) For x = 0,1,2,…,M-1 and y = 0,1,2…,N-1. This measure has values in the range (3.13) 0 (k *) 1 3.2.1. Using image Smoothing/Edge to improve Global Threshold Compare the difference between preprocess of smoothing and Edge detection. Smoothing What situation is more Large suitable for the method Edge detection object we are Small interested. object we are interested 3.3. Multiple Threshold For three classes consisting of three intensity intervals, the between-class variance is given by: B2 P 1 (m 1 m G )2 P 2 (m 2 m G ) 2 P 3 (m 3 m G ) 2 (3.14) The following relationships hold: P 1 m1 P 2 m 2 P 3 m 3 m G (3.15) P 1 + P 2 +P 3 =1 (3.16) The optimum threshold values, B2 (k1* , k2* ) max o k 1 k 2 L 1 B2 (k 1 , k 2 ) (3.17) Finally, we note that the separability measure defined in section 2.7.2 for one threshold extends directly to multiple thresholds: (k1* , k2* ) B2 (k1* , k2* ) G2 (3.18) 3.4. Variable Thresholding Image partitioning One of the simplest approaches to variable threshold is to subdivide an image into nonoverlapping rectangles. This approach is used to compensate for non-uniformities in illumination and/or reflection. 16 Figure 3.1 (a)Noisy, shaded image (b) Image subdivide into six subimages. (c)Result of applying Otsu’s method to each subimage individually. Image subdivision generally works well when the objects of interest and the background occupy regions of reasonably comparable size. When this is not the case, the method typically fail. Variable thresholding based on local image properties We illustrate the basic approach to local thresholding using the standard deviation and mean of the pixels in a neighborhood of every point in an image. Let xy and m xy denote the standard deviation and mean value of the set of pixels contained in a neighborhood, S xy . Q(local parameters) is true g ( x, y ) 10 ifif Q(local parameters) is true (3.19) Where Q is a predicate based on parameter computes using the pixels in neighborhood. Q( xy , m xy ) true if f ( x , y ) a false otherwise xy ANDf ( x , y ) bm xy (3.20) Using moving average Computing a moving average along scan lines of an image. This implementation is quite useful in document processing, where speed is a fundamental requirement. The scanning typically is carried out line by line in a zigzag pattern to reduce illumination bias. m(k 1) 1 k 1 1 z i m(k ) ( z k 1 z k n ) n i k 2n n (3.21) Let z k 1 denote the intensity of the point encountered in the scanning sequence at step k+1. Where n denote the number of point used in computing the average. 17 m(1) z 1 / n is the initial value. Segmentation is implemented using Eq(2.7-1) with T xy bm xy ,where b is constant and m xy is the moving average at point (x,y) in the input Image. Multivariable Thresholding We have been concerned with thresholding based on a single variable: gray-scale intensity. A notable example is color imaging, where red(R),greed(G), and blue(B) components are used to form a composite color image. It can be represented as a 3-D vector, z= ( z 1 , z 2 , z 3 )T ,whose component are the RGB colors at a point. Let a denote the average reddish color in which we are interested, D(z,a) is a distance measure between an arbitrary color point, z, then we segment the input image as follows: if D(z,a)<T g ( x, y ) 10 otherwise (3.22) Note that the inequalities in this equation are the opposite of the equation we used before. The reason is that the equation D(z,a)=T defines a volume. D( z , a ) z a [( z a ) ( z a )] T 1 1 2 D( z , a ) z a [( z a ) C ( z a )] T (3.23) 1 2 (3.24) A more powerful distance measure is the so-called Mahalanobis distance. Where C is the covariance matrix of the zs, when C=I, the identity matrix. 4. Region-Based Segmentation 4.1. Region Growing Region growing segmentation is an approach to examine the neighboring pixels of the initial “seed points” and determine if the pixels are added to the seed point or not. Step1. Selecting a set of one or more starting point (seed) often can be based on the nature of the problem. 18 Step2. The region are grown from these seed points to adjacent point depending on a threshold or criteria(8-connected) we make. Step3. Region growth should stop when no more pixels satisfy the criteria for inclusion in that region Figure 4.1 (a)Original image (b)Use step 1 to find seed based on the nature problem.(c) Use Step 2(4-connected here) to growing the region and finding the similar point. (d)(e) repeat Step 2. Until no more pixels satisfy the criteria. (f) The final image. Then we can conclude several important issues about region growing: 1. 2. 3. 4. 5. The suitable selection of seed points is important. The selection of seed points is depending on the users. More information of the image is better. Obviously, the connectivity or pixel adjacent information is helpful for us to determine the threshold and seed points. The value, “minimum area threshold”. No region in region growing method result will be smaller than this threshold in the segmented image. The value, “Similarity threshold value“. If the difference of pixel-value or the difference value of average gray level of a set of pixels less than “Similarity threshold value”, the regions will be considered as a same region. The result of an image after region growing still have point’s gray-level higher than the threshold but not connected with the object in image. We briefly conclude the advantages and disadvantages of region growing. 19 Advantages: 1. Region growing methods can correctly separate the regions that have the same properties we define. 2. Region growing methods can provide the original images which have clear edges the good segmentation results. 3. The concept is simple. We only need a small numbers of seed point to represent t he property we want, then grow the region. 4. We can choose the multiple criteria at the same time. It performs well with respect to noise, which means it has a good shape matching of its result. Disadvantage: 1. The computation is consuming, no matter the time or power. 2. This method may not distinguish the shading of the real images. In conclusion, the region growing method has a good performance with the good shape matching and connectivity. The most serious problem of region growing method is the time consuming. 4.2. Simulation of Region Growing Using C++. 20 Figure 4.2 Lena image after using region growing, there are 90% pixels have been classified. Threshold/second: 20/4.7 seconds. The method have connected region, but it need more time to process. 4.3. Region Splitting and Merging An alternative method is to subdivide an image initially into a set of arbitrary, disjoint regions and then merge and/or split the region. The quadtrees means that we subdivide that quadrant into subquadrants, and it is the 21 following as: 1. Split into four disjoint quadrants any region Ri for which Q( R i ) FALSE (means the region don’t satisfy same logic in R i ) . 2. When no further splitting is possible, merge any adjacent region R j and R k for which Q( R j R k ) TRUE (means that R j and R k have similarity we define in 3. some where) . Stop when no further merging is possible. Advantage of region splitting and merging: We can split the image by choosing the criteria we want, such as segment variance or mean of the pixel-value. And the splitting criteria can be different from the merging criteria. Disadvantage of it: 1. Computation is intensive. 2. Probably producing the blocky segments. The blocky segment problem effect can be reduce by splitting for higher resolution, but at the same time, the computational problem will be more serious. 4.4. Data Clustering The main idea of data clustering is that we will use the centroids or prototypes to represent the huge numbers of clusters to reach the two goals which are, “reducing the computational time consuming on the image processing”, and “providing a better condition(say, more conveniently for us to compress it) on the segmented image for us to compress it”. Different from hierarchical and partitional clustering. Hierarchical clustering, we can change the number of cluster anytime during process if we want. Partitional clustering , we have to decide the number of clustering we need first before we begin the process. 4.4.1. Hierarchical clustering There are two kink of hierarchical clustering. Agglomerative algorithms(builds) begin with each element as a separate cluster and merge them in successively larger clusters. The divisive algorithms(break up) begin with the whole set and proceed to divide it 22 into successively smaller clusters. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.) We introduce for the former one first. Algorithm of hierarchical agglomeration: 1. See every single data (as for the image, pixel) point in the database (as for the image, the whole image) as a cluster Ci . 2. Find out two data points Ci , C j for the distance between them is the shortest in the whole database (as for the image, the whole image), and agglomerate them together to form a new cluster. 3. Repeat the step 1 and step 2 until the numbers of cluster satisfies our demand. Notice that we have to define the “distance” in hierarchical algorithm. The mostly two adopted definitions are single-linkage agglomerative and complete-linkage agglomerative methods. Single-Linkage agglomerative algorithm: D(Ci , C j ) min(d (a, b)), for a Ci , b C j (4.1) Complete-Linkage agglomerative algorithm: D(Ci , C j ) max(d (a, b)), for a Ci , b C j (4.2) We assume D(Ci , C j ) as the distance between cluster Ci and C j . And assume d (a, b) as the distance between data a and b (as for image, pixel a and pixel b). For example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters Algorithm of hierarchical division: 1. 2. See the whole database (as for the image, the whole image) as one cluster. Find out the cluster having the biggest diameter in the clusters group we have already had. 23 3. Find the data point (as for image, d ( x, C ) max (d ( y, c)), for y C the pixel) 4. Split x out as a new cluster C1 , and see the rest data points of C as Ci 5. Count d ( y, C1 ) and d ( y, Ci ) , for y Ci . x in C that If d ( y, Ci ) > d ( y, C1 ) , then split y out of Ci and classify it to C1 6. Back to step2 and continue the algorithm until every iteration of C1 and Ci is not change anymore. We assume that the diameter of a cluster Ci as D(Ci ) . The diameter is defined as D(Ci ) max(d (a,b)) , for a Ci , b Ci . The distance between point x and cluster C is defined as below, d ( x, C ) the mean of the distance between x and every single point in cluster C . Figure 4.3 The simple case of hierarchical division 4.5. Partitional clustering Comparing with Hierarchical algorithm, partitional clustering cannot show the characteristic of database. However, it does save more computational time than the 24 Hierarchical algorithms. The most famous partitional clustering algorithm is “K-mean” method. We now check out its algorithm. Algorithms of K-means: 1. Decide the numbers of the cluster in the final classified result we want. We assume the number is N . 2. Randomly choose N data points (as for image, the N pixels) in the whole database (as for image, the whole image) as the N centroids in N clusters. 3. As for every single data point (as for image, the pixel), find out the nearest centroid and classify the data point into that cluster the centroid located. After step 3, we now have all data points classified in a specified cluster. And the total numbers of cluster is always N , which is decided in the step 1. 4. As for every cluster, we calculate the centroid of it with the data points in it. The same, the calculation of the centroid can also be defined by users. It can be the median of the database within the cluster, or it can be the really centre. Then again, we get N centroids of N clusters just like we do after step 2. 5. Repeat the iteration of step 3 to step 4 until there is no change in two successive iterations.( Step 4 and 5 are checking step) We have to mention the in step 3, we do not always decide the cluster classified by the “distance” between the data points and the centroids. The using of distance now is because it is the simplest criteria choice. We can also use some other criteria as we want depending on the database characteristics or the final classified result we want. Figure 4.4 (a) original image. (b) Chose 3 clusters and 3 initial points. (c) Classify other points by using the minimum distance between the point to the center of the cluster. There are some disadvantages of the K-mean algorithm. 1. The results are sensitive from the initial random centroids we choose. That is, difference choices of initial centroids may result in different results. 2. We cannot show the segmentation details as hierarchical clustering does. 25 3. The most serious problem is that the results of the clustering are often the circular shapes due to the distance-oriented algorithms. 4. It is important to clarify that 10 average values does not mean there are merely 10 regions in the segmentation result. See fig 4.4 Figure 4.5 (a) The result of the K-means. (b) The result that we want. To solve initial problem There is a solution to overcome the initial problem. We can choose just one initial point in the step 2. Using two point beside the initial point as centroids to classify become two cluster, and then using the other four point beside the two point, do it until satisfy the number of clustering we want. That means the initial point is fixed also. Then we split the numbers of cluster until N clusters are classified. By using this kind of improvement, we solve the initial problem caused by randomly chose initial data point. No matter what names they called, the concepts of them are all similar. To determine the number of clusters There are many researches to determine the number of clusters in K-means clustering. Siddheswar Ray and Rose H. Turi define a “validity” which is the ratio of “intra-cluster distance” to “inter-cluster distance” to be a criteria. The validity measure will tell us what the ideal value of K in the K-means algorithm is. Vadility= intra inter (4.3) They define the two distances as: intra-cluster distance= 1 N k i 1 xCi x zi 2 where N is the number of pixels in the image, K is the number of cluster, cluster centre of cluster Ci . 26 (4.4) zi is the Since the goal of the k-means clustering algorithm is to minimize the sum of squared distances from all points to their cluster centers. First we have to define a “intra-cluster” distance to describe the distances of the points from their cluster centre (or centroid), then minimize it. inter-cluster distance min( zi z j 2 ) , i 1, 2,3,......, K 1 (4.5) ; j i 1,......K From the other hand, the purpose of clustering a database (or an image) is to separate the differences between clusters. Therefore we define the “Inter-cluster distance” which can describe the difference of different clusters. And we want the value of it the bigger the better. It’s obvious that the distance between each clusters’ centroids are large, then we just need few clustering to segment it, but if each centroids are close, then we need more cluster to classify it clearly. 4.5.1. More method to solve initial problem but do not change the scheme of K-mean Particle Swarm optimization PSO is a population-based randomly searching process. We assume that there are N “particles” randomly appear in a “solution space”. Mention that we are solving the optimization problem and for data clustering, there is always a criteria (for example, the squared error function) for every single particle at their position in the solution space. The N particles will keep moving and calculating the criteria in every position the stay (we call “fitness” in PSO) until the criteria reaches some threshold we need. Each particle keeps track of its coordinates in the solution space which are associated with the best solution (fitness) that has achieved so far by that particle. This value is called personal best, pbest. Another best value that is tracked by the PSO is the best value obtained so far by any particle in the neighborhood of that particle. This value is called global best, gbest. We introduce the exact statement in mathematics below: vi , j (t ) w vi , j (t 1) c1 r1 ( pi , j (t 1) xi , j (t 1)) c2 r2 ( pg , j (t 1) (4.6) xi , j (t 1)) xi , j (t ) xi , j (t 1) vi , j (t ) (4.7) where xi is the current position of the particle, vi is the current velocity of the particle, pi is the personal best position of the particle, w, c, are all constant factors, 27 and r are the random numbers uniform distributed within the interval [0,1]. We use last velocity and last position of personal and global best to predicate the velocity now. The position we stay is predicated by last position plus velocity now. By using PSO, we can solve the initial problem of “K-means” and still maintain the whole partitional clustering scheme. The most important thing is to think about it as an optimization problem. 4.5.2. Advantage and disadvantage of data clustering Hierarchical algorithm advantage 1. 2. 3. disadvantage Partitional algorithms(K-means) Concept is simple. 1. Result is reliable. Result shows 2. strong correlation with original Computing speed is fast. Numbers of cluster is fixed, so the concept is also simple. image. Instead of calculating the centroids of clusters, we have to calculate the distances between every single data point. 1. Computing time is consuming. So that this algorithm is not suitable for performing on a large database. 1. 2. 3. 4. 5. 28 Numbers of cluster is fixed. So, what numbers of clustering we choose is the best? Initial problem. Partitional clustering cannot show the characteristic of database compared with Hierarchical clustering. The results of the clustering are probably the circular shapes. We can’t improve K-means by setting less centroids. We can solve the choice of numbers of clustering by observing the value “validity” proposed by Siddheswar Ray and Rose H. Turi. For the initial problem, we can solve it by choosing only one initial point or using the PSO algorithm directly. However, we cannot solve the circular shapes problem because that is due to the core computing scheme of partitional algorithms. 4.6. Simulation of K-means using C++. Figure 4.6 Lena image after using k-means. Clustering/time: 9 clustering/ 0.1 29 seconds. The image of top left is the original image. The others are the images after K-means. We discover it is not connected to one region. However, it is a fast way in region-based segmentation. 4.7. Cheng-Jin Kuo`s method We will introduce the method we propose and explain why we would like to use this kind of algorithm by the “data compression”. 4.7.1. Ideal thought of segment we would like to obtain. The ideal result we would like to obtain is something like fig.5-6. It is very important for us to classify the similar region together. Figure 4.7 The ideal segment result we want. We would like to see that the whole hair section is classified as one cluster. Because after we obtain the result, we can send the result directly to the compression stage to do the compression for every region. For almost all the methods of segmentation, it is unavoidable to over-segment a region like the hair region in Lena image. 4.7.2. Algorithm of Cheng-Jin Kuo`s method Make the first pixel (Mostly, it will be the top-left one) we scan as the first clustering. 1. See the pixel (x,1) in the image as one cluster Ci . See the pixel which we are scanning as C j . 2. From the first column, scan the next pixel (x,1+1) and make a decision with the threshold if it will be merged into the first clustering or to be a new clustering. 30 If C j centroid (Ci ) threshold , we merge C j into Ci and recount the centroid of Ci . If C j centroid (Ci ) threshold , we make C j as a new cluster Ci 1 . 3. Repeat step 2 until all the pixels in the same column have been scanned. 4. Scan the next column with pixel (x+1,1) and compare it to the region Cu which is in the upper side of it. And make the merge decision see if we have to merge pixel (x+1,1) to the region Cu . If C j centroid (Cu ) threshold , , we merge C j into Cu and recount the centroid of Cu . If C j centroid (Cu ) threshold , we make Cj as a new cluster Cn , where n is the cluster number so far . 5. Scan the next pixel (x+1,1+1) and compare it to the region Cu , Cl which is upper to it and in the left side of it, respectively. And make the merge decision see if we have to merge pixel (x+1,1) to anyone of them. If C j centroid (Cu ) threshold and C j centroid (Cl ) threshold , (1) We merge C j into Cu , merge C j into Cl . (2) Combine the region Cu and Cl to be region Cn , where n is the cluster number so far. (3) Recount the centroid of Cu . else if C j centroid (Cu ) threshold and C j centroid (Cl ) threshold , (1) We merge C j into Cu and recount the centroid of Cu . else if C j centroid (Cu ) threshold and C j centroid (Cl ) threshold , (2) We merge C j into Cl and recount the centroid of Cl .else We make C j as a new cluster Cn , where n is the cluster number so far . 6. Repeat step 4 ~ step 5 until all the pixel in the image has been scanned. 7. Process the small regions which are classified from step1~ step4 It is important for us to deal with the isolated small regions carefully. For that we do 31 not want there are too many fragmentary results after we segment the image using our method. Therefore, our goal is to classify the isolated small regions into the big region which is already classified and is adjacent to these isolated small regions. The following is the method to merge small region to big region. (a) We aim and are prepare to process the regions Ri which have the small size.(For the 256x256 input images, we aim the regions which have the size below 32 pixels) (b) If the region Ri is fully surrounded by a single bigger region Ci ; Ci Ci Ri (C) If the region Ri is surrounded by several(for example, k) bigger regions Ci , where i 1 ~ k ,We see the adjacent pixel of Ci as a region and count We count the mean of Ri and classified Ri to the most similar Ci :if mean (Ri ) - mean (C h ) mean (R i - C i ), where h could be one of 1~k, and i=1~k, then Ch Ch Ri 4.8. The Improvement of the Fast Algorithm : Adaptive local Threshold Decision In our algorithm, the threshold in section 4.7 does not change in the whole procedure. We would like to make a new procedure that could adaptively decide the threshold with the local frequency and variance in the original figure. 4.8.1. Adaptive threshold decision with local variance We would like to select the threshold based on the local variance of a figure. Here is the step of the algorithm: 1. Separate the original figure to 4*4, 16 section. 2. Compute the variance of the 16 sections, respectively. 3. Depending in the local variance, we select the suitable threshold. The bigger variance, we assign the bigger threshold. 32 Figure 4.8 Lena image separated into 16 sections. The variance of 16 sections as a matrix: 716 447 899 1579 1497 1822 2314 1129 1293 1960 1974 1545 2470 2238 1273 1646 (4.8) We can image that after using the adaptive threshold method, the segmented result in section (2,3) and (2,4) whose variance are 1960 and 1974 will be similar to the original method whose global variance is 1943 We can image that the new segmented result in section (1,1) and (1,2) will be more detailed, or we can say, in these two sections we will have more clusters in the result. In the most of the time, the adaptive threshold selection method help us to do more precisely segmentation. However, as we can see, we do not really feel the improvement of select local threshold with local variance by observing the simulation results. The small value of variance will cause our local threshold be a small one. Therefore, it will cause a more detailed segmenting result in the end. 4.8.2. Adaptive threshold decision with local variance and frequency For example, the baboon.bmp will be segmented more detailed because the adaptive local variance will select a smaller threshold in every parted area, because it’s low variance. However, we do not satisfy for this. We would like to segment the beard parts of this image more roughly. 33 Figure 4.9 Baboon.bmp The local variance of baboon.bmp: 1625 1645 1694 1562 1405 865 757 1058 2346 1222 1256 505 606 990 1054 635 (4.9) The local average of baboon.bmp: 14.0943 12.4850 12.1756 13.6914 12.7597 9.8058 9.4788 12.6781 11.4280 10.4072 10.6333 12.8095 11.8825 12.5687 13.1654 11.8211 (4.10) As we can see, the bottom left area has a small variance and big frequency component. In this area we will choose a small threshold, then the segmenting result will be more detail. If we want to classify it as one region, we need set it depend on local frequency. To sum up, we conclude four situations for our improvement. 1. High frequency, high variance. Set highest threshold. Figure 4.10 A high frequency and high variance Image 34 2. High frequency, low variance. Set second high threshold. Figure 4.11 A high frequency and low variance image 3. Low frequency, high variance. Set third high threshold. Figure 4.12 A frequency and high variance image 4. Low frequency, low variance. Set the lowest threshold. Figure 4.13 A low frequency and low variance image For the first case, the reason for a higher threshold we select is that there are often many edges and different objects in this kind of area. The larger value of threshold may cause a rough segmenting result, but we believe the clear edge and the high variety between different objects will make the segmenting work. The larger threshold will remove some over-segmentation cause by the high variance and high frequency. It might be thought that the smallest threshold in case four will cause an over-segment result. However, the stable and monotonous characteristic in case four will not make the over-segmentation work. 4.8.3. Decide the local threshold. we use a formula to decide the threshold: threshold 16 F V 35 (4.11) The formula of F: F A (local average frequency) B (4.12) V C (local variance) D (4.13) The formula of V: In this thesis we always try to control the threshold value between 16 and 32 for the best testing threshold value with the original method (without using adaptive threshold) will be 24. For that, the range of F will be 0 to 8, and so does the range of V. The maximum of F and V are all 8 which will make the maximum of threshold be 32. If local average frequency>9, F=6; Else if local average frequency<3, F=0; End If local variance>3000, V=6; Else if local variance<1000, V=0; End With the range of F to be defined from 0 to 8 and range of V to be defined from 0 to 8, the value of A, B, C, D will be 4/3, -4, 0.004, -4, respectively. The values of parameters are simply the linear relationship we count. Equation (6.4) processes the linear relationship between the local average frequency and F. We consider the case only when 3<local average frequency<9. Equation (6.5) processes the linear relationship between the local variance and V. We consider the case only when 1000<local variance<3000. We can also change the range of the final threshold. Only we have to do is to recount the parameters A, B, C, D with the equation below: A, B solve ' A Fmin B / 3', ' B Fmax 9* A ' ; (4.14) C, D solve ' C Vmin D /1000 ', ' D Vmax 3000*C ' ; (4.15) 36 4.9. Comparison of all algorithm by data compression Region growing K-means Watershed Cheng-Jin Kuo`s method Speed bad Good(worse than C.J.K’s method) bad Good Shape connectivity intact fragmentary oversegmentation Intact Shape match Good(better than C.J.K’s Good(equal C.J.K’s bad Good method) method) 5. Boundary Compression using Asymmetric Fourier Descriptor for Non-closed Boundary Segmentation This chapter briefly introduces the Fourier descriptor and provides an improvement of using with Fourier descriptor in describing a boundary. We define a variable R to represent the ratio of the number of original terms K to the number of compressed terms P in Discrete Fourier transform. Note that R = P/K. 5.1. Fourier Descriptor The Fourier description is a method to descript boundary by using DFT to the image as x-axis becomes real part and y-axis becomes imaginary part. We assume that there are several boundary of point, ( x 0 , y 0 ),( x 1 , y 1 ),...,( x k 1 , y k 1 ) . These coordinates can be expressed in the form s(k ) [ x(k ), y (k )] , k=0,1,2,…,K-1. Moreover, each coordinate pair can be treated as a complex number so that s (k ) x(k ) jy (k ), for k=0, 1,2 ...K-1. The Discrete Fourier transform (DFT) of s(k) is 37 (5.1) 1 K 1 s(k ) e j 2 uk / K , K k 0 for u=0, 1, 2, ..., K-1. a (u ) (5.2) The complex coefficients a(u) are called the Fourier descriptors of the boundary. The inverse Fourier transform of these coefficients are denoted by s(k). That is, K 1 s (k ) a (u ) e j 2 uk / K , u 0 (5.3) for k=0, 1, 2, ..., K-1 We get rid of the high frequency terms which k is higher than P-1. In mathematics, this is equivalent to setting a(u) = 0 for u > P1 in (4.3). The result is the approximation to s(k): P 1 sˆ(k ) a (u ) e j 2 uk / K , u 0 (5.4) for k=0, 1,2 ...K-1. In Fourier transform theorem, high-frequency components account for fine detail, and low frequency components determine global shape. Thus the small P becomes the more lost detail on the boundary. Problems of Fourier descriptor Fourier descriptor has a serious problem when the compression rate is below 20%. Below this compression rate, the corner of the boundary shape will be smoothed. Mention that the corners of an image or boundary usually present the high-frequency components in frequency domain. However, if we reconstruct the boundary from (5-4) and let R is less than 20%, the results are not very good in the corner of the boundaries. 5.2. Asymmetric Fourier descriptor of non-closed boundary segment There is a method proposed by Ding and Huang can solve the problems we mentioned above. This method which is so called “Asymmetric Fourier descriptor of non-closed boundary segment” can improve the efficiency of the Fourier description even when the value of R is below 20%. There are Three approaches(Steps) in this method. We introduce it below.[A-1] 5.2.1. Approach 1: Predicting and marking the corner The first step of the method is to find out the corner point of the boundary. In this step, we will predict and mark the corner of the boundary. As we can see in Figure 5.1, the corner points are at the regional maximum of the error value. In our experiment, we define the corner points at the place that the error 38 value is greater than 0.5 and choose the maximal in the 10-point nearby region. 35 40 45 50 1.5 55 1 60 65 Figure 5.1 0.5 35 40 45 50 55 60 0 65 0 20 40 60 80 100 120 (a) A star-shape boundary (b) Error between the two boundary of (a). In this method, predicting and marking the corners is just the first step. After this step, we have to segment the original boundary into several parts and to convert these boundary segments by Fourier description. 5.2.2. Approach 2: Fourier descriptor of non-closed boundary segment Fourier description to describe a boundary is to get rid of the high frequency components. a(u) truncate a(u) DFT Boundary Boundary segment segment DFT Fig. . 0 truncate P 0 K P Fourier Fourier descriptor descriptor K u u Inverse Inverse DFT DFT Recovery Recovery boundary boundary Use Fourier descriptor to a non-close boundary segment. . Use Fourier descriptor to to a deal non-close boundary segment. Figure 5.2 Fig. Using Fourier descriptor with a non-closed boundary segment. However, if we truncate the high frequency of the frequency domain of a non-closed boundary, the reconstructed boundary will be a closed boundary. Now we describe it as followed: (x0, y0) s1(k) (x0, y0) s1(k) s2(k) s2(k) s3(k) s3(k) (xK1, yK1) (xK1, yK1) Boundary Boundary segment segment Fig. . . Fig. Linearly Linearly shift shift Add a new Add a new segment segment The steps toto solve the non-closed The steps solve the non-closedproblem. problem. Figure 5.3 (a) Step1 、 (b) step2 and (c) step3 . Step 1: We set the coordinate of the start point as ( x 0 , y 0 ) , and the end point 39 as ( x K 1 , y K 1 ) . See figure 5.3 (a) Step 2: We shift the boundary points linearly according to the distance on the curve between the two end points. If (xk, yk) is a point of the boundary segment s1(k), for k = 0, 1, ..., K1, it will be shifted to (xk', yk'), see (b), s 2 (k ) where xk ' xk x0 ( xK 1 x0 ) k / ( K 1) (5.5) (5.6) yk ' yk y0 ( yK 1 y0 ) k / ( K 1) Step 3: We add a boundary segment which is odd-symmetric to the original one. Then the new boundary segment is closed and perfectly continuous along the curve between the two end points. See (C) The new boundary segment is s3 (k ) s2 (k ) s2 (k ), (5.7) for k K 1, K 2,..., 0,1,..., K 1 Step 4: Compute the Fourier descriptor to the new boundary segment s3(k). That is, 1 K 1 s (k ) e j 2 uk / K , K k K 1 for u 0,1,..., 2 K 2 a (u ) (5.8) if the signal s(k) is odd-symmetric, its DFT a(u) is also odd-symmetric. DFT s(k ) s(k ) a(u ) a(u) (5.9) Because the central point of s3(k) is the origin, the DC-term (the first coefficient of DFT) is zero. We only need to record the second to the K-th coefficient of the Fourier descriptors, as illustrated in Fig. 4.6. odd symmetry |a(u)| s3 odd symmetry |a(u)| s3 useless DFT 0 DFT 0 useless K K DC-term is zero u u 2K2 2K2 DC-term is zero Figure 5.4 Fourier descriptor of s3 After doing all the steps, we can start to take the Fourier description of the processed boundary and the problem we mention will not exist anymore. 5.2.3. Approach 3: Boundary compression This is talk about the process in approach 2 which can use on boundary compression. We can only reserve the P1 coefficients and truncate the other coefficients. We 40 recover the whole coefficient by stuffing zeroes and then copy to the odd symmetry part only reserve P1 coefficients truncate |a(u)| |a(u)| odd symmetry u u 0 P1 0 P1 K1 2K2 Figure 5.5 The reserve P1 coefficients |a(u)| only reserve P1 coefficients truncate |a(u)| odd symmetry u u 0 P1 0 P1 K1 2K2 Figure 5.6 Recover the whole coefficients from fig. 5.5 5.2.4. Approach 4: boundary encoding In the boundary segments encoding, we have four data to record: Segment number Difference of each boundary Huffman encoding Coordinates Difference of each corner Point number Corners & boundary segments Figure5.5 of each segment & & Huffman encoding Corner distance + Difference Bit stream & Huffman encoding Coefficients Truncate and Zero-run length & of each segment quantization Huffman encoding Figure 5.7 Boundary Segment Encoding In the third data, the point number of each segment is the distance of two end points of each boundary segment. The distance we used here is the sum of the two distances of the x-axis and y-axis. 41 Figure 5.8 Point numbers of boundary and distance of two end point. As we can see in Fig. 5.8, we have vector n of the point number of each boundary segment. Similarly, dx and dy are vectors that record the distances of the x-axis and y-axis respectively. Therefore, we can get the difference vector d where d n (dx dy ) (5.10) The value difference vector d is close to zeroes and is appropriate to encode with Huff-man encoding. In the decoder as shown in Fig. 4.12, we can recover n that n d (dx dy ) (5.11) In the fourth data, we combine the coefficients of each boundary segment in a whole boundary and encode them with zero-run length and Huffman coding. When the boundary segment is a straight line, its coefficients of Fourier descriptor will be all zeroes. Therefore, it is appropriate to use the zero-run length coding when many boundary segments are straight lines. Because we have recoded the point number of each boundary segment, we can calculate the reserved coefficient number and the split the combined coefficient array correctly. And then we can recover the original coefficients by stuffing zeroes to the truncated position. (a ) Origina l Bounda ry (b) Recovered bounda ry with R = 10% a nd coefficient number is grea ter tha n 3 Figure 5.9 Result of improved Boundary Compression 42 However, if we use the modified Fourier descriptor method that has split the boundary at the corner point to several boundary segments, the sharp corner can be reserve when R is less. We can also see that when R = 10%, the result of the original Fourier descriptor method is obviously distortion. However, in the modified Fourier descriptor method, the characteristic of corners can be reserved and some longer boundary segments do not be distortion obviously. We have to notice that the shorter boundary segments are stretched from a curve due to the reserve coefficients are less than one. Therefore, we make the reserve coefficient number is greater than three, where the three coefficients can represent the most characteristic in our experiment. The improved result is shown in Fig 4.13. 43 6. Reference 1. R. C. Gonzalez, R. E. Woods, Digital Image Processing third edition, Prentice Hall, 2010. 2. L.G. Roberts, “Machine Perception of three-Dimensional Solid,” in Optical and Electro-Optical Information Processing, Tippet, J.T (ed.), MIT Press, Cambridge, Mass, 1965. 44