Medical Image Classification with Advanced Markov Random Field/Gibbs Classification A Dissertation Proposal Zhihong Yang May 30,2000 Advisor committee Dr. Ian R. Greenshields Dr. Howard Sholl Dr. Reda Ammar 5/30/00 Outline Medical image classification and its time-consuming properties Introduction to MRF/Gibbs classification Previous work and novelty of this study Methods and techniques to be used Partial parallel algorithm ensemble-parallel algorithm locally adaptive cooling schedule based on multiresolution tiling Summary of research goals and expected contributions Result evaluation plan Availability and location of research facilities 5/30/00 2 Medical Image Classification with MRF/Gibbs Methods--1/2 Image classification is a procedure by which desired information is extracted from original image data through a designed algorithm four elements are involved in the definition of image classification: original data/classified data/classification algorithm/estimation criterion The scale of image classification problem--original input data: a 256 X 256 lattice grey image classified image: a 256 X 256 lattice binary image The number of possible output is power(2, 256*256). This is a very big number 5/30/00 3 Medical Image Classification with MRF/Gibbs Methods--2/2 Markov Random Field/Gibbs classification is a wellestablished method to classify image based on statistical inference. In order to introduce the problem we are facing, we would like to summarize the properties of MRF/Gibbs classification. MAP estimate slow speed of convergence due to the cooling schedule The principle of MRF/Gibbs classification follows the problem statement. 5/30/00 4 The problem & Proposed methods The time consuming property of MRF/Gibbs classification to medical image data because of both the volume of the input data and the nature of the algorithm Proposed Methods to reduce the time for the MRF/Gibbs algorithm to converge partial parallel algorithm ensemble parallel algorithm locally adaptive cooling schedule based on multiresolution tiling 5/30/00 5 Introduction to MRF/Gibbs Classification Priors and Posteriors Bayes decision rules Maximum A Posterior estimate Markov Random Fields/Neighborhood System Gibbs Fields Markov Chains, Limit Theorems, Convergence of the algorithm Gibbs Sampler/ Visiting Scheme Simulated Annealing/Cooling Schedule 5/30/00 6 Priors and posterior Distributions Prior gives the model that we expect to see in an image before observation, for example--How many classes are actually in the image? What is the percentage distribution of these tissue types? Given a class, what is the distributions of data? If it is normal distribution, what are the parameters of the distribution? Posterior can be interrupted as an adjustment of priors to the real data after the observation. 5/30/00 7 Gibbsian form of priors and posteriors A strictly positive probability distribution will always have the Gibbsian form (ς) where 1 Z exp( H (ς)) ς Z exp(-H(z)) z and H is called the energy function from statistical physics. Large value of energe correspond to small value of probability values. P( x | ˆy) z( x( )zP) P( x(,zˆy,ˆy) ) the posterior has this form where is the prior on the space of image configurations, P( x,) are the distributions of the data(observed images) y given the configuration x, and P is the posterior (the distribution of configurations x given the observed data y. 5/30/00 8 Bayes Decision rules What is the best? Two elements have to be take into account The ideal mode presented by priors The real data that we observed The Bayesian approach takes into account both requirements simultaneously by looking for desired posteriors. A estimator minimizing the risk is called a Bayes estimator MAP estimators are Bayes estimators for the 0-1 loss functions. 5/30/00 9 MAP estimator An estimate from observed data that will maximize the posterior distribution is called a Maximum A Posterior(MAP) estimate.The image is estimated as a whole in MAP. MAP is contextual related estimate. Thus it is equal to look for the estimate that will minimize the energy function of posterior distribution. Why MAP estimator? MAP estimator is the best estimator under 0-1 loss function. Computation complexity--exponential 5/30/00 10 Markov Random Fields/Neighborhood System Random Field---A strictly positive probability distribution on the space of configurations Index Set of Sites/Pixel space of states/Classes space of configurations Neighborhood System---Some axioms One site is not its own neighbor. Site S is site T’s neighbor, if and only if site T is site S’s neighbor. Local characteristic-- the conditional probability of the site S, given the configuration of other sites, is called local characteristic. Markov field-- A site’s local characteristic only relies on its neighbors. We want the neighborhood to be small. 5/30/00 11 Gibbs Fields Probability of Gibbs Form are always strictly positive and hence random fields. Gibbs fields is induced by the energy function. Energy function is given by the sum of potentials. Potential is something related to a site’s class/state/position to reflect the relative relations with other sites’ class/state/position. Thus the image configurations with posterior that is Gibbsian form is a Gibbs field. 5/30/00 12 Markov Chains:Limit Theorem Markov Kernal--the possibility from a old class(x) to new class(y). It is a matrix, with x-th row and y-th column. Markov kernels with a strictly positive power are called primitive. Markov Chain--On the finite space X, given by an initial distribution v and Markov kernels P1,P2,P3,…. If Pi=P for all I then the chain is called homogeneous. The limit Theorem. A primitive Markov kernel P on a finite space has a unique invariant distribution μ and n , vP n μ uniformly in all distributions v. 5/30/00 13 Gibbs Sampler(1) limit theorem tells us that we should look for a strictly positive Markov kernel for which the distribution is invariant. One natural construction is based on the local characteristics of the possibility measure of the random field. A Markov Kernal is defined by a site’s local characteristic. Z 1exp(-H(y x S )) if y S- x S ( x, y) 0, otherwise The Gibbs field P and its local characteristics fulfill the detailed balance equation. If u and P fulfill the detailed balance equation then u is invariant for P. In particular, Gibbs fields are invariant for their local characteristics. After very large number of iterations, one will end up in a sample from a distribution close to the Gibbs field. 5/30/00 14 Gibbs Sampler(2) Visiting Scheme is an enumation of sites whose classes are waiting to be determined. Gibbs Sampler. The above homogeneous Markov chain with transition probability P induces the following algorithm: an initial configuration x is chosen or picked at random according to some initial distribution v. In the first step, x is updated at site 1 by sampling from the single-site characteristic. This yields a new configuration y=y1xs\{1} which in turn is updated at site 2. /This way all the sites in S are sequentially updated. This will be called a sweep. The first sweep results in a sample from vP. Running the chain for many sweeps produces a sample from vP…P. The procedure goes over and over …... 5/30/00 15 Simulated Annealing/Cooling Schedule The computation of MAP estimators for Gibbs fields amounts to the minimization of energy functions. β β 1 β , ( x ) ( Z ) exp(-βH(x)) inverse temperature is defined by Cooling schedule is an increasing sequence of positive numbers β( n) 1 ln n σ For every n>=1, a Markov kernel is defined by Pn ( x, y) βS(1 n) βS(2n) βS(μn) ( x, y) A Theorem by S. and D. Geman(1984) M lim v P1P 2 P n ( x) n 0 5/30/00 1 if x M otherwise 16 A Brief History of MRF/Gibbs Classification S.Geman and D. Geman(1984) build a theoretical schema for this algorithm. M.C. Zhang(1990) applied Geman’s brothers work to image classification problem. Their definition of potential function and energy are inherited in our study. Tom Deggeet and I.R. Greenshields(1998) explored 3-D volume medical image classification. The proposed siteaging will be refined in our study. Numerous other attempts to apply Geman’s theory in classifications 5/30/00 17 Literature Review--Parallel Simulated Annealing Parallel computation by speculative computation [Sohn,1995],[Nabhan,1995],[Witte, 1991] Partial Parallel Algorithm cluster algorithm[Swensdon,1987],[Sokal,1989],[Fox,1995] synchronous updating based on independent set partition [Beba, 1997]---not based on independent set partition, no rigorous result on speed up. [Jeng,1993]---failed to take the communication cost into consideration Multiple trials/ensemble parallel Pseudo parallel algorithm[Deggett, 1999] 5/30/00 18 Comparison of the proposed parallel techniques Parallel techniques Speculative Computation cluster algorithm Indendent set partition Multiple trials pesudo algorithm 5/30/00 spectific problems no isling/potts model no no no Speedup 2-3 at no comm cost 10( data from experiment) pixels/chromic numbers uncertain number of data chunks Properties limited speedup limited image model chromic number is a NP complete problem uncertainty theoretical ground need to be bilit 19 Shorten the run time by adjusting cooling schedule Various cooling schedules polynomial[Young,1999],[Yuan, 1999] adaptive cooling[Steinhofel,1998] Characters Problem-specific limited Speedup preliminary consideration on cooling schedule that is adaptive to the data. Regular data partition. 5/30/00 20 The novelty of this study Partial parallel algorithm based on independent set partition on identified simple neighbor hood system locally adaptive cooling schedule based multiresolution tiling the application on medical image classification 5/30/00 21 Method 1--Partial Parallel Algorithm An example For a 4-neighbor hood with northern, eastern, southern and western neighbors, an update at a 'black' site need no information about the states in other 'black' sites. Hence, given a configuration x, all 'black' processing units may do their job simultaneously and produce a new configuration y' on the basis of x and then the white processors may update y' in the same way and end up with a configuration y. Thus a sweep is finished after two time steps and the transition possibility is the same as for sequential updating over a sweep in |S| time steps. 5/30/00 22 Independent sets If a neighbor hood system is given, then a subset T of S is independent if it contains no pair of neighbors. The transition probability of partially parallel based on independent sets coincides with the transition probability for one sequential sweep. The limit theorem for sampling stays the same as the sequential case. 5/30/00 23 Independent set partition--The graph coloring problem The small number of independent sets is called chromatic number of the neighbor hood system. In fact, it is the smallest number of colors needed to paint the sites in such a fashion that neighbors never have the same color. Two extreme cases. If the classes in the site are independent, then there are no neighbor hood at all and the chromatic number is 1. If the all sites interact each other, then chromatic number is |S|. In general this problem is known as the graph coloring problem, which is NP complete. 5/30/00 24 Independent set partition with small neighborhood Though in general , the independent set partition is a problem that is even harder than the Gibbs algorithm, the problem is still solvable with a small neighborhood system. For example, a 2-D four neighborhood (East, West, North, and South) partition is given by a checkerboard style partition. It is (B—black, W—white) BWBWB WBWBW BWBWB WBWBW 5/30/00 25 5-neighborhood partition A 2-D five neighborhood (East, West, North, South, and southwestern) partition is given by (R—Red, G—Green, B—Blue). RGBRGBRGBRGB GBRGBRGBRGBR BRGBRGBRGBRG RGBRGBRGBRGB GBRGBRGBRGBR 5/30/00 26 3-D checkboard The partition to 3-D case is much more difficult than that to 2-D neighborhood system, but to the 6neighborhood case (East, West, North, South, bottom, up) the partition is elegant. Suppose we have a checkerboard in each plain, we can have one plain beginning from a black block (the upper left corner), and another layer beginning from a white block, respectively. It is a 3-D checkerboard. 5/30/00 27 Parallel strategy based on independent partition With the independent set partition, we can consider the parallel strategy of this algorithm. In a cluster computer, we cut an image into several data block as a simple way, and assign each block to a computer. Here is the key to this implementation: each computer will not update the sites in raster scanner order, it will update the sites with white color first, black later or on the other way around. Each computer does this in same order. After these computers complete one color in the assigned data block, they need to exchange the neighbor column/row. Then they will proceed to update another color. After all color have been updated, the next iteration will begin. 5/30/00 28 Data exchange after the black blocks are updated Processor 1 B B B B B B B B B B B B B B B B B B Processor 2 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B Processor 3 5/30/00 B B B B B B B B B B B B B B B B B Processor 4 29 Data exchange after the white blocks are updated Processor 1 B B B B B B B B B B B B B B B B B B Processor 2 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B Processor 3 5/30/00 B B B B B B B B B B B B B B B B B Processor 4 30 parallel algorithm on multigrid clusters P1 P3 5/30/00 P2 P4 31 Partial parallel algorithm Partition the neighborhood manually, label the pixels with colors. Partition the data into chunks for t = start to end do (t is preselected based on locally adaptive cooling schedule) For colors = color0 to colorN for every pixel (i,j,k) in the image do let s = c(i,j,k) be the current class at (i,j,k) let N(i,j,k) be the neighborhood at (i,jk) compute E[s; i,j,k ] (==energy at (i,j,k) ) from the New Class Selection Rule* let q be a new class for site (i,j,k) compute E[q; i,j,k] (==energy at (i,j,k) with class q If ( E[q;i,j,k] < E[s;i,j,k] ) c(i,j,k) = q Else let D = Abs( E[q; i,j,k] – E[s;i,j,k]) Let y = Exp[ -D/T(t) ] Let x = Random [0..1] /* a uniform rand in 0..1 */ If ( y > x) c(i,j,k) = q; end for every pixel if the other processors with neighor pixels have finished the update, exchange edge pixels between processors end for colors end for t collect the result by one of the machine 5/30/00 32 Method2--Ensemble parallel Multiple trials--the simultaneous execution of N instances of the same algorithm over the same dataset seeded different starting configurations. If these ensemble instances are driven from the same prior, then there is evidence that they will converge locally to a single instance, offering the possibility that K ensemble instances can be collapsed into 1 instance over identified point sets in the volume. 5/30/00 33 Method3--locally adaptive cooling schedule based on multiresolution tiling Site aging multiresolution tiling with self-similar sets local moments Rank temperature based on tile 5/30/00 34 Site aging Observation---Background Area If the possibility that the configuration of site’s neighborhood changes over iterations once the the schedule had reached an inverse temperature given by N0 is a small value, then we can change the visitation schedule so that that site does not invoke the sampler process once the temperature exceeds N0 5/30/00 35 Seek heuristic that predicts the inverse temperature Need a flexible and tunable plane partition Multiresolution tilling with self-similar sets Develop a function that reflect the complexity of the data in the sense of classification Wavelet decomposition---Harr wavelet local moments rank the cooling temperature based on the functions 5/30/00 36 Multiresolution tilings --1 A well-established theory by Mathematicians Dilation Transformation Translation transformation Multiresolution analysis scaling function definition of wavelet If a multiresolution analysis’s scaling function is the characteristic function of a measurable set Q and |Q|=1, then Q can tile a plane. 5/30/00 37 Multiresolution tilings --2 If |Q|=1, Q is the attractor of an affine transformation. Q can be solved by the iteration Qn+1=Union(invA(Qn+k). Each intermediate Qn can tile a plane as well. K belongs to coset representative set. The union is over the complete coset. So we have a geometrically tunable, irregular system over which heuristics for the cooling schedule can be developed 5/30/00 38 Local Moments Local moments definition We call the number p q m pq Α x y f(x, y)dμ A The (p, q)' th local moment of f with respect to A. first moment--mean second moment--variance higher moments 5/30/00 39 Heuristic Choose a tile( inverse problem, but we are considering to choose one from tile banks based on some similarity measures) wavelet decomposition over tilling--tree presentation Ranking Assign cooling schedule 5/30/00 40 Evaluation Plan Validation test Apply the algorithm to synthetic image with known priors comparison with existing classification result Speed up efficiency scalability 5/30/00 41 Summary of research goals and expected contributions Development of a SIMD MRF/Gibbs classification algorithm based on independent set partition Rigorous medical image classification result based on this algorithm with random initial configuration Rigorous result of the speedup, efficiency about this algorithm Exploration of ensemble parallel algorithm in the application of medical image classification Exploration of multiresolution tiling on locally adaptive cooling schedule 5/30/00 42 Research facilities In general, the research facilities for this study is available or can be accessed in Computer Science and Engineering Department and Booth Research Center at UConn. Hardware--SUN, PC, SGI workstation Software--mobile agent, MPI, openMP,... Network--100BaseTEthernet,Gigabit Ethernet,OC3ATM... The data that will be used are Visible Human Data, which Image Processing Lab has the license to acquire this data. Super computer facilities---NPACI 5/30/00 43 Bibliography--1 [1] The visible Human Dataset, National Library Medicine, http://www.nlm.nih.gov/pubs/factsheets/visible_human.html [2] Stuart Geman and Donald Geman, “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images,”IEEE PAMI Vol 6, No.6, 1984 [3] Ming chuang Zhu, Robert M. Haralick, James B. Campbell, “Multispectral Image Context Classification Using Stochastic Relaxation,” IEEE PAMI Vol. 20, No. 1, 1990 [4] T. Degget, I.R. Greenshields and G. Weerasinghe, “Asynchronous, parallel pseudo Gibbs Classification of the VF Dataset,” Proceedings of twelfth IEEE symposium on computer-based medical systems [5] Gerhard Winkler, “Image Analysis, Random fields and Dynamic Monte Carlo Methods--A Mathematical introduction,” Springer, 1995 [6] W.T. Tutte, “Graph Theory, Encyclopedia of Mathematics and its Applications,” Cambridge University Press, 1984 [7] Laurent Younes, “Synchronous Random Fields and Image Restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No.4, April 1998, pp. 380-390 [8] Soo-Young Lee, Kyung Geun Lee, “Synchronous and Asynchronous Parallel Simulated Annealing with Multiple Markov Chains,” IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 10, October 1996, pp. 993-1008 5/30/00 44 Bibliography--2 [9] Hao Chen, Nicholas S. Flann, and Deniel W. Watson, “Parallel Genetic Simulated Annealing: A Massively Parallel SIMD Algorithm,” IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No.2, February 1998, pp. 126-136 [10] B. Hajek, “Cooling Schedules for Optimal Annealing,” Mathematics of Operations Research, Vol 13, pp. 311-329, 1998 [11] Andrew Sohn, “Parallel N-ary Speculative Computation of Simulated Annealing, IEEE Transactions on Parallel and Distributed Systems,” Vol. 6, No. 10, October, 1995 pp. 997-1005 [12] Tarek M. Nabhan and Albert Y. Zomaya, “A parallel Simulated Annealing Algorithm with Low Communication Overhead,” Vol. 6, No. 12, December 1995, pp. 1226-1233 [13] EE. Witte, R.D. Chambelain and M.A. Franklin, “Parallel Simulated Annealing using Speculative Compuation, ” IEEE Transaction on Parallel and Distributed Systems, Vol.2, No.4, pp483-495, April. 1991. [14] Beba C. Vemuri, Chhandomay Mandal, and Shang-Hong Lai, “A Fast Gibbs Sampler for Synthesizing Constrained Fractals,” IEEE Transactions on Visualization and Computer Graphics, Vol.3, No. 4, October-December 1997, pp.337—351 [15] Fure-Ching Jeng, John W. Woods, and Sanjeev Rastoni, “Compound GaussMarkov Random Fields for Parallel Image Processing,” in Markov Random Fields— Theory and Applications, pp. 11-38, Edited by Rama Chellappa and Anil Jain, Academic Press, 1993 5/30/00 45 Bibliography--3 [16] Zhihong Yang, Ian R. Greenshields, “Volume Visible Human Data Classification with Parallel Dynamic Monte Carlo Methods,” to appear, in The 4th World Multiconference on Systemics, Cybernetics and Informatics and the 6th International Conference on Information Systems, Analysis and Synthesis. [17] Madych, W., “Some Elementary Properties of Multiresolution Analyses of,” in Wavelets: A Tutorial in Theory and Applications, Ed. C.K. Chui, Academic Press, 1992 [18] Ian R. Greenshields, Zhihong Yang, “A Multigrid Approach to the Gibbsian Classification of Mammograms,” to appear in the 13th IEEE symposium of Computer Based Medical Systems. [19] Ian R. Greenshields, “Local Moments, Contractive IFS and Multiresolution Decompositions of 3D Imagery,” Proceeding of Microscopy, Holography and Interferometry in Biomedicine, SPIE Vol. 2083, 174-183, 1993 [20] http://www.npaci.com [21] Danny B. Lange, Mitsuru Oshima, “Programming and Deploying Java Mobile Agents with Aglets,” Addison-Wesley, 1998 [22] Ian T. Foster, “Designing and Building Parallel Programs—Concepts and Tools for Parallel Software Engineering,” Addison-Wesley Publishing Company, 1994 [23] Joel A. Rosiene, Ph.D. dissertation, “Affine Transformations and Image Representation,” the University of Connecticut, 1994 [24] Thomas A. Daggett, Ph.D. dissertation, “MRF-Gibbs Context-Dependent Classification on a Small-scale Cluster Computing System,” the University of Connecticut, 1998 5/30/00 46