EVALUATION OF IMAGE SEGMENTATION METHODS Jayaram K. Udupa Medical Image Processing Group - Department of Radiology University of Pennsylvania 423 Guardian Drive - 4th Floor Blockley Hall Philadelphia, Pennsylvania - 19104-6021 1 CAVA CAVA: Computer-Aided Visualization and Analysis The science underlying computerized methods of image processing, analysis, and visualization to facilitate new therapeutic strategies, basic clinical research, education, and training. 2 CAD CAD: Computer-Aided Diagnosis The science underlying computerized methods for the diagnosis of diseases via images 3 Image Segmentation Recognition: Determining the object’s whereabouts in the scene. (humans > computer) Delineation: Determining the object’s spatial extent and composition in the scene. (computer > humans) In CAVA, Segmentation Delineation. Recognition is usually manual. 4 SEGMENTATION EVALUATION Can be considered to consist of two components: • Theoretical Study mathematical equivalence among algorithms. • Empirical Study practical performance of algorithms in specific application domains. 5 SEGMENTATION EVALUATION: Theoretical Segmentation approaches may be broadly classified into two groups: • pI approaches Purely image based – rely mostly on information available in the given image only. • SM approaches Shape model based – employ prior shape models for the objects of interest. 6 SEGMENTATION EVALUATION: Theoretical pI approaches SM approaches Boundary-based: optimum boundary active contours/surfaces level sets manual tracing Live wire Active Shape Active Appearance m-Reps atlas-based Region-based clustering – kNN, CM, FCM graph cut fuzzy connectedness MRF watersheds optimum partitioning (Mumford-Shah, Chan-Vese) 7 SEGMENTATION EVALUATION: Theoretical Fundamental challenges in image segmentation: (Ch1) Are major pI frameworks such as active contours, level sets, graph cuts, fuzzy connectedness, watersheds, truly distinct or some level of equivalence exists among them? (Ch2) How to develop truly distinct methods constituting real advance? (Ch3) How to choose a method for a given application domain? (Ch4) How to set an algorithm optimally for an application domain? Currently any method A can be shown empirically to be better than any method B, even when they are equivalent. 8 SEGMENTATION EVALUATION: Theoretical A general theory of image segmentation: An idealized image F: a function . n is a bounded open subset of . A digital image f: a function C . f is a digitization of F. C is a subset of . A delineation model M: F , p O. O is a segment of image F, p is a parameter vector. Ciesielski, Udupa, SPIE Proceedings 6512:65120W-1-65120W-12, 2007. Ciesielski, Udupa, MIPG Technical Report 335, U of Pennsylvania, November 2007. 9 SEGMENTATION EVALUATION: Theoretical A delineation algorithm A: a mapping f , S . is a parameter vector, S C . Algorithm A represents model M: a limiting process. As the resolution of f increases, S approaches O. lim A f , M F , p . Algorithms A1 and A2 are model-equivalent: if there exists a model M such that both A1 and A2 represent M. 10 SEGMENTATION EVALUATION: Theoretical (1) Theorem: The Malladi-Sethian-Vemuri (PAMI-17, 1995) level set algorithm is model equivalent to Udupa-Samarasekera (GMIP-58, 1996) fuzzy connectedness algorithm with gradient based fuzzy affinity. (FC method has definite computational advantages over LS.) (2) Audigier and Lotufo have shown by a different approach (Image Foresting Transform) equivalence between particular forms of watershed and fuzzy connectedness. 11 SEGMENTATION EVALUATION: Theoretical Attributes used by some well known delineation models Connectedness Gradient Fuzzy Connectedness Yes Gradient + homogeneity affinity Chan-Vese No No Mumford-Shah No KWT snake No (not for edge detection) Texture Object feature affinity Smoothness Shape Noise Scale based FC Optimization No No In RFC Yes Yes No No Yes Yes Yes No Yes Yes Boundary Yes No Yes No No Yes Foreground when expanding Yes No No No No No Boundary Yes Yes Yes User No Yes Active shape Yes No No No Yes No Yes Active appearance Yes No Yes No Yes No Yes Graph cut Usually not Yes Possible No Usually not No Yes Clustering No No Yes No No No Yes Maladi-SethianVemuri LS Live wire 12 SEGMENTATION EVALUATION: Empirical Need to specify Application Domain T : A task - Example: Estimating the volume of brain. B : A body region - Example: Head. P : Imaging protocol - Example: T2 weighted MR imaging with a particular set of parameters. Application domain: A particular triple T , B, P From now on, we denote a digital image by C C , f . 13 SEGMENTATION EVALUATION: Empirical The segmentation efficacy of a method M in an application domain T , B, P may be characterized by three groups of factors: Precision : (Reliability) Repeatability taking into account all subjective actions influencing the result. Accuracy (Validity) : Degree to which the result agrees with truth. Efficiency (Viability) : Practical viability of the method. Udupa et al., Computerized Medical Imaging and Graphics, 30:75-87, 2006. 14 SEGMENTATION EVALUATION: Empirical For determining accuracy, need true/surrogates of true delineation. S: A given set of images in T , B, P. Std : The corresponding set of images with true delineations. (1) Manual delineation in images in S – trace or paint Std . (2) Simulated images I: Create an ensemble of “cut-outs” of the object from different images and bury them realistically in different images S. The cut-outs are segmented carefully Std. 15 (a) (b) A slice (a) of an image simulated from an acquired MR proton density image of a Multiple Sclerosis patient’s brain and its “true” segmentation (b) of the lesions. 16 (3) Simulated Images II : Start from (binary/fuzzy) objects (Std ) segmented from real images. Add intensity contrast, blur, noise, background variation realistically S. (a) (b) (c) White matter (WM) in a gray matter background, simulated by segmenting WM from real MR images and by adding blur, noise, background variation to various degrees: (a) low, (b) medium, and (c) high. 17 (4) Simulated Images III : As in (3) or (1) but apply realistic deformations to the images in S and Std. (a) (b) (c) (d) Simulating more images (c) and their “true” segmentations (d) from existing images (a) and their manual segmentation (b) by applying known realistic deformations. 18 (5) Simulated Images IV: Start from realistic mathematical phantoms (Std). Simulate the imaging process with noise, blur, background variation, etc. Create Images S. http://www.bic.mni.mcgill.ca/brainweb/ 19 (6) Estimating surrogate segmentations from manual segmentations. Have many manual segmentations for each image in S. Estimate the segmentation that represents the best estimate of truth Std. Warfield, S.K., Zou, K.H., Wells, W.M.: “Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation.” IEEE Trans Med Imaging 23(7):903-921, 2004. 20 SEGMENTATION EVALUATION: Empirical Precision Repeatability taking into account all subjective actions that influence the segmentation result. (1) (2) (3) (4) Intra operator variations Inter operator variations Intra scanner variations Inter scanner variations Inter scanner variations include variations due to the same brand and different brands. 21 SEGMENTATION EVALUATION: Empirical - Precision A measure of precision for method M in a trial that produces C MO1 and CMO2 for situation Ti is given by PRMTi CMO1 CMO2 CMO1 CMO2 PRMTi 1 - O1 M C O1 M C , i 1, 2. - CMO2 + CMO2 2 Intra/inter operator , i = 3, 4. Intra/inter scanner Surrogates of truth are not needed. 22 SEGMENTATION EVALUATION: Empirical Accuracy The degree to which segmentations agree with true segmentation. Surrogates of truth are needed. For any scene C acquired for application domain T , B, P, CMO - fuzzy segmentation of O in C by method M , Ctd - surrogate of true delineation of O in C. 23 SEGMENTATION EVALUATION: Empirical – Accuracy FNVFMd FPVFMd Ctd CMO Ctd CMO Ctd Ud - Ctd , , TPVFMd TNVFMd Ctd CMO U d - Ctd U d CMO - Ctd Ctd , Ud : A binary image representing a reference super set. (for example, the imaged body region ). FNVFMd : Amount of tissue truly in O that is missed by M . FPVFMd : Amount of tissue falsely delineated by M . 24 SEGMENTATION EVALUATION: Empirical – Accuracy Requirements for accuracy metrics: (1) Capture M’s behavior of trade-off between FP and FN. (2) Satisfy fractional relations: FNVFMd 1 TPVFMd FPVFMd 1 TNVFMd (3) (4) (5) (6) Capable of characterizing the range of behavior of M. Boundary-based FN and FP metrics may also be devised. Any monotonic function g(FNVF, FPVF) is fine as a metric. Appropriate for T , B, P. 25 SEGMENTATION EVALUATION: Empirical – Accuracy Delineation Operating Characteristic Each value of parameter vector of M gives a point on the DOC curve. The DOC curve characterizes the behavior of M over a range of parametric values of M. Brain WM segmentation in PD MRI images. 1-FNVF A : Area under M the DOC curve FPVF 26 SEGMENTATION EVALUATION: Empirical Efficiency Describes practical viability of a method. Four factors should be considered: Computational time – for segmenting each scene t Human time – for one-time training of M t Human time – for segmenting each scene t (1) Computational time – for one time training of M t Mc1 (2) (3) (4) c2 M h1 M h2 M (2) and (4) are crucial. (4) determines the degree of automation of M. 27 Summary Accuracy : Precision : FPVFMd : FP fraction for delineation PRMT2 : intra operator : inter operator FNVFMd : FN fraction for delineation PRMT3 : intra scanner AM : Area under the DOC curve PRMT4 : inter scanner PRMT1 Efficiency : tMc1 : computational time for algorithm training. tMc 2 : computational time for scene segmentation. tMh1 : operator time for algorithm training. tMh 2 : operator time for scene segmentation. 28 SEGMENTATION EVALUATION: Empirical Software Systems for Segmentation Software OS Cost Tools 3D Doctor [162] W fee Manual tracing 3D Slicer [163] W, L, U no fee Manual, EM methods, level sets 3DVIEWNIX [164] L, U binary no fee Manual, optimal thresh., FC family, live wire family, fuzzy thresh., clustering, live snake Amira [165] fee Manual, snakes, region growing, live wire Analyze [166] W, L, U, M W, L, U fee Manual, region growing, contouring, math morph, interface to ITK Aquarius [167] Unknown fee Unknown Brain Voyager [168] W, L, U fee Thresholding, region growing, histogram methods CAVASS [169] W, L, U, M no fee Manual, opt thresh., FC family, live wire family, fuzzy thresh, clustering, live snake, active shape, interface to ITK etdips [170] W no fee Manual, thresholding, region growing Freesurfer [171] L, M no fee Atlas-based (for brain MRI) Advantage Windows U, W fee Unknown Image Pro [172] W fee Color histogram 29 SEGMENTATION EVALUATION: Empirical Software Systems (cont’d) Imaris [173] W fee Thresholding (microscopic images) ITK [174] no fee Thresh., level sets, watershed, fuzzy connectedness, active shape, region growing, etc. MeVisLab [175] W, L, U, M W, L binary no fee Manual, thresh., region growing, fuzzy connectedness, live wire MRVision [176] L, U, M fee Manual, region growing Osiris [177] W, M no fee Thresholding, region growing RadioDexter [178] SurfDriver [179] Unknown fee Unknown W, M fee Manual SliceOmatic [180] Syngo InSpace [181] VIDA [182] W fee Thresholding, watershed, region growing, snakes Unknown fee Automatic bone removal Unknown fee Manual, thresholding Vitrea [183] Unknown fee Unknown VolView [184] W, L, U fee Level sets, region growing, watershed Voxar [185] W, L, U fee Unknown 30 SEGMENTATION EVALUATION: Empirical Publicly Available Data Sets Data sets Description True Segmentation Number of Images 20 BrainWeb [186] Simulated brain T1, T2, PD MR images-Objects: CSF, GM, WM, vessels, skull, .. binary, fuzzy DDSM [187] Digital database for screening mammography - Objects: lesions no 2,500 (2D) CAD ICBM [188] International consortium for brain mapping, MRI, images warped to template binary 3,000 (3D) CAVA LIDC [ 189] Lung spiral CT images - Objects: nodules OAI [190] Osteo arthritis initiative, x-ray and MRI knee images RIDER [191] Chest CT images over time of lung cancer patients, radiation therapy followup no 140 (3D) CAD VCC [192] Virtual colonoscopy; CT images of colon no 835 (3D) CAD VH [193195] Visible human data sets; whole body sectional, CT, and MR images binary 2 (3D) CAVA binary (4 readers) no 85 (3D) CAVA (3D) CAD 160 (2D, 3D) CAVA 31 Segmentation Evaluation: Empirical An Evaluation Framework for CAVA should consist of: (FW1) Real life image data for several application domains T , B, P. (FW2) Reference segmentations (of all images) that can be used as surrogates of true segmentations. (FW3) Specification of computable, effective, meaningful metrics for precision, accuracy, efficiency. (FW4) Several reference segmentation methods optimized for each T , B, P. (FW5) Software incorporating (FW1) – (FW4). 32 SEMENTATION EVALUATION: Empirical Remarks (1) Precision, accuracy, efficiency are interdependent. • • accuracy precision and efficiency. accuracy difficult. (2) “Automatic segmentation method” has no meaning unless the results are proven on a large number of data sets with acceptable precision, accuracy, efficiency, and with t Mh2 = 0 . (3) A descriptive answer to “is method M1 better than M2 under T , B, P ?” in terms of the 11 parameters is more meaningful than a “yes” or “no” answer. (4) DOC is essential to describe the range of behavior of M. 33 Concluding Remarks (1) Need unifying segmentation theories that can explain equivalences/distinctness of existing algorithms. This can ensure true advances in segmentation. (2) Need evaluation frameworks with FW1-FW5. This can standardize methods of empirical comparison of competing and distinct algorithms. 34