High Performance Kernel Smoothing Library For Biomedical Imaging

High Performance Kernel Smoothing Library For Biomedical Imaging A Thesis Presented by Haofu Liao to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering Northeastern University Boston, Massachusetts May 2015 NORTHEASTERN UNIVERSITY Graduate School of Engineering Thesis Signature Page Thesis Title: High Performance Kernel Smoothing Library For Biomedical Imaging Author: Haofu Liao Department: Electrical and Computer Engineering NUID: 001988944 Approved for Thesis Requirements of the Master of Science Degree Thesis Advisor Dr. Deniz Erdogmus Signature Date Signature Date Signature Date Signature Date Signature Date Signature Date Thesis Committee Member or Reader Dr. David R. Kaeli Thesis Committee Member or Reader Dr. Gunar Schirner Thesis Committee Member or Reader Dr. Rafael Ubal Department Chair Dr. Sheila S. Hemami Associate Dean of Graduate School: Dr. Sara Wadia-Fascetti Contents List of Figures iv List of Tables vi Abstract of the Thesis vii 1 2 3 Introduction 1.1 Background . . . . . 1.2 Related Work . . . . 1.3 Contributions . . . . 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kernel Smoothing 2.1 Univariate Kernel Density Estimation . 2.2 Multivariate Kernel Density Estimation 2.3 Kernel Functions . . . . . . . . . . . . 2.3.1 Univariate Kernels . . . . . . . 2.3.2 Separable Multivariate Kernels . 2.4 Bandwidth . . . . . . . . . . . . . . . . 2.4.1 Types of Bandwidth . . . . . . 2.4.2 Variable Bandwidth . . . . . . . 2.5 Kernel Density Derivative Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 7 8 9 10 11 11 13 15 Vesselness Measure 3.1 Gradients and Hessian Matrices of Images . . . . . 3.1.1 Gradient . . . . . . . . . . . . . . . . . . . 3.1.2 Hessian . . . . . . . . . . . . . . . . . . . 3.2 Finding 1st and 2nd Order Derivatives From Images 3.2.1 Gradient Operator . . . . . . . . . . . . . 3.2.2 Gaussian Smoothing . . . . . . . . . . . . 3.2.3 Kernel Density Derivative Estimation . . . 3.3 Frangi Filtering . . . . . . . . . . . . . . . . . . . 3.4 Ridgeness Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 20 22 23 25 27 28 29 ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 6 7 GPU Architecture and Programming Model 4.1 GPU Architecture . . . . . . . . . . . . . 4.2 Programming Model . . . . . . . . . . . 4.3 Thread Execution Model . . . . . . . . . 4.4 Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 34 36 38 Algorithms and Implementations 5.1 Efficient Computation of Separable Multivariate Kernel Derivative . . . . . 5.1.1 Definitions and Facts . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 High Performance Kernel Density and Kernel Density Derivative Estimators 5.2.1 Multi-core CPU Implementation . . . . . . . . . . . . . . . . . . . 5.2.2 GPU Implementation in CUDA . . . . . . . . . . . . . . . . . . . 5.3 Efficient k-Nearest Neighbors Bandwidth Selection For Images . . . . . . . 5.3.1 k-Nearest Neighbors Covariance Matrix of Images . . . . . . . . . 5.3.2 r-Neighborhood Covariance Matrix of Images . . . . . . . . . . . 5.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 41 44 45 46 47 48 59 59 61 64 65 Experiments and Results 6.1 Environment . . . . . . . . . . . . . . . . . 6.2 Performance Evaluation . . . . . . . . . . . 6.2.1 Efficient SMKD . . . . . . . . . . 6.2.2 High Performance KDE and KDDE 6.2.3 Efficient k-NN Bandwidth Selector 6.3 Vesselness Measure . . . . . . . . . . . . . 6.3.1 Frangi Filtering . . . . . . . . . . . 6.3.2 Ridgeness Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 68 69 69 71 75 77 77 78 Conclusion and Future Work 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 82 83 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 iii List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 The relation between under-five mortality rate and life expectancy at birth Univariate kernel density estimate . . . . . . . . . . . . . . . . . . . . . Multivariate kernel density estimate . . . . . . . . . . . . . . . . . . . . Truncated Gaussian kernel function . . . . . . . . . . . . . . . . . . . . Univariate kernel density estimates of different bandwidths . . . . . . . . Comparison of three bandwidth matrix parametrization classes . . . . . . Univariate sample point kernel density estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 8 10 12 13 14 3.1 3.2 3.3 3.4 3.5 Gradient of the standard Gaussian function . Image gradient . . . . . . . . . . . . . . . Visualized Eigenvalues with ellipsoid . . . Derivatives of Gaussian filters . . . . . . . Vesselness measure using Frangi filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 20 22 26 30 4.1 4.2 4.3 4.4 4.5 4.6 4.7 GPU block diagram . . . . . . . . . . . GPU hardware memory hierarchy . . . Programming model . . . . . . . . . . GPU software memory hierarchy . . . . Warp scheduler . . . . . . . . . . . . . Aligned and consecutive memory access Misaligned memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 35 36 37 39 39 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Relation between nodes in graph G . . . . . . . . . . . . . . Graph based efficient multivariate kernel derivative algorithm Memory access patterns of matrices and cubes . . . . . . . . Memory access pattern without using shared memory . . . . Memory access pattern using shared memory . . . . . . . . The covariance and disk operators of r = 4 . . . . . . . . . Searching circles of different radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 45 51 58 58 63 64 6.1 Multiplication number comparison between the naive method and the proposed efficient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution time comparison between the naive method and the proposed efficient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 iv . . . . . . . . . . . . . . 70 72 6.3 6.4 6.5 6.6 6.7 6.8 6.9 The comparison of speed-ups between different optimization methods on synthetic 2D data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The comparison of speed-ups between different optimization methods on synthetic 3D data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the k-NN bandwidth selector on 2D images using naive algorithm and CPU efficient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the k-NN bandwidth selector on 3D images using naive algorithm, CPU efficient algorithm and GPU efficient algorithm . . . . . . . . . . . . . . . . Vesselness measure results using Frangi filter . . . . . . . . . . . . . . . . . . . . Algorithm pipeline of the ridgeness filtering based vessel segmentation . . . . . . . Vesselness measure results using ridgeness filter . . . . . . . . . . . . . . . . . . . v 73 73 75 76 79 80 81 List of Tables 3.1 Possible orientation patterns in 2D and 3D images . . . . . . . . . . . . . . . . . . 23 4.1 Compute capability of Fermi and Kepler GPUs . . . . . . . . . . . . . . . . . . . 38 6.1 6.2 Experiment environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global memory transactions between different optimization methods . . . . . . . . 69 74 vi Abstract of the Thesis High Performance Kernel Smoothing Library For Biomedical Imaging by Haofu Liao Master of Science in Electrical and Computer Engineering Northeastern University, May 2015 Dr. Deniz Erdogmus, Adviser The estimation of probability density and probability density derivatives has full potential for applications. In biomedical imaging, the estimation of the first and second derivatives of the density is crucial to extract tubular structures, such as blood vessels and neuron traces. Probability density and probability density derivatives are often estimated using nonparametric data-driven methods. One of the most popular nonparametric methods is the Kernel Density Estimation (KDE) and Kernel Density Derivative Estimation (KDDE). However, a very serious drawback of using KDE and KDDE is the intensive computational requirements, especially for large data sets. In this thesis, we develop a high performance kernel smoothing library to accelerate KDE and KDDE methods. A series of hardware optimizations are used to deliver a high performance code. On the host side, multi-core platforms and parallel programming frameworks are used to accelerate the execution of the library. For 2 or 3-dimensional data points, the Graphic Processing Unit (GPU) platform is used to provide high levels of performance to the kernel density estimators, kernel gradient estimators as well as the kernel curvature estimators. Several Compute Unified Device Architecture (CUDA) based techniques are used to optimize their performances. What’s more, a graph-based algorithm is designed to calculate the derivatives efficiently and a fast k-nearest neighbor bandwidth selector is designed to speed up the variable bandwidth selection for image data on GPU. vii Chapter 1 Introduction 1.1 Background Density estimation constructs an estimate of underlying probability density function using an observed data set. In density estimation, there are three types of approaches, parametric, semi-parametric and nonparametric. Both paramedic and semi-parametric techniques require a prior knowledge of the underlying distribution of the sample data. In parametric approaches, the data should be from a known family. In semi-parametric approaches, the knowledge of the mixture distribution is assumed to be known. On the contrary, nonparametric methods, which attempt to flexibly estimate an unknown distribution, require less structure information about the underlying distribution. This advantage makes them a good choice for robust and more accurate analysis. Kernel density estimation (KDE) is the most widely studied and used nonparametric technique. It is first introduced by Rosenblatt [1], and then discussed in detail by Paren [2]. Typically, a kernel density estimate is constructed by a sum of kernel functions centered at observed data points and a smoothing parameter called bandwidth is used to control the smoothness of the estimated densities. KDE has a broad range of applications such as image processing, medical monitoring and market analysis. On the other hand, the estimation of density derivative, though have only received relatively scant attention, also has a full potential for applications. Indeed, nonparametric estimation of higher order derivatives of the density functions can provide lots of important information about a multivariate data set, such as local extrema, valleys, ridges or saddle points. In the gradient estimation case, the well known mean-shif t algorithm can be used for clustering and data filtering. It is very popular in the areas of low-level vision problems, discontinuity preserving smoothing and image segmentation. 1 CHAPTER 1. INTRODUCTION Another use of gradient estimation is to find filaments in point clouds, which has applications in medical imaging, remote sensing, seismology and cosmology. In the Hessian estimation case, the eigenvalues of Hessian matrix are crucial to manifolds extraction and curvilinear structure analysis. Moreover, the prevalent Frangi filter [3] and its variants also require the calculation of Hessian matrix. Smoothing parameter or bandwidth plays a very important role in KDE and kernel density derivative estimation. It determines the performance of the estimator in practice. However, in most of the cases only constrained bandwidth is used. In adaptive kernel density estimation case, the bandwidth is a symmetric positive definite matrix; it allows the kernel estimator to smooth in any direction whether coordinate or not. In even simpler case, a bandwidth matrix is only a positive scalar multiple of the identity matrix. There are three reasons for the widely use of simpler parameterization than the unconstrained counterpart. First, in practice they need less smoothing parameters to be tuned. Second, due to the difficulties encountered in the mathematical analysis of estimators with unconstrained bandwidth. Third, unconstrained bandwidth is not suitable for most of the existed fast KDE algorithms. 1.2 Related Work Around 1980s, KDE becomes the de facto nonparametric method to represent a continuous distribution from a discrete point set. However, a very serious drawback of KDE methods is the expensive computational complexity for the calculation of probability at each target data vector. A typical KDE method is usually of computational order O(n2 k), where n is the number of observations and k is the number of variables. In many cases, such as database management and wildlife ecology, the size of n can be as large as hundreds of millions. What’s more, data-driven methods of bandwidth selection can also add additional order of computational burden to KDEs. Currently, there are two different approaches to satisfy the computational demands of KDEs. The first one is to use approximate techniques to reduce the computational burden of kernel estimation. In 1982, Silverman [4] proposed a fast density estimation method based on Fast Fourier Transformation (FFT). However, this method requires source points to be distributed on an evenly spaced grid and it can only compute univariate kernels. In 1994, Wond [5] extended Silverman’s method to multivariate case and proposed a well-known binned estimation method. But it still requires a binned data set. Another approach is proposed by Elgammal [6]. He designed a Fast Gauss Transform (FGT) method, where the data are not necessarily on grids. But the problem is the complexity of 2 CHAPTER 1. INTRODUCTION computation and storage of the FGT grows exponentially with dimension. Therefore, Changjiang Yang et al. [7] proposed an Improved Fast Gauss Transform (IFGT) which can efficiently evaluate the sum of Gaussian in higher dimension. But both algorithms are limited to work with only Gaussian kernel. Moreover, Sinha and Gupta [8] proposed a new fast KDE algorithm through PDDP which they claimed that their algorithm is more accurate and efficient than IFGT. Recently, an ε-sample algorithms is proposed by Phillips [9]. His algorithm studied the worst case error of kernel density estimates via subset approximation which can be helpful for sampling large dataset and hence can result a fast kernel density estimate. The second approach is to use parallel computing. Some of the most important parallel computing technologies are clustering computing, multicore computing and general-purpose computing on graphics processing units (GPGPU). For clustering computing, Zheng et al [10] implemented the kernel density estimation on Hadoop cluster machines using MapReduce as the distributed and parallel programming framework. Łukasik [11] and Racine [12] presented parallel methods based on Message Passing Interface standard in multicomputer environment. For multicore computing, Michailidis and Margaritis [13] parallelized kernel estimation methods on multi-core platform using different programming frameworks such as Pthreads, OpenMP, Intel Cilk++, Intel TBB, SWARM and FastFlow. The same authors also presented some preliminary work of kernel density estimation using GPU approach [14]. Recently, Andrzejewski et al [15] proposed a GPU based algorithm to accelerate the bandwidth selection methods of kernel density estimators. However, all the authors ignore the more complicated unconstrained bandwidth case for multivariate kernel density estimation. The kernel density derivative estimation is not considered as well. 1.3 Contributions We developed a highly efficient and flexible kernel smoothing library. This library supports both univariate and multivariate kernels. Unlike other existing kernel smoothing libraries [16, 17, 18, 19], it supports not only the constrained(restricted) bandwidth, but also the more general unconstrained bandwidth. The bandwidth, both constrained and unconstrained, is not limited to be fixed. What’s more, a sample-point based variable bandwidth is supported as well. The input data has no dimensional limitation. Basically, as long as the hardware is permitted, the library can support data of any dimension. To improve the computational efficiency, kernel functions with finite support can be used and only the data points within the kernel function’s support will be calculated. Besides the kernel density estimators, the kernel density derivative estimators are implemented as 3 CHAPTER 1. INTRODUCTION well. For separable kernel functions, this library is able to calculate the derivatives of any order. A graph based algorithm is designed to calculate the derivatives efficiently. A series of hardware optimizations are used to deliver a high performance code. On the host side, multi-core platforms and parallel programming frameworks are used to accelerate the execution of the library. For 2 or 3-dimensional data points, the GPU platform is used for speeding up the kernel density estimators, kernel gradient estimators as well as the kernel curvature estimators. Several CUDA based algorithms are designed to optimize their performance. Finally, an efficient k-nearest neighbor based variable bandwidth selector is designed for image data and a high-performance CUDA algorithm is implemented for this selector. 1.4 Outline of the Thesis This thesis is organized as follows: In Chapter 2, we discuss the detailed knowledge background of the kernel smoothing theory. We introduce both the univariate and multivariate KDE methods, provide the direct calculation of the separable multivariate kernel and kernel derivatives, present a variety of bandwidth types, and give the strict definition of KDDE methods. In Chapter 3, we first introduce the gradients and Hessian matrices of images. Then, discuss three ways of finding 1st and 2nd order derivatives from images. Finally, we present two vesselness measure algorithms that use gradients and Hessian matrices of images. Chapter 4 gives a detailed introduction of the GPU architecture and the CUDA programming framework. We present three major contributions of this thesis in Chapter 5 and discuss their performance in Chapter 6. Finally, conclusions and future works are given in Chapter 7. 4 Chapter 2 Kernel Smoothing Given data X 1 , X 2 , . . . , X n are drawn from density f , How do we estimate the probability density function f from these observations? 80 Life expectancy at birth 75 70 65 60 55 50 45 40 0 50 100 150 200 250 Under-five mortality (per 1000 live births) F IGURE 2.1: The relation between under-five mortality rate and life expectancy at birth in different countries and regions. The original data is from the department of Economic and Social Affairs, United Nation. 5 CHAPTER 2. KERNEL SMOOTHING 2.1 Univariate Kernel Density Estimation Given a set of n independent and identically distributed (i.i.d.) random samples X1 , X2 , . . . , Xn from a common density f , the univariate kernel density estimator is fˆ(x; h) = n−1 n X i=1 Here K is a kernel function which satisfies R h−1 K(h−1 (x − Xi )). (2.1) K(x)dx = 1, and h > 0 is a smoothing parameter called the bandwidth. By introducing a rescaling notation Kh (u) = h−1 K(h−1 x), the above formula can be written in a more compact way fˆ(x; h) = n−1 n X i=1 Kh (x − Xi ). (2.2) As we can see from Equation (2.2), the kernel density estimate is a summation of scaled kernel functions with each of a probability mass n−1 . In a intuitive view, we can look this as a sum of ‘bumps’ placed at the observation points X1 , X2 , . . . , Xn . The kernel function K determines the shape of the bumps while the bandwidth h determines their width. An illustration is given in Figure 2.2, where observations Xi are marked in dots on x-axis and their corresponding scaled kernel ‘bumps’ n−1 Kh (x − Xi ) are shown in the dotted lines. Here, the kernel K is chosen to be the standard normal pdf N (0, 1). The resulting univariate kernel density estimate fˆ is given in the solid line. We can find that the estimate is bimodal, which is a reflection of the distribution of observations. Usually, it is not appropriate to construct a density estimate from such a small number of samples, but a sample size of 5 has been chosen here for the sake of clarity. As illustrated in Figure 2.2, the value of the kernel estimate at point x is simply the average of the n kernel ordinates at that point. The estimate combines contributions from each data. Hence, in regions where there are many observations, the estimate will have a relative large value. It is consistent with the truth that a densely distributed region will have a high probability density, and vice versa. Notice that, in this case the scaled kernel Kh is simply the N (0, h2 ) density. In this case, the bandwidth parameter h can be seen as a scaling factor which determines the spread of the kernel. In common, the bandwidth controls the amount of smoothness of kernel density estimators. It is the most important factor in KDE and KDDE. We will cover more details of bandwidths in Section 2.4. 6 CHAPTER 2. KERNEL SMOOTHING 0.6 0.5 f^(x) 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 1 2 3 x F IGURE 2.2: Univariate kernel density estimate: dots on x-axis - sample (training) points, solid line - kernel density estimate, dashed lines - scaled kernels at different sample points. 2.2 Multivariate Kernel Density Estimation The d-dimensional multivariate kernel density estimator, for a set of n i.i.d. sample X 1 , X 2 , . . . , X n from a common density f , is fˆ(x; H) = n−1 n X i=1 KH (x − X i ), (2.3) where • x = (x1 , x2 , . . . , xd )T , X i = (Xi1 , Xi2 , . . . , Xid )T , i = 1, 2, . . . , n; • K is the unscaled kernel, which is usually a spherically symmetric probability density function; • KH is the scaled kernel. It is related with the unscaled kernel by KH (x) = |H|−1/2 K(H −1/2 x); • H is the d × d bandwidth matrix, which is non-random, symmetric, and positive defined. The same as the univariate case, the multivariate kernel density estimate is calculated by placing a scaled kernel of mass n−1 at each data point and then aggregating to form the density estimate. Figure 2.3 illustrates a multivariate kernel density estimate in 2-dimension. The left hand figure shows observations (marked in dots) from density f (denoted by the isolines). On the right is the estimate fˆ. Since the ground truth f is actually a linear combination of five bivariate normal density functions, as we can see from the right that the fˆ gives a good estimate of this function. 7 8 8 7 7 6 6 5 5 4 4 3 3 f^(x) f (x) CHAPTER 2. KERNEL SMOOTHING 2 1 2 1 0 0 -1 -1 -2 -2 -3 -5 0 -3 5 -5 0 5 x x F IGURE 2.3: Multivariate kernel density estimate. Left: the contour denotes the density function f , and the dots are the sample/training points that draw from f . Right: the estimate fˆ calculated from the dots in the left figure. Define S = H −1/2 and evaluate fˆ at some points of interest x1 , x2 , . . . , xm , Equation (2.3) can be rewritten as fˆ(xi ; S) = n−1 n X j=1 KS (xi − X j ), i = 1, 2, . . . , m, j = 1, 2, . . . , n, (2.4) where xi = (xi1 , xi2 , . . . , xid )T , X j = (Xj1 , Xj2 , . . . , Xjd )T , i = 1, 2, . . . , m, j = 1, 2, . . . , n and K S (x) = |S|K(Sx). Here xi is called test point, X i is called training point, and S is called scale. The scale and bandwidth is related by H −1 = S T S. Equation (2.4) provides a more direct form when considering its implementation and complexity. Instead of a continuous function fˆ(x), the discrete form fˆ(xi ) is more intuitive for its software implementation and the scale S reduces the complexity by avoiding the calculation of inverse square root of bandwidth H. In the subsequent discussions, we will mostly use this form for the formulas and equations. Since there are m test points and for each test point there are n scaled kernel function evaluations at the d-dimensional training point, then the complexity of Equation (2.4) is O(mnd). 2.3 Kernel Functions 8 CHAPTER 2. KERNEL SMOOTHING 2.3.1 Univariate Kernels A univariate kernel is a one dimensional, non-negative, real-valued, integrable function k which satisfies • R +∞ −∞ k(u)du = 1; • k(−u) = k(u) for all values of u. The first requirement ensures that the result of kernel density estimator is a probability density function. The second requirement makes sure that the kernel function has zero mean and the kernel function placed at certain training point has an average value the same as the corresponding training point. To help reduce the computational complexity, a univariate bounding box can be used to the kernels for finite support at some costs of less accuracy. The truncated kernel is given as Z b Z a ktrunc (x; a, b) = [ k(u)du − k(u)du]−1 k(x)b(x; a, b), −∞ (2.5) −∞ where b(x; a, b) is the bounding box expanding from the lower bound a to the upper bound b, and   1, if a ≤ x ≤ b, , (2.6) b(x; a, b) =  0, otherwise. Rb Ra The the normalization factor [ −∞ k(u)du − −∞ k(u)du]−1 is introduced to ensure the truncated kernel function satisfies the requirement that Z +∞ ktrunc (u)du = 1. (2.7) −∞ If the accuracy outweighs the computational complexity, then the bounding box can be ignored by setting the lower bound and upper bound to −∞ and +∞ respectively. In this case, we get ktrunc (x) = k(x). There are a range of univariate kernels commonly used, such as uniform, triangular, biweight, triweight, Epanechnikov, etc. However, the choice of the univariate kernel function k is not crucial to the accuracy of kernel density estimators [20]. Due to the convenient of its mathematical properties and the smooth density estimates it results, the normal kernel is often used k(x) = φ(x), where φ is the standard normal density function and it is defined as 1 2 1 φ(x) = √ e− 2 x . 2π 9 (2.8) CHAPTER 2. KERNEL SMOOTHING 1.4 kernel bounding box truncated kernel 1.2 1 y 0.8 0.6 0.4 0.2 0 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 x F IGURE 2.4: Truncated Gaussian kernel function. The solid line - the truncated Gaussian kernel, the square dash line - the bounding box, the dot line - the untruncated Gaussian kernel. 2.3.2 Separable Multivariate Kernels The multivariate kernel functions, based on their separability, can be divided into two categories: separable kernel function and nonseparable kernel function. Due to the computational simplicity, we mainly focus on the separable multivariate kernel functions in this section. A separable multivariate kernel K(x) : Rd → R can be written as [21] K(x) = d Y k(xl ), (2.9) l=1 where xl ∈ R represents the l-th component of x = (x1 , x2 , . . . , xd )T . Notice that the kernels can have either finite or infinite support. In the finite case, we omit the truncation subscript and bounding box for simplicity. According to Equation (2.5) we know that the separable multivariate kernel K is only valid for x ∈ support{K(·)} and values are zero outside the support. Similarly, the first order partial derivatives of K can be written as d Y ∂K (x) = k 0 (xc ) k(xl ), ∂xc l=1 (2.10) l6=c where x, and ∂K ∂xc (x) is the first order partial derivative of the K with respect xc , the c-th component of k 0 (xc ) is the first order derivative of the univariate kernel function k(xc ). The second order 10 CHAPTER 2. KERNEL SMOOTHING partial derivatives of K is  d Q   k 00 (xc ) k(xl ),    l=1  l6=c ∂K (x) = d Q  ∂xr ∂xc 0 (x )k 0 (x )  k k(xl ), r c    l=1  l6=c r = c, , (2.11) r 6= c. l6=r where ∂K ∂xr ∂xc (x) is the seconder order partial derivative of the K with respect to xr and xc , and k 00 (xc ) is the second order derivative of the univariate kernel function k(xc ). The above definition of the first and second order partial derivatives of kernel K can be extended to higher order case. Given a mutiset N = {n1 , . . . , nr | ni ∈ 1, . . . , d, i ∈ 1, . . . , r}, then the r-th order partial derivative of K with respect to xn1 , . . . , xnr can be given as d Y ∂rK (x) = k (N (i)) (xi ), ∂xn1 , . . . , ∂xnr (2.12) i=1 where N (i) denotes the number of element of value i in set N , k (N (i)) is the N (i)-th order derivative of the univariate kernel function k. 2.4 Bandwidth In common with all smoothing problems, the most important factor is to determine the amount of smoothing. For kernel density estimators, the single most important factor is the bandwidth since it controls the amount and orientation of the smoothing. 2.4.1 Types of Bandwidth For the univariate case, the bandwidth h is a scalar. If the standard normal density function is used to approximate univariate data, and the underlying density being estimated is Gaussian then it can be shown that the optimal choice for h is [22] h=( 4σ̂ 5 1/5 ) 3n (2.13) For the multivariate case, the bandwidth H is a matrix. The type of orientation of the kernel function is controlled by the parameterization of the bandwidth matrix. There are respectively three main classes of parameterization [23]: 11 CHAPTER 2. KERNEL SMOOTHING 0.45 0.05 0.1 0.35 reference 0.4 0.35 f^(x) 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 x F IGURE 2.5: Univariate kernel density estimates of different bandwidths. • the class of all symmetric, positive definite matrices:   h21 h12 . . . h1n    h12 h2 . . . h2n  2   H= . .. ..  ..  .. . . .    2 h1n h2n . . . hn (2.14) • the class of all diagonal, positive definite matrices:  h21 0   0 h2 2  dgH =  . ..  .. .  0 0 (2.15) ... 0   0  ..  .   2 . . . hn ... .. . • the class of all positive constants times the identity matrix:   h2 0 . . . 0    0 h2 . . . 0    h2 I =  .  . . .  .. .. . . ..    0 0 . . . h2 (2.16) The first class defines a full bandwidth matrix, which is the most general bandwidth type. It allows the kernel estimator to smooth in any direction whether coordinate or not. The second class defines 12 CHAPTER 2. KERNEL SMOOTHING the diagonal matrix parameterization, which is the most commonly used one. A diagonal matrix bandwidth allows for different degrees of smoothing along each of the coordinate axis. The third class h2 I allows the same smoothing to be applied to every coordinate direction, which is too restrictive for general use. The visualization of the scaled kernel functions using these different classes of bandwidths is given in Figure 2.6. It is worth mentioning that, for a bivariate bandwidth matrix, the full bandwidth matrix of the first class can also be parameterized as   2 h1 ρ12 h1 h2  H= ρ12 h1 h2 h22 (2.17) 3 3 2 2 2 1 1 1 0 0 0 y 3 y y where ρ12 is the correlation coefficient, which can be used as a measure of orientation. -1 -1 -1 -2 -2 -2 -3 -3 -2 -1 0 1 2 -3 -3 3 x -2 -1 0 1 2 x 3 -3 -3 -2 -1 0 1 2 3 x F IGURE 2.6: Comparison of three bandwidth matrix parametrization classes. Left: positive scalar times the identity matrix. Center: all diagonal, positive definite matrices. Right: all symmetric, positive definite matrices. 2.4.2 Variable Bandwidth So far, the bandwidths we’ve used in kernel density estimators are f ixed, which means an unified bandwidth is used for every testing point xi , i = 1, . . . , m and training point X j , j = 1, . . . , n. In this section, we will generalize these fixed bandwidth estimators to variable bandwidth estimators. There are two main classes of variable bandwidth estimators fˆ(xi ; H) = n−1 n X j=1 and fˆ(xi ; Ω) = n−1 n X j=1 KH(xi ) (xi − X j ), i = 1, . . . , m, j = 1, . . . , n, (2.18) KΩ(X j ) (xi − X j ), i = 1, . . . , m, j = 1, . . . , n. (2.19) 13 CHAPTER 2. KERNEL SMOOTHING where functions H(·) and Ω(·) are bandwidth functions. They are considered to be non-random functions, in the same way as we consider a single bandwidth to be a non-random number or matrix. The first kernel density estimator is called the balloon kernel density estimator. Its bandwidths are different at each testing point xi , i = 1, . . . , m. The second kernel density estimator is called the sample point kernel density estimator. Its bandwidths are different at each training point X j , j = 1, . . . , n. In this thesis, we only covers the sample point kernel density estimators. We won’t cover balloon kernel density estimators for two reasons. First, balloon estimators typically do not integrate to 1 so they are not true density functions, a result from focusing on estimating locally rather than globally [24]. Second, balloon estimators are generally less accurate than sample point estimators [25, 26]. 0.8 0.7 0.6 f^(x) 0.5 0.4 0.3 0.2 0.1 0 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x F IGURE 2.7: Univariate sample point kernel density estimate: solid line kernel density estimate, dashed lines individual kernels. For the sample point kernel density estimators, there are usually two choices for the bandwidth function Ω. One commonly used form is Ω(X j ) = h2 f (X j )−1 I, j = 1, . . . , n, (2.20) where h is a constant. Using the reciprocal of f leads to an O(h4 ) bias rather than the usual O(h2 ) bias for fixed bandwidth estimators [26]. This form of the bandwidth appeals intuitively since it states that the smaller bandwidths should be used in those parts of the data set with high density of points, which is controlled by the value of f , and larger bandwidths in parts with lower density. This combination of small bandwidths near the modes and large bandwidths in the tails should be 14 CHAPTER 2. KERNEL SMOOTHING able to detect fine features near the former and prevent spurious features in the latter. One possible solution for finding the estimate of the bandwidth function Ω is to use a pilot estimate fˆ to give Ω̂(X j ) = hfˆ(X j )−1/2 . The other choice of Ω is to use the k-nearest neighbor function of X j [27]. The k-nearest neighbor function is defined as a symmetric positive definite second order covariation matrix associated with the neighborhood of X j . It can be written as C(X j ) = n−1 k nk X (X j − X jk )(X j − X jk )T , j = 1, . . . , n, (2.21) k=1 where nk = dn−γ e is the number of neighbors, and X jk denotes k-th nearest neighbor of X j . Here, nk is chosen to be significantly smaller than the number of samples, but large enough to reflect the variations. The parameter γ depends on the dimension of the space and sparsity of the data points. Thus, the bandwidth function Ω can be given as Ω(X j ) = σ 2 C(X j ), j = 1, . . . , n, (2.22) where σ is the scalar kernel width. According to the notations from Equation (2.4), and allowing the scaled kernels to have different weights at each training point, the variable bandwidth kernel density estimator can be written as fˆ(xi ; S j , ωj ) = n X j=1 ωj KS j (xi − X j ), i = 1, . . . , m, j = 1, . . . , n, (2.23) where wj , j = 1, . . . , n is the weight of the scaled kernel at training point X j and S j is the scale. For simplicity, we write Ωj and C j instead of Ω(X j ) and C(X j ). The variable bandwidth and the T scale is related by Ω−1 j = S j S j . We can extract S i by utilizing the eigendecomposition of C j Ωj = σ 2 Qj Λj QTj , j = 1, . . . , n, (2.24) where the columns elements of Qj and diagonal elements of Λj are the eigenvectors and eigenvalues of C j . Therefore, the scale matrix S j can be written as −1/2 S j = σ −1 Λj 2.5 QTj , j = 1, . . . , n. (2.25) Kernel Density Derivative Estimation Before considering the r-th derivative of a multivariate density, we first introduce the notation of r-th derivatives of a function [28, 29]. From a multivariate point view, the r-th derivative of a function is 15 CHAPTER 2. KERNEL SMOOTHING understood as the set of all its partial derivatives of order r, rather than just one of them. All these r-th partial derivatives can be neatly organized into a single vector as follow: if f is a real d-variate r density function and x = (x1 , . . . , xd ), then we denote D⊗r f (x) ∈ Rd the vector containing all the partial derivatives of order r of f at x, arranged so that we can formally write D⊗r f = ∂f , (∂x)⊗r (2.26) where D⊗r is the r-th Kronecker power [30] of the operator D. Thus we write the r-th derivative of f as a vector of length dr . Notice that, using this notation, we have D(D⊗r f ) = D⊗(r+1) f . Also, the gradient of f is just D⊗1 f and the Hessian ∇2 f = ∂2f ∂x∂xT is such that vec ∇2 f = D⊗2 f , where vec denotes the vector operator [31]. According to the previous notation we can then write the r-th kernel density derivative estimator D⊗r f as D ⊗r fˆ(xi ; S j , ωj ) = = n X j=1 n X j=1 ωj D⊗r KS j (xi − X j ) (2.27) ⊗r ωj |S j |S ⊗r j D K(S j (xi − X j )), i = 1, . . . , m, j = 1, . . . , n, Here, we follow the definition in Equation (2.4) and Equation (2.23), where xi is the i-th data of the test set, X j is the j-th data of the training set, S j is the variable scale, KS j is the scaled kernel and ωj is the weight of the r-th derivative of the scaled kernel. In this thesis, we will mainly focus on the first and second derivatives of the kernel density function because they have full potential for applications and are crucial to identify significant features of the distribution [32]. The first order derivatives can be given from the kernel gradient estimator ∇fˆ(xi ; S j , ωj ) = n X j=1 ωj ∇KS j (xi − X j ), i = 1, . . . , m, j = 1, . . . , n, (2.28) where ∇ is the column vector of the d first-order partial derivatives and ∇KS (x) = |S|S T ∇K(Sx). (2.29) Similarly, the second order derivatives can be given from the kernel curvature estimator ∇2 fˆ(xi ; S j , ωj ) = n X j=1 ωj ∇2 KS j (xi − X j ), i = 1, . . . , m, j = 1, . . . , n, (2.30) where ∇2 denotes the matrix of all second-order partial derivatives and ∇2 KS (x) = |S|S T ∇2 K(Sx)S. 16 (2.31) CHAPTER 2. KERNEL SMOOTHING It should be pointed out that the kernel curvature estimator is actually the estimator for the Hessian matrix of the density function. 17 Chapter 3 Vesselness Measure The vesselness measure intuitively describe the likelihood of a point being part of a vessel. It is not reliable to judge whether a point belongs to a vessel or not only based on the point’s intensity. Because the analysis of vesselness relies on some structural information such as local extrema, valleys, ridges or saddle points, which can only be given from the derivatives of the intensity function. In this chapter, we will first introduce the basic knowledges of gradients and Hessian matrices and their relation with structural features, then we give three different methods to find the gradients and Hessian matrices from images, finally we provide two popular algorithms for vesselness measure. 3.1 Gradients and Hessian Matrices of Images Gradients and Hessian matrices are crucial in finding structural information from images. In this section we will start from the definition of the gradient and Hessian matrix, then introduce their mathematical properties and finally discuss how to extract structural information from image gradients and Hessian matrices. 3.1.1 Gradient Given a differentiable, scalar-valued function f (x), x = (x1 , . . . , xn )T of standard Cartesian coordinates in Euclidean space, its gradient is the vector whose components are the n partial derivatives of 18 CHAPTER 3. VESSELNESS MEASURE f . It can be written as ∇f (x) =   ∂  f (x) =  ∂x  ∂f ∂x1 .. . ∂f ∂xn    .  (3.1) In mathematics, the gradient points in the direction of the greatest rate of increase of the function and its magnitude is the slope of the graph in that direction. An example is illustrated in Figure 3.1. On the left, we constructed a bivariate Gaussian function f (x, y) = 1 −0.5(x2 +y 2 ) . 2π e Its gradients as well as its contours are given in the right. Here, each blue arrow represents a gradient vector of f at its current location. The direction of the arrow denotes the direction of the gradient vector and the length of the arrow denotes its magnitude. We can see that all the blue arrows point to the center where the function f reaches its peak value. 2 1.5 0.2 0.5 0.1 y f (x; y) 1 0.15 0.05 0 -0.5 0 2 -1 1 2 y -1.5 1 0 0 -1 -1 -2 -2 -2 -2 x -1 0 1 2 x F IGURE 3.1: Gradient of the standard Gaussian function. Left: Standard bivariate Gaussian function. Right: Gradients (blue arrows) of the Standard 2D Gaussian function. For image processing and computer vision, the gradient of an image is defined the same way as a mathematical gradient, except that the f is now an image intensity function I. Since an image is usually either 2D or 3D, then an image gradient can be written as  g(x, y) =  ∂I ∂x ∂ ∂y     or g(x, y, z) =   ∂I ∂x ∂I ∂y ∂I ∂z   .  (3.2) where I(x, y) and I(x, y, z) are the image intensity function for 2D and 3D respectively. For a 2D 19 CHAPTER 3. VESSELNESS MEASURE image, the magnitude and direction of the gradient vector at point (x0 , y0 ) is s ∂ ∂ |g(x0 , y0 )| = ( I(x0 , y0 ))2 + ( I(x0 , y0 ))2 ∂x ∂y (3.3) and θ = atan( ∂ ∂ I(x0 , y0 ), I(x0 , y0 )). ∂y ∂x (3.4) F IGURE 3.2: Image gradient. On the left, an intensity image of a cameraman. In the center, a gradient image in the x direction measuring horizontal change in intensity. On the right, a gradient image in the y direction measuring vertical change in intensity. Usually, the intensity function I(x, y) or I(x, y, z) of digital image is not given directly. It is only known at discrete points. Therefore, to get its derivatives we assume that there is an underlying continuous intensity function which has been sampled at the image points. With some additional assumptions, the derivative of the continuous intensity function can be approximated as a function on the sampled intensity function, i.e., the digital image. Approximations of these derivative functions can be defined at varying degrees of accuracy. We will discuss them in details in Section 3.2. 3.1.2 Hessian Suppose f : Rn → R is a function taking a vector x = (x1 , ..., xn )T ∈ Rn and outputting a scalar f (x) ∈ R. If all second order partial derivatives of f exist, then the Hessian matrix of f is an n × n square matrix, which is defined as follows: ∇2 f (x) =  ∂2f 2  ∂x2 1  ∂ f  ∂x2 ∂x1  ∂ f (x) =  ∂xT x   .. . ∂2f ∂xn ∂x1 20 ∂2f ∂x1 ∂x2 ∂2f ∂x22 ... .. . ... .. . ∂2f ∂xn ∂x2 ...  ∂2f ∂x1 ∂xn  ∂2f  ∂x2 ∂xn   .. . ∂2f ∂x2n ,   (3.5) CHAPTER 3. VESSELNESS MEASURE where ∂2f ∂xi ∂xj is the second order partial derivative of f with respect to the variable xi and xj . Specifically, if f has continuous second partial derivatives at any given point in Rn , then ∀i, j ∈ {1, 2, . . . , n}, ∂2f ∂xi ∂xj = ∂2f ∂xj ∂xi . Thus, the Hessian matrix of f is a symmetric matrix. This is true in most “real-life” circumstances. For a digital image, the Hessian matrix is defined in the same way, except that function f is now the image intensity function I, which is usually a 2D or 3D discrete function. By convention, we use H to denote Hessian matrices of an image, and it can be written as  2   ∂ I 2 2 ∂x2 ∂ I ∂ I  2 2 ∂x∂y  ∂ I H(x, y) =  ∂x2 and H(x, y, z) =   ∂y∂x ∂ I ∂2I ∂y∂x ∂y 2 ∂2I ∂z∂x ∂2I ∂x∂y ∂2I ∂y 2 ∂2I ∂z∂y  ∂2I ∂x∂z  ∂2I  ∂y∂z  ∂2I ∂z 2 (3.6) A symmetric n × n Hessian matrix can be decomposed into the following form using eigenvalue decomposition, H = QΛQT , (3.7) where Q is the square matrix whose i-th column is the eigenvector q i of H and Λ is the diagonal Q and Λ can be written as  ... 0  ... 0   (3.8) ..  .. . .   0 . . . λn matrix whose diagonal elements are the corresponding eigenvalues.  λ1 0   0 λ2 h i  Q = q 1 , q 2 , . . . , q n and Λ =  . ..  .. .  0 In 2D case, eigenvalues of H can be visualized by constructing an ellipsoid v T Hv = 1 (3.9) where v = (x, y)T . By performing eigenvalue decomposition to H, so that Λ is a diagonal matrix and Q is a rotation (orthogonal) matrix   h i λ1 0  and Q = q , q . Λ= 1 2 0 λ2 (3.10) v T QΛQT v = (QT v)T Λ(QT v) = 1 (3.11) We have, Let v 0 = QT v = (x0 , y 0 )T , we get v 0T Λv 0 = λ1 x02 + λ2 y 02 = x02 ( √1λ )2 1 21 + y 02 ( √1λ )2 2 =1 (3.12) CHAPTER 3. VESSELNESS MEASURE We can see that Equation 3.12 is a standard ellipsoid equation in coordinates (x0 , y 0 ). Its semiprincipal axes are √1 λ1 and √1 . λ2 Since Q is a rotation matrix, thus Equation 3.9 is actually a rotated ellipsoid in coordinates (x, y). y 1 p ⇥ 1 p x y ⇤ H  x y =1 2 1 x ⇥ x y ⇤ ⇤  x y =1 F IGURE 3.3: Visualized Eigenvalues with ellipsoid. Intuitively, for 2D images, if a pixel is close to the centerline of a vessel, it should satisfy the following properties • one of the eigenvalues λ1 should be very close to zero; • the absolute value of the other eigenvalue should be a lot greater than zero, λ2 0. For 3D images, let λk be the eigenvalue with the k-th smallest magnitude, i.e. |λ1 | ≤ |λ2 | ≤ |λ3 |. Then, an ideal tubular structure in 3D image satisfy • λ1 should be very close to zero; • λ2 and λ3 should be of large magnitude and equal sign (the sign is an indicator of brightness/darkness). The respective eigenvectors point out singular directions: q 1 indicates the direction along the vessel (minimum intensity variation). q 2 and q 3 form a base for the orthogonal plane. 3.2 Finding 1st and 2nd Order Derivatives From Images Typically, the intensity function of a digital image is only known at evenly distributed discrete places. Thus, instead of the continuous intensity function I(x, y), I(n1 , n2 ) is usually used to refer 22 CHAPTER 3. VESSELNESS MEASURE 2D 3D orientation pattern λ1 λ2 λ1 λ2 λ3 N N N N N noisy, no preferred direction L L H- plate-like structure (bright) L L H+ plate-like structure (dark) L H- L H- H- tubular structure (bright) L H+ L H+ H+ tubular structure (dark) H- H- H- H- H- blob-like structure (bright) H+ H+ H+ H+ H+ blob-like structure (dark) TABLE 3.1: Possible orientation patterns in 2D and 3D images, depending on the value of the eigenvalues λk (H=high, L=low, N=noisy, usually small, +/- indicate the sign of the eigenvalue). The eigenvalues are ordered: |λ1 | ≤ |λ2 | ≤ |λ3 | [3]. an image at discrete points. Here, (n1 , n2 ) is the indices of a pixel on the image. It is related to (x, y) by (x, y) = (n1 ∆n1 , n2 ∆n2 ), where ∆n1 and ∆n2 is the distance between two adjacent pixels in horizontal and vertical direction. To compute the derivatives of an image, we need to the use pixels from I(n1 , n2 ) to approximate the underlying I(x, y) and its derivatives. In this section, three approximations are introduced: Gradient Operator, Gaussian Smoothing, and Kernel Density Derivative Estimation. 3.2.1 Gradient Operator For a 2D image, the partial derivative of its continuous intensity function I(x, y) in the x direction is defined as Ix (x, y) = I(x + h2 , y) − I(x − h2 , y) ∂ I(x, y) = lim . h→0 ∂x h (3.13) Thus, for a constant h, Equation (3.13) can be approximated by following I(x + h2 , y) − I(x − h2 , y) Iˆx (x, y) = , h and the error of the approximation is O(h2 ) [33]. In the discrete case, if we let (3.14) h 2 = ∆n, then the approximation of I(n1 , n2 ) can be written as I(n1 + 1, n2 ) − I(n1 − 1, n2 ) Iˆx (n1 , n2 ) = . 2∆n 23 (3.15) CHAPTER 3. VESSELNESS MEASURE Usually, the sampling factor 1 2∆n is ignored, since it is constant throughout the image. Therefore, the approximation of Ix (n1 , n2 ) can be simplified by writing Iˆx (n1 , n2 ) = I(n1 + 1, n2 ) − I(n1 − 1, n2 ). (3.16) Similarly, the approximation of Iy (n1 , n2 ) can be written as Iˆy (n1 , n2 ) = I(n1 , n2 + 1) − I(n1 , n2 − 1). (3.17) Let h1 (n1 , n2 ) = δ(n1 + 1, n2 ) − δ(n1 − 1, n2 ) and h2 (n1 , n2 ) = δ(n1 , n2 + 1) − δ(n1 , n2 − 1), where δ(n1 , n2 ) is defined as δ(n1 , n2 ) =   1,  0, n 1 = n2 = 0 , (3.18) otherwise then Equation (3.16) and (3.17) can be written as Iˆx (n1 , n2 ) = I(n1 , n2 ) ∗ h1 (n1 , n2 ) (3.19) Iˆy (n1 , n2 ) = I(n1 , n2 ) ∗ h2 (n1 , n2 ), (3.20) Here h1 and h2 is called gradient operators. Usually, the gradient operators is written in the form of matrices   −1 h i    h1 (n1 , n2 ) =  ; h (n , n ) = −1, 0, 1 ; 0 1 1 2 1 (3.21) The approximated second order derivatives can be derived from the first order derivatives directly Iˆxx (n1 , n2 ) = Ix (n1 , n2 ) ∗ h1 (n1 , n2 ) ≈ (I(n1 , n2 ) ∗ h1 (n1 , n2 )) ∗ h1 (n1 , n2 ) (3.22) = I(n1 , n2 ) ∗ h1 (n1 , n2 ) ∗ h1 (n1 , n2 ) Similarly, Iˆxy (n1 , n2 ) = I(n1 , n2 ) ∗ h1 (n1 , n2 ) ∗ h2 (n1 , n2 ) (3.23) Iˆyy (n1 , n2 ) = I(n1 , n2 ) ∗ h2 (n1 , n2 ) ∗ h2 (n1 , n2 ) (3.24) 24 CHAPTER 3. VESSELNESS MEASURE The discussion above can be easily applied to higher dimensional case. Let Ixi denotes the first order derivative of I with respect to xi , then its approximation is given as Iˆxi (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ hi (n1 , . . . , nd ) (3.25) where hi (n1 , . . . , nd ) = δ(n1 , . . . , ni + 1, . . . , nd ) − δ(n1 , . . . , ni − 1, . . . , nd ) and d is the dimension of the image. Similarly, the second order derivative Ixi xj can be written as Iˆxi xj (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ hi (n1 , . . . , nd ) ∗ hj (n1 , . . . , nd ) (3.26) This method is computationally efficient due to the simple structure of the gradient operators. But since the estimation of the derivatives only involves adjacent pixels, which contains limited information of the neighborhood, this method is not accurate. Especially, this method can’t provide the accurate derivative information for images of large scale. 3.2.2 Gaussian Smoothing For a 2D image, from the sampling theorem[34] we know the continuous intensity function I(x, y) can be ideally reconstructed from the discrete image function I(n1 , n2 ) by I(x, y) = n X n X k1 =−n k2 =−n I(k1 , k2 )K(x − k1 ∆n1 , y − k2 ∆n2 ), (3.27) where K is a sinc like function K(x, y) = sin(πx/∆n1 ) sin(πy/∆n2 ) . πx/∆n1 πx/∆n2 (3.28) Here, ∆n1 and ∆n2 are the sampling intervals. However, K decays proportionally to 1/x and 1/y, which is a rather slow rate of decay. Consequently, only values that are far away from the origin can be ignored in the computation. In other words, the summation limit n must be large, which is a computationally undesirable state of affairs. In addition, if there is aliasing, the sinc function will amplify its effects, since it combines a large number of unrelated pixel values. Instead, a Gaussian function, which passes only frequencies below a certain value and has a small support in the spatial domain, can be a good replacement. Thus, we have ˆ y) = I(x, n X n X k1 =−n k2 =−n I(k1 , k2 )G(x − k1 ∆n1 , y − k2 ∆n2 ), 25 (3.29) CHAPTER 3. VESSELNESS MEASURE where G is a Gaussian function at scale h 1 − x2 +y2 2 e 2h . 2πh2 G(x, y) = (3.30) Thus, the approximated first order derivatives and second order derivatives of I(x, y) with respect to x can be given as n X ∂ ˆ Iˆx (x, y) = I(x, y) = ∂x ∂2 ˆ Iˆxx (x, y) = I(x, y) = ∂x2 n X k1 =−n k2 =−n n n X X I(k1 , k2 )Gx (x − k1 ∆n1 , y − k2 ∆n2 ), (3.31) I(k1 , k2 )Gxx (x − k1 ∆n1 , y − k2 ∆n2 ), (3.32) k1 =−n k2 =−n where Gx (x, y) and Gxx (x, y) are first and second order derivatives of Gaussian in direction x x − x2 +y2 2 ∂ G(x, y) = − e 2h , ∂x 2πh4 2 2 ∂2 1 x2 − x +y 2h2 . Gxx (x, y) = G(x, y) = ( − 1)e ∂x2 2πh4 h2 Gx (x, y) = (3.34) 0.1 @2 @x2 G(x; y) @ @x G(x; y) 0.1 (3.33) 0.05 0 0.05 0 -0.05 -0.1 -0.05 -0.15 -0.1 4 -0.2 4 2 2 4 y 4 2 0 0 -2 -2 -4 -4 2 0 y x 0 -2 -2 -4 -4 x F IGURE 3.4: Derivatives of Gaussian filters. Left: the first order derivative of the bivariate Gaussian function with respect to x. Right: the second order derivative of the bivariate Gaussian function with respect to x. Sample Iˆx (x, y) and Iˆx (x, y) the same way as I(n1 , n2 ), we have Iˆx (n1 , n2 ) = n X n X k1 =−n k2 =−n I(k1 , k2 )Gx (n1 − k2 , n2 − k2 ) = I(n1 , n2 ) ∗ Gx (n1 , n2 ), 26 (3.35) CHAPTER 3. VESSELNESS MEASURE Iˆxx (n1 , n2 ) = n X n X k1 =−n k2 =−n I(k1 , k2 )Gxx (n1 − k2 , n2 − k2 ) (3.36) = I(n1 , n2 ) ∗ Gxx (n1 , n2 ), where Gx (n1 , n2 ) and Gxx (n1 , n2 ) are sampled Gaussian derivative functions. Therefore, the first and second order derivatives of an image can be easily got by convolving it with the corresponding Gaussian derivative filters. We call this the Gaussian Smoothing. In general, for a d dimensional image I(n1 , . . . , nd ), its first and second order derivatives can be given as Iˆxi (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ Gxi (n1 , . . . , nd ) (3.37) Iˆxi xj (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ Gxi xj (n1 , . . . , nd ), (3.38) where Gxi is the first order Gaussian derivative filter with respect to xi and Gxi xj is the second order Gaussian derivative filter with respect to xi and xj . Since the computation uses convolution only and the size of the Gaussian derivative filter is usually small, the Gaussian smoothing method is also computationally efficient. Unlike the Gradient operator which calculate the derivatives at the finest scale, the smoothing of this method is controlled by the scale parameter h, which can decide the degree of information that should be used to calculate the derivatives. The problem of this method is that the smoothing is only performed along the coordinates and the choice of a proper scale h is hard. 3.2.3 Kernel Density Derivative Estimation We’ve discussed kernel density derivative estimation in Section 2.5. We know that the approximation of the derivatives of a function f : Rd → R can be given from a set of sample data Xi , i = 1, . . . , m by Equation (2.28) and Equation (2.30). To get the derivatives of a 2D image I(x, y), the same idea can be applied. Let the weight ωj = I(k1 , k2 ) and rearrange the indices of the summation operator, we can rewrite Equation (2.28) and (2.30) as XX ˆ y) = ∇I(x, I(k1 , k2 )∇KS k1 k2 (x − Xk1 , y − Yk2 ) k1 2ˆ ∇ I(x, y) = k2 XX k1 (3.39) k2 I(k1 , k2 )∇2 KS k1 k2 (x − Xk1 , y − Yk2 ) (3.40) ˆ y) is the estimated gradient of where I(k1 , k2 ) is the intensity of the image at pixel (k1 , k2 ), ∇I(x, ˆ y) is the estimated Hessian of the image, S k k is the scale matrix, K(.) is the the image, ∇2 I(x, 1 2 kernel function and (Xk1 , Yk2 ) = (k1 ∆n1 , k2 ∆n2 ) is the location of the pixel (k1 , k2 ) on the image. 27 CHAPTER 3. VESSELNESS MEASURE ˆ y) and ∇2 I(x, ˆ y) the same way as I(n1 , n2 ), we have Sample ∇I(x, XX ˆ 1 , n2 ) = I(k1 , k2 )∇KS k1 k2 (n1 − k1 , n2 − k2 ), ∇I(n k1 ˆ 1 , n2 ) = ∇2 I(n XX k1 (3.41) k2 k2 I(k1 , k2 )∇2 KS k1 k2 (n1 − k1 , n2 − k2 ). (3.42) Here, ∇KS k1 k2 (n1 , n2 ) is the sampled kernel gradient and ∇2 KS k1 k2 (n1 , n2 ) is the sampled kernel Hessian. Similarly, the above equations can be easily extended to higher dimensional case X ˆ ∇I(n) = I(k)∇KS k (n − k), (3.43) k ˆ ∇2 I(n) = X k I(k)∇2 KS k (n − k). (3.44) where n = (n1 , . . . , nd ) and k = (k1 , . . . , kd ). The kernel density derivative estimators can give the most accurate estimation of the derivatives of the image. Because, it is able to smooth the image locally, which means it can decide the smoothing direction and scale level for each pixel accordingly. But the problem of this method is that it is too computationally insensitive. To solve this problem, we proposed a high performance solution, which based on the GPU CUDA framework, in Chapter 5. 3.3 Frangi Filtering Frangi filter[3], developed by Frangi et al. in 1998, is a popular method in highlighting tubular structures in images. It uses the eigenvalues of the Hessian matrices obtained from an image to analysis the likelihood of a pixel being on the tubular structure. For a 3D image, assume we’ve obtained a Hessian H matrix at voxel (n1 , n2 , n3 ) using any of the methods discussed in Section 3.2, then by performing eigenvalue decomposition to H and sorting the resulting eigenvalues, we have |λ1 | ≤ |λ2 | ≤ |λ3 |. (3.45) From the discussion in Section 3.1.2, we know that if a voxel is on the tubular-like structure, the eigenvalues satisfy |λ1 | ≈ 0, |λ1 | |λ2 |, and λ2 ≈ λ3 . Combining these constrains with the relations in Table 3.1, we can define the following three dissimilarity measures: • To distinguish between blob-like and nonblob-like structures, we define |λ1 | RB = p . |λ2 λ3 | 28 (3.46) CHAPTER 3. VESSELNESS MEASURE This ratio attains its maximum for a blob-like structure, which satisfies |λ1 | ≈ |λ2 | ≈ |λ3 |, and is close to zero whenever λ1 ≈ 0 or λ1 and λ2 tend to vanish. • To distinguish between plate- and line-like structures, we define RA = |λ2 | . |λ3 | (3.47) RA → 0 implies a plane-like structure and RA → 1 implies a line-like structure. • To distinguish between background (noise) and foreground, we define q S = λ21 + λ22 + λ23 . (3.48) This measure will be low in the background where no structure is present and the eigenvalues are small for the lack of contrast. Combining the measures above, we can define a vesselness function as following   0, if λ2 > 0 or λ3 > 0, V(n1 , n2 , n3 ) = 2 RB 2 RA 2 S  (1 − e− 2α2 )e− 2β2 (1 − e− 2c 2 ), otherwise. (3.49) where α, β and c are the thresholds which control the sensitivity of the measures RA , RB and S. Similarly, for 2D images the vesselness function can be given as   0, if λ2 > 0, V(n1 , n2 ) = 2 RB 2 S  e− 2β2 (1 − e− 2c 2 ), otherwise. Here, RB = |λ1 | |λ2 | is the blobness measure in 2D and S = (3.50) p λ21 + λ22 is the backgroundness measure. Note that Equation (3.49) and (3.50) are given for bright tubular-like structures . For dark objects the conditions should be reversed. 3.4 Ridgeness Filtering Another approach to extract tubular structure from images is to use ridgeness filter [35, 36]. In ridgeness filtering, a tubular structure can be viewed as a ridge or principal curve of the continuous intensity function I(x) : Rd → R. A rigde is defined as a set of curves whose points are local maxima of the function in at least one direction. This more mathematically rigorous definition of a tubular structure provides us more mathematical tools to analyze the likelihood of a point being on 29 CHAPTER 3. VESSELNESS MEASURE F IGURE 3.5: Lef t: Original X-Ray vessel image. Right: enhanced vessel image using Frangi filter. the ridge. What’s more, unlike a Frangi filter which only uses the information from a local Hessian matrix H(x) of images, the ridgness filter combines both local gradient and Hessian to measure the ridgeness. Let q i and λi be the i-th eigenvector and eigenvalue pair of the Hessian matrix H(x) of I(x) such that |λ1 | ≤ . . . ≤ |λd |. In general, a point is on the k-dimensional ideal ridge structure iff it satisfies the following conditions • the gradient g(x) is collinear with the first k eigenvectors, i.e. g(x) k q i (x), i = 1, . . . , k, and is orthogonal to d − k eigenvectors, i.e. g(x)T q i (x) = 0, i = k + 1, . . . , d; • λk+1 , . . . , λd have the same sign. • |λk | ≈ 0; Thus, this point on the ridge is the local maximum of the function in the subspace spanned by the d−k, i.e. S⊥ = span(q k+1 , . . . , q d ) and the tangential space is spanned by the remaining k eigenvectors, Sk = span(q 1 , . . . , q k ). If we consider the inner product between g(x) and H(x)g(x), we can get T g(x) H(x)g(x) = = d X i=1 k X λi g(x)T qi (x)(g(x)T qi (x))T λi (g(x)T qi (x))2 + i=1 d X i=k+1 30 (3.51) λi (g(x)T qi (x))2 . CHAPTER 3. VESSELNESS MEASURE Note that since the eigenvalues are sorted in a descend order based on their magnitude, then the third condition is true for all the first k eigenvalues, i.e. |λi | ≈ 0, i = 1, . . . , k. Hence, we can find that the inner product between g(x) and H(x)g(x) is close to zero for a point x on the ridge. Therefore, a measure for being on the ridge can be formulated in terms of the inner product between g(x) and H(x)g(x), ζ(x) = abs( g(x)T H(x)g(x) ), kH(x)g(x)kkg(x)k (3.52) where abs is the absolute operator. This function is bounded between 0 and 1, due to the normalization factor in the denominator. 31 Chapter 4 GPU Architecture and Programming Model 4.1 GPU Architecture A GPU is connected to a host through PCI-E bus. It has its own device memory, which usually can be up to several gigabytes in current GPU architecture. A GPU manages its device memory independently and it can’t work on host memory directly. Typically, a data in host memory need to be transferred to GPU device memory through programmed DMA, so that it can be read and written by GPU. The device memory on GPU supports very high data bandwidth with relatively high latency. Since most data access on GPU begins in device memory, it is very important for programmers to leverage the high bandwidth of device memory to achieve peak throughput of GPU. NVIDA GPUs consist of several streaming multiprocessors (SMs), each of which works independently from each other. Each multiprocessor contains a group of CUDA cores (processors), load/store units or special function units (SFUs). Each core is capable of performing integer and floating point operations. Multiprocessors create, manage, schedule, and execute threads in groups of 32 parallel threads called warps. A warp is a minimum unit of execution on GPU. When a multiprocessor is issued with a block of threads, it will first partition them into warps and then schedule those warps by a warp scheduler for execution. All the threads in a warp, if there is no divergence, execute one common instruction at a time. Such architecture is called SIMT (Single Instruction, Multiple Threads). Each SM also contains several warp schedulers and instruction dispatch units, which select warps and instructions that will be executed on the SM. 32 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL Multicore CPU Execution Queue Control SMX N Warp Scheduler … Dispatch Warp Scheduler Dispatch Dispatch Warp Scheduler Dispatch Dispatch Warp Scheduler Dispatch Dispatch Dispatch core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST SFU core Warp DPScheduler Unit LD/ST SFU Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch core Warp core core Dispatch DP Unit core core core DP Unit LD/ST Warp Scheduler Scheduler Warp Scheduler Warp Scheduler SFU SMX 1 Warp Scheduler core Scheduler coreWarpcore DP Unit Warp coreScheduler core SMX 0 core core core DP Unit Dispatch Dispatch Dispatch core core DP Unit core Dispatch core … … … … … … … … … core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST SFU SFU SFU SFU … coreDispatch core Dispatch core DPDispatch Unit LD/ST core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST SFU Dispatch core SFU core core core DP Unit core core Hardware/Software core DP Unit LD/ST SFU core core core DP Unit User core Selectable core core DP Unit LD/STCache SFU core core core DP Unit core core core DP Unit LD/ST SFU core core core DP Unit core core core DP Unit LD/ST SFU … SFU … DP Unit … core SFU LD/ST … core LD/ST DP Unit LD/ST SFU … core DP Unit … core core … core core … DP Unit … … core core core … … core DP Unit … … core DP Unit … core core … … core core … … core core … … Host Memory User Selectable Hardware/Software Cache L2 Cache DMA Device Memory F IGURE 4.1: GPU block diagram. GPU’s memory hierarchy can be divided into two categories: the off-chip memory and the on-chip memory, where the chip is referred to the multiprocessor. A off-chip memory of GPU is usually slower than on-chip memory because they are relatively far from GPU. There two types of off-chip memory: L2 cache and global memory. A L2 cache is a part of the GPU’s cache memory hierarchy. It is typically smaller than CPU’s L2 or L3 cache, but has higher bandwidth available, which makes it more suitable for throughput computing. The L2 cache is shared by all multiprocessors on GPU and it is invisible to programmers. The global memory here is referred to GPU’s device memory. But in a more precise definition, the global memory is actually only a part of device memory. A global memory is a programming concept and we will discuss it in detail later in the Programming Model sections. The global memory is also shared by all multiprocessors. There are five types of on-chip memory: register, L1 cache, shared memory, constant cache and texture cache (read-only data cache). Unlike the registers on CPU, the register file on GPU is very large. And it is the fastest on-chip memory. A L1 cache is also a part of the GPU’s cache memory hierarchy. It is has larger cache line size and lower latency than the L2 cache. As a on-chip memory, the L1 cache is only 33 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL Device SM 1 Register File ... CUDA SM 1 Core CUDA Core … CUDA Core Register File SM 0 Shared Memory/L1 Cache Register File CUDA Core … Cache Texture Memory CUDA Core CUDA Core CUDA Core CUDA CUDA … Constant Memory Cache Core Core Shared Memory/L1 Cache ... Shared Texture Memory/L1 MemoryCache Cache Texture Constant Memory Memory Cache Cache Constant Memory Cache L2 Cache Device Memory F IGURE 4.2: GPU hardware memory hierarchy. accessible by the multiprocessor which it belongs to. And the same as the L2 cache, the L1 cache is also invisible to programmers. The shared memory is a programmable cache. It actually shares the same physical cache component with L1 cache, which make shared memory extremely fast. Typically, the shared memory/L1 cache is 100 faster than the global memory. The shared memory is fully visible to programmers. The constant cache and texture cache are used for caching the constant memory and texture memory, which we will discuss in detail in the following section. 4.2 Programming Model In this section, we are going to talk about NVIDA’s CUDA programming model. Since this is a programming model, all the concepts talked in this section will be visible by programmers. Please note that some concepts may have their physical counterpart and some concepts won’t have. 34 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL Host Block (2, 0) Device Grid 0 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Kernel 0 Grid 1 Kernel 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (0, 2) Block (1, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (0, 3) Thread (1, 3) Thread (2, 3) Thread (3, 3) Thread (0, 4) Thread (1, 4) Thread (2, 4) Thread (3, 4) F IGURE 4.3: Programming Model. In CUDA programming model, programmers write GPU through a C like functions called kernels. To distinguish it from kernels used in kernel-based density and density derivatives estimates, we will call these functions gpu − kernels. A gpu-kernel is nothing but a task that the programmer want to assign to the GPU. Usually, the task or gpu-kernel is large and can’t be executed by GPU at a time. Therefore, the gpu-kernel is divided into several equally sized task chunks called blocks. Each block will be executed on one multiprocessor. And a multiprocessor can execute several blocks at a time. A block consists a number of threads, which is the minimum task unit on GPU (recall that a warp is the minimum execution unit on GPU). All the threads in a gpu-kernel must be able to be executed independently. All the blocks in a gpu-kernel form the so-called computation grid. When a programmer want to write a gpu-kernel function, he must define the grid size and block size in advance, so that the GPU can know how assign the task to multiprocessors. From a programming model, we can view the memory hierarchy in a different perspective. In this new memory hierarchy, all the memory are visible to programmers. Even though they are called memory, they are actually the memory resources that the GPU assigns to the application or the gpu-kernel. These memory doesn’t exist physically. Instead, they are created and managed at runtime. When a gpu-kernel begins its execution on GPU, each thread will be assigned a number of dedicated registers and, if needed, a private local memory space. But a programmer should always avoid using 35 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL too much local memory. Because the local memory is allocated in the off-chip device memory, which is much slower than the on-chip registers. Each block has shared memory visible to all threads of the block and with the same lifetime as the block. Note that the shared memory we mention here is not a physical component. It is a part of shared memory resource that assigned to this block. All threads have access to the same global memory. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages. The global memory, constant memory and texture memory are allocated in the off-chip device memory as well. They are persistent across kernel launches by the same application. Grid Block (1, 0) Block (0, 0) Shared Memory Registers Registers Thread (0, 0) Local Memory Shared Memory Registers Thread (0, 0) Thread (1, 0) Local Memory Local Memory Registers Thread (1, 0) Local Memory Global Memory Constant Memory Texture Memory F IGURE 4.4: GPU software memory hierarchy. 4.3 Thread Execution Model When a kernel is invoked, the CUDA runtime will distribute the blocks across the multiprocessors on the device and when a block is assigned to a multiprocessor, it is further divided into groups of 32 threads, which is a warp. A warp scheduler then selects available warps and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four special function units. CUDA’s warp scheduling mechanism will help hide instruction latency. Each instruction of a kernel may require more than a few clock cycles to execute (for example, an instruction to read from global 36 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL memory will require multiple clock cycles). The latency of long-running instructions can be hidden by executing instructions from other warps while waiting for the result of the previous warp. Instruction Dispatch Unit Instruction Dispatch Unit Warp 8 instruction 11 Warp 8 instruction 12 Warp 8 instruction 42 Warp 8 instruction 43 Warp 8 instruction 42 Warp 8 instruction 42 ... ... time Warp Scheduler Warp 8 instruction 11 Warp 8 instruction 12 Warp 8 instruction 42 Warp 8 instruction 43 Warp 8 instruction 42 Warp 8 instruction 42 F IGURE 4.5: Warp scheduler. It is critical for a GPU to achieve high occupancy in its execution. But unlike the CPU, it is usually very hard to keep GPU busy all the time. Because there are several factors that will affect GPU’s occupancy. Those factors are maximum register number per thread, maximum thread number in a block, and shared memory size per block. Think of a multiprocessor as a container, which has limited resources such as registers, shared memory size, cores and other ALU resources. As we’ve already talked about previously, a gpu-kernel is executed block by block on multiprocessors. How many blocks can be executed on a multiprocessor is decided by the block size, i.e., the number of resources a block need to use, and the number of resources that is available on the multiprocessor. In order to achieve a high occupancy on GPU, our goal is to select a number of proper sized block to execute on multiprocessors so that most of the resources on multiprocessors are occupied. For example, the Kepler architecture supports maximum 63 registers per thread, 1024 threads per block, and 48KB shared memory per multiprocessor. 37 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL FERMI FERMI KEPLER KEPLER GF100 GF104 GF104 GF110 Max Warps / SMX 48 48 64 64 Max Threads / SMX 1536 1536 2048 2048 Max Thread Blocks / SMX 8 8 16 16 32bit Registers / SMX 32768 32768 65536 65536 Max Registers / Thread 63 63 63 255 Max Threads / Thread Block 1024 1024 1024 1024 16KB 16KB 32KB 32KB 48KB 48KB Shared Memory 16KB 16KB Size Configurations (bytes) 48KB 48KB TABLE 4.1: Compute capability of Fermi and Kepler GPUs. 4.4 Memory Accesses Current GPU’s device memory can only be accessed via 32-byte, 64-byte or 128-byte transaction. All the memory transactions are naturally aligned. They take place at 32-byte, 64-byte or 128-byte memory segments, i.e., the address of the first byte of the memory segment must be a multiple of the transaction size. If the memory addresses are misaligned and they distribute across two memory segments rather than one, then it will take one more memory transaction to read or write the data. To make fully use of each memory transaction, memory accesses are usually coalesced by warp. When a warp executes an instruction that need to access the device/global memory, it looks at the distribution of memory addresses across the threads within it. Instead of generate a memory transaction for each thread, it coalesces the memory accesses that read/write data from the same memory segment into just one memory transaction. Typically, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a 32-byte memory transaction is generated for each thread’s 4-byte access, throughput is divided by 8. Besides using memory coalescing to increase the global memory throughput, programmers can also speed up their application by reducing unnecessary global memory traffic. One way to achieve this is to use shared memory. As we’ve already known that a shared memory is nothing but a programmable L1 cache. A traditional cache is invisible to programmers and they can’t decide which data can be cached. Shared memory, on the other hand, is fully controlled by programmers. When the programmer identifies that some data is accessed repeatedly by threads in the same block, he can 38 CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL 256 128 Address … … Thread ID 0 31 F IGURE 4.6: Aligned and consecutive memory access. 256 257 128 Address … … Thread ID 0 31 F IGURE 4.7: Misaligned memory access. load those data from global/device memory into the shared memory first and then the later accesses of those data can be done through shared memory instead of global memory, which can reduce a great amount of global memories. Another advantage of using shared memory is that it can be accessed simultaneously. Shared memory is divided into equally-sized memory modules, called banks. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts. One exception of bank conflict is that if all threads in a warp access the same shared memory address at the same time, only one memory request will be generated and the data will be broadcasted to all the threads. We call this mechanism broadcasting. 39 Chapter 5 Algorithms and Implementations In this chapter, we present three main contributions of this thesis. In Section 5.1, we propose an algorithm which can calculate the separable multivariate kernel derivatives (SMKD) efficiently. In Section 5.2, we introduce some core functions in our kernel smoothing library. Several optimization algorithms for these functions are proposed. Finally, we design a fast k-nearest neighbors bandwidth selector in Section 5.3. 5.1 Efficient Computation of Separable Multivariate Kernel Derivative As we mentioned in Section 2.5, the implementation of Equation (2.27) requires the calculation of D⊗r K(x), which is a vector containing all the partial derivatives of order r of the kernel function K at point x. For a separable kernel, these partial derivatives are given by Equation (2.12). A brute force implementation of calculating D⊗r K(x) is to calculate these partial derivatives respectively. But this results in calculating the same set of kernel density, kernel derivative and multiplication operations repeatedly, which is clearly not computationally efficient. Let’s consider a motivating example. Assume a separable 4-variable kernel K(x) = k(x1 )k(x2 )k(x3 )k(x4 ). Its first order partial derivative with respect to x4 would be are ∂2 K(x) ∂x24 ∂ ∂x4 K(x) = k(x1 )k(x2 )k(x3 )k 0 (x4 ), similarly two second order derivatives = k(x1 )k(x2 )k(x3 )k 00 (x4 ) and ∂2 ∂x3 x4 K(x) = k(x1 )k(x2 )k 0 (x3 )k 0 (x4 ). One can observe that carrying out calculation of these derivatives respectively would yield the computation of k(x1 )k(x2 )k(x3 ) and k(x1 )k(x2 ) redundant with three and two repetitions, respectively. This redundancy grows as the number of dimensions and the derivative order increases, which leads to 40 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS significant room for optimization. Therefore, to avoid these redundant calculations, we proposed a graph-based efficient algorithm in this section. 5.1.1 Definitions and Facts Our algorithm is based on a directed acyclic graph where each node denotes a set of multivariate kernel partial derivatives and each edge denotes a univariate kernel derivative. In order to propose a well-defined and mathematically strict description of our algorithm, some required notations, definitions and facts are introduced in this section. (j) Definition 1. ki is the value of the j-th order derivative of the univariate kernel k at xi , (j) ki = k (j) (xi ). (5.1) (r) Definition 2. Nd denotes the set whose members are the unique partial derivatives of order r of Q the kernel function K(x1 , ..., xd ) = di=1 k(xi ), (r) Nd (n1 ) (n2 ) (n ) k2 . . . kd d = {k1 | d X i=1 ni = r, ni ∈ N0 }, d ∈ N+ , r ∈ N0 . (r) (5.2) (r) Definition 3. Sd is the number of the elements in set Nd , (r) (r) Sd = |Nd |. (5.3) Definition 4. The product of a scalar ω and a set A = {a1 , a2 , . . . , an } is defined using the operator ×, such that A × ω = {a1 ω, a2 ω, . . . , an ω}. (5.4) Definition 5. Define a directed acyclic graph G(V, E). Each node in V stands for a set and each edge in E has a weight. The relation between nodes and edges in G is given by Figure 5.1. Here, graph (a) contains two nodes, which stands for two sets, A and B. The edge (A, B) has a weight ω. Then, according to the graph, the relation between set A and B is B = A × ω, where the operator × is defined in Definition 4. Similarly, we can find from graph (b) that the node C is pointed by two edges from node A and B respectively. In this case, the relation between these three sets is C = (A × ωA ) ∪ (B × ωB ). (i) Fact 5.1. Set N1 contains only the i-th derivative of the kernel k(x1 ), (i) (i) N1 = {k1 }. 41 (5.5) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS ! A B B =A⇥! (a) A !A C !B C = (A ⇥ !A ) [ (B ⇥ !B ) B (b) F IGURE 5.1: Relation between nodes in graph G. (j) Fact 5.2. Set Ni (l) can be derived from sets Ni−1 , l ∈ 0, . . . , j, (j) Ni = j [ (l) (j−l) (Ni−1 × ki ). (5.6) l=0 Proof. According to Definition 2, we have (j) Ni = (n ) (n ) (n ) {k1 1 k2 2 . . . ki i | (n1 ) (n2 ) (n ) k2 . . . ki i | = {k1 i X l=1 i−1 X l=1 nl = j, nl ∈ N0 } nl = j − ni , nl ∈ N0 , ni ∈ 0, . . . , j}. (j) Since ni can be any value from 0 to j, we can split set Ni into j + 1 mutually disjoint subsets such that, (j) Ni (n1 ) (n2 ) (0) k2 . . . ki ={k1 (n1 ) (n2 ) (1) k2 . . . ki {k1 (n1 ) (n2 ) (j) k2 . . . ki {k1 | | | i−1 X l=1 i−1 X l=1 i−1 X l=1 42 nl = j − 0, nl ∈ N0 }∪ nl = j − 1, nl ∈ N0 } ∪ . . . ∪ nl = j − j = 0, nl ∈ N0 }. CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Then, according to Definition 4, we can find that for any p ∈ 0, . . . j we have (n ) (n ) (p) {k1 1 k2 2 . . . ki | i−1 X (j−p) nl = j − p, nl ∈ N0 } = Ni−1 l=1 (p) × ki . Therefore, (j) Ni = (j−0) (Ni−1 × (0) ki ) ∪ (j−1) (Ni−1 × (1) ki ) (j) × (j) ki ) = j [ (l) (j−l) (Ni−1 × ki ). l=0 (l) equals to the sum of the elements in sets Ni−1 , l ∈ Fact 5.3. The number of elements in set Ni 0, . . . , j, (j) Si ∪ ... ∪ (j−j) (Ni−1 = j X (l) Si−1 . (5.7) l=0 Proof. According to Fact 5.2, we have (j) |Ni | =| j [ (l) (j−l) (Ni−1 × ki )|. l=0 Thus, (j) Si = j X l=0 According to Definition 2 and 4, we know (l) (j−l) |Ni−1 × ki (l) (j−l) |Ni−1 × ki (n ) (j−l) (n1 ) (n2 ) k2 . . . ki−1i−1 ki | = |{k1 (n ) (n1 ) (n2 ) k2 . . . ki−1i−1 = |{k1 = | (l) Si−1 . i−1 X p=1 |. | i−1 X p=1 np = l, np ∈ N0 }| np = l, np ∈ N0 }| Therefore, (j) Si = j X (l) Si−1 . l=0 (r) , Fact 5.4. The number of elements in set Nd is d+r−1 r d+r−1 (r) Sd = . r 43 (5.8) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Proof. This statement can be easily proved by induction. (r) Basic: Show that the statement holds for d = 1. According to Fact 5.1, we know that N1 (r) (r) Thus, S1 = 1. Since 1+r−1 = rr = 1, we can get S1 = 1+r−1 . r r (r) = {k1 }. Basic: Show that the statement holds for r = 0. According to Definition 2, we know that (0) (0) (0) d−1 d+0−1 Nd = {k1 k2 . . . kd }. Thus, Sd = 1. Since d+0−1 = = 1, we can get S = . d 0 0 0 (j) (j−1) Inductive step: Show that if Si−1 and Si know that (j) Si = j X l=0 Reapply Fact 5.3 to (j) Si Pj−1 l=0 (j−1) = Si (j) hold, then Si (l) Si−1 = j−1 X (l) also holds. According to Fact 5.3, we (j) Si−1 + Si−1 . l=0 (l) Si−1 , we get (j) + Si−1 = i+j−1−1 i−1+j−1 i+j−1 + = . j−1 j j Since both the basis and the inductive step have been performed, by mathematical induction, the (r) statement Sd holds for all d ∈ N+ and r ∈ N0 . Q.E.D. 5.1.2 Algorithm Our algorithm is illustrated in Figure 5.2. Consistent with Definition 5, each node in Figure 5.2 (j) stands for a set. As is defined in Definition 2, a set Ni contains all the partial derivatives of order j of the i-variable function K(x1 , . . . , xi ). Each edge in the graph defines a product operation, which is defined in Definition 4, between its head node and its weight. The weight of an edge is a univariate kernel derivative which is given by Definition 1. The relationship between an edge’s head, weight and tail is demonstrated in Figure 5.1. (r) Ignoring the output node Nd , we have a matrix of nodes in the graph. Each column contains the sets whose elements are the partial derivatives of the same kernel function. Each row contains the sets whose elements are the partial derivatives of the same order. Our algorithm start from the left side of the graph, where we initialize all the nodes in the first column by assigning its corresponding univariate kernel derivative as denoted by Fact 5.1. Then, according to Fact 5.2, we compute the sets in each column from the sets in previous column. Repeat this step until we reach the (d − 1)-th column. Finally, once we’ve got the output of (d − 1)-th column, we can reapply Fact 44 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS (0) 1 k1 … (0) N1 (0) ki (0) 1 Ni (0) Ni … (0) 1 Nd (1) ki (1) 1 k1 … (1) N1 (r) (0) ki (1) Ni 1 (1) Ni (r) ki … kd (1) 1 Nd (r 1) kd … … … … (r 1) ki (r) 1 k1 … (r) N1 (0) ki (r) 1 (r) Ni Ni … (0) (r) 1 kd Nd (r) Nd F IGURE 5.2: Graph based efficient multivariate kernel derivative algorithm. (r) 5.2 by Nd = Pr (i) i=0 (Nd−1 (r−i) × kd ) and output the result. The outline of this algorithm is shown in Algorithm 1. 5.1.3 Complexity Analysis Instead of computing the multivariate partial derivatives from the products of univariate derivatives directly, in this algorithm, we reuse the results from previous columns as much as possible, which (j) can reduce a great number of operations. Since all the univariate kernel derivatives ki , i ∈ 1, . . . , n, j ∈ 0, . . . , r can be calculated efficiently in advance, then the only operations for calculating the multivariate kernel derivative are multiplications. Thus, in the rest of this section, we only focus on counting the number of multiplications in the proposed efficient algorithm and comparing it with the naive method. The naive algorithm calculates the product of univariate derivatives for each multivariate partial derivative respectively. Assume we want to calculate the r-th order partial derivatives of the d-variable kernel function K(x1 , . . . , xd ). Then, according to Fact 5.4, the number of the r-th order partial derivatives is d+r−1 . Since each partial derivative is a product of d univariate kernel derivatives, r which results d − 1 multiplications, then the total number of multiplications in the naive algorithm is d+r−1 Mn = (d − 1) . (5.9) r From Algorithm 1, we know that the multiplication of the proposed efficient algorithm happens (l) (j−l) at the calculation of Ni−1 × ki (i) (r−i) and Nd−1 × kd 45 . According to Definition 4, we know that the CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 1 Efficient Multivariate Kernel Derivative 1: procedure M ULTIVARIATE D ERIVATIVE(n, r ) 2: for i ← 0, r do (i) (i) 3: N1 ← k1 4: end for 5: for i ← 2, d − 1 do 6: for j ← 0, r do (j) 7: Ni ← Ø 8: for l ← 0, j do (j) (j) (l) (j−l) 9: Ni ← Ni ∪ (Ni−1 × ki ) 10: end for 11: end for 12: end for 13: for i ← 0, r do (r) (r) (i) (r−i) 14: Nd ← Nd ∪ (Nd−1 × kd ) 15: end for (r) 16: return Nd 17: end procedure number of multiplications in performing × operation is equal to the size of the set. Thus, the number (l) (j−l) of multiplication in computing Ni−1 × ki (i) (r−i) and Nd−1 × kd (l) (i) is Si−1 and Sd−1 . Therefore, by applying this to all the for loops in Algorithm 1, we can get Me = j d−1 X r X X (l) Si−1 + i=2 j=0 l=0 r X (i) Sd−1 . (5.10) i=0 According to Fact 5.3, the above equation can be simplified as Me = d−1 X i=2 (r) Si+1 + (r) Sd = d−1 X r+i i=2 r d+r−1 + . r (5.11) Comparing Equation (5.10) with (5.9), we can find that the number of multiplications in the efficient algorithm is significantly smaller than the naive algorithm. Hence, our algorithm can achieve a considerable speed up in theory. The detailed experimental results are given in Chapter 6. 5.2 High Performance Kernel Density and Kernel Density Derivative Estimators Kernel density and kernel density derivative estimation methods usually have very high computational requirements. From the discussions in Section 2.5 and Section 5.1.1, we know that the direct 46 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS computation of the KDE and KDDE requires mn d+r−1 r kernel evaluations, where m is the number of test points, n is the number of training points, d is the dimension of the data, and r is the order of estimators. In many cases, the size of the data sets are becoming larger and larger in recent years. In our case, the test data and training data are usually of size 106 to 107 . Fortunately, the evaluations of KDE and KDDE are independent for different test points, which makes it a perfect fit for parallel computing. In this section, we proposed a multi-core CPU and GPU platform based solution to accelerate the computation of KDE and KDDE. Several optimization techniques are used to achieve significant performance gains. 5.2.1 Multi-core CPU Implementation The goal of the multi-core CPU implementation is to deliver a set of kernel smoothing functions to achieve high flexibility as well as a good performance. For flexibility, this implementation supports input data of any dimension, can compute kernel density derivatives of any order and has a flexible choice of kernel and bandwidth types. To achieve a good performance, this implementation uses the POSIX Threads (PThreads) programming interface to utilize parallelism on multi-core CPU platform. In this section, we only focus on the most general case (unconstrained variable bandwidth, any dimension, any order, and Gaussian kernel) due to its high computational and mathematical complexity. According to Equation (2.27), we know that KDDE is usually calculated at n different test points xi . And for each KDDE at test point xi , we need to calculate the weighted sum of the scaled r-th order kernel derivative D⊗r KS j at m different shifted locations xi − X j . Thus, there will be m × n scaled r-th order kernel derivative calculations involved. Note that the scaled r-th order kernel derivative and unscaled r-th order kernel derivative at shifted location xi − X j is related ⊗r by D⊗r KS j (xi − X j ) = |S j |S ⊗r j D K(S j (xi − X j )). Hence, the calculation of the scaled r-th order kernel derivative D⊗r KS j at xi − X j can be divided into four steps: • calculate the scaled and shifted data y = S j (xi − X j ), where y is a d dimensional vector y = [y1 , y2 , . . . , yd ]T ; • for each variable yl , l = 1, . . . , d in y calculate its univariate kernel and kernel derivatives k (0) (yl ), k (1) (yl ), . . ., k (r) (yl ) and store the results into a d × r matrix F , where F (u, v) = k (v) (yu ); 47 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS • calculate the multivariate r-th order kernel derivative D⊗r K from the univariate kernel derivatives in F using the efficient algorithm introduced in Section 5.1. Note that D⊗r K requires dr r-th order partial derivatives which carries some redundancies. However, the efficient algorithm only gives d+r−1 unique partial derivatives. Hence, we need to repeat some results r we got from the efficient algorithm to fill the redundant locations in D⊗r K; ⊗r ⊗r ⊗r • calculate |S j | and S ⊗r j , and update D KS j by |S j |S j D K. Since the calculations of KDDE at different test points are independent, we can parallelize our algorithm at the test point level. Therefore, we propose our general KDDE algorithm on multi-core platform in Algorithm 2. Algorithm 2 Parallel CPU Kernel Density Derivative Estimation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: procedure KDDE(x, X, S, ω, r) d, m ← S IZE(x) n ← S IZE(X, 2) D⊗r f ← Z EROS(n, dr ) parfor i ← 0, m − 1 do for j ← 0, n − 1 do y ← S(j)(x(i) − X(j)) F ← U NIVARIATE D ERIVATIVES(y, r) D⊗r K ← M ULTIVARIATE D ERIVATIVES(F, d, r) D⊗r f (i) ← D⊗r f (i) + ω(j)|Sj|S ⊗r (j)D⊗r K end for end parfor return D⊗r f end procedure 5.2.2 GPU Implementation in CUDA From Chapter 3, we know that in the context of image processing and pattern recognition, the estimation of the first and second derivatives of the density is crucial to locate significant feature regions on images. Therefore, in this section we’ll focus only on the kernel gradient and curvature estimator, as denoted by Equation (3.43) and (3.44), for 2D and 3D images. What’s more, since the choice of kernel function is not crucial to the accuracy of KDE and KDDE, we choose the standard Gaussian function as the kernel. Based on these assumptions and because of the multivariate estimators are far more computationally and mathematically involved, we present several optimized GPU KDE and KDDE implementations in this section. 48 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Naive Implementation The same as the multi-core CPU implementation, the naive GPU implementation also parallelizes at the test point level. we create four gpu-kernel functions (ShiftAndScale, UnivarDeri, MultivarDeri, and Update) corresponding to the four steps in calculating the scaled kernel derivatives. Each gpu-kernel is designed to complete its job for all the test points concurrently. To achieve this, based on the CUDA programming model in Section 4.2, we divide the gpu-kernel into dm/te equally-sized blocks and each block contains t threads. Thus, there are roughly m threads in the gpu-kernels. Each of these m threads is responsible for calculating the linear combination of the shifted D⊗r KS j functions at a test point. The naive implementation is shown in Algorithm 3. To illustrate the all the problems and optimization techniques clearly, we give the most complex KDDE function, which can compute the kernel density, kernel gradient and kernel curvature at the same time, from our kernel smoothing library. Since we are only interested in the first and second order derivatives, we implement the estimators according to Equation (2.28) and (2.30). Optimization I – Kernel Merging, Loop Unrolling, and Memory Layout Optimization If we take a close look at Algorithm 3, we can find that there exists several problems: • Too many small gpu-kernels. The four gpu-kernels generate lots of redundant global memory transactions. Every gpu-kernel has to save its results back to the global memory so that they can be used by the following gpu-kernels. As mentioned in Section 4.1, all the global memory transactions are off-chip, which makes them much slower than the on-chip memories. Thus, the redundant off-chip global memory transactions will introduce lots of warp stalls and eventually slow down the execution of the GPU. • Bad memory layouts. The memory layout of matrices and cubes are column-major (first dimension first) in the naive implementation. As illustrated in Figure 5.3 (a) and (c), such memory layout will result strided memory accesses. As mentioned in Section 4.4, memory accesses are coalesced by warps. It means scattered or strided memory access will require more memory transactions since they can’t be efficiently grouped in one memory transaction. • Unnecessary loops. The calculation of multivariate kernel derivatives involves lots of matrix operations. To complete these operations, we need to write many for loop statements in the gpu-kernels. Usually, the sizes of those for loops are determined by data dimension. However, 49 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 3 Naive GPU Kernel Density Derivative Estimation 1: procedure KDDE(x, X, S, w) 2: d, m ← S IZE(x) 3: n ← S IZE(X, 2) 4: f ← Z EROS(1, m) 5: g ← Z EROS(d, m) 6: H ← Z EROS(d ∗ d, m) 7: for i ← 0, n − 1 do 8: y i ← S HIFTA ND S CALE(x, X, S, i) 0 00 9: ki , ki , ki ← U NIVAR D ERI(y i ) 0 00 10: f i , g i , H i ← M ULTIVAR D ERI(ki , ki , ki ) 11: f , g, H ← U PDATE(f i , g i , H i , S, w, i) 12: end for 13: return f , g, H 14: end procedure 49: end for 50: g i [j ∗ d + p] ← gp 51: H i [j ∗ d ∗ d + p ∗ d + p] ← Hpp 52: end for 53: f i [j] ← f 54: for p ← 1, d − 1 do 55: for q ← 0, p − 1 do 0 56: Hpq ← ki [j ∗ d + q] ∗ g i [j ∗ d + p]/ki [j ∗ d + q] 57: H i [j ∗ d ∗ d + p ∗ d + q] ← Hpq 58: H i [j ∗ d ∗ d + q ∗ d + p] ← Hpq 59: end for 60: end for 61: return f i , g i , H i 62: end procedure 15: procedure S HIFTA ND S CALE(x, X, S, i) 16: j ← blockDim.x ∗ blockIdx.x + threadIdx.x 17: d, n ← S IZE(x) 18: m ← S IZE(X, 2) 19: for k ← 0, d − 1 do 20: t[j ∗ d + k] ← x[j ∗ d + k] − X[i ∗ d + k] 21: end for 22: for p ← 0, d − 1 do 23: for q ← 0, d − 1 do 24: y i [j ∗ d + p] ← y i [j ∗ d + p] + t[j ∗ d + q] ∗ 63: procedure U PDATE(f i , g i , H i , S, w, i) 64: j ← blockDim.x + blockIdx.x + threadIdx.x 65: f [j] ← f [j] + f i [j] ∗ w[i] 66: for p ← 0, d − 1 do 67: t←0 68: for q ← 0, d − 1 do 69: t ← t + g i [j ∗ d + q] ∗ S[j ∗ d ∗ d + d ∗ q + p] 70: end for 71: g[j ∗ d + p] ← g[j ∗ d + p] + w[j] ∗ t 72: end for 73: for p ← 0, d − 1 do 74: for q ← 0, d − 1 do 75: t←0 76: for r ← 0, d − 1 do 77: t ← t + H i [j ∗ d ∗ d + p ∗ d + r] ∗ S[j ∗ d ∗ S[d ∗ d ∗ i + d ∗ q + p] 25: end for 26: end for 27: return y i 28: end procedure 29: procedure U NIVAR D ERI(y i ) 30: j ← blockDim.x + blockIdx.x + threadIdx.x 31: yij ← y i [j] 32: k ← 1/SQRT(2 ∗ π) ∗ EXP(−0.5 ∗ yij ∗ yij ) 33: ki [j] ← k 0 34: ki [j] ← −yij ∗ k 00 35: ki [j] ← (yij ∗ yij − 1) ∗ k 00 0 36: return ki , ki , ki 37: end procedure 0 00 38: procedure M ULTIVAR D ERI(ki , ki , ki ) 39: j ← blockDim.x + blockIdx.x + threadIdx.x 40: d, n ← S IZE(ki ) 41: f ←1 42: for p ← 0, d − 1 do 43: f ← f ∗ ki [j ∗ d + p] 0 44: gp ← ki [j ∗ d + p] 00 45: Hpp ← ki [j ∗ d + p] 46: for q ← p + 1, d + p − 1 do 47: gp ← gp ∗ ki [j ∗ d + MOD(q, d)] 48: Hpp ← Hpp ∗ ki [j ∗ d + MOD(q, d)] d + r ∗ d + q] end for H tmp [p ∗ d + q] ← t end for end for for p ← 0, d − 1 do for q ← 0, d − 1 do t←0 for r ← 0, d − 1 do t ← t+S[j∗d∗d+r∗d+p]∗H tmp [r∗d+q] end for H[j ∗ d ∗ d + p ∗ d + q] ← H[j ∗ d ∗ d + p ∗ d + q] + w[j] ∗ t 89: if p 6= q then 90: H[j ∗ d ∗ d + q ∗ d + p] ← H[j ∗ d ∗ d + q ∗ d + p] + w[j] ∗ t 91: end if 92: end for 93: end for 94: return f , g, H 95: end procedure 78: 79: 80: 81: 82: 83: 84: 85: 86: 87: 88: 50 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS since we are only interested in 2D and 3D data in this section, the for loop size is actually only 2 or 3, which is definitely inefficient and should be unrolled to remove the loop overhead. Global Memory x(0,0) Thread IDs x(1,0) x(2,0) x(0,1) 0 x(1,1) x(2,1) … X(0,N-1) … 1 X(1,N-1) X(2,N-1) x(2,1) … N-1 (a) x(0,0) x(0,1) … X(0,N-1) 0 1 … N-1 x(1,0) … x(1,1) X(1,N-1) x(2,0) X(2,N-1) (b) x(0,0,0) x(1,0,0) x(0,1,0) x(1,1,0) x(0,0,1) x(1,0,1) x(0,1,1) x(1,1,1) 0 … x(0,0,N-1) … 1 x(1,0,N-1) x(0,1,N-1) x(1,1,N-1) N-1 (c) x(0,0,0) x(0,0,1) 0 1 … x(0,0,N-1) … N-1 x(1,0,0) x(1,0,1) … x(1,0,N-1) x(0,1,0) x(0,1,1) … x(0,1,N-1) x(1,1,0) x(1,1,1) … x(1,1,N-1) (d) F IGURE 5.3: Memory access patterns of matrices and cubes. (a) Memory access pattern of column-major matrix. (b) Memory access pattern of row-major matrix. (c) Memory access pattern of column-major cube (3D matrix). (d) Memory access pattern of column-major cube (3D matrix). Therefore, to solve these problems, we propose our optimized implementation in Algorithm 4. Here, we merged the four small gpu-kernels into two big gpu-kernels to avoid redundant global memory transactions. However, big gpu-kernels usually consume more registers because they use lots of variables. According to Section 4.3, we know that GPU has limited register resources. If 51 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS a gpu-kernel uses too much registers, it can not achieve a high occupancy, which results a bad performance. Hence, in the optimized implementation I, we reduce the usage of register by reusing previous results or local variables as much as possible. We also changed our column-major memory layout of matrices and cubes into row-major (second dimension first) and slice-major (third dimension first) respectively. Then, from Figure 5.3 (b) and (d), we can find that the memory access is now continuous in either case. What’s more, we unrolled the loops in the optimized implementation. Since the order of the statements is now not restricted to be what it was in the loop, the loop unrolling not only avoided executing loop control instructions, but also provided us more flexible controls over statements once inside the for loop. Optimization II – Simplified Math Expressions In Optimization I, we improved the implementation from the perspective of the GPU code. However, the implementation can also improved at the algorithm level. If we take a look at the kernel function itself, we can find that a better representation can be used to simplify the KDDE, and thus, speed up the calculation. As we know from Equation (2.9) that a separable multivariate Gaussian kernel can be written as K(x) = d Y 1 2 √ e−xl /2 . 2π l=1 (5.12) If we evaluate this equation directly, it will result in computing the exponential function d times. However, we can use the property of exponential function and simplify the above equation as 1 T 1 K(x) = ( √ )d e− 2 x x . 2π (5.13) We can find that the simplified kernel function only requires one exponential function. Similarly, instead of calculating Equation (2.10) and (2.11), the gradient and Hessian of the separable multivariate Gaussian kernel can be given as 1 T 1 ∇K(x) = ( √ )d e− 2 x x x 2π 1 T 1 ∇2 K(x) = ( √ )d e− 2 x x (xxT − I) 2π 52 (5.14) (5.15) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 4 Optimized GPU Kernel Density Derivative Estimation I 1: procedure KDDE(x, X, S, w) 2: m ← S IZE(x, 1) 3: n ← S IZE(X, 1) 4: f ← Z EROS(n, 1) 5: g ← Z EROS(n, 3) 6: H ← Z EROS(n, 9) 7: for i ← 0, n − 1 do 8: yi ← S HIFTA ND S CALE(x, X, S, i) 9: f, g, H ← K ERNEL C ORE(x, yi , S, w[i], i) 10: end for 11: return f, g, H 12: end procedure 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: fij ← wi ∗ c fij ← fij ∗ E XP(−0.5 ∗ xj0 ∗ xj0 ) fij ← fij ∗ E XP(−0.5 ∗ xj1 ∗ xj1 ) fij ← fij ∗ E XP(−0.5 ∗ xj2 ∗ xj2 ) t0 ← xj0 ∗ xj1 ∗ fij t1 ← xj0 ∗ xj2 ∗ fij t2 ← xj1 ∗ xj2 ∗ fij t3 ← s0 ∗ (xj0 ∗ xj0 − 1) ∗ fij + s1 ∗ t0 + s2 ∗ t1 t4 ← s0 ∗ t0 + s1 ∗ (xj1 ∗ xj1 − 1) ∗ fij + s2 ∗ t2 t5 ← s0 ∗ t1 + s1 ∗ t2 + s2 ∗ (xj2 ∗ xj2 − 1) ∗ fij f[j] ← f[j] + fij g[j] ← g[j] − fij ∗ (s0 ∗ xj0 + s1 ∗ xj1 + s2 ∗ xj2 ) H[j] ← H[j] + s0 ∗ t3 + s1 ∗ t4 + s2 ∗ t5 13: procedure S HIFTA ND S CALE(x, X, S, i) s0 ← S[i ∗ 9 + 3] 14: j ← blockDim.x ∗ blockIdx.x + threadIdx.x s1 ← S[i ∗ 9 + 4] 15: n ← S IZE(x, 1) s2 ← S[i ∗ 9 + 5] 16: m ← S IZE(X, 1) t6 ← s0 ∗ t3 + s1 ∗ t4 + s2 ∗ t5 17: yi0 ← x[j + n ∗ 0] − X[i + m ∗ 0] g[j + n ∗ 1] ← g[j + n ∗ 1] − fij ∗ (s0 ∗ xj0 + s1 ∗ 18: yi1 ← x[j + n ∗ 1] − X[i + m ∗ 1] xj1 + s2 ∗ xj2 ) 19: yi2 ← x[j + n ∗ 2] − X[i + m ∗ 2] 55: H[j + n ∗ 1] ← H[j + n ∗ 1] + t6 20: yi [j + n ∗ 0] ← yi0 ∗ S[i ∗ 9 + 0] + yi1 ∗ S[i ∗ 9 + 3] + 56: H[j + n ∗ 3] ← H[j + n ∗ 3] + t6 yi2 ∗ S[i ∗ 9 + 6] 57: t6 ← s3 ∗ t3 + s4 ∗ t4 + s5 ∗ t5 21: yi [j + n ∗ 1] ← yi0 ∗ S[i ∗ 9 + 1] + yi1 ∗ S[i ∗ 9 + 4] + 58: H[j + n ∗ 2] ← H[j + n ∗ 2] + t6 yi2 ∗ S[i ∗ 9 + 7] 59: H[j + n ∗ 6] ← H[j + n ∗ 6] + t6 22: yi [j + n ∗ 2] ← yi0 ∗ S[i ∗ 9 + 2] + yi1 ∗ S[i ∗ 9 + 5] + 60: t3 ← s0 ∗ (xj0 ∗ xj0 − 1) ∗ fij + s1 ∗ t0 + s2 ∗ t1 yi2 ∗ S[i ∗ 9 + 8] 61: t4 ← s0 ∗ t0 + s1 ∗ (xj1 ∗ xj1 − 1) ∗ fij + s2 ∗ t2 23: return yi 62: t5 ← s0 ∗ t1 + s1 ∗ t2 + s2 ∗ (xj2 ∗ xj2 − 1) ∗ fij 24: end procedure 63: t6 ← s3 ∗ t3 + s4 ∗ t4 + s5 ∗ t5 64: g[j + n ∗ 2] ← g[j + n ∗ 2] − fij ∗ (s3 ∗ xj0 + s4 ∗ 25: procedure K ERNEL C ORE(x, yi , S, wi , i) xj1 + s5 ∗ xj2 ) 26: j ← blockDim.x ∗ blockIdx.x + threadIdx.x 65: H[j + n ∗ 4] ← H[j + n ∗ 4] + s0 ∗ t3 + s1 ∗ t4 + s2 ∗ t5 27: n ← S IZE(yi , 1) 66: H[j + n ∗ 5] ← H[j + n ∗ 5] + t6 28: xj0 ← x[j + n ∗ 0] 67: H[j + n ∗ 7] ← H[j + n ∗ 7] + t6 29: xj1 ← x[j + n ∗ 1] 68: t3 ← s3 ∗ (xj0 ∗ xj0 − 1) ∗ fij + s4 ∗ t0 + s5 ∗ t1 30: xj2 ← x[j + n ∗ 2] 69: t4 ← s3 ∗ t0 + s4 ∗ (xj1 ∗ xj1 − 1) ∗ fij + s5 ∗ t2 31: s0 ← S[i ∗ 9 + 0] 70: t5 ← s3 ∗ t1 + s4 ∗ t2 + s5 ∗ (xj2 ∗ xj2 − 1) ∗ fij 32: s1 ← S[i ∗ 9 + 1] 71: H[j + n ∗ 8] ← H[j + n ∗ 8] + s3 ∗ t3 + s4 ∗ t4 + s5 ∗ t5 33: s2 ← S[i ∗ 9 + 2] 72: return f, g, H 34: s3 ← S[i ∗ 9 + 6] 73: end procedure 35: s4 ← S[i ∗ 9 + 7] 36: s5 ← S[i ∗ 9 + 8] 53 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Therefore, the simplified kernel density, kernel gradient, kernel curvature estimators can be written as n X 1 T T 1 fˆ(xi ; S j , ωj ) = ( √ )d ωj |S j |e− 2 (xi −X j ) S j S j (xi −X j ) 2π j=1 (5.16) n 1 T T 1 dX ˆ ∇f (xi ; S j , ωj ) = ( √ ) ωj |S j |e− 2 (xi −X j ) S j S j (xi −X j ) S Tj S j (xi − X j ) 2π j=1 (5.17) n X 1 T T 1 ∇2 fˆ(xi ; S j , ωj ) =( √ )d ωj |S j |e− 2 (xi −X j ) S j S j (xi −X j ) 2π j=1 (5.18) (S Tj S j (xi − X j )(xi − X j )T S j S Tj − S Tj S j ) The advantage of these simplified forms is that it not only simplifies the computation but also de1 T S T S (x −X ) j i j j creases the usage of variables. First, they contain a common expression ωj |S j |e− 2 (xi −X j ) Hence, we only need to compute this expression once and save the result for reuse. Second, expression S Tj S j (xi − X j ) appears repeatedly. Thus, its evaluation result can also be used in multiple places. Third, if S Tj S j is computed in advance, then there will be no square matrix multiplications. Since square matrix multiplication involves lots of addition and multiplication operations and need to use lots of variables to store temporary results, the simplified forms can greatly reduce the usage of registers. What’s more, in Optimization I, we need to call the gpu-kernels for each training point. Since the number of training points is usually very large, there will be a great number of gpu-kernel calls, which results a significant kernel call overhead. One solution of this problem is to design a big kernel to enclose the outer for loop. Then, there will be only one gpu-kernel call. Previously, this solution is not practical because such a large gpu-kernel will consume too much GPU resources, which lead to a low GPU occupancy. But due to the low variable usage of the simplified forms, it is now possible to merge all the gpu-kernel calls into just one single call. Based on the analysis above, we propose our optimized implementation in Algorithm 5. We can find that the main function KDDE now contains only two gpu-kernels SquareAndDet and LinCombKernels. SquareAndDet is responsible for calculating the squared scale S Tj S j and the scale determinant |S j | for each training point. The outer for loop is now moved into the LinCombKernels, which basically computes Equation (5.16), (5.17), and (5.18). Optimization III – Exploiting Temporal Locality Using Shared Memory As we mentioned in Section 4.1, shared memory is a programmable cache on GPU. It is way more faster than the off-chip global memory. However, to utilize the shared memory efficiently, there has 54 . CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 5 Optimized GPU Kernel Density Derivative Estimation II 1: procedure KDDE(x, X, S, w) 31: t2 ← xi2 − X[j + m ∗ 2] 2: SS, c ← S QUARE A ND D ET(S, w) 32: t3 ← t1 ∗S[j∗9+0]+t2 ∗S[j∗9+3]+t3 ∗S[j∗9+6] 3: f, g, H ← L IN C OMB K ERNELS(x, X, S, SS, c) 33: t4 ← t1 ∗S[j∗9+1]+t2 ∗S[j∗9+4]+t3 ∗S[j∗9+7] 4: return f, g, H 34: t5 ← t1 ∗S[j∗9+2]+t2 ∗S[j∗9+5]+t3 ∗S[j∗9+8] 5: end procedure 35: fij ← c[j] ∗ E XP(t3 ∗ t3 + t4 ∗ t4 + t5 ∗ t5 ) 36: fi ← fi + fij 6: procedure S QUARE A ND D ET(S, w) 37: ss11 ← SS[j ∗ 6 + 0] 7: i ← blockDim.x ∗ blockIdx.x + threadIdx.x 38: ss12 ← SS[j ∗ 6 + 1] 8: s11 ← S[i∗9+0], s12 ← S[i∗9+3], s13 ← S[i∗9+6] 39: ss13 ← SS[j ∗ 6 + 2] 9: s21 ← S[i∗9+1], s22 ← S[i∗9+4], s23 ← S[i∗9+7] 40: ss22 ← SS[j ∗ 6 + 3] 10: s31 ← S[i∗9+2], s32 ← S[i∗9+5], s33 ← S[i∗9+8] 41: ss23 ← SS[j ∗ 6 + 4] 11: SS[i ∗ 6 + 0] ← s11 ∗ s11 + s21 ∗ s21 + s31 ∗ s31 42: ss33 ← SS[j ∗ 6 + 5] 12: SS[i ∗ 6 + 1] ← s11 ∗ s12 + s21 ∗ s22 + s31 ∗ s32 43: t3 ← ss11 ∗ t0 + ss12 ∗ t1 + ss13 ∗ t2 13: SS[i ∗ 6 + 2] ← s11 ∗ s13 + s21 ∗ s23 + s31 ∗ s33 44: t4 ← ss12 ∗ t0 + ss22 ∗ t1 + ss23 ∗ t2 14: SS[i ∗ 6 + 3] ← s12 ∗ s12 + s22 ∗ s22 + s32 ∗ s32 45: t5 ← ss13 ∗ t0 + ss23 ∗ t1 + ss33 ∗ t2 15: SS[i ∗ 6 + 4] ← s12 ∗ s13 + s22 ∗ s23 + s32 ∗ s33 46: g[i + n ∗ 0] ← g[i + n ∗ 0] − fij ∗ t3 16: SS[i ∗ 6 + 5] ← s13 ∗ s13 + s23 ∗ s23 + s33 ∗ s33 47: g[i + n ∗ 1] ← g[i + n ∗ 1] − fij ∗ t4 17: t0 ← s11 ∗ s22 ∗ s33 + s12 ∗ s23 ∗ s31 + s13 ∗ s21 ∗ s32 48: g[i + n ∗ 2] ← g[i + n ∗ 2] − fij ∗ t5 18: t1 ← s13 ∗ s22 ∗ s31 + s23 ∗ s32 ∗ s11 + s12 ∗ s21 ∗ s33 49: H[i + n ∗ 0] ← H[i + n ∗ 0] + fij ∗ (t3 ∗ t3 − ss11 ) 19: c[i] ← w[i] ∗ c ∗ A BS(t0 − t1 ) 50: H[i + n ∗ 1] ← H[i + n ∗ 1] + fij ∗ (t3 ∗ t4 − ss12 ) 20: return SS, c 21: end procedure 51: H[i + n ∗ 2] ← H[i + n ∗ 2] + fij ∗ (t3 ∗ t5 − ss13 ) 52: H[i + n ∗ 4] ← H[i + n ∗ 4] + fij ∗ (t4 ∗ t4 − ss22 ) 53: H[i + n ∗ 5] ← H[i + n ∗ 5] + fij ∗ (t4 ∗ t5 − ss23 ) 22: procedure L IN C OMB K ERNELS(x, X, S, SS, c) 54: 23: i ← blockDim.x ∗ blockIdx.x + threadIdx.x 55: 24: n ← S IZE(x, 1) 56: f[i] ← fi 25: m ← S IZE(X, 1) 57: H[i + n ∗ 3] ← H[i + n ∗ 1] 26: fi ← 0 58: H[i + n ∗ 6] ← H[i + n ∗ 2] 27: xi0 ← x[i+n∗0], xi1 ← x[i+n∗1], xi2 ← x[i+n∗2] 59: H[i + n ∗ 7] ← H[i + n ∗ 5] 28: for j ← 0, m − 1 do 60: return f, g, H 29: t0 ← xi0 − X[j + m ∗ 0] 30: t1 ← xi1 − X[j + m ∗ 1] H[i + n ∗ 8] ← H[i + n ∗ 8] + fij ∗ (t5 ∗ t5 − ss33 ) end for 61: end procedure 55 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS to be enough temporal locality (reuse of data) in the gpu-kernel. Before introducing our optimized implementation using shared memory, let’s take a look at the memory access pattern in Optimization II. Assume m is the number of training points, n is the number of threads (equal to test point number), and r is the number of variables that access data from global memory, then the memory access pattern in Optimization II can be illustrated as Figure 5.4. Here, colored block groups V1 , V2 , . . . , Vr are arrays with m elements each. They all exist in global memory. Gray circles stand for threads. Each step is an iteration of the for loop in Optimization II. As we already knew that each thread will evaluate the kernel density, kernel gradient and kernel curvature for a single test point. Since such evaluation are similar for every thread, they usually read the same data from the global memory in each step. In the figure, we can find that, in the first step, all the threads read the first element in V1 , V2 , . . . , Vr . Since there are n threads, it means the same data is read n times, which is obviously inefficient. To analysis this problem quantitatively, we calculate the total number of global memory access of this case, Mg = m × n × r. (5.19) We can solve this problem by introducing shared memory. The new memory access pattern is shown in Figure 5.5. Here, the colored blocks stand for the data in global memory. Each color corresponds to an array in Figure 5.4. The number in the block stands for the index of an element in the array. The reason we choose this form of memory layout instead of the layout in Figure 5.4 is that in this arrangement of memory, the data read by threads are consecutive, which can reduce global memory transactions due to memory coalescing. The threads are divided into colored groups. Each group stands for a thread block mentioned in Section 4.2. Since data in shared memory can only be accessed by threads from the same block, here we draw shared memory separately for each thread block. The idea of using shared memory is that if certain data is used several times, we can store it into shared memory first, then read it from the shared memory directly in later usage. We can see from Figure 5.5 that the threads in the same block read the same data from the global memory only once (threads in different block still have to read the same data repeatedly, because data in shared memory can not be accessed between thread blocks). Then, this data can be accessed by other threads in the same block directly from the shared memory. The total number of global memory accesses is n 1 Mg = m × r × d e × , b c 56 (5.20) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 6 Optimized GPU Kernel Density Derivative Estimation III 1: procedure KDDE(x, X, S, w) 2: m, d ← S IZE(X) 3: r ← 1 + d + d ∗ d + (1 + d) ∗ d/2 4: for i ← 0, m − 1 do 5: for j ← 0, d − 1 do 6: y[r ∗ i + j] ← X[m ∗ j + i] 7: for k ← 0, d − 1 do 8: y[r∗i+(j +1)∗d+k] ← S[i∗d∗d+j ∗d+k] 9: end for 10: end for 11: end for 12: y ← S QUARE A ND D ET(y, w) 13: f, g, H ← L IN C OMB G AUSSIAN K ERNELS(x, y) 14: return f, g, H 15: end procedure 16: procedure S QUARE A ND D ET(y, w) 17: i ← (blockDim.x ∗ blockIdx.x + threadIdx.x) ∗ 19 18: s11 ← y[i + 3], s12 ← y[i + 6], s13 ← y[i + 9] 19: s21 ← y[i + 4], s22 ← y[i + 7], s23 ← y[i + 10] 20: s31 ← y[i + 5], s32 ← y[i + 8], s33 ← y[i + 11] 21: t0 ← s11 ∗ s22 ∗ s33 + s12 ∗ s23 ∗ s31 + s13 ∗ s21 ∗ s32 22: t1 ← s13 ∗ s22 ∗ s31 + s23 ∗ s32 ∗ s11 + s12 ∗ s21 ∗ s33 23: y[i + 12] ← w[i/19] ∗ c ∗ A BS(t0 − t1 ) 24: y[i + 13] ← s11 ∗ s11 + s21 ∗ s21 + s31 ∗ s31 25: y[i + 14] ← s11 ∗ s12 + s21 ∗ s22 + s31 ∗ s32 26: y[i + 15] ← s11 ∗ s13 + s21 ∗ s23 + s31 ∗ s33 27: y[i + 16] ← s12 ∗ s12 + s22 ∗ s22 + s32 ∗ s32 28: y[i + 17] ← s12 ∗ s13 + s22 ∗ s23 + s32 ∗ s33 29: y[i + 18] ← s13 ∗ s13 + s23 ∗ s23 + s33 ∗ s33 30: return y 31: end procedure 32: procedure L IN C OMB G AUSSIAN K ERNELS(x, y) 33: i ← blockDim.x ∗ blockIdx.x + threadIdx.x shared t[6144] 34: 35: n ← S IZE(x, 1), m ← S IZE(X, 1) 36: fi ← 0 37: gi0 ← 0, gi1 ← 0, gi2 ← 0 38: H11i ← 0 39: H12i ← 0, H22i ← 0 40: H13i ← 0, H23i ← 0, H33i ← 0 41: xi0 ← x[i+n∗0], xi1 ← x[i+n∗1], xi2 ← x[i+n∗2] 42: for j ← 0, m − 1 do 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: i ← M OD(j, 323) if i == 0 then S YNCTHREADS() t[thread + 0] ← y[j ∗ 19 + thread + 0] t[thread + 1024] ← y[j ∗ 19 + thread + 1024] t[thread + 2048] ← y[j ∗ 19 + thread + 2048] t[thread + 3072] ← y[j ∗ 19 + thread + 3072] t[thread + 4096] ← y[j ∗ 19 + thread + 4096] t[thread + 5120] ← y[j ∗ 19 + thread + 5120] S YNCTHREADS() end if i ← i ∗ 19 t0 ← xi0 − t[i + 0], t1 ← xi1 − t[i + 1], t2 ← xi2 − t[i + 2] 56: 57: 58: 59: 60: 61: 62: 63: 64: t3 ← t0 ∗ t[i + 3] + t1 ∗ t[i + 6] + t2 ∗ t[i + 9] t4 ← t0 ∗ t[i + 4] + t1 ∗ t[i + 7] + t2 ∗ t[i + 10] t5 ← t0 ∗ t[i + 5] + t1 ∗ t[i + 8] + t2 ∗ t[i + 11] fij ← t[i + 12] ∗ E XP(t3 ∗ t3 + t4 ∗ t4 + t5 ∗ t5 ) t3 ← t[i + 13] ∗ t0 + t[i + 14] ∗ t1 + t[i + 15] ∗ t2 t4 ← t[i + 14] ∗ t0 + t[i + 16] ∗ t1 + t[i + 17] ∗ t2 t5 ← t[i + 15] ∗ t0 + t[i + 17] ∗ t1 + t[i + 18] ∗ t2 fi ← fi + fij gi0 ← gi0 − fij ∗ t3 , gi1 ← gi1 − fij ∗ t4 , gi2 ← gi2 − fij ∗ t5 65: 66: 67: 68: 69: 70: 71: 72: 73: 74: 75: H11i ← H11i + fij ∗ (t3 ∗ t3 − t[i + 13]) H12i ← H12i + fij ∗ (t3 ∗ t4 − t[i + 14]) H13i ← H13i + fij ∗ (t3 ∗ t5 − t[i + 15]) H22i ← H22i + fij ∗ (t4 ∗ t4 − t[i + 16]) H23i ← H23i + fij ∗ (t4 ∗ t5 − t[i + 17]) H33i ← H33i + fij ∗ (t5 ∗ t5 − t[i + 18]) end for i ← blockDim.x ∗ blockIdx.x + threadIdx.x f[i] ← fi g[i+n∗0] ← gi0 , g[i+n∗1] ← gi1 , g[i+n∗2] ← gi2 H[i+n∗0] ← H11i , H[i+n∗3] ← H12i , H[i+n∗6] ← H13i 76: H[i+n∗1] ← H12i , H[i+n∗4] ← H22i , H[i+n∗7] ← H23i 77: H[i+n∗2] ← H13i , H[i+n∗5] ← H23i , H[i+n∗8] ← H33i 78: return f, g, H 79: end procedure 57 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Global Memory V1 Step 1 0 Thread IDs 1 … m-1 0 1 0 … 0 Thread IDs … m-1 1 0 1 m-1 0 1 0 m-1 n-1 Vr V2 … … … 1 V1 Step 2 Vr V2 … … m-1 0 1 … m-1 … 1 n-1 … V1 Step m 0 Thread IDs 1 V2 … m-1 0 1 0 Vr … … m-1 0 1 … m-1 … 1 n-1 F IGURE 5.4: Memory access pattern without using shared memory. Global Memory 0 Thread IDs 0 Shared Memory 0 Thread IDs 0 0 … 1 1 … … 1 0 1 b-1 b 0 b-1 … 1 b-1 b … … b+1 1 … 1 … b+1 2b-1 … … b-1 … m-1 2b-1 m-1 … k 0 … k+1 1 k F IGURE 5.5: Memory access pattern using shared memory. 58 m-1 k+1 … k+b-1 b-1 … k+b-1 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS where b is the block size, c is the memory coalescing factor. We can see that the number of global memory accesses using shared memory is reduced b × c times. The total number of shared memory accesses is Ms = m × n × r × where w is the half warp size. We need the factor 1 w 1 , w (5.21) is because all the threads always read the data at the same location each time. Thus, the shared memory access will be broadcasted to the threads in the same warp. The outline of this optimized implementation is shown in Algorithm 5. 5.3 Efficient k-Nearest Neighbors Bandwidth Selection For Images In Section 2.4.2, we introduced a k-nearest neighbors based bandwidth selection method. The key point of this method is to calculate the covariance matrix of the k-nearest neighbors for each training point X j , j = 1, . . . , n. To compute the covariance matrix at X j , a naive implementation is to find the k-nearest neighbors of X j first and then calculate the covariance matrix using Equation (2.21). However, to find the k-nearest neighbors, one need to calculate the distances between this point and all other training points, and find the k-nearest neighbors points of X j by sorting the resulting distances. It can be easily proven that the time complexity of such k-nearest neighbors search algorithm has O(n2 ) time complexity, where n is the size of training set. Since the size of training set is usually very large, this is clearly computationally intensive. Therefore, to avoid the directly k-nearest neighbors search, we propose a covariance filtering based algorithm in this section. 5.3.1 k-Nearest Neighbors Covariance Matrix of Images Given a set of d-dimensional training points S = {x1 , x2 , . . . , xn } and an image intensity function   1, x ∈ S, I(x) = , (5.22)  0, otherwise. then the k-nearest neighbors covariance matrix at xi can be written as C(xi ) = k 1X (xi − xp(i,j) )(xi − xp(i,j) )T , i = 1, . . . , n, k j=1 59 (5.23) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 7 Naive k-Nearest Neighbors Bandwidth Selection 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: procedure I MAGE BANDWIDTH S ELECTION(I, k, σ ) for each point x in image such that I(x) 6= 0 do i←0 for each point y i in the image such that I(x) 6= 0 and x 6= y i do di ← calculate the distance between x and y i i←i+1 end for p(1), p(2), . . . , p(k) ← find the indices of the k smallest distances in D = {d0 , d1 , . . . , di−1 } the covariance matrix C(x) ← 0 for i ← 1, k do C(x) ← C(x) + k1 (x − y p(i) )(x − y p(i) )T end for Q(x), Λ(x) ← E IGENDECOMPOSITION(C(x)) S(x) ← σ −1 Λ(x)−1/2 Q(x)T end for return S end procedure where function p(i, j) returns the index of j-th nearest neighbors of xi . For a 2D image, the training point xi = [xi , yi ]T , xi , yi ∈ Z, i = 1, . . . , n, then         k 1 X xi  xp(i,j)  xi  xp(i,j)  T C(xi ) = ( − )( − ) k yi yp(i,j) yi yp(i,j) j=1   1 Pk 1 Pk 2 (xi − xp(i,j) ) j=1 (xi − xp(i,j) )(yi − yp(i,j) ) k . =  Pk k j=1 1 1 Pk 2 j=1 (xi − xp(i,j) )(yi − yp(i,j) ) j=1 (yi − yp(i,j) ) k k (5.24) Let C11 (xi ) = k 1X (xi − xp(i,j) )2 , k j=1 C12 (xi ) = k 1X (xi − xp(i,j) )(yi − yp(i,j) ), k (5.25) j=1 C22 (xi ) = k 1X (yi − yp(i,j) )2 , k j=1 then Equation (5.24) can be written as   C11 (xi ) C12 (xi ) . C(xi ) =  C12 (xi ) C22 (xi ) 60 (5.26) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Similarly, for a 3D image, the training point xi = [xi , yi , zi ]T , xi , yi , zi ∈ Z, i = 1, . . . , n and we have where   C11 (xi ) C12 (xi ) C13 (xi )    C(xi ) =  C12 (xi ) C22 (xi ) C23 (xi ) , C13 (xi ) C23 (xi ) C33 (xi ) C13 (xi ) = (5.27) k 1X (xi − xp(i,j) )(zi − zp(i,j) ), k j=1 C23 (xi ) = k 1X (yi − yp(i,j) )(zi − zp(i,j) ), k (5.28) j=1 C33 (xi ) = k 1X (zi − zp(i,j) )2 . k j=1 5.3.2 r-Neighborhood Covariance Matrix of Images Our algorithm is based on the fact that the locations of the pixels are evenly distributed on the image. Thus, there is a potential that the neighbor searching can be done by filtering. In this section, we give a simple problem where neighbor searching can be completed by filtering easily. Consider a training point xi ∈ S, and a set of training points whose distance from xi is smaller than r, i.e. Nr (xi ) = {x | x ∈ S, x 6= xi , kx − xi k < r}, we want to calculate the covariance matrix of Nr (xi ) at xi . Here, Nr (xi ) is called the r-neighborhood of xi . For a 2D image, according to Equation (5.26) the covariance matrix can be written as   D11 (xi ) D12 (xi )  D(xi ) =  D12 (xi ) D22 (xi ) (5.29) where D11 (xi ) = |Nr (xi )|−1 D12 (xi ) = |Nr (xi )|−1 D22 (xi ) = |Nr (xi )|−1 X (xi − x)2 , x∈Nr (xi ) X (xi − x)(yi − y), x∈Nr (xi ) X (yi − y)2 , x∈Nr (xi ) 61 (5.30) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS and |Nr (xi )| denotes the number of elements in Nr (xi ). Define the covariance operators h11 , h12 , and h22 and the disk operator d as follows, h11 (x) = h12 (x) = h22 (x) =   x2 , kxk < r,  0, otherwise.   xy, kxk < r,  0, otherwise.   y 2 , kxk < r, , , (5.31) ,  0, otherwise.   1, kxk < r, d(x) = .  0, otherwise. Here, both the covariance and disk operators can be expressed by (2r + 1) × (2r + 1) matrices as illustrated in Figure 5.6. It can be easily proved that I(xi ) ∗ h11 (xi ) = I(xi ) ∗ h12 (xi ) = I(xi ) ∗ h22 (xi ) = X (xi − x)2 , x∈Nr (xi ) X (xi − x)(yi − y), x∈Nr (xi ) X (5.32) (yi − y)2 , x∈Nr (xi ) I(xi ) ∗ d(xi ) = |Nr (xi )|. Thus, the covariance matrix of neighborhood Nr (xi ) can be calculated by   D11 (xi ) D12 (xi )  D(xi ) =  D12 (xi ) D22 (xi )   I(xi ) ∗ h11 (xi ) I(xi ) ∗ h12 (xi ) 1   = I(xi ) ∗ d(xi ) I(xi ) ∗ h12 (xi ) I(xi ) ∗ h22 (xi ) 62 (5.33) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS -4 -3 -2 -1 y 0 1 2 3 4 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 -4 -3 -2 -1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 R=4 0 y 1 0 0 0 0 0 1 2 3 4 x -4 -3 -2 -1 0 1 2 3 4 0 0 0 0 0 0 0 0 0 0 9 9 9 9 9 0 0 0 4 4 4 4 4 4 4 0 0 1 1 1 1 1 1 1 0 -4 -3 -2 -1 16 (a) -4 -3 -2 -1 y 0 1 2 3 4 0 0 0 0 0 0 0 0 0 -4 0 0 0 0 6 3 6 4 2 3 2 1 0 0 0 -3 -2 -1 -6 -4 -2 0 -6 -3 0 0 0 -3 -2 -1 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 4 4 4 4 4 4 4 0 0 0 9 9 9 9 9 0 0 16 0 1 2 3 4 x 0 0 0 R=4 0 0 0 0 0 (b) 0 0 0 -3 -6 0 -2 -4 -6 -1 -2 -3 0 0 0 1 2 3 2 4 6 3 6 0 0 0 0 1 2 3 0 0 0 R=4 0 y 0 0 0 0 0 4 (c) -4 -3 -2 -1 0 1 2 3 4 0 0 0 0 0 0 0 0 0 0 0 4 1 0 1 4 0 0 0 9 4 1 0 1 4 9 0 0 9 4 1 0 1 4 9 0 -4 -3 -2 -1 16 0 0 0 0 9 4 1 0 1 4 9 9 4 1 0 1 4 9 16 0 0 x 1 9 4 1 0 1 4 9 0 0 4 1 0 1 4 0 0 0 0 R=4 0 0 0 0 0 0 2 3 4 (d) F IGURE 5.6: The covariance and disk operators of r = 4. (a): disk operator. (b) h11 covariance operator. (c) h12 covariance operator. (d) h22 covariance operator. Similarly, for a 3D image the covariance matrix of neighborhood Nr (xi ) is given by   D11 (xi ) D12 (xi ) D13 (xi )    D(xi ) =  D12 (xi ) D22 (xi ) D23 (xi ) D13 (xi ) D23 (xi ) D33 (xi )   I(xi ) ∗ h11 (xi ) I(xi ) ∗ h12 (xi ) I(xi ) ∗ h13 (xi )   1 I(xi ) ∗ h12 (xi ) I(xi ) ∗ h22 (xi ) I(xi ) ∗ h23 (xi ) =   I(xi ) ∗ d(xi ) I(xi ) ∗ h13 (xi ) I(xi ) ∗ h23 (xi ) I(xi ) ∗ h33 (xi ) 63 (5.34) CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS where h13 (x) = h23 (x) = h33 (x) = 5.3.3 Algorithm   xz, kxk < r,  0, otherwise.   yz, kxk < r,  0, otherwise.   z 2 , kxk < r,  0, , , (5.35) . otherwise. R=3 R=4 F IGURE 5.7: Searching circles of different radii. Assume k = 6, we can find that there are only 2 neighbor training points inside the green searching circle of radius 3. Thus, we increase the searching radius by one and find that the orange searching circle of radius 4 contains 6 neighbor training points. Therefor, if we choose a searching radius r = 4, we have C(x) = D(x). Based on the discussion in Section 5.3.1 and 5.3.2, we propose our efficient k-nearest neighbors bandwidth selection algorithm in this section. From the definition of Nr (xi ), we know that ∀x ∈ S and x ∈ / Nr (xi ) : kx − xi k ≥ r. Thus, all the training points inside Nr (xi ) are closer to xi than those training points outside Nr (xi ). Then, the r-neighborhood Nr (xi ) can also be viewed as the |Nr (xi )|-nearest neighbors of xi . Therefore, the covariance matrix C(xi ) = D(xi ) iff 64 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS k = |Nr (xi )|. Since D(xi ) can be easily calculate by filtering, then, as long as we can find a proper r such that |Nr (xi )| = k, we can calculate C(xi ) from D(xi ) efficiently. Therefore, we need to find the correct searching radius r for xi . Assume that all points within the searching circle (or sphere) are training points, which means πr2 = k (or 43 πr3 = k), then we can set 3k −1/3 the initial value of r to d( πk )−1/2 e(or d( 4π ) e). According to Equation (5.32), we can calculate |Nr (xi )| using the disk operator. Compare the value of |Nr (xi )| with k. If |Nr (xi )| < k, then increase r until |Nr (xi )| ≥ k for all xi in S. The increasing step of r determines the performance of this algorithm. A small r will result a very accurate approximation of C(xi ) but the speed will be relatively slow. The minimal choice of the increasing step is 1, since the training points are the indices of pixels on the image. It should be pointed out that different training point xi has different searching radius. Thus, we need to update C(xi ) according to their searching radius r respectively. After we got all the covariance matrices, we can them apply eigendecomposition to these matrices and calculate the bandwidth accordingly. The outline of this algorithm is shown in Algorithm 8. 5.3.4 GPU Implementation One advantage of our algorithm is that it can be easily accelerated by GPUs. First, the calculation of r-neighborhood covariance matrix involves lots of image convolutions which can be easily done by GPUs. One way to achieve this is to calculate the convolution directly via Matlab’s Parallel Computing Toolbox (PCT). The PCT provides the built-in GPU accelerated functions conv2 and convn for 2D and 3D convolution respectively. The other way is to accomplish convolution through FFT base on the convolution theorem [37], F{f ∗ g} = F{f } · F {g}. (5.36) The time complexity of FFT is O(nlog2 n) [38], which is much faster than convolution’s O(n2 ) when n is large. Lots of GPU packages are available for high performance FFT implementation, such as PCT, Jacket, CUFFT, OPenCL FFT. Second, our algorithm need to perform the eigendecomposition to the covariance matrix for each training point in the image. Since there are a large number of training points and their eigendecompositions are independent, it is a good idea to put these computations on GPU. Several GPU libraries (CULA, MAGMA, etc.) are available for computing QR decomposition, but they are only efficient and competitive with large matrices, at least over 1000 × 1000 [39]. Since the covariance matrix in our case is only 2 × 2 or 3 × 3, we implemented our own GPU based function 65 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS Algorithm 8 Efficient k-Nearest Neighbors Bandwidth Selection 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: procedure I MAGE BANDWIDTH S ELECTION(I, k, σ ) if the image I is 2D then Initialize the filtering radius r ← d( πk )−1/2 e else if the image I is 3D then 3k −1/3 Initialize the filtering radius r ← d( 4π ) e end if for each point x in the image do the number of neighbors N (x) ← 0 the covariance matrix C(x) ← 0 end for while exists point x in the image such that N (x) < k and I(x) 6= 0 do d ← create a disk operator with radius r Nr ← filter the image I with the disk operator d if the image I is 2D then h11 , h12 , h22 ← create the covariance operators C11 , C12 , C22 ← filter the image I with covariance operators h11 , h12 , h22 else if the image I is 3D then h11 , h12 , h13 , h22 , h23 , h33 ← create the covariance operators C11 , C12 , C13 ← filter the image I with covariance operators h11 , h12 , h13 C22 , C23 , C33 ← filter the image I with covariance operators h22 , h23 , h33 end if for each point x in the image such that Nr (x) ≥ k and C(x) = 0 do if the image I is 2D then C (x) C12 (x) C(x) ← Nr (x)−1 11 C12 (x) C22 (x) else if the image I is 3D   then C11 (x) C12 (x) C13 (x) C(x) ← Nr (x)−1 C12 (x) C22 (x) C23 (x) C13 (x) C23 (x) C33 (x) end if end for r ←r+1 end while for each point x such that I(x) 6= 0 do Q(x), Λ(x) ← E IGENDECOMPOSITION(C(x)) S(x) ← σ −1 Λ(x)−1/2 Q(x)T end for return S end procedure 66 CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS to perform millions of small matrix eigendecompositions simultaneously. For more details about the performance of our algorithm, please see Chapter 6. 67 Chapter 6 Experiments and Results In this chapter, we present the experimental results of the efficient methods, optimization techniques, and vesselness measures that we introduced in Chapter 3 and 5. We first introduce the hardware environment of the experiments in Section 6.1. Then, in Section 6.2, we investigate the speed performance of the efficient methods and optimization techniques that we used in our kernel smoothing library. Finally, in Section 6.3, we test the overall performance of the kernel smoothing library when applied to two medical imaging techniques. 6.1 Environment The experiments were performed on two platforms. One platform is a GPU node on Northeastern University’s discovery cluster. This node has an NVIDIA Tesla K20m GPU, dual Intel Xeon E5-2670 CPU and 256GB RAM. The NVIDIA Tesla K20m GPU has 2496 cuda cores (13 SMs, 192 cores each), 0.7GHz clock rate, 5GB GDDR RAM, and 3.0 compute capability and we conduct all the GPU experiments using CUDA 6.5 Toolkit. The Intel Xeon E5-2670 CPU has 2.6GHz clock rate and 8 physical cores. Since each physical core has 2 logical cores and there are 2 Intel Xeon E5-2670 CPUs, this platform has 32 logical cores in total. The other platform is a computer with a Intel Core i7-3615QM CPU, an NVIDIA GeForce GT 650M GPU, and 8GB RAM. The NVIDIA GeForce GT 650M GPU has 0.9GHz clock rate, 384 cuda cores (2 SMs, 192 cores each), and 1GB GDDR RAM. The Intel Core i7-3615QM CPU has 2.3GHz clock rate and 8 logical cores (4 physical cores, 2 logical core each). A summary of the specification of these two platform is given in Table 6.1. 68 CHAPTER 6. EXPERIMENTS AND RESULTS GPU CPU NVIDIA NVIDIA GeForce Intel Xeon Intel Core Tesla K20m GT 650M E5-2670 i7-3615QM Clock Rate 0.7 GHz 0.9 GHz 2.60 GHz 2.30 GHz GPU/CPU Cores 2496 384 8 4 Device/Host Memory 5120 MB 1024 MB 256GB 8GB Name TABLE 6.1: Experiment environment. 6.2 Performance Evaluation In this section, we present the experimental results of algorithms introduced in Chapter 5. We first evaluate the performance of the efficient separable multivariate kernel derivative (SMKD) algorithm, in Section 6.2.1, by providing the visualized complexity analysis as well as the detailed running time between the naive and the efficient algorithms. In Section 6.2.2, we compare the speed-ups between different versions of high-speed KDE and KDDE methods. An memory performance analysis is also provided to illustrate the effectiveness of different memory optimization techniques used in these methods. Finally, the performance comparison results of the efficient k-NN bandwidth selection method on CPU and GPU platforms are given in Section 6.2.3. 6.2.1 Efficient SMKD We perform all the experiments for efficient SMKD algorithm on the Intel Xeon E5-2670 CPU platform. For the first set of experiments, we compare the theoretical number of multiplications between the naive and efficient algorithms with different dimensions and orders at a single sample point. This comparison is based on the complexity analysis Equation (5.9) and (5.10). The experimental results are given in Figure 6.1. As can be seen from the figures, the efficient algorithm outperforms the naive algorithm significantly as the dimension and derivative order increase. For example, the top left figure shows that when the order and data dimension is low, there is only a slightly difference in multiplication number between naive algorithm and efficient algorithm. However, when the order and data dimension increases, as shown in the bottom right figure, the multiplication number of the naive algorithm can be several times higher than the efficient algorithm. 69 number of multiplications number of multiplications number of multiplications order = 1 2000 naive algorithm efficient algorithm 1500 1000 500 0 5 0 #10 10 20 30 40 dimension order = 3 5 naive algorithm efficient algorithm 4 3 2 1 0 5 0 10 20 30 40 dimension order = 5 #10 7 naive algorithm efficient algorithm 4 3 2 1 0 2.5 0 10 20 30 40 dimension order = 7 #10 9 number of multiplications number of multiplications number of multiplications number of multiplications number of multiplications CHAPTER 6. EXPERIMENTS AND RESULTS naive algorithm efficient algorithm 2 1.5 1 0.5 0 0 10 20 30 40 dimension 4 #10 4 order = 2 naive algorithm efficient algorithm 3 2 1 0 5 0 #10 10 20 30 40 30 40 30 40 30 40 dimension order = 4 6 naive algorithm efficient algorithm 4 3 2 1 0 4 0 10 20 dimension order = 6 #10 8 naive algorithm efficient algorithm 3 2 1 0 15 0 10 20 dimension order = 8 #10 9 naive algorithm efficient algorithm 10 5 0 0 10 20 dimension F IGURE 6.1: The comparison of the number of multiplications in computing different orders of derivatives of separable multivariate kernel function using the naive method and the proposed efficient method with dimensions from 1 to 40. Next, we test the performance of the naive and efficient algorithms on the synthetic data. The synthetic data is generated based on the univariate Gaussian kernel N (0, 1) and its derivatives. For both algorithms, some basic memory optimization techniques are used to minimize the latency introduced by memory operations. Thus, our experiments focus mostly on the computational 70 CHAPTER 6. EXPERIMENTS AND RESULTS differences between these two algorithms. The experimental results is given in Figure 6.2. We investigate the performance of these two algorithms at different orders and dimensions. Here, we choose the orders from 1 to 6 and dimensions from 2 to 20. As we can see from the top left figure that when the order is 1 and dimension is smaller than 20, the proposed algorithm is slightly slower than the naive algorithm. It is because that there is a constant computational overhead when deciding (j) the number of elements in set Ni . And as the order and dimension increase, the proposed efficient algorithm outperforms the naive algorithm significantly. The execution time increasing rate in Figure 6.2 is consistent with the multiplication number increasing number in Figure 6.1. It proves that our complexity analysis in Section 5.1.3 is correct. 6.2.2 High Performance KDE and KDDE We perform our CPU experiments on the Intel Core i7-3615QM platform and perform our GPU experiments on the NVIDIA Tesla K20m platform. Because the speed performances of KDE and KDDE algorithms are insensitive to the data type. Our experiments are based on the synthetic data. We generate the synthetic training points and test points from random number generators directly. The experiments are divided into three groups. For the first group, we investigate the speed-ups of our different optimization methods on synthetic 2D data. For the second group, we discover the speed performance of different optimization methods on synthetic 3D data. And for the final group, we try to find the GPU device memory performance of these methods. In all these experiments, we use the functions, which can perform kernel density estimation, kernel gradient estimation and kernel curvature estimation at the same time, from our kernel smoothing library. For the first group, we perform two set of experiments. First, as can be seen in the left bar graph of Figure 6.3, we present the speed-up comparison between CPU serial, CPU parallel, and GPU naive methods. Four experiments have been performed with data size ranging from 107 to 1010 . The data size is denoted by m × n, where m is the number of test points and n is the number of training points. Here, we can see that the CPU parallel method is almost three to four times faster than the CPU serial methods. It is reasonable, since the Intel Core i7-3615QM contains four CPU cores. We also can find that the naive GPU method is way more fast than the CPU methods. Especially, when the data size is 1010 , the naive GPU method can be more than 100 times faster than the serial GPU method. However, the naive GPU method doesn’t achieve its best performance when the data size is small. The reason is that the naive GPU method can’t achieve a good occupancy when the workload is low. 71 CHAPTER 6. EXPERIMENTS AND RESULTS Order = 1 naive algorithm efficient algorithm 0.01 0.008 0.006 0.004 0.002 0 0 5 10 Order = 2 0.1 Execution Time Execution Time 0.012 15 naive algorithm efficient algorithm 0.08 0.06 0.04 0.02 0 20 0 5 Dimension Order = 3 naive algorithm efficient algorithm 0.6 0.4 0.2 0 0 5 10 15 Execution Time Execution Time 20 15 10 5 5 10 15 20 1 0 5 10 Order = 6 120 naive algorithm efficient algorithm 0 20 2 Dimension 25 0 15 3 0 20 Order = 5 30 20 naive algorithm efficient algorithm 4 Dimension 35 15 Order = 4 5 Execution Time Execution Time 0.8 10 Dimension 15 80 60 40 20 0 20 Dimension naive algorithm efficient algorithm 100 0 5 10 Dimension F IGURE 6.2: The execution time of naive method and the proposed efficient method in computing the different orders of derivatives of the multivariate kernel function on synthetic data. Here the number of samples is 10000, and the data dimension ranges from 2 to 20. Second, the right bar graph shows the performance differences between GPU optimization methods. Since the GPU methods are much faster than CPU methods, we use bigger data sizes in this set of experiments. We can see that the GPU Optimization I method is almost two to three times faster than the naive method, which means the kernel merging, loop unrolling and memory layout optimization techniques can bring about two to three times speed up to naive method. Then, we can 72 CHAPTER 6. EXPERIMENTS AND RESULTS 120 100 40 CPU Serial CPU Parallel GPU Naive 35 GPU Naive GPU Optimization I GPU Optimization II GPU Optimization III 30 Speed Up Speed Up 80 60 25 20 15 40 10 20 5 0 1e+07 1e+08 1e+09 0 1e+10 1e+09 Data Size 1e+10 1e+11 1e+12 Data Size F IGURE 6.3: The comparison of speed-ups between different optimization methods on synthetic 2D data. find that if the data size is large the Optimization II method can bring about 7 times speed up to the Optimization I method. It looks like the simplification of Gaussian kernel and the removal of outer loop bring lots of benefits to Optimization II method. The right bar graph also shows that the Optimization III method speeds up the Optimization II method by two times, when the data size is large. However, we can find that when the data size is small, the Optimization III algorithm has a bad performance. It is even slower than the Optimization II method when the data size is 109 . This is because that we need to rearrange the memory layout for Optimization III method. The cost of this rearrangement is constant and is ignored when the total running time is long. But when the data size is small, this rearrangement can’t be ignored and thus will affect the overall performance of Optimization III method. 80 70 60 CPU Serial CPU Parallel GPU Naive 50 GPU Naive GPU Optimization I GPU Optimization II GPU Optimization III 60 Speed Up Speed Up 40 50 40 30 30 20 20 10 10 0 1e+07 1e+08 1e+09 0 1e+10 Data Size 1e+09 1e+10 1e+11 1e+12 Data Size F IGURE 6.4: The comparison of speed-ups between different optimization methods on synthetic 3D data. 73 CHAPTER 6. EXPERIMENTS AND RESULTS For the second group, we do the similar experiments as the first group, except that this time we test their performances on 3D data. The test results are given in Figure 6.4. We can see that, for 3D data, the performance results in the second group are close to the results in the first group. One thing we should notice is that, in the 3D case, the Optimization III method is four to five times faster than the Optimization II method, which is much better than the 2D case. It is because that the data structure in the 3D case is more complex than the 2D case. Thus, more global memory operations are involved when computing the 3D data. Hence, once the shared memory is introduced, the 3D KDE and KDDE algorithm can have more benefits. GPU Implementation Global Memory Transactions Naive 16.4M Optimization I 1.96M Optimization II 1.56M Optimization III 11.7K TABLE 6.2: Global memory transactions between different optimization methods. For the GPU optimization methods, we perform the third group of experiments. We analysis the number of global memory transactions for each optimization method using the CUDA GPU profiler gprof. The experimental results are shown in Table 6.2. Here, the size of data is 106 . We can see that, from the naive method to the Optimization I method, the number of global memory transactions is decreased by 8 times. Such a big improvement is because both kernel merging and memory layout optimization techniques aim at reducing global memory transactions. And from Optimization I to Optimization II, there is only a slight global memory transaction decrease. Since Gaussian kernel simplification only focuses on reducing computation and the outer loop removal only focuses on decreasing gpu-kernel call overhead, this result is reasonable. We can also see that the use of shared memory reduces global memory transactions significantly. From Optimization II to Optimization III the number of global memory transactions has been decreased more than 100 times. However, according to the quantitative global memory analysis given by Equation (5.19) and (5.19), 74 CHAPTER 6. EXPERIMENTS AND RESULTS we know that the theoretical global memory transaction differences should be b × c times. Since, in the experiment, the block size we use is 1024 and the memory coalescing factor c is 32, then the theoretical global memory transaction decreasing should be 32768 times, which is much bigger than our experimental result. The reason of this problem is because, in the Optimization II method, the GPU L1 cache will be used to help reduce global memory transactions. 6.2.3 Efficient k-NN Bandwidth Selector In this section, we perform our CPU experiments on the Intel Core i7-3615QM platform and perform our GPU experiments on the NVIDIA Tesla K20m platform. We investigate the performance of naive and efficient algorithms. For the efficient algorithm, we implemented it on both CPU and GPU, and we call them the GPU efficient algorithm and the CPU efficient algorithm. We divide the experiments into two groups. In the first group, we test the execution time of naive and efficient k-NN bandwidth selector on 2D images. And in the second group, we conduct our experiments subject to naive, CPU efficient and GPU efficient algorithms. We test their performances on 3D images. 20 18 Naive Efficient CPU 16 Execution Time 14 12 10 8 6 4 2 0 64 # 64 64 # 128 128 # 128 128 # 256 256 # 256 Image Size F IGURE 6.5: Performance of the k-NN bandwidth selector on 2D images using naive algorithm and CPU efficient algorithm. For the first group, the experimental results are shown in Figure 6.5. Here, we perform our 75 CHAPTER 6. EXPERIMENTS AND RESULTS experiments with five different image sizes. When the image size is small, we can find that the performances of the naive algorithm and efficient algorithm are similar. However, the execution time of the naive algorithm increases much faster than the efficient algorithm as the image size increases. When the image size increases to 256 × 256, the efficient algorithm is 6 times faster than the naive algorithm. 100 90 Naive Efficient CPU Efficient GPU 80 Execution Time 70 60 50 40 30 20 10 0 32 # 32 # 32 32 # 32 # 64 64 # 64 # 64 64 # 64 # 128 128 # 128 # 128 Image Size F IGURE 6.6: Performance of the k-NN bandwidth selector on 3D images using naive algorithm, CPU efficient algorithm and GPU efficient algorithm. For the second group, five different 3D image sizes are used in the experiments. For image size of 32 × 32 × 32 and 32 × 32 × 64, we conduct experiments subject to naive, CPU efficient and GPU efficient algorithms. For larger 3D image sizes, the naive algorithm will face an out of memory problem, which is because this algorithm need to store a huge distant map between points. Hence, we only conduct our experiments subject to GPU and CPU efficient algorithm on large 3D images. The experimental results are shown in Figure 6.6. We can see that the GPU efficient algorithm is way more faster than the CPU efficient algorithm and the naive algorithm. 76 CHAPTER 6. EXPERIMENTS AND RESULTS 6.3 Vesselness Measure In this section, we apply our kernel smoothing library to the two vesselness measure algorithms introduced in Section 3.3 and Section 3.4. We compare both the filtering results and speed performances of these algorithms when using and not using our kernel smoothing library. All the experiments are conducted on the discovery cluster platform with NVIDIA Tesla K20m GPU and Intel Xeon E5-2670 CPU. 6.3.1 Frangi Filtering As we discussed in Section 3.3, a Frangi filtering based vesselness measure uses the eigenvalues of the Hessian matrices obtained from an image to analysis the likelihood of a pixel being on the tubular structure. To calculate the Hessian matrices, there are three different ways to achieve this: gradient operator, Gaussian smoothing, and KDDE (See Section 3.2). In the original Frangi paper [3], the Hessian matrix is computed through Gaussian smoothing, which is actually identical with the binned estimation method of kernel density estimation theory. However, this method only uses constrained bandwidth meaning that the same smoothing is applied to every coordinate direction. Therefore, to get a more accurate Hessian matrix, one can choose the variable and unconstrained bandwidth for their kernel density derivative estimator. In this way, the estimator is allowed to smooth in any direction whether coordinate or not. In this section, we implement the Frangi filter in three different ways. For the first way, we implement the filter exactly as the original paper. We use the Gaussian smoothing method to calculate the Hessian matrices and perform all the calculation on CPU only. For the second way, we calculate the Hessian matrices using variable unconstrained bandwidth KDDE. This is still performed on CPU. The third way is similar with the second way, except that we now use our GPU accelerated kernel smoothing library to calculate KDDE. We test the performance of these implementations respectively. The vesselness measure results are given in Figure 6.7. We can see that the Frangi filtering result using KDDE, in the middle, gives more details about the retina image than the result using Gaussian smoothing. However, to get such a good vesselness measure result is extremely expensive. The Gaussian smoothing based Frangi filtering only takes 0.38 seconds to get the result. However, if we calculate using KDDE with CPU only, the execution time would be 3433 seconds! Fortunately, this can be accelerated using our GPU accelerated kernel smoothing library, and the execution time is 14.9 seconds. In this case, we only spend a little bit more time to get a much better vesselness measure result. 77 CHAPTER 6. EXPERIMENTS AND RESULTS 6.3.2 Ridgeness Filtering We implement a ridgeness filtering based vessel segmentation algorithm according to the pipeline in Figure 6.8. For a given image, we first perform preprocessing algorithms, such as anisotropic diffusion, unsharp masking, adaptive histogram equalization, etc, to suppress background noise and highlight tubular structures. Then, for the preprocessed image, we use k-N N bandwidth selector to calculate the bandwidth of each training points (nonzero points). Based on the bandwidths and the preprocessed image, we calculate kernel density estimates f , kernel gradient estimates g, and kernel curvature estimates H. For each kernel curvature estimate H, we use lamabda selector to calculate its largest absolute eigenvalue |λ|max . We compute the ridgeness scores s using ridgeness f ilter from kernel gradient estimates and kernel curvature estimates. Finally, based on the values of |λ|max , s, and f , the classifier makes a combined decision on the vesselness of each pixel. Before outputing the segmented image, a postprocessing procedure is used to refine the results from the classifier. The vessel segmentation results are shown in Figure 6.7. We compare the results of this algorithm when using and not using our kernel smoothing library. As can be seen from the figure that the results are exactly the same. It means our kernel smoothing library won’t bring any inaccuracies when accelerating the algorithm. The total execution time of the GPU accelerated implementation is 12.56 seconds. And the total execution time of the non-GPU implementation is 938.7293 seconds. It shows that we achieved 75 times speed-up when using our kernel smoothing library for the ridgeness filtering based vessel segmentation. 78 CHAPTER 6. EXPERIMENTS AND RESULTS F IGURE 6.7: Vesselness measure results using Frangi filter. Top: Original image. Middle: Frangi filtering result using KDDE. Bottom: Frangi filtering result using Gaussian smoothing. 79 CHAPTER 6. EXPERIMENTS AND RESULTS Image Preprocessing k-NN Bandwidth Selector KDDE Lambda Selector Ridgeness Filtering Classifier Postprocessing Output F IGURE 6.8: Algorithm pipeline of the ridgeness filtering based vessel segmentation. 80 CHAPTER 6. EXPERIMENTS AND RESULTS F IGURE 6.9: Vesselness measure results using ridgeness filter. Top: Original image. Middle: Ridgeness filtering result with GPU. Bottom: Ridgeness filtering result without GPU. 81 Chapter 7 Conclusion and Future Work 7.1 Conclusion We started our discussion with the background introduction of kernel smoothing theory in Chapter 1. Then, in Chapter 2, 3 and 4, we provided the essential background knowledge for the discussion in Chapter 5 and 6. In Chapter 2, Section 2.2 and 2.5 gave a detailed knowledge about the kernel density and kernel density derivative estimation that we need for implementing the high performance functions in Section 5.2. Section 2.3 introduced the separable multivariate kernels, which provided the foundation for our discussion in Section 5.1. The discussion of k-nearest neighbors bandwidth selection, in Section 2.4, helped the understanding of the efficient method in Section 5.3. In Chapter 3, we introduced two vesselness measure algorithms in Section 3.3 and 3.4. We used these two algorithms to demonstrate the full potential of our kernel smoothing library in applications. Section 3.1 and 3.2 provided the knowledge background of these two algorithms. Section 6.3 provided the performance of these two algorithms when using the kernel smoothing library. In Chapter 4, we gave a detailed introduction of the GPU architecture and the CUDA programming framework. It helped to understand the optimization techniques we used in Section 5.2. Based on the background knowledge introduced in the previous chapters, we presented three major contributions of our kernel smoothing library in Chapter 5. First, we proposed an efficient method to calculate the separable multivariate kernel derivative. Second, we implemented the kernel density and kernel density derivative estimators using several optimization techniques on multi-core CPU and GPU platforms. Third, we also designed an efficient k-nearest neighbors bandwidth selection algorithm for image processing. We provided a GPU implementation for this algorithm 82 CHAPTER 7. CONCLUSION AND FUTURE WORK as well. In Chapter 6, we designed a series of experiments to evaluate the performance of the algorithms and implementations we presented in Chapter 5. It shows that the presented algorithms and implementations achieved significant speed-ups than their direct or naive counterparts. The performance evaluation of our kernel smoothing library on two vesselness measure algorithms was provided as well. 7.2 Future Work There are several places we can improve in the future. First, in the current version, our kernel smoothing library only implemented the GPU accelerated KDE and KDDE functions for 2D and 3D data. In the future, we can add a GPU implementation for higher dimensional data. Second, the bandwidth selection methods are crucial in kernel smoothing. But we only implemented one bandwidth selection method in our library. We should add more implementations for bandwidth selection methods. Since some bandwidth selection methods are also computational intensive, there is a potential to implement them on GPU. Finally, object-oriented programming can be used in our library. 83 Bibliography [1] M. Rosenblatt et al., “Remarks on some nonparametric estimates of a density function,” The Annals of Mathematical Statistics, vol. 27, no. 3, pp. 832–837, 1956. [2] E. Parzen, “On estimation of a probability density function and mode,” The annals of mathematical statistics, pp. 1065–1076, 1962. [3] A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever, “Multiscale vessel enhancement filtering,” in Medical Image Computing and Computer-Assisted Interventation—MICCAI’98. Springer, 1998, pp. 130–137. [4] B. Silverman, “Algorithm as 176: Kernel density estimation using the fast fourier transform,” Applied Statistics, pp. 93–99, 1982. [5] M. Wand, “Fast computation of multivariate kernel estimators,” Journal of Computational and Graphical Statistics, vol. 3, no. 4, pp. 433–445, 1994. [6] A. Elgammal, R. Duraiswami, and L. S. Davis, “Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 11, pp. 1499–1504, 2003. [7] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis, “Improved fast gauss transform and efficient kernel density estimation,” in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003, pp. 664–671. [8] A. Sinha and S. Gupta, “Fast estimation of nonparametric kernel density through pddp, and its application in texture synthesis.” in BCS Int. Acad. Conf., 2008, pp. 225–236. [9] J. M. Phillips, “ε-samples for kernels,” in Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2013, pp. 1622–1632. 84 BIBLIOGRAPHY [10] Y. Zheng, J. Jestes, J. M. Phillips, and F. Li, “Quality and efficiency for kernel density estimates in large data,” in Proceedings of the 2013 ACM SIGMOD International Conference on ACM, 2013, pp. 433–444. Management of Data. [11] S. Łukasik, “Parallel computing of kernel density estimates with mpi,” in Computational Science–ICCS 2007. Springer, 2007, pp. 726–733. [12] J. Racine, “Parallel distributed kernel estimation,” Computational Statistics & Data Analysis, vol. 40, no. 2, pp. 293–302, 2002. [13] P. D. Michailidis and K. G. Margaritis, “Parallel computing of kernel density estimation with different multi-core programming models,” in Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on. IEEE, 2013, pp. 77–85. [14] ——, “Accelerating kernel density estimation on the gpu using the cuda framework,” Applied Mathematical Sciences, vol. 7, no. 30, pp. 1447–1476, 2013. [15] W. Andrzejewski, A. Gramacki, and J. Gramacki, “Graphics processing units in acceleration of bandwidth selection for kernel density estimation,” International Journal of Applied Mathematics and Computer Science, vol. 23, no. 4, pp. 869–885, 2013. [16] T. Duong et al., “ks: Kernel density estimation and kernel discriminant analysis for multivariate data in r,” Journal of Statistical Software, vol. 21, no. 7, pp. 1–16, 2007. [17] M. Wand and B. Ripley, “Kernsmooth: Functions for kernel smoothing for wand & jones (1995),” R package version, vol. 2, pp. 22–19, 2006. [18] T. Hayfield and J. S. Racine, “Nonparametric econometrics: The np package,” Journal of statistical software, vol. 27, no. 5, pp. 1–32, 2008. [19] A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press, 1997. [20] V. A. Epanechnikov, “Non-parametric estimation of a multivariate probability density,” Theory of Probability & Its Applications, vol. 14, no. 1, pp. 153–158, 1969. [21] M. Shaker, J. N. Myhre, and D. Erdogmus, “Computationally efficient exact calculation of kernel density derivatives,” Journal of Signal Processing Systems, pp. 1–12, 2014. 85 BIBLIOGRAPHY [22] B. W. Silverman, Density estimation for statistics and data analysis. CRC press, 1986, vol. 26. [23] T. Duong, Bandwidth selectors for multivariate kernel density estimation. University of Western Australia, 2004. [24] G. R. Terrell and D. W. Scott, “Variable kernel density estimation,” The Annals of Statistics, pp. 1236–1265, 1992. [25] M. Jones, “Variable kernel density estimates and variable kernel density estimates,” Australian Journal of Statistics, vol. 32, no. 3, pp. 361–371, 1990. [26] I. S. Abramson, “On bandwidth variation in kernel estimates-a square root law,” The Annals of Statistics, pp. 1217–1223, 1982. [27] L. Breiman, W. Meisel, and E. Purcell, “Variable kernel estimates of multivariate densities,” Technometrics, vol. 19, no. 2, pp. 135–144, 1977. [28] J. E. Chacón, T. Duon, and M. Wand, “Asymptotics for general multivariate kernel density derivative estimators,” 2009. [29] J. E. Chacón, T. Duong et al., “Data-driven density derivative estimation, with applications to nonparametric clustering and bump hunting,” Electronic Journal of Statistics, vol. 7, pp. 499–532, 2013. [30] J. R. Magnus and H. Neudecker, “Matrix differential calculus with applications in statistics and econometrics,” 1995. [31] H. V. Henderson and S. Searle, “Vec and vech operators for matrices, with some uses in jacobians and multivariate statistics,” Canadian Journal of Statistics, vol. 7, no. 1, pp. 65–81, 1979. [32] T. Duong, A. Cowling, I. Koch, and M. Wand, “Feature significance for multivariate kernel density estimation,” Computational Statistics & Data Analysis, vol. 52, no. 9, pp. 4225–4242, 2008. [33] T. M. Apostol, “Mathematical analysis,” 1974. [34] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, 1949. 86 BIBLIOGRAPHY [35] U. Ozertem and D. Erdogmus, “Locally defined principal curves and surfaces,” The Journal of Machine Learning Research, vol. 12, pp. 1249–1286, 2011. [36] E. Bas and D. Erdogmus, “Principal curves as skeletons of tubular objects,” Neuroinformatics, vol. 9, no. 2-3, pp. 181–191, 2011. [37] Y. Katznelson, An introduction to harmonic analysis. Cambridge University Press, 2004. [38] V. Y. Pan, “The trade-off between the additive complexity and the asynchronicity of linear and bilinear algorithms,” Information processing letters, vol. 22, no. 1, pp. 11–14, 1986. [39] R. Solcà, T. C. Schulthess, A. Haidar, S. Tomov, I. Yamazaki, and J. Dongarra, “A hybrid hermitian general eigenvalue solver,” arXiv preprint arXiv:1207.1773, 2012. 87

High Performance Kernel Smoothing Library For Biomedical Imaging

Related documents

Products

Support

High Performance Kernel Smoothing Library For Biomedical Imaging

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib