High Performance Kernel Smoothing Library For Biomedical Imaging

advertisement
High Performance Kernel Smoothing Library For Biomedical Imaging
A Thesis Presented
by
Haofu Liao
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Master of Science
in
Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts
May 2015
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Signature Page
Thesis Title:
High Performance Kernel Smoothing Library For Biomedical Imaging
Author:
Haofu Liao
Department:
Electrical and Computer Engineering
NUID:
001988944
Approved for Thesis Requirements of the Master of Science Degree
Thesis Advisor
Dr. Deniz Erdogmus
Signature
Date
Signature
Date
Signature
Date
Signature
Date
Signature
Date
Signature
Date
Thesis Committee Member or Reader
Dr. David R. Kaeli
Thesis Committee Member or Reader
Dr. Gunar Schirner
Thesis Committee Member or Reader
Dr. Rafael Ubal
Department Chair
Dr. Sheila S. Hemami
Associate Dean of Graduate School:
Dr. Sara Wadia-Fascetti
Contents
List of Figures
iv
List of Tables
vi
Abstract of the Thesis
vii
1
2
3
Introduction
1.1 Background . . . . .
1.2 Related Work . . . .
1.3 Contributions . . . .
1.4 Outline of the Thesis
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Kernel Smoothing
2.1 Univariate Kernel Density Estimation .
2.2 Multivariate Kernel Density Estimation
2.3 Kernel Functions . . . . . . . . . . . .
2.3.1 Univariate Kernels . . . . . . .
2.3.2 Separable Multivariate Kernels .
2.4 Bandwidth . . . . . . . . . . . . . . . .
2.4.1 Types of Bandwidth . . . . . .
2.4.2 Variable Bandwidth . . . . . . .
2.5 Kernel Density Derivative Estimation .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
7
8
9
10
11
11
13
15
Vesselness Measure
3.1 Gradients and Hessian Matrices of Images . . . . .
3.1.1 Gradient . . . . . . . . . . . . . . . . . . .
3.1.2 Hessian . . . . . . . . . . . . . . . . . . .
3.2 Finding 1st and 2nd Order Derivatives From Images
3.2.1 Gradient Operator . . . . . . . . . . . . .
3.2.2 Gaussian Smoothing . . . . . . . . . . . .
3.2.3 Kernel Density Derivative Estimation . . .
3.3 Frangi Filtering . . . . . . . . . . . . . . . . . . .
3.4 Ridgeness Filtering . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
18
20
22
23
25
27
28
29
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
6
7
GPU Architecture and Programming Model
4.1 GPU Architecture . . . . . . . . . . . . .
4.2 Programming Model . . . . . . . . . . .
4.3 Thread Execution Model . . . . . . . . .
4.4 Memory Accesses . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
34
36
38
Algorithms and Implementations
5.1 Efficient Computation of Separable Multivariate Kernel Derivative . . . . .
5.1.1 Definitions and Facts . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 High Performance Kernel Density and Kernel Density Derivative Estimators
5.2.1 Multi-core CPU Implementation . . . . . . . . . . . . . . . . . . .
5.2.2 GPU Implementation in CUDA . . . . . . . . . . . . . . . . . . .
5.3 Efficient k-Nearest Neighbors Bandwidth Selection For Images . . . . . . .
5.3.1 k-Nearest Neighbors Covariance Matrix of Images . . . . . . . . .
5.3.2 r-Neighborhood Covariance Matrix of Images . . . . . . . . . . .
5.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
40
41
44
45
46
47
48
59
59
61
64
65
Experiments and Results
6.1 Environment . . . . . . . . . . . . . . . . .
6.2 Performance Evaluation . . . . . . . . . . .
6.2.1 Efficient SMKD . . . . . . . . . .
6.2.2 High Performance KDE and KDDE
6.2.3 Efficient k-NN Bandwidth Selector
6.3 Vesselness Measure . . . . . . . . . . . . .
6.3.1 Frangi Filtering . . . . . . . . . . .
6.3.2 Ridgeness Filtering . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
68
68
69
69
71
75
77
77
78
Conclusion and Future Work
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
82
83
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
iii
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
The relation between under-five mortality rate and life expectancy at birth
Univariate kernel density estimate . . . . . . . . . . . . . . . . . . . . .
Multivariate kernel density estimate . . . . . . . . . . . . . . . . . . . .
Truncated Gaussian kernel function . . . . . . . . . . . . . . . . . . . .
Univariate kernel density estimates of different bandwidths . . . . . . . .
Comparison of three bandwidth matrix parametrization classes . . . . . .
Univariate sample point kernel density estimate . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
7
8
10
12
13
14
3.1
3.2
3.3
3.4
3.5
Gradient of the standard Gaussian function .
Image gradient . . . . . . . . . . . . . . .
Visualized Eigenvalues with ellipsoid . . .
Derivatives of Gaussian filters . . . . . . .
Vesselness measure using Frangi filter . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
20
22
26
30
4.1
4.2
4.3
4.4
4.5
4.6
4.7
GPU block diagram . . . . . . . . . . .
GPU hardware memory hierarchy . . .
Programming model . . . . . . . . . .
GPU software memory hierarchy . . . .
Warp scheduler . . . . . . . . . . . . .
Aligned and consecutive memory access
Misaligned memory access . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
35
36
37
39
39
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Relation between nodes in graph G . . . . . . . . . . . . . .
Graph based efficient multivariate kernel derivative algorithm
Memory access patterns of matrices and cubes . . . . . . . .
Memory access pattern without using shared memory . . . .
Memory access pattern using shared memory . . . . . . . .
The covariance and disk operators of r = 4 . . . . . . . . .
Searching circles of different radii . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
45
51
58
58
63
64
6.1
Multiplication number comparison between the naive method and the proposed
efficient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution time comparison between the naive method and the proposed efficient
method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
72
6.3
6.4
6.5
6.6
6.7
6.8
6.9
The comparison of speed-ups between different optimization methods on synthetic
2D data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The comparison of speed-ups between different optimization methods on synthetic
3D data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of the k-NN bandwidth selector on 2D images using naive algorithm
and CPU efficient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of the k-NN bandwidth selector on 3D images using naive algorithm,
CPU efficient algorithm and GPU efficient algorithm . . . . . . . . . . . . . . . .
Vesselness measure results using Frangi filter . . . . . . . . . . . . . . . . . . . .
Algorithm pipeline of the ridgeness filtering based vessel segmentation . . . . . . .
Vesselness measure results using ridgeness filter . . . . . . . . . . . . . . . . . . .
v
73
73
75
76
79
80
81
List of Tables
3.1
Possible orientation patterns in 2D and 3D images . . . . . . . . . . . . . . . . . .
23
4.1
Compute capability of Fermi and Kepler GPUs . . . . . . . . . . . . . . . . . . .
38
6.1
6.2
Experiment environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Global memory transactions between different optimization methods . . . . . . . .
69
74
vi
Abstract of the Thesis
High Performance Kernel Smoothing Library For Biomedical Imaging
by
Haofu Liao
Master of Science in Electrical and Computer Engineering
Northeastern University, May 2015
Dr. Deniz Erdogmus, Adviser
The estimation of probability density and probability density derivatives has full potential for
applications. In biomedical imaging, the estimation of the first and second derivatives of the density
is crucial to extract tubular structures, such as blood vessels and neuron traces. Probability density
and probability density derivatives are often estimated using nonparametric data-driven methods.
One of the most popular nonparametric methods is the Kernel Density Estimation (KDE) and Kernel
Density Derivative Estimation (KDDE). However, a very serious drawback of using KDE and KDDE
is the intensive computational requirements, especially for large data sets. In this thesis, we develop
a high performance kernel smoothing library to accelerate KDE and KDDE methods. A series of
hardware optimizations are used to deliver a high performance code. On the host side, multi-core
platforms and parallel programming frameworks are used to accelerate the execution of the library.
For 2 or 3-dimensional data points, the Graphic Processing Unit (GPU) platform is used to provide
high levels of performance to the kernel density estimators, kernel gradient estimators as well as the
kernel curvature estimators. Several Compute Unified Device Architecture (CUDA) based techniques
are used to optimize their performances. What’s more, a graph-based algorithm is designed to
calculate the derivatives efficiently and a fast k-nearest neighbor bandwidth selector is designed to
speed up the variable bandwidth selection for image data on GPU.
vii
Chapter 1
Introduction
1.1
Background
Density estimation constructs an estimate of underlying probability density function using an observed data set. In density estimation, there are three types of approaches, parametric, semi-parametric
and nonparametric. Both paramedic and semi-parametric techniques require a prior knowledge of
the underlying distribution of the sample data. In parametric approaches, the data should be from a
known family. In semi-parametric approaches, the knowledge of the mixture distribution is assumed
to be known. On the contrary, nonparametric methods, which attempt to flexibly estimate an unknown
distribution, require less structure information about the underlying distribution. This advantage
makes them a good choice for robust and more accurate analysis.
Kernel density estimation (KDE) is the most widely studied and used nonparametric technique. It
is first introduced by Rosenblatt [1], and then discussed in detail by Paren [2]. Typically, a kernel
density estimate is constructed by a sum of kernel functions centered at observed data points and a
smoothing parameter called bandwidth is used to control the smoothness of the estimated densities.
KDE has a broad range of applications such as image processing, medical monitoring and market
analysis.
On the other hand, the estimation of density derivative, though have only received relatively scant
attention, also has a full potential for applications. Indeed, nonparametric estimation of higher order
derivatives of the density functions can provide lots of important information about a multivariate
data set, such as local extrema, valleys, ridges or saddle points. In the gradient estimation case, the
well known mean-shif t algorithm can be used for clustering and data filtering. It is very popular in
the areas of low-level vision problems, discontinuity preserving smoothing and image segmentation.
1
CHAPTER 1. INTRODUCTION
Another use of gradient estimation is to find filaments in point clouds, which has applications in
medical imaging, remote sensing, seismology and cosmology. In the Hessian estimation case, the
eigenvalues of Hessian matrix are crucial to manifolds extraction and curvilinear structure analysis.
Moreover, the prevalent Frangi filter [3] and its variants also require the calculation of Hessian
matrix.
Smoothing parameter or bandwidth plays a very important role in KDE and kernel density
derivative estimation. It determines the performance of the estimator in practice. However, in most
of the cases only constrained bandwidth is used. In adaptive kernel density estimation case, the
bandwidth is a symmetric positive definite matrix; it allows the kernel estimator to smooth in any
direction whether coordinate or not. In even simpler case, a bandwidth matrix is only a positive scalar
multiple of the identity matrix. There are three reasons for the widely use of simpler parameterization
than the unconstrained counterpart. First, in practice they need less smoothing parameters to be
tuned. Second, due to the difficulties encountered in the mathematical analysis of estimators with
unconstrained bandwidth. Third, unconstrained bandwidth is not suitable for most of the existed fast
KDE algorithms.
1.2
Related Work
Around 1980s, KDE becomes the de facto nonparametric method to represent a continuous distribution from a discrete point set. However, a very serious drawback of KDE methods is the expensive
computational complexity for the calculation of probability at each target data vector. A typical KDE
method is usually of computational order O(n2 k), where n is the number of observations and k is
the number of variables. In many cases, such as database management and wildlife ecology, the
size of n can be as large as hundreds of millions. What’s more, data-driven methods of bandwidth
selection can also add additional order of computational burden to KDEs.
Currently, there are two different approaches to satisfy the computational demands of KDEs. The
first one is to use approximate techniques to reduce the computational burden of kernel estimation. In
1982, Silverman [4] proposed a fast density estimation method based on Fast Fourier Transformation
(FFT). However, this method requires source points to be distributed on an evenly spaced grid
and it can only compute univariate kernels. In 1994, Wond [5] extended Silverman’s method to
multivariate case and proposed a well-known binned estimation method. But it still requires a binned
data set. Another approach is proposed by Elgammal [6]. He designed a Fast Gauss Transform
(FGT) method, where the data are not necessarily on grids. But the problem is the complexity of
2
CHAPTER 1. INTRODUCTION
computation and storage of the FGT grows exponentially with dimension. Therefore, Changjiang
Yang et al. [7] proposed an Improved Fast Gauss Transform (IFGT) which can efficiently evaluate
the sum of Gaussian in higher dimension. But both algorithms are limited to work with only Gaussian
kernel. Moreover, Sinha and Gupta [8] proposed a new fast KDE algorithm through PDDP which
they claimed that their algorithm is more accurate and efficient than IFGT. Recently, an ε-sample
algorithms is proposed by Phillips [9]. His algorithm studied the worst case error of kernel density
estimates via subset approximation which can be helpful for sampling large dataset and hence can
result a fast kernel density estimate.
The second approach is to use parallel computing. Some of the most important parallel computing
technologies are clustering computing, multicore computing and general-purpose computing on
graphics processing units (GPGPU). For clustering computing, Zheng et al [10] implemented the
kernel density estimation on Hadoop cluster machines using MapReduce as the distributed and
parallel programming framework. Łukasik [11] and Racine [12] presented parallel methods based
on Message Passing Interface standard in multicomputer environment. For multicore computing,
Michailidis and Margaritis [13] parallelized kernel estimation methods on multi-core platform using
different programming frameworks such as Pthreads, OpenMP, Intel Cilk++, Intel TBB, SWARM
and FastFlow. The same authors also presented some preliminary work of kernel density estimation
using GPU approach [14]. Recently, Andrzejewski et al [15] proposed a GPU based algorithm to
accelerate the bandwidth selection methods of kernel density estimators. However, all the authors
ignore the more complicated unconstrained bandwidth case for multivariate kernel density estimation.
The kernel density derivative estimation is not considered as well.
1.3
Contributions
We developed a highly efficient and flexible kernel smoothing library. This library supports both
univariate and multivariate kernels. Unlike other existing kernel smoothing libraries [16, 17, 18, 19],
it supports not only the constrained(restricted) bandwidth, but also the more general unconstrained
bandwidth. The bandwidth, both constrained and unconstrained, is not limited to be fixed. What’s
more, a sample-point based variable bandwidth is supported as well. The input data has no dimensional limitation. Basically, as long as the hardware is permitted, the library can support data of any
dimension. To improve the computational efficiency, kernel functions with finite support can be used
and only the data points within the kernel function’s support will be calculated.
Besides the kernel density estimators, the kernel density derivative estimators are implemented as
3
CHAPTER 1. INTRODUCTION
well. For separable kernel functions, this library is able to calculate the derivatives of any order. A
graph based algorithm is designed to calculate the derivatives efficiently.
A series of hardware optimizations are used to deliver a high performance code. On the host side,
multi-core platforms and parallel programming frameworks are used to accelerate the execution of
the library. For 2 or 3-dimensional data points, the GPU platform is used for speeding up the kernel
density estimators, kernel gradient estimators as well as the kernel curvature estimators. Several
CUDA based algorithms are designed to optimize their performance.
Finally, an efficient k-nearest neighbor based variable bandwidth selector is designed for image
data and a high-performance CUDA algorithm is implemented for this selector.
1.4
Outline of the Thesis
This thesis is organized as follows:
In Chapter 2, we discuss the detailed knowledge background of the kernel smoothing theory.
We introduce both the univariate and multivariate KDE methods, provide the direct calculation
of the separable multivariate kernel and kernel derivatives, present a variety of bandwidth types,
and give the strict definition of KDDE methods. In Chapter 3, we first introduce the gradients
and Hessian matrices of images. Then, discuss three ways of finding 1st and 2nd order derivatives
from images. Finally, we present two vesselness measure algorithms that use gradients and Hessian
matrices of images. Chapter 4 gives a detailed introduction of the GPU architecture and the CUDA
programming framework. We present three major contributions of this thesis in Chapter 5 and discuss
their performance in Chapter 6. Finally, conclusions and future works are given in Chapter 7.
4
Chapter 2
Kernel Smoothing
Given data X 1 , X 2 , . . . , X n are drawn from density f , How do we estimate the probability density
function f from these observations?
80
Life expectancy at birth
75
70
65
60
55
50
45
40
0
50
100
150
200
250
Under-five mortality (per 1000 live births)
F IGURE 2.1: The relation between under-five mortality rate and life expectancy at birth in different countries
and regions. The original data is from the department of Economic and Social Affairs, United Nation.
5
CHAPTER 2. KERNEL SMOOTHING
2.1
Univariate Kernel Density Estimation
Given a set of n independent and identically distributed (i.i.d.) random samples X1 , X2 , . . . , Xn
from a common density f , the univariate kernel density estimator is
fˆ(x; h) = n−1
n
X
i=1
Here K is a kernel function which satisfies
R
h−1 K(h−1 (x − Xi )).
(2.1)
K(x)dx = 1, and h > 0 is a smoothing parameter
called the bandwidth. By introducing a rescaling notation Kh (u) = h−1 K(h−1 x), the above
formula can be written in a more compact way
fˆ(x; h) = n−1
n
X
i=1
Kh (x − Xi ).
(2.2)
As we can see from Equation (2.2), the kernel density estimate is a summation of scaled kernel
functions with each of a probability mass n−1 . In a intuitive view, we can look this as a sum of
‘bumps’ placed at the observation points X1 , X2 , . . . , Xn . The kernel function K determines the
shape of the bumps while the bandwidth h determines their width.
An illustration is given in Figure 2.2, where observations Xi are marked in dots on x-axis and
their corresponding scaled kernel ‘bumps’ n−1 Kh (x − Xi ) are shown in the dotted lines. Here, the
kernel K is chosen to be the standard normal pdf N (0, 1). The resulting univariate kernel density
estimate fˆ is given in the solid line. We can find that the estimate is bimodal, which is a reflection of
the distribution of observations. Usually, it is not appropriate to construct a density estimate from
such a small number of samples, but a sample size of 5 has been chosen here for the sake of clarity.
As illustrated in Figure 2.2, the value of the kernel estimate at point x is simply the average of
the n kernel ordinates at that point. The estimate combines contributions from each data. Hence,
in regions where there are many observations, the estimate will have a relative large value. It is
consistent with the truth that a densely distributed region will have a high probability density, and
vice versa. Notice that, in this case the scaled kernel Kh is simply the N (0, h2 ) density. In this case,
the bandwidth parameter h can be seen as a scaling factor which determines the spread of the kernel.
In common, the bandwidth controls the amount of smoothness of kernel density estimators. It is the
most important factor in KDE and KDDE. We will cover more details of bandwidths in Section 2.4.
6
CHAPTER 2. KERNEL SMOOTHING
0.6
0.5
f^(x)
0.4
0.3
0.2
0.1
0
-3
-2
-1
0
1
2
3
x
F IGURE 2.2: Univariate kernel density estimate: dots on x-axis - sample (training) points, solid line - kernel
density estimate, dashed lines - scaled kernels at different sample points.
2.2
Multivariate Kernel Density Estimation
The d-dimensional multivariate kernel density estimator, for a set of n i.i.d. sample X 1 , X 2 , . . . , X n
from a common density f , is
fˆ(x; H) = n−1
n
X
i=1
KH (x − X i ),
(2.3)
where
• x = (x1 , x2 , . . . , xd )T , X i = (Xi1 , Xi2 , . . . , Xid )T , i = 1, 2, . . . , n;
• K is the unscaled kernel, which is usually a spherically symmetric probability density function;
• KH is the scaled kernel. It is related with the unscaled kernel by KH (x) = |H|−1/2 K(H −1/2 x);
• H is the d × d bandwidth matrix, which is non-random, symmetric, and positive defined.
The same as the univariate case, the multivariate kernel density estimate is calculated by placing
a scaled kernel of mass n−1 at each data point and then aggregating to form the density estimate.
Figure 2.3 illustrates a multivariate kernel density estimate in 2-dimension. The left hand figure
shows observations (marked in dots) from density f (denoted by the isolines). On the right is the
estimate fˆ. Since the ground truth f is actually a linear combination of five bivariate normal density
functions, as we can see from the right that the fˆ gives a good estimate of this function.
7
8
8
7
7
6
6
5
5
4
4
3
3
f^(x)
f (x)
CHAPTER 2. KERNEL SMOOTHING
2
1
2
1
0
0
-1
-1
-2
-2
-3
-5
0
-3
5
-5
0
5
x
x
F IGURE 2.3: Multivariate kernel density estimate. Left: the contour denotes the density function f , and the
dots are the sample/training points that draw from f . Right: the estimate fˆ calculated from the dots in the left
figure.
Define S = H −1/2 and evaluate fˆ at some points of interest x1 , x2 , . . . , xm , Equation (2.3) can
be rewritten as
fˆ(xi ; S) = n−1
n
X
j=1
KS (xi − X j ), i = 1, 2, . . . , m, j = 1, 2, . . . , n,
(2.4)
where xi = (xi1 , xi2 , . . . , xid )T , X j = (Xj1 , Xj2 , . . . , Xjd )T , i = 1, 2, . . . , m, j = 1, 2, . . . , n
and K S (x) = |S|K(Sx). Here xi is called test point, X i is called training point, and S is
called scale. The scale and bandwidth is related by H −1 = S T S. Equation (2.4) provides a more
direct form when considering its implementation and complexity. Instead of a continuous function
fˆ(x), the discrete form fˆ(xi ) is more intuitive for its software implementation and the scale S
reduces the complexity by avoiding the calculation of inverse square root of bandwidth H. In
the subsequent discussions, we will mostly use this form for the formulas and equations. Since
there are m test points and for each test point there are n scaled kernel function evaluations at the
d-dimensional training point, then the complexity of Equation (2.4) is O(mnd).
2.3
Kernel Functions
8
CHAPTER 2. KERNEL SMOOTHING
2.3.1
Univariate Kernels
A univariate kernel is a one dimensional, non-negative, real-valued, integrable function k which
satisfies
•
R +∞
−∞
k(u)du = 1;
• k(−u) = k(u) for all values of u.
The first requirement ensures that the result of kernel density estimator is a probability density
function. The second requirement makes sure that the kernel function has zero mean and the kernel
function placed at certain training point has an average value the same as the corresponding training
point.
To help reduce the computational complexity, a univariate bounding box can be used to the
kernels for finite support at some costs of less accuracy. The truncated kernel is given as
Z b
Z a
ktrunc (x; a, b) = [
k(u)du −
k(u)du]−1 k(x)b(x; a, b),
−∞
(2.5)
−∞
where b(x; a, b) is the bounding box expanding from the lower bound a to the upper bound b, and


1, if a ≤ x ≤ b,
,
(2.6)
b(x; a, b) =

0, otherwise.
Rb
Ra
The the normalization factor [ −∞ k(u)du − −∞ k(u)du]−1 is introduced to ensure the truncated
kernel function satisfies the requirement that
Z +∞
ktrunc (u)du = 1.
(2.7)
−∞
If the accuracy outweighs the computational complexity, then the bounding box can be ignored
by setting the lower bound and upper bound to −∞ and +∞ respectively. In this case, we get
ktrunc (x) = k(x).
There are a range of univariate kernels commonly used, such as uniform, triangular, biweight,
triweight, Epanechnikov, etc. However, the choice of the univariate kernel function k is not crucial to
the accuracy of kernel density estimators [20]. Due to the convenient of its mathematical properties
and the smooth density estimates it results, the normal kernel is often used k(x) = φ(x), where φ is
the standard normal density function and it is defined as
1 2
1
φ(x) = √ e− 2 x .
2π
9
(2.8)
CHAPTER 2. KERNEL SMOOTHING
1.4
kernel
bounding box
truncated kernel
1.2
1
y
0.8
0.6
0.4
0.2
0
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
x
F IGURE 2.4: Truncated Gaussian kernel function. The solid line - the truncated Gaussian kernel, the square
dash line - the bounding box, the dot line - the untruncated Gaussian kernel.
2.3.2
Separable Multivariate Kernels
The multivariate kernel functions, based on their separability, can be divided into two categories:
separable kernel function and nonseparable kernel function. Due to the computational simplicity, we
mainly focus on the separable multivariate kernel functions in this section.
A separable multivariate kernel K(x) : Rd → R can be written as [21]
K(x) =
d
Y
k(xl ),
(2.9)
l=1
where xl ∈ R represents the l-th component of x = (x1 , x2 , . . . , xd )T . Notice that the kernels can
have either finite or infinite support. In the finite case, we omit the truncation subscript and bounding
box for simplicity. According to Equation (2.5) we know that the separable multivariate kernel K is
only valid for x ∈ support{K(·)} and values are zero outside the support.
Similarly, the first order partial derivatives of K can be written as
d
Y
∂K
(x) = k 0 (xc )
k(xl ),
∂xc
l=1
(2.10)
l6=c
where
x, and
∂K
∂xc (x) is the first order partial derivative of the K with respect xc , the c-th component of
k 0 (xc ) is the first order derivative of the univariate kernel function k(xc ). The second order
10
CHAPTER 2. KERNEL SMOOTHING
partial derivatives of K is

d
Q


k 00 (xc )
k(xl ),



l=1

l6=c
∂K
(x) =
d
Q

∂xr ∂xc
0 (x )k 0 (x )

k
k(xl ),
r
c



l=1

l6=c
r = c,
,
(2.11)
r 6= c.
l6=r
where
∂K
∂xr ∂xc (x)
is the seconder order partial derivative of the K with respect to xr and xc , and
k 00 (xc ) is the second order derivative of the univariate kernel function k(xc ).
The above definition of the first and second order partial derivatives of kernel K can be extended
to higher order case. Given a mutiset N = {n1 , . . . , nr | ni ∈ 1, . . . , d, i ∈ 1, . . . , r}, then the r-th
order partial derivative of K with respect to xn1 , . . . , xnr can be given as
d
Y
∂rK
(x) =
k (N (i)) (xi ),
∂xn1 , . . . , ∂xnr
(2.12)
i=1
where N (i) denotes the number of element of value i in set N , k (N (i)) is the N (i)-th order derivative
of the univariate kernel function k.
2.4
Bandwidth
In common with all smoothing problems, the most important factor is to determine the amount of
smoothing. For kernel density estimators, the single most important factor is the bandwidth since it
controls the amount and orientation of the smoothing.
2.4.1
Types of Bandwidth
For the univariate case, the bandwidth h is a scalar. If the standard normal density function is used to
approximate univariate data, and the underlying density being estimated is Gaussian then it can be
shown that the optimal choice for h is [22]
h=(
4σ̂ 5 1/5
)
3n
(2.13)
For the multivariate case, the bandwidth H is a matrix. The type of orientation of the kernel
function is controlled by the parameterization of the bandwidth matrix. There are respectively three
main classes of parameterization [23]:
11
CHAPTER 2. KERNEL SMOOTHING
0.45
0.05
0.1
0.35
reference
0.4
0.35
f^(x)
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
-1
0
1
2
3
x
F IGURE 2.5: Univariate kernel density estimates of different bandwidths.
• the class of all symmetric, positive definite matrices:


h21 h12 . . . h1n


 h12 h2 . . . h2n 
2


H= .
..
.. 
..
 ..
.
.
. 


2
h1n h2n . . . hn
(2.14)
• the class of all diagonal, positive definite matrices:

h21 0

 0 h2
2

dgH =  .
..
 ..
.

0 0
(2.15)
...
0


0

.. 
. 

2
. . . hn
...
..
.
• the class of all positive constants times the identity matrix:


h2 0 . . . 0


 0 h2 . . . 0 


h2 I =  .

.
.
.
 ..
..
. . .. 


0 0 . . . h2
(2.16)
The first class defines a full bandwidth matrix, which is the most general bandwidth type. It allows
the kernel estimator to smooth in any direction whether coordinate or not. The second class defines
12
CHAPTER 2. KERNEL SMOOTHING
the diagonal matrix parameterization, which is the most commonly used one. A diagonal matrix
bandwidth allows for different degrees of smoothing along each of the coordinate axis. The third class
h2 I allows the same smoothing to be applied to every coordinate direction, which is too restrictive
for general use. The visualization of the scaled kernel functions using these different classes of
bandwidths is given in Figure 2.6. It is worth mentioning that, for a bivariate bandwidth matrix, the
full bandwidth matrix of the first class can also be parameterized as


2
h1
ρ12 h1 h2

H=
ρ12 h1 h2
h22
(2.17)
3
3
2
2
2
1
1
1
0
0
0
y
3
y
y
where ρ12 is the correlation coefficient, which can be used as a measure of orientation.
-1
-1
-1
-2
-2
-2
-3
-3
-2
-1
0
1
2
-3
-3
3
x
-2
-1
0
1
2
x
3
-3
-3
-2
-1
0
1
2
3
x
F IGURE 2.6: Comparison of three bandwidth matrix parametrization classes. Left: positive scalar times
the identity matrix. Center: all diagonal, positive definite matrices. Right: all symmetric, positive definite
matrices.
2.4.2
Variable Bandwidth
So far, the bandwidths we’ve used in kernel density estimators are f ixed, which means an unified
bandwidth is used for every testing point xi , i = 1, . . . , m and training point X j , j = 1, . . . , n. In
this section, we will generalize these fixed bandwidth estimators to variable bandwidth estimators.
There are two main classes of variable bandwidth estimators
fˆ(xi ; H) = n−1
n
X
j=1
and
fˆ(xi ; Ω) = n−1
n
X
j=1
KH(xi ) (xi − X j ), i = 1, . . . , m, j = 1, . . . , n,
(2.18)
KΩ(X j ) (xi − X j ), i = 1, . . . , m, j = 1, . . . , n.
(2.19)
13
CHAPTER 2. KERNEL SMOOTHING
where functions H(·) and Ω(·) are bandwidth functions. They are considered to be non-random
functions, in the same way as we consider a single bandwidth to be a non-random number or matrix.
The first kernel density estimator is called the balloon kernel density estimator. Its bandwidths
are different at each testing point xi , i = 1, . . . , m. The second kernel density estimator is called
the sample point kernel density estimator. Its bandwidths are different at each training point
X j , j = 1, . . . , n. In this thesis, we only covers the sample point kernel density estimators. We
won’t cover balloon kernel density estimators for two reasons. First, balloon estimators typically do
not integrate to 1 so they are not true density functions, a result from focusing on estimating locally
rather than globally [24]. Second, balloon estimators are generally less accurate than sample point
estimators [25, 26].
0.8
0.7
0.6
f^(x)
0.5
0.4
0.3
0.2
0.1
0
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
x
F IGURE 2.7: Univariate sample point kernel density estimate: solid line kernel density estimate, dashed lines
individual kernels.
For the sample point kernel density estimators, there are usually two choices for the bandwidth
function Ω. One commonly used form is
Ω(X j ) = h2 f (X j )−1 I, j = 1, . . . , n,
(2.20)
where h is a constant. Using the reciprocal of f leads to an O(h4 ) bias rather than the usual O(h2 )
bias for fixed bandwidth estimators [26]. This form of the bandwidth appeals intuitively since it
states that the smaller bandwidths should be used in those parts of the data set with high density
of points, which is controlled by the value of f , and larger bandwidths in parts with lower density.
This combination of small bandwidths near the modes and large bandwidths in the tails should be
14
CHAPTER 2. KERNEL SMOOTHING
able to detect fine features near the former and prevent spurious features in the latter. One possible
solution for finding the estimate of the bandwidth function Ω is to use a pilot estimate fˆ to give
Ω̂(X j ) = hfˆ(X j )−1/2 .
The other choice of Ω is to use the k-nearest neighbor function of X j [27]. The k-nearest neighbor
function is defined as a symmetric positive definite second order covariation matrix associated with
the neighborhood of X j . It can be written as
C(X j ) = n−1
k
nk
X
(X j − X jk )(X j − X jk )T , j = 1, . . . , n,
(2.21)
k=1
where nk = dn−γ e is the number of neighbors, and X jk denotes k-th nearest neighbor of X j . Here,
nk is chosen to be significantly smaller than the number of samples, but large enough to reflect the
variations. The parameter γ depends on the dimension of the space and sparsity of the data points.
Thus, the bandwidth function Ω can be given as
Ω(X j ) = σ 2 C(X j ), j = 1, . . . , n,
(2.22)
where σ is the scalar kernel width.
According to the notations from Equation (2.4), and allowing the scaled kernels to have different
weights at each training point, the variable bandwidth kernel density estimator can be written as
fˆ(xi ; S j , ωj ) =
n
X
j=1
ωj KS j (xi − X j ), i = 1, . . . , m, j = 1, . . . , n,
(2.23)
where wj , j = 1, . . . , n is the weight of the scaled kernel at training point X j and S j is the scale.
For simplicity, we write Ωj and C j instead of Ω(X j ) and C(X j ). The variable bandwidth and the
T
scale is related by Ω−1
j = S j S j . We can extract S i by utilizing the eigendecomposition of C j
Ωj = σ 2 Qj Λj QTj , j = 1, . . . , n,
(2.24)
where the columns elements of Qj and diagonal elements of Λj are the eigenvectors and eigenvalues
of C j . Therefore, the scale matrix S j can be written as
−1/2
S j = σ −1 Λj
2.5
QTj , j = 1, . . . , n.
(2.25)
Kernel Density Derivative Estimation
Before considering the r-th derivative of a multivariate density, we first introduce the notation of r-th
derivatives of a function [28, 29]. From a multivariate point view, the r-th derivative of a function is
15
CHAPTER 2. KERNEL SMOOTHING
understood as the set of all its partial derivatives of order r, rather than just one of them. All these
r-th partial derivatives can be neatly organized into a single vector as follow: if f is a real d-variate
r
density function and x = (x1 , . . . , xd ), then we denote D⊗r f (x) ∈ Rd the vector containing all the
partial derivatives of order r of f at x, arranged so that we can formally write
D⊗r f =
∂f
,
(∂x)⊗r
(2.26)
where D⊗r is the r-th Kronecker power [30] of the operator D. Thus we write the r-th derivative
of f as a vector of length dr . Notice that, using this notation, we have D(D⊗r f ) = D⊗(r+1) f . Also,
the gradient of f is just D⊗1 f and the Hessian ∇2 f =
∂2f
∂x∂xT
is such that vec ∇2 f = D⊗2 f , where
vec denotes the vector operator [31]. According to the previous notation we can then write the r-th
kernel density derivative estimator D⊗r f as
D
⊗r
fˆ(xi ; S j , ωj ) =
=
n
X
j=1
n
X
j=1
ωj D⊗r KS j (xi − X j )
(2.27)
⊗r
ωj |S j |S ⊗r
j D K(S j (xi
− X j )), i = 1, . . . , m, j = 1, . . . , n,
Here, we follow the definition in Equation (2.4) and Equation (2.23), where xi is the i-th data of the
test set, X j is the j-th data of the training set, S j is the variable scale, KS j is the scaled kernel and
ωj is the weight of the r-th derivative of the scaled kernel.
In this thesis, we will mainly focus on the first and second derivatives of the kernel density
function because they have full potential for applications and are crucial to identify significant
features of the distribution [32]. The first order derivatives can be given from the kernel gradient
estimator
∇fˆ(xi ; S j , ωj ) =
n
X
j=1
ωj ∇KS j (xi − X j ), i = 1, . . . , m, j = 1, . . . , n,
(2.28)
where ∇ is the column vector of the d first-order partial derivatives and
∇KS (x) = |S|S T ∇K(Sx).
(2.29)
Similarly, the second order derivatives can be given from the kernel curvature estimator
∇2 fˆ(xi ; S j , ωj ) =
n
X
j=1
ωj ∇2 KS j (xi − X j ), i = 1, . . . , m, j = 1, . . . , n,
(2.30)
where ∇2 denotes the matrix of all second-order partial derivatives and
∇2 KS (x) = |S|S T ∇2 K(Sx)S.
16
(2.31)
CHAPTER 2. KERNEL SMOOTHING
It should be pointed out that the kernel curvature estimator is actually the estimator for the Hessian
matrix of the density function.
17
Chapter 3
Vesselness Measure
The vesselness measure intuitively describe the likelihood of a point being part of a vessel. It is
not reliable to judge whether a point belongs to a vessel or not only based on the point’s intensity.
Because the analysis of vesselness relies on some structural information such as local extrema,
valleys, ridges or saddle points, which can only be given from the derivatives of the intensity function.
In this chapter, we will first introduce the basic knowledges of gradients and Hessian matrices and
their relation with structural features, then we give three different methods to find the gradients and
Hessian matrices from images, finally we provide two popular algorithms for vesselness measure.
3.1
Gradients and Hessian Matrices of Images
Gradients and Hessian matrices are crucial in finding structural information from images. In this
section we will start from the definition of the gradient and Hessian matrix, then introduce their
mathematical properties and finally discuss how to extract structural information from image gradients
and Hessian matrices.
3.1.1
Gradient
Given a differentiable, scalar-valued function f (x), x = (x1 , . . . , xn )T of standard Cartesian coordinates in Euclidean space, its gradient is the vector whose components are the n partial derivatives of
18
CHAPTER 3. VESSELNESS MEASURE
f . It can be written as
∇f (x) =


∂

f (x) = 
∂x

∂f
∂x1
..
.
∂f
∂xn



.

(3.1)
In mathematics, the gradient points in the direction of the greatest rate of increase of the function
and its magnitude is the slope of the graph in that direction. An example is illustrated in Figure 3.1.
On the left, we constructed a bivariate Gaussian function f (x, y) =
1 −0.5(x2 +y 2 )
.
2π e
Its gradients as
well as its contours are given in the right. Here, each blue arrow represents a gradient vector of f at
its current location. The direction of the arrow denotes the direction of the gradient vector and the
length of the arrow denotes its magnitude. We can see that all the blue arrows point to the center
where the function f reaches its peak value.
2
1.5
0.2
0.5
0.1
y
f (x; y)
1
0.15
0.05
0
-0.5
0
2
-1
1
2
y
-1.5
1
0
0
-1
-1
-2
-2
-2
-2
x
-1
0
1
2
x
F IGURE 3.1: Gradient of the standard Gaussian function. Left: Standard bivariate Gaussian function. Right:
Gradients (blue arrows) of the Standard 2D Gaussian function.
For image processing and computer vision, the gradient of an image is defined the same way as
a mathematical gradient, except that the f is now an image intensity function I. Since an image is
usually either 2D or 3D, then an image gradient can be written as

g(x, y) = 
∂I
∂x
∂
∂y



 or g(x, y, z) = 

∂I
∂x
∂I
∂y
∂I
∂z


.

(3.2)
where I(x, y) and I(x, y, z) are the image intensity function for 2D and 3D respectively. For a 2D
19
CHAPTER 3. VESSELNESS MEASURE
image, the magnitude and direction of the gradient vector at point (x0 , y0 ) is
s
∂
∂
|g(x0 , y0 )| = ( I(x0 , y0 ))2 + ( I(x0 , y0 ))2
∂x
∂y
(3.3)
and
θ = atan(
∂
∂
I(x0 , y0 ),
I(x0 , y0 )).
∂y
∂x
(3.4)
F IGURE 3.2: Image gradient. On the left, an intensity image of a cameraman. In the center, a gradient image
in the x direction measuring horizontal change in intensity. On the right, a gradient image in the y direction
measuring vertical change in intensity.
Usually, the intensity function I(x, y) or I(x, y, z) of digital image is not given directly. It is
only known at discrete points. Therefore, to get its derivatives we assume that there is an underlying
continuous intensity function which has been sampled at the image points. With some additional
assumptions, the derivative of the continuous intensity function can be approximated as a function on
the sampled intensity function, i.e., the digital image. Approximations of these derivative functions
can be defined at varying degrees of accuracy. We will discuss them in details in Section 3.2.
3.1.2
Hessian
Suppose f : Rn → R is a function taking a vector x = (x1 , ..., xn )T ∈ Rn and outputting a scalar
f (x) ∈ R. If all second order partial derivatives of f exist, then the Hessian matrix of f is an n × n
square matrix, which is defined as follows:
∇2 f (x) =

∂2f
2
 ∂x2 1
 ∂ f
 ∂x2 ∂x1

∂
f (x) = 
∂xT x


..
.
∂2f
∂xn ∂x1
20
∂2f
∂x1 ∂x2
∂2f
∂x22
...
..
.
...
..
.
∂2f
∂xn ∂x2
...

∂2f
∂x1 ∂xn 
∂2f 
∂x2 ∂xn 

..
.
∂2f
∂x2n
,


(3.5)
CHAPTER 3. VESSELNESS MEASURE
where
∂2f
∂xi ∂xj
is the second order partial derivative of f with respect to the variable xi and xj .
Specifically, if f has continuous second partial derivatives at any given point in Rn , then ∀i, j ∈
{1, 2, . . . , n},
∂2f
∂xi ∂xj
=
∂2f
∂xj ∂xi .
Thus, the Hessian matrix of f is a symmetric matrix. This is true in
most “real-life” circumstances.
For a digital image, the Hessian matrix is defined in the same way, except that function f is now
the image intensity function I, which is usually a 2D or 3D discrete function. By convention, we use
H to denote Hessian matrices of an image, and it can be written as
 2


∂ I
2
2
∂x2
∂ I
∂ I
 2
2
∂x∂y 
∂ I
H(x, y) =  ∂x2
and H(x, y, z) = 
 ∂y∂x
∂ I
∂2I
∂y∂x
∂y 2
∂2I
∂z∂x
∂2I
∂x∂y
∂2I
∂y 2
∂2I
∂z∂y

∂2I
∂x∂z

∂2I 
∂y∂z 
∂2I
∂z 2
(3.6)
A symmetric n × n Hessian matrix can be decomposed into the following form using eigenvalue
decomposition,
H = QΛQT ,
(3.7)
where Q is the square matrix whose i-th column is the eigenvector q i of H and Λ is the diagonal
Q and Λ can be written as

... 0

... 0 

(3.8)
.. 
..
. . 

0 . . . λn
matrix whose diagonal elements are the corresponding eigenvalues.

λ1 0

 0 λ2
h
i

Q = q 1 , q 2 , . . . , q n and Λ =  .
..
 ..
.

0
In 2D case, eigenvalues of H can be visualized by constructing an ellipsoid
v T Hv = 1
(3.9)
where v = (x, y)T . By performing eigenvalue decomposition to H, so that Λ is a diagonal matrix
and Q is a rotation (orthogonal) matrix


h
i
λ1 0
 and Q = q , q .
Λ=
1
2
0 λ2
(3.10)
v T QΛQT v = (QT v)T Λ(QT v) = 1
(3.11)
We have,
Let v 0 = QT v = (x0 , y 0 )T , we get
v 0T Λv 0 = λ1 x02 + λ2 y 02 =
x02
( √1λ )2
1
21
+
y 02
( √1λ )2
2
=1
(3.12)
CHAPTER 3. VESSELNESS MEASURE
We can see that Equation 3.12 is a standard ellipsoid equation in coordinates (x0 , y 0 ). Its semiprincipal axes are
√1
λ1
and
√1 .
λ2
Since Q is a rotation matrix, thus Equation 3.9 is actually a rotated
ellipsoid in coordinates (x, y).
y
1
p
⇥
1
p
x
y
⇤
H

x
y
=1
2
1
x
⇥
x
y
⇤
⇤

x
y
=1
F IGURE 3.3: Visualized Eigenvalues with ellipsoid.
Intuitively, for 2D images, if a pixel is close to the centerline of a vessel, it should satisfy the
following properties
• one of the eigenvalues λ1 should be very close to zero;
• the absolute value of the other eigenvalue should be a lot greater than zero, λ2 0.
For 3D images, let λk be the eigenvalue with the k-th smallest magnitude, i.e. |λ1 | ≤ |λ2 | ≤ |λ3 |.
Then, an ideal tubular structure in 3D image satisfy
• λ1 should be very close to zero;
• λ2 and λ3 should be of large magnitude and equal sign (the sign is an indicator of brightness/darkness).
The respective eigenvectors point out singular directions: q 1 indicates the direction along the vessel
(minimum intensity variation). q 2 and q 3 form a base for the orthogonal plane.
3.2
Finding 1st and 2nd Order Derivatives From Images
Typically, the intensity function of a digital image is only known at evenly distributed discrete
places. Thus, instead of the continuous intensity function I(x, y), I(n1 , n2 ) is usually used to refer
22
CHAPTER 3. VESSELNESS MEASURE
2D
3D
orientation pattern
λ1
λ2
λ1
λ2
λ3
N
N
N
N
N
noisy, no preferred direction
L
L
H-
plate-like structure (bright)
L
L
H+
plate-like structure (dark)
L
H-
L
H-
H-
tubular structure (bright)
L
H+
L
H+
H+
tubular structure (dark)
H-
H-
H-
H-
H-
blob-like structure (bright)
H+
H+
H+
H+
H+
blob-like structure (dark)
TABLE 3.1: Possible orientation patterns in 2D and 3D images, depending on the value of the eigenvalues λk
(H=high, L=low, N=noisy, usually small, +/- indicate the sign of the eigenvalue). The eigenvalues are ordered:
|λ1 | ≤ |λ2 | ≤ |λ3 | [3].
an image at discrete points. Here, (n1 , n2 ) is the indices of a pixel on the image. It is related to
(x, y) by (x, y) = (n1 ∆n1 , n2 ∆n2 ), where ∆n1 and ∆n2 is the distance between two adjacent
pixels in horizontal and vertical direction. To compute the derivatives of an image, we need to the
use pixels from I(n1 , n2 ) to approximate the underlying I(x, y) and its derivatives. In this section,
three approximations are introduced: Gradient Operator, Gaussian Smoothing, and Kernel Density
Derivative Estimation.
3.2.1
Gradient Operator
For a 2D image, the partial derivative of its continuous intensity function I(x, y) in the x direction is
defined as
Ix (x, y) =
I(x + h2 , y) − I(x − h2 , y)
∂
I(x, y) = lim
.
h→0
∂x
h
(3.13)
Thus, for a constant h, Equation (3.13) can be approximated by following
I(x + h2 , y) − I(x − h2 , y)
Iˆx (x, y) =
,
h
and the error of the approximation is O(h2 ) [33]. In the discrete case, if we let
(3.14)
h
2
= ∆n, then the
approximation of I(n1 , n2 ) can be written as
I(n1 + 1, n2 ) − I(n1 − 1, n2 )
Iˆx (n1 , n2 ) =
.
2∆n
23
(3.15)
CHAPTER 3. VESSELNESS MEASURE
Usually, the sampling factor
1
2∆n
is ignored, since it is constant throughout the image. Therefore, the
approximation of Ix (n1 , n2 ) can be simplified by writing
Iˆx (n1 , n2 ) = I(n1 + 1, n2 ) − I(n1 − 1, n2 ).
(3.16)
Similarly, the approximation of Iy (n1 , n2 ) can be written as
Iˆy (n1 , n2 ) = I(n1 , n2 + 1) − I(n1 , n2 − 1).
(3.17)
Let h1 (n1 , n2 ) = δ(n1 + 1, n2 ) − δ(n1 − 1, n2 ) and h2 (n1 , n2 ) = δ(n1 , n2 + 1) − δ(n1 , n2 − 1),
where δ(n1 , n2 ) is defined as
δ(n1 , n2 ) =


1,

0,
n 1 = n2 = 0
,
(3.18)
otherwise
then Equation (3.16) and (3.17) can be written as
Iˆx (n1 , n2 ) = I(n1 , n2 ) ∗ h1 (n1 , n2 )
(3.19)
Iˆy (n1 , n2 ) = I(n1 , n2 ) ∗ h2 (n1 , n2 ),
(3.20)
Here h1 and h2 is called gradient operators. Usually, the gradient operators is written in the form
of matrices
 
−1
h
i
 

h1 (n1 , n2 ) = 
;
h
(n
,
n
)
=
−1, 0, 1 ;
0 1 1 2
1
(3.21)
The approximated second order derivatives can be derived from the first order derivatives directly
Iˆxx (n1 , n2 ) = Ix (n1 , n2 ) ∗ h1 (n1 , n2 )
≈ (I(n1 , n2 ) ∗ h1 (n1 , n2 )) ∗ h1 (n1 , n2 )
(3.22)
= I(n1 , n2 ) ∗ h1 (n1 , n2 ) ∗ h1 (n1 , n2 )
Similarly,
Iˆxy (n1 , n2 ) = I(n1 , n2 ) ∗ h1 (n1 , n2 ) ∗ h2 (n1 , n2 )
(3.23)
Iˆyy (n1 , n2 ) = I(n1 , n2 ) ∗ h2 (n1 , n2 ) ∗ h2 (n1 , n2 )
(3.24)
24
CHAPTER 3. VESSELNESS MEASURE
The discussion above can be easily applied to higher dimensional case. Let Ixi denotes the first
order derivative of I with respect to xi , then its approximation is given as
Iˆxi (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ hi (n1 , . . . , nd )
(3.25)
where
hi (n1 , . . . , nd ) = δ(n1 , . . . , ni + 1, . . . , nd ) − δ(n1 , . . . , ni − 1, . . . , nd )
and d is the dimension of the image. Similarly, the second order derivative Ixi xj can be written as
Iˆxi xj (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ hi (n1 , . . . , nd ) ∗ hj (n1 , . . . , nd )
(3.26)
This method is computationally efficient due to the simple structure of the gradient operators.
But since the estimation of the derivatives only involves adjacent pixels, which contains limited
information of the neighborhood, this method is not accurate. Especially, this method can’t provide
the accurate derivative information for images of large scale.
3.2.2
Gaussian Smoothing
For a 2D image, from the sampling theorem[34] we know the continuous intensity function I(x, y)
can be ideally reconstructed from the discrete image function I(n1 , n2 ) by
I(x, y) =
n
X
n
X
k1 =−n k2 =−n
I(k1 , k2 )K(x − k1 ∆n1 , y − k2 ∆n2 ),
(3.27)
where K is a sinc like function
K(x, y) =
sin(πx/∆n1 ) sin(πy/∆n2 )
.
πx/∆n1
πx/∆n2
(3.28)
Here, ∆n1 and ∆n2 are the sampling intervals. However, K decays proportionally to 1/x and 1/y,
which is a rather slow rate of decay. Consequently, only values that are far away from the origin
can be ignored in the computation. In other words, the summation limit n must be large, which is a
computationally undesirable state of affairs. In addition, if there is aliasing, the sinc function will
amplify its effects, since it combines a large number of unrelated pixel values. Instead, a Gaussian
function, which passes only frequencies below a certain value and has a small support in the spatial
domain, can be a good replacement. Thus, we have
ˆ y) =
I(x,
n
X
n
X
k1 =−n k2 =−n
I(k1 , k2 )G(x − k1 ∆n1 , y − k2 ∆n2 ),
25
(3.29)
CHAPTER 3. VESSELNESS MEASURE
where G is a Gaussian function at scale h
1 − x2 +y2 2
e 2h .
2πh2
G(x, y) =
(3.30)
Thus, the approximated first order derivatives and second order derivatives of I(x, y) with respect to
x can be given as
n
X
∂ ˆ
Iˆx (x, y) =
I(x, y) =
∂x
∂2 ˆ
Iˆxx (x, y) =
I(x, y) =
∂x2
n
X
k1 =−n k2 =−n
n
n
X
X
I(k1 , k2 )Gx (x − k1 ∆n1 , y − k2 ∆n2 ),
(3.31)
I(k1 , k2 )Gxx (x − k1 ∆n1 , y − k2 ∆n2 ),
(3.32)
k1 =−n k2 =−n
where Gx (x, y) and Gxx (x, y) are first and second order derivatives of Gaussian in direction x
x − x2 +y2 2
∂
G(x, y) = −
e 2h ,
∂x
2πh4
2
2
∂2
1 x2
− x +y
2h2 .
Gxx (x, y) =
G(x,
y)
=
(
−
1)e
∂x2
2πh4 h2
Gx (x, y) =
(3.34)
0.1
@2
@x2 G(x; y)
@
@x G(x; y)
0.1
(3.33)
0.05
0
0.05
0
-0.05
-0.1
-0.05
-0.15
-0.1
4
-0.2
4
2
2
4
y
4
2
0
0
-2
-2
-4
-4
2
0
y
x
0
-2
-2
-4
-4
x
F IGURE 3.4: Derivatives of Gaussian filters. Left: the first order derivative of the bivariate Gaussian function
with respect to x. Right: the second order derivative of the bivariate Gaussian function with respect to x.
Sample Iˆx (x, y) and Iˆx (x, y) the same way as I(n1 , n2 ), we have
Iˆx (n1 , n2 ) =
n
X
n
X
k1 =−n k2 =−n
I(k1 , k2 )Gx (n1 − k2 , n2 − k2 )
= I(n1 , n2 ) ∗ Gx (n1 , n2 ),
26
(3.35)
CHAPTER 3. VESSELNESS MEASURE
Iˆxx (n1 , n2 ) =
n
X
n
X
k1 =−n k2 =−n
I(k1 , k2 )Gxx (n1 − k2 , n2 − k2 )
(3.36)
= I(n1 , n2 ) ∗ Gxx (n1 , n2 ),
where Gx (n1 , n2 ) and Gxx (n1 , n2 ) are sampled Gaussian derivative functions. Therefore, the first
and second order derivatives of an image can be easily got by convolving it with the corresponding
Gaussian derivative filters. We call this the Gaussian Smoothing.
In general, for a d dimensional image I(n1 , . . . , nd ), its first and second order derivatives can be
given as
Iˆxi (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ Gxi (n1 , . . . , nd )
(3.37)
Iˆxi xj (n1 , . . . , nd ) = I(n1 , . . . , nd ) ∗ Gxi xj (n1 , . . . , nd ),
(3.38)
where Gxi is the first order Gaussian derivative filter with respect to xi and Gxi xj is the second order
Gaussian derivative filter with respect to xi and xj .
Since the computation uses convolution only and the size of the Gaussian derivative filter is
usually small, the Gaussian smoothing method is also computationally efficient. Unlike the Gradient
operator which calculate the derivatives at the finest scale, the smoothing of this method is controlled
by the scale parameter h, which can decide the degree of information that should be used to calculate
the derivatives. The problem of this method is that the smoothing is only performed along the
coordinates and the choice of a proper scale h is hard.
3.2.3
Kernel Density Derivative Estimation
We’ve discussed kernel density derivative estimation in Section 2.5. We know that the approximation
of the derivatives of a function f : Rd → R can be given from a set of sample data Xi , i = 1, . . . , m
by Equation (2.28) and Equation (2.30). To get the derivatives of a 2D image I(x, y), the same idea
can be applied. Let the weight ωj = I(k1 , k2 ) and rearrange the indices of the summation operator,
we can rewrite Equation (2.28) and (2.30) as
XX
ˆ y) =
∇I(x,
I(k1 , k2 )∇KS k1 k2 (x − Xk1 , y − Yk2 )
k1
2ˆ
∇ I(x, y) =
k2
XX
k1
(3.39)
k2
I(k1 , k2 )∇2 KS k1 k2 (x − Xk1 , y − Yk2 )
(3.40)
ˆ y) is the estimated gradient of
where I(k1 , k2 ) is the intensity of the image at pixel (k1 , k2 ), ∇I(x,
ˆ y) is the estimated Hessian of the image, S k k is the scale matrix, K(.) is the
the image, ∇2 I(x,
1 2
kernel function and (Xk1 , Yk2 ) = (k1 ∆n1 , k2 ∆n2 ) is the location of the pixel (k1 , k2 ) on the image.
27
CHAPTER 3. VESSELNESS MEASURE
ˆ y) and ∇2 I(x,
ˆ y) the same way as I(n1 , n2 ), we have
Sample ∇I(x,
XX
ˆ 1 , n2 ) =
I(k1 , k2 )∇KS k1 k2 (n1 − k1 , n2 − k2 ),
∇I(n
k1
ˆ 1 , n2 ) =
∇2 I(n
XX
k1
(3.41)
k2
k2
I(k1 , k2 )∇2 KS k1 k2 (n1 − k1 , n2 − k2 ).
(3.42)
Here, ∇KS k1 k2 (n1 , n2 ) is the sampled kernel gradient and ∇2 KS k1 k2 (n1 , n2 ) is the sampled kernel
Hessian. Similarly, the above equations can be easily extended to higher dimensional case
X
ˆ
∇I(n)
=
I(k)∇KS k (n − k),
(3.43)
k
ˆ
∇2 I(n)
=
X
k
I(k)∇2 KS k (n − k).
(3.44)
where n = (n1 , . . . , nd ) and k = (k1 , . . . , kd ).
The kernel density derivative estimators can give the most accurate estimation of the derivatives of
the image. Because, it is able to smooth the image locally, which means it can decide the smoothing
direction and scale level for each pixel accordingly. But the problem of this method is that it is too
computationally insensitive. To solve this problem, we proposed a high performance solution, which
based on the GPU CUDA framework, in Chapter 5.
3.3
Frangi Filtering
Frangi filter[3], developed by Frangi et al. in 1998, is a popular method in highlighting tubular
structures in images. It uses the eigenvalues of the Hessian matrices obtained from an image to
analysis the likelihood of a pixel being on the tubular structure.
For a 3D image, assume we’ve obtained a Hessian H matrix at voxel (n1 , n2 , n3 ) using any of
the methods discussed in Section 3.2, then by performing eigenvalue decomposition to H and sorting
the resulting eigenvalues, we have
|λ1 | ≤ |λ2 | ≤ |λ3 |.
(3.45)
From the discussion in Section 3.1.2, we know that if a voxel is on the tubular-like structure, the
eigenvalues satisfy |λ1 | ≈ 0, |λ1 | |λ2 |, and λ2 ≈ λ3 . Combining these constrains with the
relations in Table 3.1, we can define the following three dissimilarity measures:
• To distinguish between blob-like and nonblob-like structures, we define
|λ1 |
RB = p
.
|λ2 λ3 |
28
(3.46)
CHAPTER 3. VESSELNESS MEASURE
This ratio attains its maximum for a blob-like structure, which satisfies |λ1 | ≈ |λ2 | ≈ |λ3 |,
and is close to zero whenever λ1 ≈ 0 or λ1 and λ2 tend to vanish.
• To distinguish between plate- and line-like structures, we define
RA =
|λ2 |
.
|λ3 |
(3.47)
RA → 0 implies a plane-like structure and RA → 1 implies a line-like structure.
• To distinguish between background (noise) and foreground, we define
q
S = λ21 + λ22 + λ23 .
(3.48)
This measure will be low in the background where no structure is present and the eigen- values
are small for the lack of contrast.
Combining the measures above, we can define a vesselness function as following


0,
if λ2 > 0 or λ3 > 0,
V(n1 , n2 , n3 ) =
2
RB 2
RA 2
S

(1 − e− 2α2 )e− 2β2 (1 − e− 2c
2 ),
otherwise.
(3.49)
where α, β and c are the thresholds which control the sensitivity of the measures RA , RB and S.
Similarly, for 2D images the vesselness function can be given as


0,
if λ2 > 0,
V(n1 , n2 ) =
2
RB 2
S

e− 2β2 (1 − e− 2c
2 ),
otherwise.
Here, RB =
|λ1 |
|λ2 |
is the blobness measure in 2D and S =
(3.50)
p
λ21 + λ22 is the backgroundness measure.
Note that Equation (3.49) and (3.50) are given for bright tubular-like structures . For dark objects
the conditions should be reversed.
3.4
Ridgeness Filtering
Another approach to extract tubular structure from images is to use ridgeness filter [35, 36]. In
ridgeness filtering, a tubular structure can be viewed as a ridge or principal curve of the continuous
intensity function I(x) : Rd → R. A rigde is defined as a set of curves whose points are local
maxima of the function in at least one direction. This more mathematically rigorous definition of a
tubular structure provides us more mathematical tools to analyze the likelihood of a point being on
29
CHAPTER 3. VESSELNESS MEASURE
F IGURE 3.5: Lef t: Original X-Ray vessel image. Right: enhanced vessel image using Frangi filter.
the ridge. What’s more, unlike a Frangi filter which only uses the information from a local Hessian
matrix H(x) of images, the ridgness filter combines both local gradient and Hessian to measure the
ridgeness.
Let q i and λi be the i-th eigenvector and eigenvalue pair of the Hessian matrix H(x) of I(x)
such that |λ1 | ≤ . . . ≤ |λd |. In general, a point is on the k-dimensional ideal ridge structure iff it
satisfies the following conditions
• the gradient g(x) is collinear with the first k eigenvectors, i.e. g(x) k q i (x), i = 1, . . . , k,
and is orthogonal to d − k eigenvectors, i.e. g(x)T q i (x) = 0, i = k + 1, . . . , d;
• λk+1 , . . . , λd have the same sign.
• |λk | ≈ 0;
Thus, this point on the ridge is the local maximum of the function in the subspace spanned by the d−k,
i.e. S⊥ = span(q k+1 , . . . , q d ) and the tangential space is spanned by the remaining k eigenvectors,
Sk = span(q 1 , . . . , q k ). If we consider the inner product between g(x) and H(x)g(x), we can get
T
g(x) H(x)g(x) =
=
d
X
i=1
k
X
λi g(x)T qi (x)(g(x)T qi (x))T
λi (g(x)T qi (x))2 +
i=1
d
X
i=k+1
30
(3.51)
λi (g(x)T qi (x))2 .
CHAPTER 3. VESSELNESS MEASURE
Note that since the eigenvalues are sorted in a descend order based on their magnitude, then the third
condition is true for all the first k eigenvalues, i.e. |λi | ≈ 0, i = 1, . . . , k. Hence, we can find that
the inner product between g(x) and H(x)g(x) is close to zero for a point x on the ridge.
Therefore, a measure for being on the ridge can be formulated in terms of the inner product
between g(x) and H(x)g(x),
ζ(x) = abs(
g(x)T H(x)g(x)
),
kH(x)g(x)kkg(x)k
(3.52)
where abs is the absolute operator. This function is bounded between 0 and 1, due to the normalization
factor in the denominator.
31
Chapter 4
GPU Architecture and Programming
Model
4.1
GPU Architecture
A GPU is connected to a host through PCI-E bus. It has its own device memory, which usually
can be up to several gigabytes in current GPU architecture. A GPU manages its device memory
independently and it can’t work on host memory directly. Typically, a data in host memory need to
be transferred to GPU device memory through programmed DMA, so that it can be read and written
by GPU. The device memory on GPU supports very high data bandwidth with relatively high latency.
Since most data access on GPU begins in device memory, it is very important for programmers to
leverage the high bandwidth of device memory to achieve peak throughput of GPU.
NVIDA GPUs consist of several streaming multiprocessors (SMs), each of which works independently from each other. Each multiprocessor contains a group of CUDA cores (processors), load/store
units or special function units (SFUs). Each core is capable of performing integer and floating point
operations. Multiprocessors create, manage, schedule, and execute threads in groups of 32 parallel
threads called warps. A warp is a minimum unit of execution on GPU. When a multiprocessor is
issued with a block of threads, it will first partition them into warps and then schedule those warps
by a warp scheduler for execution. All the threads in a warp, if there is no divergence, execute
one common instruction at a time. Such architecture is called SIMT (Single Instruction, Multiple
Threads). Each SM also contains several warp schedulers and instruction dispatch units, which select
warps and instructions that will be executed on the SM.
32
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
Multicore
CPU
Execution Queue
Control
SMX N
Warp Scheduler
…
Dispatch
Warp Scheduler
Dispatch
Dispatch
Warp Scheduler
Dispatch
Dispatch
Warp Scheduler
Dispatch
Dispatch
Dispatch
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core Warp
DPScheduler
Unit
LD/ST
SFU
Dispatch
Dispatch
Dispatch
Dispatch
Dispatch
Dispatch
Dispatch
core Warp
core
core Dispatch
DP Unit
core
core
core
DP Unit
LD/ST
Warp
Scheduler
Scheduler
Warp
Scheduler
Warp
Scheduler
SFU
SMX 1
Warp Scheduler
core
Scheduler
coreWarpcore
DP Unit Warp
coreScheduler
core
SMX 0
core
core
core
DP Unit
Dispatch
Dispatch
Dispatch
core
core
DP Unit
core Dispatch
core
…
…
…
…
…
…
…
…
…
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
SFU
SFU
SFU
…
coreDispatch
core Dispatch
core
DPDispatch
Unit
LD/ST
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
Dispatch
core
SFU
core
core
core
DP Unit
core
core Hardware/Software
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit User
core Selectable
core
core
DP Unit
LD/STCache
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
core
core
core
DP Unit
core
core
core
DP Unit
LD/ST
SFU
…
SFU
…
DP Unit
…
core
SFU
LD/ST
…
core
LD/ST
DP Unit
LD/ST
SFU
…
core
DP Unit
…
core
core
…
core
core
…
DP Unit
…
…
core
core
core
…
…
core
DP Unit
…
…
core
DP Unit
…
core
core
…
…
core
core
…
…
core
core
…
…
Host
Memory
User Selectable Hardware/Software Cache
L2 Cache
DMA
Device Memory
F IGURE 4.1: GPU block diagram.
GPU’s memory hierarchy can be divided into two categories: the off-chip memory and the on-chip
memory, where the chip is referred to the multiprocessor. A off-chip memory of GPU is usually
slower than on-chip memory because they are relatively far from GPU. There two types of off-chip
memory: L2 cache and global memory. A L2 cache is a part of the GPU’s cache memory hierarchy.
It is typically smaller than CPU’s L2 or L3 cache, but has higher bandwidth available, which makes
it more suitable for throughput computing. The L2 cache is shared by all multiprocessors on GPU
and it is invisible to programmers. The global memory here is referred to GPU’s device memory.
But in a more precise definition, the global memory is actually only a part of device memory. A
global memory is a programming concept and we will discuss it in detail later in the Programming
Model sections. The global memory is also shared by all multiprocessors. There are five types of
on-chip memory: register, L1 cache, shared memory, constant cache and texture cache (read-only
data cache). Unlike the registers on CPU, the register file on GPU is very large. And it is the fastest
on-chip memory. A L1 cache is also a part of the GPU’s cache memory hierarchy. It is has larger
cache line size and lower latency than the L2 cache. As a on-chip memory, the L1 cache is only
33
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
Device
SM 1
Register File
...
CUDA
SM 1 Core
CUDA
Core
…
CUDA
Core
Register File
SM 0
Shared Memory/L1 Cache
Register File
CUDA Core
… Cache
Texture Memory
CUDA Core
CUDA Core
CUDA
Core
CUDA
CUDA
…
Constant Memory
Cache
Core
Core
Shared Memory/L1 Cache
...
Shared
Texture
Memory/L1
MemoryCache
Cache
Texture
Constant
Memory
Memory
Cache
Cache
Constant Memory Cache
L2 Cache
Device Memory
F IGURE 4.2: GPU hardware memory hierarchy.
accessible by the multiprocessor which it belongs to. And the same as the L2 cache, the L1 cache
is also invisible to programmers. The shared memory is a programmable cache. It actually shares
the same physical cache component with L1 cache, which make shared memory extremely fast.
Typically, the shared memory/L1 cache is 100 faster than the global memory. The shared memory is
fully visible to programmers. The constant cache and texture cache are used for caching the constant
memory and texture memory, which we will discuss in detail in the following section.
4.2
Programming Model
In this section, we are going to talk about NVIDA’s CUDA programming model. Since this is a
programming model, all the concepts talked in this section will be visible by programmers. Please
note that some concepts may have their physical counterpart and some concepts won’t have.
34
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
Host
Block (2, 0)
Device
Grid 0
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Kernel 0
Grid 1
Kernel 1
Block (0, 0)
Block (1, 0)
Block (0, 1)
Block (1, 1)
Block (0, 2)
Block (1, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(0, 3)
Thread
(1, 3)
Thread
(2, 3)
Thread
(3, 3)
Thread
(0, 4)
Thread
(1, 4)
Thread
(2, 4)
Thread
(3, 4)
F IGURE 4.3: Programming Model.
In CUDA programming model, programmers write GPU through a C like functions called
kernels. To distinguish it from kernels used in kernel-based density and density derivatives estimates,
we will call these functions gpu − kernels. A gpu-kernel is nothing but a task that the programmer
want to assign to the GPU. Usually, the task or gpu-kernel is large and can’t be executed by GPU at a
time. Therefore, the gpu-kernel is divided into several equally sized task chunks called blocks. Each
block will be executed on one multiprocessor. And a multiprocessor can execute several blocks at a
time. A block consists a number of threads, which is the minimum task unit on GPU (recall that
a warp is the minimum execution unit on GPU). All the threads in a gpu-kernel must be able to be
executed independently. All the blocks in a gpu-kernel form the so-called computation grid. When
a programmer want to write a gpu-kernel function, he must define the grid size and block size in
advance, so that the GPU can know how assign the task to multiprocessors.
From a programming model, we can view the memory hierarchy in a different perspective. In
this new memory hierarchy, all the memory are visible to programmers. Even though they are called
memory, they are actually the memory resources that the GPU assigns to the application or the
gpu-kernel. These memory doesn’t exist physically. Instead, they are created and managed at runtime.
When a gpu-kernel begins its execution on GPU, each thread will be assigned a number of dedicated
registers and, if needed, a private local memory space. But a programmer should always avoid using
35
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
too much local memory. Because the local memory is allocated in the off-chip device memory, which
is much slower than the on-chip registers. Each block has shared memory visible to all threads of the
block and with the same lifetime as the block. Note that the shared memory we mention here is not a
physical component. It is a part of shared memory resource that assigned to this block. All threads
have access to the same global memory. There are also two additional read-only memory spaces
accessible by all threads: the constant and texture memory spaces. The global, constant, and texture
memory spaces are optimized for different memory usages. The global memory, constant memory
and texture memory are allocated in the off-chip device memory as well. They are persistent across
kernel launches by the same application.
Grid
Block (1, 0)
Block (0, 0)
Shared Memory
Registers
Registers
Thread (0, 0)
Local
Memory
Shared Memory
Registers
Thread (0, 0)
Thread (1, 0)
Local
Memory
Local
Memory
Registers
Thread (1, 0)
Local
Memory
Global Memory
Constant Memory
Texture Memory
F IGURE 4.4: GPU software memory hierarchy.
4.3
Thread Execution Model
When a kernel is invoked, the CUDA runtime will distribute the blocks across the multiprocessors on
the device and when a block is assigned to a multiprocessor, it is further divided into groups of 32
threads, which is a warp. A warp scheduler then selects available warps and issues one instruction
from each warp to a group of sixteen cores, sixteen load/store units, or four special function units.
CUDA’s warp scheduling mechanism will help hide instruction latency. Each instruction of a kernel
may require more than a few clock cycles to execute (for example, an instruction to read from global
36
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
memory will require multiple clock cycles). The latency of long-running instructions can be hidden
by executing instructions from other warps while waiting for the result of the previous warp.
Instruction Dispatch Unit
Instruction Dispatch Unit
Warp 8 instruction 11
Warp 8 instruction 12
Warp 8 instruction 42
Warp 8 instruction 43
Warp 8 instruction 42
Warp 8 instruction 42
...
...
time
Warp Scheduler
Warp 8 instruction 11
Warp 8 instruction 12
Warp 8 instruction 42
Warp 8 instruction 43
Warp 8 instruction 42
Warp 8 instruction 42
F IGURE 4.5: Warp scheduler.
It is critical for a GPU to achieve high occupancy in its execution. But unlike the CPU, it is
usually very hard to keep GPU busy all the time. Because there are several factors that will affect
GPU’s occupancy. Those factors are maximum register number per thread, maximum thread number
in a block, and shared memory size per block. Think of a multiprocessor as a container, which has
limited resources such as registers, shared memory size, cores and other ALU resources. As we’ve
already talked about previously, a gpu-kernel is executed block by block on multiprocessors. How
many blocks can be executed on a multiprocessor is decided by the block size, i.e., the number of
resources a block need to use, and the number of resources that is available on the multiprocessor.
In order to achieve a high occupancy on GPU, our goal is to select a number of proper sized block
to execute on multiprocessors so that most of the resources on multiprocessors are occupied. For
example, the Kepler architecture supports maximum 63 registers per thread, 1024 threads per block,
and 48KB shared memory per multiprocessor.
37
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
FERMI
FERMI
KEPLER
KEPLER
GF100
GF104
GF104
GF110
Max Warps / SMX
48
48
64
64
Max Threads / SMX
1536
1536
2048
2048
Max Thread Blocks / SMX
8
8
16
16
32bit Registers / SMX
32768
32768
65536
65536
Max Registers / Thread
63
63
63
255
Max Threads / Thread Block
1024
1024
1024
1024
16KB
16KB
32KB
32KB
48KB
48KB
Shared Memory
16KB
16KB
Size Configurations (bytes)
48KB
48KB
TABLE 4.1: Compute capability of Fermi and Kepler GPUs.
4.4
Memory Accesses
Current GPU’s device memory can only be accessed via 32-byte, 64-byte or 128-byte transaction.
All the memory transactions are naturally aligned. They take place at 32-byte, 64-byte or 128-byte
memory segments, i.e., the address of the first byte of the memory segment must be a multiple of
the transaction size. If the memory addresses are misaligned and they distribute across two memory
segments rather than one, then it will take one more memory transaction to read or write the data. To
make fully use of each memory transaction, memory accesses are usually coalesced by warp. When a
warp executes an instruction that need to access the device/global memory, it looks at the distribution
of memory addresses across the threads within it. Instead of generate a memory transaction for each
thread, it coalesces the memory accesses that read/write data from the same memory segment into
just one memory transaction. Typically, the more transactions are necessary, the more unused words
are transferred in addition to the words accessed by the threads, reducing the instruction throughput
accordingly. For example, if a 32-byte memory transaction is generated for each thread’s 4-byte
access, throughput is divided by 8.
Besides using memory coalescing to increase the global memory throughput, programmers can
also speed up their application by reducing unnecessary global memory traffic. One way to achieve
this is to use shared memory. As we’ve already known that a shared memory is nothing but a
programmable L1 cache. A traditional cache is invisible to programmers and they can’t decide which
data can be cached. Shared memory, on the other hand, is fully controlled by programmers. When
the programmer identifies that some data is accessed repeatedly by threads in the same block, he can
38
CHAPTER 4. GPU ARCHITECTURE AND PROGRAMMING MODEL
256
128
Address
…
…
Thread ID
0
31
F IGURE 4.6: Aligned and consecutive memory access.
256 257
128
Address
…
…
Thread ID
0
31
F IGURE 4.7: Misaligned memory access.
load those data from global/device memory into the shared memory first and then the later accesses of
those data can be done through shared memory instead of global memory, which can reduce a great
amount of global memories. Another advantage of using shared memory is that it can be accessed
simultaneously. Shared memory is divided into equally-sized memory modules, called banks. Any
memory read or write request made of n addresses that fall in n distinct memory banks can therefore
be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of
a single module. However, if two addresses of a memory request fall in the same memory bank, there
is a bank conflict and the access has to be serialized. The hardware splits a memory request with
bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a
factor equal to the number of separate memory requests. If the number of separate memory requests
is n, the initial memory request is said to cause n-way bank conflicts. One exception of bank conflict
is that if all threads in a warp access the same shared memory address at the same time, only one
memory request will be generated and the data will be broadcasted to all the threads. We call this
mechanism broadcasting.
39
Chapter 5
Algorithms and Implementations
In this chapter, we present three main contributions of this thesis. In Section 5.1, we propose an
algorithm which can calculate the separable multivariate kernel derivatives (SMKD) efficiently. In
Section 5.2, we introduce some core functions in our kernel smoothing library. Several optimization
algorithms for these functions are proposed. Finally, we design a fast k-nearest neighbors bandwidth
selector in Section 5.3.
5.1
Efficient Computation of Separable Multivariate Kernel Derivative
As we mentioned in Section 2.5, the implementation of Equation (2.27) requires the calculation of
D⊗r K(x), which is a vector containing all the partial derivatives of order r of the kernel function K
at point x. For a separable kernel, these partial derivatives are given by Equation (2.12). A brute force
implementation of calculating D⊗r K(x) is to calculate these partial derivatives respectively. But this
results in calculating the same set of kernel density, kernel derivative and multiplication operations repeatedly, which is clearly not computationally efficient. Let’s consider a motivating example. Assume
a separable 4-variable kernel K(x) = k(x1 )k(x2 )k(x3 )k(x4 ). Its first order partial derivative with
respect to x4 would be
are
∂2
K(x)
∂x24
∂
∂x4 K(x)
= k(x1 )k(x2 )k(x3 )k 0 (x4 ), similarly two second order derivatives
= k(x1 )k(x2 )k(x3 )k 00 (x4 ) and
∂2
∂x3 x4 K(x)
= k(x1 )k(x2 )k 0 (x3 )k 0 (x4 ). One can
observe that carrying out calculation of these derivatives respectively would yield the computation
of k(x1 )k(x2 )k(x3 ) and k(x1 )k(x2 ) redundant with three and two repetitions, respectively. This
redundancy grows as the number of dimensions and the derivative order increases, which leads to
40
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
significant room for optimization. Therefore, to avoid these redundant calculations, we proposed a
graph-based efficient algorithm in this section.
5.1.1
Definitions and Facts
Our algorithm is based on a directed acyclic graph where each node denotes a set of multivariate
kernel partial derivatives and each edge denotes a univariate kernel derivative. In order to propose
a well-defined and mathematically strict description of our algorithm, some required notations,
definitions and facts are introduced in this section.
(j)
Definition 1. ki
is the value of the j-th order derivative of the univariate kernel k at xi ,
(j)
ki
= k (j) (xi ).
(5.1)
(r)
Definition 2. Nd denotes the set whose members are the unique partial derivatives of order r of
Q
the kernel function K(x1 , ..., xd ) = di=1 k(xi ),
(r)
Nd
(n1 ) (n2 )
(n )
k2 . . . kd d
= {k1
|
d
X
i=1
ni = r, ni ∈ N0 }, d ∈ N+ , r ∈ N0 .
(r)
(5.2)
(r)
Definition 3. Sd is the number of the elements in set Nd ,
(r)
(r)
Sd = |Nd |.
(5.3)
Definition 4. The product of a scalar ω and a set A = {a1 , a2 , . . . , an } is defined using the operator
×, such that
A × ω = {a1 ω, a2 ω, . . . , an ω}.
(5.4)
Definition 5. Define a directed acyclic graph G(V, E). Each node in V stands for a set and each
edge in E has a weight. The relation between nodes and edges in G is given by Figure 5.1. Here,
graph (a) contains two nodes, which stands for two sets, A and B. The edge (A, B) has a weight ω.
Then, according to the graph, the relation between set A and B is B = A × ω, where the operator
× is defined in Definition 4. Similarly, we can find from graph (b) that the node C is pointed by
two edges from node A and B respectively. In this case, the relation between these three sets is
C = (A × ωA ) ∪ (B × ωB ).
(i)
Fact 5.1. Set N1 contains only the i-th derivative of the kernel k(x1 ),
(i)
(i)
N1 = {k1 }.
41
(5.5)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
!
A
B
B =A⇥!
(a)
A
!A
C
!B
C = (A ⇥ !A ) [ (B ⇥ !B )
B
(b)
F IGURE 5.1: Relation between nodes in graph G.
(j)
Fact 5.2. Set Ni
(l)
can be derived from sets Ni−1 , l ∈ 0, . . . , j,
(j)
Ni
=
j
[
(l)
(j−l)
(Ni−1 × ki
).
(5.6)
l=0
Proof. According to Definition 2, we have
(j)
Ni
=
(n ) (n )
(n )
{k1 1 k2 2 . . . ki i
|
(n1 ) (n2 )
(n )
k2 . . . ki i
|
= {k1
i
X
l=1
i−1
X
l=1
nl = j, nl ∈ N0 }
nl = j − ni , nl ∈ N0 , ni ∈ 0, . . . , j}.
(j)
Since ni can be any value from 0 to j, we can split set Ni
into j + 1 mutually disjoint subsets such
that,
(j)
Ni
(n1 ) (n2 )
(0)
k2 . . . ki
={k1
(n1 ) (n2 )
(1)
k2 . . . ki
{k1
(n1 ) (n2 )
(j)
k2 . . . ki
{k1
|
|
|
i−1
X
l=1
i−1
X
l=1
i−1
X
l=1
42
nl = j − 0, nl ∈ N0 }∪
nl = j − 1, nl ∈ N0 } ∪ . . . ∪
nl = j − j = 0, nl ∈ N0 }.
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Then, according to Definition 4, we can find that for any p ∈ 0, . . . j we have
(n ) (n )
(p)
{k1 1 k2 2 . . . ki
|
i−1
X
(j−p)
nl = j − p, nl ∈ N0 } = Ni−1
l=1
(p)
× ki .
Therefore,
(j)
Ni
=
(j−0)
(Ni−1
×
(0)
ki )
∪
(j−1)
(Ni−1
×
(1)
ki )
(j)
×
(j)
ki )
=
j
[
(l)
(j−l)
(Ni−1 × ki
).
l=0
(l)
equals to the sum of the elements in sets Ni−1 , l ∈
Fact 5.3. The number of elements in set Ni
0, . . . , j,
(j)
Si
∪ ... ∪
(j−j)
(Ni−1
=
j
X
(l)
Si−1 .
(5.7)
l=0
Proof. According to Fact 5.2, we have
(j)
|Ni |
=|
j
[
(l)
(j−l)
(Ni−1 × ki
)|.
l=0
Thus,
(j)
Si
=
j
X
l=0
According to Definition 2 and 4, we know
(l)
(j−l)
|Ni−1 × ki
(l)
(j−l)
|Ni−1 × ki
(n
) (j−l)
(n1 ) (n2 )
k2 . . . ki−1i−1 ki
| = |{k1
(n
)
(n1 ) (n2 )
k2 . . . ki−1i−1
= |{k1
=
|
(l)
Si−1 .
i−1
X
p=1
|.
|
i−1
X
p=1
np = l, np ∈ N0 }|
np = l, np ∈ N0 }|
Therefore,
(j)
Si
=
j
X
(l)
Si−1 .
l=0
(r)
,
Fact 5.4. The number of elements in set Nd is d+r−1
r
d+r−1
(r)
Sd =
.
r
43
(5.8)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Proof. This statement can be easily proved by induction.
(r)
Basic: Show that the statement holds for d = 1. According to Fact 5.1, we know that N1
(r)
(r)
Thus, S1 = 1. Since 1+r−1
= rr = 1, we can get S1 = 1+r−1
.
r
r
(r)
= {k1 }.
Basic: Show that the statement holds for r = 0. According to Definition 2, we know that
(0)
(0)
(0)
d−1
d+0−1
Nd = {k1 k2 . . . kd }. Thus, Sd = 1. Since d+0−1
=
=
1,
we
can
get
S
=
.
d
0
0
0
(j)
(j−1)
Inductive step: Show that if Si−1 and Si
know that
(j)
Si
=
j
X
l=0
Reapply Fact 5.3 to
(j)
Si
Pj−1
l=0
(j−1)
= Si
(j)
hold, then Si
(l)
Si−1
=
j−1
X
(l)
also holds. According to Fact 5.3, we
(j)
Si−1 + Si−1 .
l=0
(l)
Si−1 , we get
(j)
+ Si−1 =
i+j−1−1
i−1+j−1
i+j−1
+
=
.
j−1
j
j
Since both the basis and the inductive step have been performed, by mathematical induction, the
(r)
statement Sd holds for all d ∈ N+ and r ∈ N0 . Q.E.D.
5.1.2
Algorithm
Our algorithm is illustrated in Figure 5.2. Consistent with Definition 5, each node in Figure 5.2
(j)
stands for a set. As is defined in Definition 2, a set Ni
contains all the partial derivatives of order j
of the i-variable function K(x1 , . . . , xi ). Each edge in the graph defines a product operation, which
is defined in Definition 4, between its head node and its weight. The weight of an edge is a univariate
kernel derivative which is given by Definition 1. The relationship between an edge’s head, weight
and tail is demonstrated in Figure 5.1.
(r)
Ignoring the output node Nd , we have a matrix of nodes in the graph. Each column contains
the sets whose elements are the partial derivatives of the same kernel function. Each row contains
the sets whose elements are the partial derivatives of the same order. Our algorithm start from
the left side of the graph, where we initialize all the nodes in the first column by assigning its
corresponding univariate kernel derivative as denoted by Fact 5.1. Then, according to Fact 5.2, we
compute the sets in each column from the sets in previous column. Repeat this step until we reach
the (d − 1)-th column. Finally, once we’ve got the output of (d − 1)-th column, we can reapply Fact
44
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
(0)
1
k1
…
(0)
N1
(0)
ki
(0)
1
Ni
(0)
Ni
…
(0)
1
Nd
(1)
ki
(1)
1
k1
…
(1)
N1
(r)
(0)
ki
(1)
Ni 1
(1)
Ni
(r)
ki
…
kd
(1)
1
Nd
(r 1)
kd
…
…
…
…
(r 1)
ki
(r)
1
k1
…
(r)
N1
(0)
ki
(r)
1
(r)
Ni
Ni
…
(0)
(r)
1
kd
Nd
(r)
Nd
F IGURE 5.2: Graph based efficient multivariate kernel derivative algorithm.
(r)
5.2 by Nd
=
Pr
(i)
i=0 (Nd−1
(r−i)
× kd
) and output the result. The outline of this algorithm is shown
in Algorithm 1.
5.1.3
Complexity Analysis
Instead of computing the multivariate partial derivatives from the products of univariate derivatives
directly, in this algorithm, we reuse the results from previous columns as much as possible, which
(j)
can reduce a great number of operations. Since all the univariate kernel derivatives ki , i ∈
1, . . . , n, j ∈ 0, . . . , r can be calculated efficiently in advance, then the only operations for calculating
the multivariate kernel derivative are multiplications. Thus, in the rest of this section, we only focus
on counting the number of multiplications in the proposed efficient algorithm and comparing it with
the naive method.
The naive algorithm calculates the product of univariate derivatives for each multivariate partial
derivative respectively. Assume we want to calculate the r-th order partial derivatives of the d-variable
kernel function K(x1 , . . . , xd ). Then, according to Fact 5.4, the number of the r-th order partial
derivatives is d+r−1
. Since each partial derivative is a product of d univariate kernel derivatives,
r
which results d − 1 multiplications, then the total number of multiplications in the naive algorithm is
d+r−1
Mn = (d − 1)
.
(5.9)
r
From Algorithm 1, we know that the multiplication of the proposed efficient algorithm happens
(l)
(j−l)
at the calculation of Ni−1 × ki
(i)
(r−i)
and Nd−1 × kd
45
. According to Definition 4, we know that the
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 1 Efficient Multivariate Kernel Derivative
1: procedure M ULTIVARIATE D ERIVATIVE(n, r )
2:
for i ← 0, r do
(i)
(i)
3:
N1 ← k1
4:
end for
5:
for i ← 2, d − 1 do
6:
for j ← 0, r do
(j)
7:
Ni ← Ø
8:
for l ← 0, j do
(j)
(j)
(l)
(j−l)
9:
Ni ← Ni ∪ (Ni−1 × ki
)
10:
end for
11:
end for
12:
end for
13:
for i ← 0, r do
(r)
(r)
(i)
(r−i)
14:
Nd ← Nd ∪ (Nd−1 × kd
)
15:
end for
(r)
16:
return Nd
17: end procedure
number of multiplications in performing × operation is equal to the size of the set. Thus, the number
(l)
(j−l)
of multiplication in computing Ni−1 × ki
(i)
(r−i)
and Nd−1 × kd
(l)
(i)
is Si−1 and Sd−1 . Therefore, by
applying this to all the for loops in Algorithm 1, we can get
Me =
j
d−1 X
r X
X
(l)
Si−1 +
i=2 j=0 l=0
r
X
(i)
Sd−1 .
(5.10)
i=0
According to Fact 5.3, the above equation can be simplified as
Me =
d−1
X
i=2
(r)
Si+1
+
(r)
Sd
=
d−1 X
r+i
i=2
r
d+r−1
+
.
r
(5.11)
Comparing Equation (5.10) with (5.9), we can find that the number of multiplications in the efficient
algorithm is significantly smaller than the naive algorithm. Hence, our algorithm can achieve a
considerable speed up in theory. The detailed experimental results are given in Chapter 6.
5.2
High Performance Kernel Density and Kernel Density Derivative
Estimators
Kernel density and kernel density derivative estimation methods usually have very high computational
requirements. From the discussions in Section 2.5 and Section 5.1.1, we know that the direct
46
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
computation of the KDE and KDDE requires mn
d+r−1
r
kernel evaluations, where m is the number
of test points, n is the number of training points, d is the dimension of the data, and r is the order of
estimators. In many cases, the size of the data sets are becoming larger and larger in recent years. In
our case, the test data and training data are usually of size 106 to 107 . Fortunately, the evaluations of
KDE and KDDE are independent for different test points, which makes it a perfect fit for parallel
computing. In this section, we proposed a multi-core CPU and GPU platform based solution to
accelerate the computation of KDE and KDDE. Several optimization techniques are used to achieve
significant performance gains.
5.2.1
Multi-core CPU Implementation
The goal of the multi-core CPU implementation is to deliver a set of kernel smoothing functions to
achieve high flexibility as well as a good performance. For flexibility, this implementation supports
input data of any dimension, can compute kernel density derivatives of any order and has a flexible
choice of kernel and bandwidth types. To achieve a good performance, this implementation uses
the POSIX Threads (PThreads) programming interface to utilize parallelism on multi-core CPU
platform. In this section, we only focus on the most general case (unconstrained variable bandwidth,
any dimension, any order, and Gaussian kernel) due to its high computational and mathematical
complexity.
According to Equation (2.27), we know that KDDE is usually calculated at n different test points
xi . And for each KDDE at test point xi , we need to calculate the weighted sum of the scaled
r-th order kernel derivative D⊗r KS j at m different shifted locations xi − X j . Thus, there will be
m × n scaled r-th order kernel derivative calculations involved. Note that the scaled r-th order
kernel derivative and unscaled r-th order kernel derivative at shifted location xi − X j is related
⊗r
by D⊗r KS j (xi − X j ) = |S j |S ⊗r
j D K(S j (xi − X j )). Hence, the calculation of the scaled r-th
order kernel derivative D⊗r KS j at xi − X j can be divided into four steps:
• calculate the scaled and shifted data y = S j (xi − X j ), where y is a d dimensional vector
y = [y1 , y2 , . . . , yd ]T ;
• for each variable yl , l = 1, . . . , d in y calculate its univariate kernel and kernel derivatives
k (0) (yl ), k (1) (yl ), . . ., k (r) (yl ) and store the results into a d × r matrix F , where F (u, v) =
k (v) (yu );
47
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
• calculate the multivariate r-th order kernel derivative D⊗r K from the univariate kernel derivatives in F using the efficient algorithm introduced in Section 5.1. Note that D⊗r K requires
dr r-th order partial derivatives which carries some redundancies. However, the efficient
algorithm only gives d+r−1
unique partial derivatives. Hence, we need to repeat some results
r
we got from the efficient algorithm to fill the redundant locations in D⊗r K;
⊗r ⊗r
⊗r
• calculate |S j | and S ⊗r
j , and update D KS j by |S j |S j D K.
Since the calculations of KDDE at different test points are independent, we can parallelize our
algorithm at the test point level. Therefore, we propose our general KDDE algorithm on multi-core
platform in Algorithm 2.
Algorithm 2 Parallel CPU Kernel Density Derivative Estimation
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
procedure KDDE(x, X, S, ω, r)
d, m ← S IZE(x)
n ← S IZE(X, 2)
D⊗r f ← Z EROS(n, dr )
parfor i ← 0, m − 1 do
for j ← 0, n − 1 do
y ← S(j)(x(i) − X(j))
F ← U NIVARIATE D ERIVATIVES(y, r)
D⊗r K ← M ULTIVARIATE D ERIVATIVES(F, d, r)
D⊗r f (i) ← D⊗r f (i) + ω(j)|Sj|S ⊗r (j)D⊗r K
end for
end parfor
return D⊗r f
end procedure
5.2.2
GPU Implementation in CUDA
From Chapter 3, we know that in the context of image processing and pattern recognition, the
estimation of the first and second derivatives of the density is crucial to locate significant feature
regions on images. Therefore, in this section we’ll focus only on the kernel gradient and curvature
estimator, as denoted by Equation (3.43) and (3.44), for 2D and 3D images. What’s more, since
the choice of kernel function is not crucial to the accuracy of KDE and KDDE, we choose the
standard Gaussian function as the kernel. Based on these assumptions and because of the multivariate
estimators are far more computationally and mathematically involved, we present several optimized
GPU KDE and KDDE implementations in this section.
48
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Naive Implementation
The same as the multi-core CPU implementation, the naive GPU implementation also parallelizes at
the test point level. we create four gpu-kernel functions (ShiftAndScale, UnivarDeri, MultivarDeri,
and Update) corresponding to the four steps in calculating the scaled kernel derivatives. Each
gpu-kernel is designed to complete its job for all the test points concurrently. To achieve this, based
on the CUDA programming model in Section 4.2, we divide the gpu-kernel into dm/te equally-sized
blocks and each block contains t threads. Thus, there are roughly m threads in the gpu-kernels.
Each of these m threads is responsible for calculating the linear combination of the shifted D⊗r KS j
functions at a test point. The naive implementation is shown in Algorithm 3. To illustrate the all the
problems and optimization techniques clearly, we give the most complex KDDE function, which
can compute the kernel density, kernel gradient and kernel curvature at the same time, from our
kernel smoothing library. Since we are only interested in the first and second order derivatives, we
implement the estimators according to Equation (2.28) and (2.30).
Optimization I – Kernel Merging, Loop Unrolling, and Memory Layout Optimization
If we take a close look at Algorithm 3, we can find that there exists several problems:
• Too many small gpu-kernels. The four gpu-kernels generate lots of redundant global memory
transactions. Every gpu-kernel has to save its results back to the global memory so that they
can be used by the following gpu-kernels. As mentioned in Section 4.1, all the global memory
transactions are off-chip, which makes them much slower than the on-chip memories. Thus, the
redundant off-chip global memory transactions will introduce lots of warp stalls and eventually
slow down the execution of the GPU.
• Bad memory layouts. The memory layout of matrices and cubes are column-major (first
dimension first) in the naive implementation. As illustrated in Figure 5.3 (a) and (c), such
memory layout will result strided memory accesses. As mentioned in Section 4.4, memory
accesses are coalesced by warps. It means scattered or strided memory access will require
more memory transactions since they can’t be efficiently grouped in one memory transaction.
• Unnecessary loops. The calculation of multivariate kernel derivatives involves lots of matrix
operations. To complete these operations, we need to write many for loop statements in the
gpu-kernels. Usually, the sizes of those for loops are determined by data dimension. However,
49
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 3 Naive GPU Kernel Density Derivative Estimation
1: procedure KDDE(x, X, S, w)
2:
d, m ← S IZE(x)
3:
n ← S IZE(X, 2)
4:
f ← Z EROS(1, m)
5:
g ← Z EROS(d, m)
6:
H ← Z EROS(d ∗ d, m)
7:
for i ← 0, n − 1 do
8:
y i ← S HIFTA ND S CALE(x, X, S, i)
0
00
9:
ki , ki , ki ← U NIVAR D ERI(y i )
0
00
10:
f i , g i , H i ← M ULTIVAR D ERI(ki , ki , ki )
11:
f , g, H ← U PDATE(f i , g i , H i , S, w, i)
12:
end for
13:
return f , g, H
14: end procedure
49:
end for
50:
g i [j ∗ d + p] ← gp
51:
H i [j ∗ d ∗ d + p ∗ d + p] ← Hpp
52:
end for
53:
f i [j] ← f
54:
for p ← 1, d − 1 do
55:
for q ← 0, p − 1 do
0
56:
Hpq ← ki [j ∗ d + q] ∗ g i [j ∗ d + p]/ki [j ∗ d + q]
57:
H i [j ∗ d ∗ d + p ∗ d + q] ← Hpq
58:
H i [j ∗ d ∗ d + q ∗ d + p] ← Hpq
59:
end for
60:
end for
61:
return f i , g i , H i
62: end procedure
15: procedure S HIFTA ND S CALE(x, X, S, i)
16:
j ← blockDim.x ∗ blockIdx.x + threadIdx.x
17:
d, n ← S IZE(x)
18:
m ← S IZE(X, 2)
19:
for k ← 0, d − 1 do
20:
t[j ∗ d + k] ← x[j ∗ d + k] − X[i ∗ d + k]
21:
end for
22:
for p ← 0, d − 1 do
23:
for q ← 0, d − 1 do
24:
y i [j ∗ d + p] ← y i [j ∗ d + p] + t[j ∗ d + q] ∗
63: procedure U PDATE(f i , g i , H i , S, w, i)
64:
j ← blockDim.x + blockIdx.x + threadIdx.x
65:
f [j] ← f [j] + f i [j] ∗ w[i]
66:
for p ← 0, d − 1 do
67:
t←0
68:
for q ← 0, d − 1 do
69:
t ← t + g i [j ∗ d + q] ∗ S[j ∗ d ∗ d + d ∗ q + p]
70:
end for
71:
g[j ∗ d + p] ← g[j ∗ d + p] + w[j] ∗ t
72:
end for
73:
for p ← 0, d − 1 do
74:
for q ← 0, d − 1 do
75:
t←0
76:
for r ← 0, d − 1 do
77:
t ← t + H i [j ∗ d ∗ d + p ∗ d + r] ∗ S[j ∗ d ∗
S[d ∗ d ∗ i + d ∗ q + p]
25:
end for
26:
end for
27:
return y i
28: end procedure
29: procedure U NIVAR D ERI(y i )
30:
j ← blockDim.x + blockIdx.x + threadIdx.x
31:
yij ← y i [j]
32:
k ← 1/SQRT(2 ∗ π) ∗ EXP(−0.5 ∗ yij ∗ yij )
33:
ki [j] ← k
0
34:
ki [j] ← −yij ∗ k
00
35:
ki [j] ← (yij ∗ yij − 1) ∗ k
00
0
36:
return ki , ki , ki
37: end procedure
0
00
38: procedure M ULTIVAR D ERI(ki , ki , ki )
39:
j ← blockDim.x + blockIdx.x + threadIdx.x
40:
d, n ← S IZE(ki )
41:
f ←1
42:
for p ← 0, d − 1 do
43:
f ← f ∗ ki [j ∗ d + p]
0
44:
gp ← ki [j ∗ d + p]
00
45:
Hpp ← ki [j ∗ d + p]
46:
for q ← p + 1, d + p − 1 do
47:
gp ← gp ∗ ki [j ∗ d + MOD(q, d)]
48:
Hpp ← Hpp ∗ ki [j ∗ d + MOD(q, d)]
d + r ∗ d + q]
end for
H tmp [p ∗ d + q] ← t
end for
end for
for p ← 0, d − 1 do
for q ← 0, d − 1 do
t←0
for r ← 0, d − 1 do
t ← t+S[j∗d∗d+r∗d+p]∗H tmp [r∗d+q]
end for
H[j ∗ d ∗ d + p ∗ d + q] ← H[j ∗ d ∗ d + p ∗
d + q] + w[j] ∗ t
89:
if p 6= q then
90:
H[j ∗ d ∗ d + q ∗ d + p] ← H[j ∗ d ∗ d +
q ∗ d + p] + w[j] ∗ t
91:
end if
92:
end for
93:
end for
94:
return f , g, H
95: end procedure
78:
79:
80:
81:
82:
83:
84:
85:
86:
87:
88:
50
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
since we are only interested in 2D and 3D data in this section, the for loop size is actually only
2 or 3, which is definitely inefficient and should be unrolled to remove the loop overhead.
Global Memory
x(0,0)
Thread IDs
x(1,0)
x(2,0)
x(0,1)
0
x(1,1)
x(2,1)
…
X(0,N-1)
…
1
X(1,N-1)
X(2,N-1)
x(2,1)
…
N-1
(a)
x(0,0)
x(0,1)
…
X(0,N-1)
0
1
…
N-1
x(1,0)
…
x(1,1)
X(1,N-1)
x(2,0)
X(2,N-1)
(b)
x(0,0,0) x(1,0,0) x(0,1,0) x(1,1,0) x(0,0,1) x(1,0,1) x(0,1,1) x(1,1,1)
0
…
x(0,0,N-1)
…
1
x(1,0,N-1)
x(0,1,N-1)
x(1,1,N-1)
N-1
(c)
x(0,0,0) x(0,0,1)
0
1
…
x(0,0,N-1)
…
N-1
x(1,0,0) x(1,0,1)
…
x(1,0,N-1)
x(0,1,0) x(0,1,1)
…
x(0,1,N-1)
x(1,1,0) x(1,1,1)
…
x(1,1,N-1)
(d)
F IGURE 5.3: Memory access patterns of matrices and cubes. (a) Memory access pattern of column-major
matrix. (b) Memory access pattern of row-major matrix. (c) Memory access pattern of column-major cube
(3D matrix). (d) Memory access pattern of column-major cube (3D matrix).
Therefore, to solve these problems, we propose our optimized implementation in Algorithm 4.
Here, we merged the four small gpu-kernels into two big gpu-kernels to avoid redundant global
memory transactions. However, big gpu-kernels usually consume more registers because they use
lots of variables. According to Section 4.3, we know that GPU has limited register resources. If
51
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
a gpu-kernel uses too much registers, it can not achieve a high occupancy, which results a bad
performance. Hence, in the optimized implementation I, we reduce the usage of register by reusing
previous results or local variables as much as possible.
We also changed our column-major memory layout of matrices and cubes into row-major (second
dimension first) and slice-major (third dimension first) respectively. Then, from Figure 5.3 (b) and
(d), we can find that the memory access is now continuous in either case.
What’s more, we unrolled the loops in the optimized implementation. Since the order of the
statements is now not restricted to be what it was in the loop, the loop unrolling not only avoided
executing loop control instructions, but also provided us more flexible controls over statements once
inside the for loop.
Optimization II – Simplified Math Expressions
In Optimization I, we improved the implementation from the perspective of the GPU code. However,
the implementation can also improved at the algorithm level. If we take a look at the kernel function
itself, we can find that a better representation can be used to simplify the KDDE, and thus, speed up
the calculation. As we know from Equation (2.9) that a separable multivariate Gaussian kernel can
be written as
K(x) =
d
Y
1
2
√ e−xl /2 .
2π
l=1
(5.12)
If we evaluate this equation directly, it will result in computing the exponential function d times.
However, we can use the property of exponential function and simplify the above equation as
1 T
1
K(x) = ( √ )d e− 2 x x .
2π
(5.13)
We can find that the simplified kernel function only requires one exponential function. Similarly, instead of calculating Equation (2.10) and (2.11), the gradient and Hessian of the separable multivariate
Gaussian kernel can be given as
1 T
1
∇K(x) = ( √ )d e− 2 x x x
2π
1 T
1
∇2 K(x) = ( √ )d e− 2 x x (xxT − I)
2π
52
(5.14)
(5.15)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 4 Optimized GPU Kernel Density Derivative Estimation I
1: procedure KDDE(x, X, S, w)
2:
m ← S IZE(x, 1)
3:
n ← S IZE(X, 1)
4:
f ← Z EROS(n, 1)
5:
g ← Z EROS(n, 3)
6:
H ← Z EROS(n, 9)
7:
for i ← 0, n − 1 do
8:
yi ← S HIFTA ND S CALE(x, X, S, i)
9:
f, g, H ← K ERNEL C ORE(x, yi , S, w[i], i)
10:
end for
11:
return f, g, H
12: end procedure
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
fij ← wi ∗ c
fij ← fij ∗ E XP(−0.5 ∗ xj0 ∗ xj0 )
fij ← fij ∗ E XP(−0.5 ∗ xj1 ∗ xj1 )
fij ← fij ∗ E XP(−0.5 ∗ xj2 ∗ xj2 )
t0 ← xj0 ∗ xj1 ∗ fij
t1 ← xj0 ∗ xj2 ∗ fij
t2 ← xj1 ∗ xj2 ∗ fij
t3 ← s0 ∗ (xj0 ∗ xj0 − 1) ∗ fij + s1 ∗ t0 + s2 ∗ t1
t4 ← s0 ∗ t0 + s1 ∗ (xj1 ∗ xj1 − 1) ∗ fij + s2 ∗ t2
t5 ← s0 ∗ t1 + s1 ∗ t2 + s2 ∗ (xj2 ∗ xj2 − 1) ∗ fij
f[j] ← f[j] + fij
g[j] ← g[j] − fij ∗ (s0 ∗ xj0 + s1 ∗ xj1 + s2 ∗ xj2 )
H[j] ← H[j] + s0 ∗ t3 + s1 ∗ t4 + s2 ∗ t5
13: procedure S HIFTA ND S CALE(x, X, S, i)
s0 ← S[i ∗ 9 + 3]
14:
j ← blockDim.x ∗ blockIdx.x + threadIdx.x
s1 ← S[i ∗ 9 + 4]
15:
n ← S IZE(x, 1)
s2 ← S[i ∗ 9 + 5]
16:
m ← S IZE(X, 1)
t6 ← s0 ∗ t3 + s1 ∗ t4 + s2 ∗ t5
17:
yi0 ← x[j + n ∗ 0] − X[i + m ∗ 0]
g[j + n ∗ 1] ← g[j + n ∗ 1] − fij ∗ (s0 ∗ xj0 + s1 ∗
18:
yi1 ← x[j + n ∗ 1] − X[i + m ∗ 1]
xj1 + s2 ∗ xj2 )
19:
yi2 ← x[j + n ∗ 2] − X[i + m ∗ 2]
55:
H[j + n ∗ 1] ← H[j + n ∗ 1] + t6
20:
yi [j + n ∗ 0] ← yi0 ∗ S[i ∗ 9 + 0] + yi1 ∗ S[i ∗ 9 + 3] + 56:
H[j + n ∗ 3] ← H[j + n ∗ 3] + t6
yi2 ∗ S[i ∗ 9 + 6]
57:
t6 ← s3 ∗ t3 + s4 ∗ t4 + s5 ∗ t5
21:
yi [j + n ∗ 1] ← yi0 ∗ S[i ∗ 9 + 1] + yi1 ∗ S[i ∗ 9 + 4] + 58:
H[j + n ∗ 2] ← H[j + n ∗ 2] + t6
yi2 ∗ S[i ∗ 9 + 7]
59:
H[j + n ∗ 6] ← H[j + n ∗ 6] + t6
22:
yi [j + n ∗ 2] ← yi0 ∗ S[i ∗ 9 + 2] + yi1 ∗ S[i ∗ 9 + 5] + 60:
t3 ← s0 ∗ (xj0 ∗ xj0 − 1) ∗ fij + s1 ∗ t0 + s2 ∗ t1
yi2 ∗ S[i ∗ 9 + 8]
61:
t4 ← s0 ∗ t0 + s1 ∗ (xj1 ∗ xj1 − 1) ∗ fij + s2 ∗ t2
23:
return yi
62:
t5 ← s0 ∗ t1 + s1 ∗ t2 + s2 ∗ (xj2 ∗ xj2 − 1) ∗ fij
24: end procedure
63:
t6 ← s3 ∗ t3 + s4 ∗ t4 + s5 ∗ t5
64:
g[j + n ∗ 2] ← g[j + n ∗ 2] − fij ∗ (s3 ∗ xj0 + s4 ∗
25: procedure K ERNEL C ORE(x, yi , S, wi , i)
xj1 + s5 ∗ xj2 )
26:
j ← blockDim.x ∗ blockIdx.x + threadIdx.x
65:
H[j + n ∗ 4] ← H[j + n ∗ 4] + s0 ∗ t3 + s1 ∗ t4 + s2 ∗ t5
27:
n ← S IZE(yi , 1)
66:
H[j + n ∗ 5] ← H[j + n ∗ 5] + t6
28:
xj0 ← x[j + n ∗ 0]
67:
H[j + n ∗ 7] ← H[j + n ∗ 7] + t6
29:
xj1 ← x[j + n ∗ 1]
68:
t3 ← s3 ∗ (xj0 ∗ xj0 − 1) ∗ fij + s4 ∗ t0 + s5 ∗ t1
30:
xj2 ← x[j + n ∗ 2]
69:
t4 ← s3 ∗ t0 + s4 ∗ (xj1 ∗ xj1 − 1) ∗ fij + s5 ∗ t2
31:
s0 ← S[i ∗ 9 + 0]
70:
t5 ← s3 ∗ t1 + s4 ∗ t2 + s5 ∗ (xj2 ∗ xj2 − 1) ∗ fij
32:
s1 ← S[i ∗ 9 + 1]
71:
H[j + n ∗ 8] ← H[j + n ∗ 8] + s3 ∗ t3 + s4 ∗ t4 + s5 ∗ t5
33:
s2 ← S[i ∗ 9 + 2]
72:
return f, g, H
34:
s3 ← S[i ∗ 9 + 6]
73: end procedure
35:
s4 ← S[i ∗ 9 + 7]
36:
s5 ← S[i ∗ 9 + 8]
53
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Therefore, the simplified kernel density, kernel gradient, kernel curvature estimators can be written as
n
X
1
T T
1
fˆ(xi ; S j , ωj ) = ( √ )d
ωj |S j |e− 2 (xi −X j ) S j S j (xi −X j )
2π j=1
(5.16)
n
1
T T
1 dX
ˆ
∇f (xi ; S j , ωj ) = ( √ )
ωj |S j |e− 2 (xi −X j ) S j S j (xi −X j ) S Tj S j (xi − X j )
2π j=1
(5.17)
n
X
1
T T
1
∇2 fˆ(xi ; S j , ωj ) =( √ )d
ωj |S j |e− 2 (xi −X j ) S j S j (xi −X j )
2π j=1
(5.18)
(S Tj S j (xi − X j )(xi − X j )T S j S Tj − S Tj S j )
The advantage of these simplified forms is that it not only simplifies the computation but also de1
T S T S (x −X )
j
i
j
j
creases the usage of variables. First, they contain a common expression ωj |S j |e− 2 (xi −X j )
Hence, we only need to compute this expression once and save the result for reuse. Second, expression S Tj S j (xi − X j ) appears repeatedly. Thus, its evaluation result can also be used in multiple
places. Third, if S Tj S j is computed in advance, then there will be no square matrix multiplications.
Since square matrix multiplication involves lots of addition and multiplication operations and need
to use lots of variables to store temporary results, the simplified forms can greatly reduce the usage
of registers.
What’s more, in Optimization I, we need to call the gpu-kernels for each training point. Since
the number of training points is usually very large, there will be a great number of gpu-kernel calls,
which results a significant kernel call overhead. One solution of this problem is to design a big kernel
to enclose the outer for loop. Then, there will be only one gpu-kernel call. Previously, this solution is
not practical because such a large gpu-kernel will consume too much GPU resources, which lead to a
low GPU occupancy. But due to the low variable usage of the simplified forms, it is now possible to
merge all the gpu-kernel calls into just one single call.
Based on the analysis above, we propose our optimized implementation in Algorithm 5. We
can find that the main function KDDE now contains only two gpu-kernels SquareAndDet and
LinCombKernels. SquareAndDet is responsible for calculating the squared scale S Tj S j and the scale
determinant |S j | for each training point. The outer for loop is now moved into the LinCombKernels,
which basically computes Equation (5.16), (5.17), and (5.18).
Optimization III – Exploiting Temporal Locality Using Shared Memory
As we mentioned in Section 4.1, shared memory is a programmable cache on GPU. It is way more
faster than the off-chip global memory. However, to utilize the shared memory efficiently, there has
54
.
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 5 Optimized GPU Kernel Density Derivative Estimation II
1: procedure KDDE(x, X, S, w)
31:
t2 ← xi2 − X[j + m ∗ 2]
2:
SS, c ← S QUARE A ND D ET(S, w)
32:
t3 ← t1 ∗S[j∗9+0]+t2 ∗S[j∗9+3]+t3 ∗S[j∗9+6]
3:
f, g, H ← L IN C OMB K ERNELS(x, X, S, SS, c)
33:
t4 ← t1 ∗S[j∗9+1]+t2 ∗S[j∗9+4]+t3 ∗S[j∗9+7]
4:
return f, g, H
34:
t5 ← t1 ∗S[j∗9+2]+t2 ∗S[j∗9+5]+t3 ∗S[j∗9+8]
5: end procedure
35:
fij ← c[j] ∗ E XP(t3 ∗ t3 + t4 ∗ t4 + t5 ∗ t5 )
36:
fi ← fi + fij
6: procedure S QUARE A ND D ET(S, w)
37:
ss11 ← SS[j ∗ 6 + 0]
7:
i ← blockDim.x ∗ blockIdx.x + threadIdx.x
38:
ss12 ← SS[j ∗ 6 + 1]
8:
s11 ← S[i∗9+0], s12 ← S[i∗9+3], s13 ← S[i∗9+6]
39:
ss13 ← SS[j ∗ 6 + 2]
9:
s21 ← S[i∗9+1], s22 ← S[i∗9+4], s23 ← S[i∗9+7]
40:
ss22 ← SS[j ∗ 6 + 3]
10:
s31 ← S[i∗9+2], s32 ← S[i∗9+5], s33 ← S[i∗9+8]
41:
ss23 ← SS[j ∗ 6 + 4]
11:
SS[i ∗ 6 + 0] ← s11 ∗ s11 + s21 ∗ s21 + s31 ∗ s31
42:
ss33 ← SS[j ∗ 6 + 5]
12:
SS[i ∗ 6 + 1] ← s11 ∗ s12 + s21 ∗ s22 + s31 ∗ s32
43:
t3 ← ss11 ∗ t0 + ss12 ∗ t1 + ss13 ∗ t2
13:
SS[i ∗ 6 + 2] ← s11 ∗ s13 + s21 ∗ s23 + s31 ∗ s33
44:
t4 ← ss12 ∗ t0 + ss22 ∗ t1 + ss23 ∗ t2
14:
SS[i ∗ 6 + 3] ← s12 ∗ s12 + s22 ∗ s22 + s32 ∗ s32
45:
t5 ← ss13 ∗ t0 + ss23 ∗ t1 + ss33 ∗ t2
15:
SS[i ∗ 6 + 4] ← s12 ∗ s13 + s22 ∗ s23 + s32 ∗ s33
46:
g[i + n ∗ 0] ← g[i + n ∗ 0] − fij ∗ t3
16:
SS[i ∗ 6 + 5] ← s13 ∗ s13 + s23 ∗ s23 + s33 ∗ s33
47:
g[i + n ∗ 1] ← g[i + n ∗ 1] − fij ∗ t4
17:
t0 ← s11 ∗ s22 ∗ s33 + s12 ∗ s23 ∗ s31 + s13 ∗ s21 ∗ s32
48:
g[i + n ∗ 2] ← g[i + n ∗ 2] − fij ∗ t5
18:
t1 ← s13 ∗ s22 ∗ s31 + s23 ∗ s32 ∗ s11 + s12 ∗ s21 ∗ s33
49:
H[i + n ∗ 0] ← H[i + n ∗ 0] + fij ∗ (t3 ∗ t3 − ss11 )
19:
c[i] ← w[i] ∗ c ∗ A BS(t0 − t1 )
50:
H[i + n ∗ 1] ← H[i + n ∗ 1] + fij ∗ (t3 ∗ t4 − ss12 )
20:
return SS, c
21: end procedure
51:
H[i + n ∗ 2] ← H[i + n ∗ 2] + fij ∗ (t3 ∗ t5 − ss13 )
52:
H[i + n ∗ 4] ← H[i + n ∗ 4] + fij ∗ (t4 ∗ t4 − ss22 )
53:
H[i + n ∗ 5] ← H[i + n ∗ 5] + fij ∗ (t4 ∗ t5 − ss23 )
22: procedure L IN C OMB K ERNELS(x, X, S, SS, c)
54:
23:
i ← blockDim.x ∗ blockIdx.x + threadIdx.x
55:
24:
n ← S IZE(x, 1)
56:
f[i] ← fi
25:
m ← S IZE(X, 1)
57:
H[i + n ∗ 3] ← H[i + n ∗ 1]
26:
fi ← 0
58:
H[i + n ∗ 6] ← H[i + n ∗ 2]
27:
xi0 ← x[i+n∗0], xi1 ← x[i+n∗1], xi2 ← x[i+n∗2]
59:
H[i + n ∗ 7] ← H[i + n ∗ 5]
28:
for j ← 0, m − 1 do
60:
return f, g, H
29:
t0 ← xi0 − X[j + m ∗ 0]
30:
t1 ← xi1 − X[j + m ∗ 1]
H[i + n ∗ 8] ← H[i + n ∗ 8] + fij ∗ (t5 ∗ t5 − ss33 )
end for
61: end procedure
55
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
to be enough temporal locality (reuse of data) in the gpu-kernel.
Before introducing our optimized implementation using shared memory, let’s take a look at the
memory access pattern in Optimization II. Assume m is the number of training points, n is the
number of threads (equal to test point number), and r is the number of variables that access data from
global memory, then the memory access pattern in Optimization II can be illustrated as Figure 5.4.
Here, colored block groups V1 , V2 , . . . , Vr are arrays with m elements each. They all exist in global
memory. Gray circles stand for threads. Each step is an iteration of the for loop in Optimization
II. As we already knew that each thread will evaluate the kernel density, kernel gradient and kernel
curvature for a single test point. Since such evaluation are similar for every thread, they usually read
the same data from the global memory in each step. In the figure, we can find that, in the first step, all
the threads read the first element in V1 , V2 , . . . , Vr . Since there are n threads, it means the same data
is read n times, which is obviously inefficient. To analysis this problem quantitatively, we calculate
the total number of global memory access of this case,
Mg = m × n × r.
(5.19)
We can solve this problem by introducing shared memory. The new memory access pattern is
shown in Figure 5.5. Here, the colored blocks stand for the data in global memory. Each color
corresponds to an array in Figure 5.4. The number in the block stands for the index of an element
in the array. The reason we choose this form of memory layout instead of the layout in Figure 5.4
is that in this arrangement of memory, the data read by threads are consecutive, which can reduce
global memory transactions due to memory coalescing. The threads are divided into colored groups.
Each group stands for a thread block mentioned in Section 4.2. Since data in shared memory can
only be accessed by threads from the same block, here we draw shared memory separately for each
thread block.
The idea of using shared memory is that if certain data is used several times, we can store it into
shared memory first, then read it from the shared memory directly in later usage. We can see from
Figure 5.5 that the threads in the same block read the same data from the global memory only once
(threads in different block still have to read the same data repeatedly, because data in shared memory
can not be accessed between thread blocks). Then, this data can be accessed by other threads in the
same block directly from the shared memory. The total number of global memory accesses is
n
1
Mg = m × r × d e × ,
b
c
56
(5.20)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 6 Optimized GPU Kernel Density Derivative Estimation III
1: procedure KDDE(x, X, S, w)
2:
m, d ← S IZE(X)
3:
r ← 1 + d + d ∗ d + (1 + d) ∗ d/2
4:
for i ← 0, m − 1 do
5:
for j ← 0, d − 1 do
6:
y[r ∗ i + j] ← X[m ∗ j + i]
7:
for k ← 0, d − 1 do
8:
y[r∗i+(j +1)∗d+k] ← S[i∗d∗d+j ∗d+k]
9:
end for
10:
end for
11:
end for
12:
y ← S QUARE A ND D ET(y, w)
13:
f, g, H ← L IN C OMB G AUSSIAN K ERNELS(x, y)
14:
return f, g, H
15: end procedure
16: procedure S QUARE A ND D ET(y, w)
17:
i ← (blockDim.x ∗ blockIdx.x + threadIdx.x) ∗ 19
18:
s11 ← y[i + 3], s12 ← y[i + 6], s13 ← y[i + 9]
19:
s21 ← y[i + 4], s22 ← y[i + 7], s23 ← y[i + 10]
20:
s31 ← y[i + 5], s32 ← y[i + 8], s33 ← y[i + 11]
21:
t0 ← s11 ∗ s22 ∗ s33 + s12 ∗ s23 ∗ s31 + s13 ∗ s21 ∗ s32
22:
t1 ← s13 ∗ s22 ∗ s31 + s23 ∗ s32 ∗ s11 + s12 ∗ s21 ∗ s33
23:
y[i + 12] ← w[i/19] ∗ c ∗ A BS(t0 − t1 )
24:
y[i + 13] ← s11 ∗ s11 + s21 ∗ s21 + s31 ∗ s31
25:
y[i + 14] ← s11 ∗ s12 + s21 ∗ s22 + s31 ∗ s32
26:
y[i + 15] ← s11 ∗ s13 + s21 ∗ s23 + s31 ∗ s33
27:
y[i + 16] ← s12 ∗ s12 + s22 ∗ s22 + s32 ∗ s32
28:
y[i + 17] ← s12 ∗ s13 + s22 ∗ s23 + s32 ∗ s33
29:
y[i + 18] ← s13 ∗ s13 + s23 ∗ s23 + s33 ∗ s33
30:
return y
31: end procedure
32: procedure L IN C OMB G AUSSIAN K ERNELS(x, y)
33:
i ← blockDim.x ∗ blockIdx.x + threadIdx.x
shared t[6144]
34:
35:
n ← S IZE(x, 1), m ← S IZE(X, 1)
36:
fi ← 0
37:
gi0 ← 0, gi1 ← 0, gi2 ← 0
38:
H11i ← 0
39:
H12i ← 0, H22i ← 0
40:
H13i ← 0, H23i ← 0, H33i ← 0
41:
xi0 ← x[i+n∗0], xi1 ← x[i+n∗1], xi2 ← x[i+n∗2]
42:
for j ← 0, m − 1 do
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
55:
i ← M OD(j, 323)
if i == 0 then
S YNCTHREADS()
t[thread + 0] ← y[j ∗ 19 + thread + 0]
t[thread + 1024] ← y[j ∗ 19 + thread + 1024]
t[thread + 2048] ← y[j ∗ 19 + thread + 2048]
t[thread + 3072] ← y[j ∗ 19 + thread + 3072]
t[thread + 4096] ← y[j ∗ 19 + thread + 4096]
t[thread + 5120] ← y[j ∗ 19 + thread + 5120]
S YNCTHREADS()
end if
i ← i ∗ 19
t0 ← xi0 − t[i + 0], t1 ← xi1 − t[i + 1], t2 ←
xi2 − t[i + 2]
56:
57:
58:
59:
60:
61:
62:
63:
64:
t3 ← t0 ∗ t[i + 3] + t1 ∗ t[i + 6] + t2 ∗ t[i + 9]
t4 ← t0 ∗ t[i + 4] + t1 ∗ t[i + 7] + t2 ∗ t[i + 10]
t5 ← t0 ∗ t[i + 5] + t1 ∗ t[i + 8] + t2 ∗ t[i + 11]
fij ← t[i + 12] ∗ E XP(t3 ∗ t3 + t4 ∗ t4 + t5 ∗ t5 )
t3 ← t[i + 13] ∗ t0 + t[i + 14] ∗ t1 + t[i + 15] ∗ t2
t4 ← t[i + 14] ∗ t0 + t[i + 16] ∗ t1 + t[i + 17] ∗ t2
t5 ← t[i + 15] ∗ t0 + t[i + 17] ∗ t1 + t[i + 18] ∗ t2
fi ← fi + fij
gi0 ← gi0 − fij ∗ t3 , gi1 ← gi1 − fij ∗ t4 , gi2 ←
gi2 − fij ∗ t5
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
H11i ← H11i + fij ∗ (t3 ∗ t3 − t[i + 13])
H12i ← H12i + fij ∗ (t3 ∗ t4 − t[i + 14])
H13i ← H13i + fij ∗ (t3 ∗ t5 − t[i + 15])
H22i ← H22i + fij ∗ (t4 ∗ t4 − t[i + 16])
H23i ← H23i + fij ∗ (t4 ∗ t5 − t[i + 17])
H33i ← H33i + fij ∗ (t5 ∗ t5 − t[i + 18])
end for
i ← blockDim.x ∗ blockIdx.x + threadIdx.x
f[i] ← fi
g[i+n∗0] ← gi0 , g[i+n∗1] ← gi1 , g[i+n∗2] ← gi2
H[i+n∗0] ← H11i , H[i+n∗3] ← H12i , H[i+n∗6] ←
H13i
76:
H[i+n∗1] ← H12i , H[i+n∗4] ← H22i , H[i+n∗7] ←
H23i
77:
H[i+n∗2] ← H13i , H[i+n∗5] ← H23i , H[i+n∗8] ←
H33i
78:
return f, g, H
79: end procedure
57
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Global Memory
V1
Step 1
0
Thread
IDs
1
…
m-1
0
1
0
…
0
Thread
IDs
…
m-1
1
0
1
m-1
0
1
0
m-1
n-1
Vr
V2
…
…
…
1
V1
Step 2
Vr
V2
…
…
m-1
0
1
…
m-1
…
1
n-1
…
V1
Step
m
0
Thread
IDs
1
V2
…
m-1
0
1
0
Vr
…
…
m-1
0
1
…
m-1
…
1
n-1
F IGURE 5.4: Memory access pattern without using shared memory.
Global Memory
0
Thread
IDs
0
Shared
Memory
0
Thread
IDs
0
0
…
1
1
…
…
1
0
1
b-1
b
0
b-1
…
1
b-1
b
…
…
b+1
1
…
1
…
b+1
2b-1
…
…
b-1
…
m-1
2b-1
m-1
…
k
0
…
k+1
1
k
F IGURE 5.5: Memory access pattern using shared memory.
58
m-1
k+1
…
k+b-1
b-1
…
k+b-1
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
where b is the block size, c is the memory coalescing factor. We can see that the number of global
memory accesses using shared memory is reduced b × c times. The total number of shared memory
accesses is
Ms = m × n × r ×
where w is the half warp size. We need the factor
1
w
1
,
w
(5.21)
is because all the threads always read the data at
the same location each time. Thus, the shared memory access will be broadcasted to the threads in
the same warp. The outline of this optimized implementation is shown in Algorithm 5.
5.3
Efficient k-Nearest Neighbors Bandwidth Selection For Images
In Section 2.4.2, we introduced a k-nearest neighbors based bandwidth selection method. The key
point of this method is to calculate the covariance matrix of the k-nearest neighbors for each training
point X j , j = 1, . . . , n. To compute the covariance matrix at X j , a naive implementation is to
find the k-nearest neighbors of X j first and then calculate the covariance matrix using Equation
(2.21). However, to find the k-nearest neighbors, one need to calculate the distances between this
point and all other training points, and find the k-nearest neighbors points of X j by sorting the
resulting distances. It can be easily proven that the time complexity of such k-nearest neighbors
search algorithm has O(n2 ) time complexity, where n is the size of training set. Since the size of
training set is usually very large, this is clearly computationally intensive. Therefore, to avoid the
directly k-nearest neighbors search, we propose a covariance filtering based algorithm in this section.
5.3.1 k-Nearest Neighbors Covariance Matrix of Images
Given a set of d-dimensional training points S = {x1 , x2 , . . . , xn } and an image intensity function


1, x ∈ S,
I(x) =
,
(5.22)

0, otherwise.
then the k-nearest neighbors covariance matrix at xi can be written as
C(xi ) =
k
1X
(xi − xp(i,j) )(xi − xp(i,j) )T , i = 1, . . . , n,
k
j=1
59
(5.23)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 7 Naive k-Nearest Neighbors Bandwidth Selection
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
procedure I MAGE BANDWIDTH S ELECTION(I, k, σ )
for each point x in image such that I(x) 6= 0 do
i←0
for each point y i in the image such that I(x) 6= 0 and x 6= y i do
di ← calculate the distance between x and y i
i←i+1
end for
p(1), p(2), . . . , p(k) ← find the indices of the k smallest distances in D =
{d0 , d1 , . . . , di−1 }
the covariance matrix C(x) ← 0
for i ← 1, k do
C(x) ← C(x) + k1 (x − y p(i) )(x − y p(i) )T
end for
Q(x), Λ(x) ← E IGENDECOMPOSITION(C(x))
S(x) ← σ −1 Λ(x)−1/2 Q(x)T
end for
return S
end procedure
where function p(i, j) returns the index of j-th nearest neighbors of xi . For a 2D image, the training
point xi = [xi , yi ]T , xi , yi ∈ Z, i = 1, . . . , n, then
  
   

k
1 X xi  xp(i,j)  xi  xp(i,j)  T
C(xi ) =
(
−
)(
−
)
k
yi
yp(i,j)
yi
yp(i,j)
j=1


1 Pk
1 Pk
2
(xi − xp(i,j) )
j=1 (xi − xp(i,j) )(yi − yp(i,j) )
k
.
=  Pk k j=1
1
1 Pk
2
j=1 (xi − xp(i,j) )(yi − yp(i,j) )
j=1 (yi − yp(i,j) )
k
k
(5.24)
Let
C11 (xi ) =
k
1X
(xi − xp(i,j) )2 ,
k
j=1
C12 (xi ) =
k
1X
(xi − xp(i,j) )(yi − yp(i,j) ),
k
(5.25)
j=1
C22 (xi ) =
k
1X
(yi − yp(i,j) )2 ,
k
j=1
then Equation (5.24) can be written as


C11 (xi ) C12 (xi )
.
C(xi ) = 
C12 (xi ) C22 (xi )
60
(5.26)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Similarly, for a 3D image, the training point xi = [xi , yi , zi ]T , xi , yi , zi ∈ Z, i = 1, . . . , n and we
have
where


C11 (xi ) C12 (xi ) C13 (xi )



C(xi ) = 
C12 (xi ) C22 (xi ) C23 (xi ) ,
C13 (xi ) C23 (xi ) C33 (xi )
C13 (xi ) =
(5.27)
k
1X
(xi − xp(i,j) )(zi − zp(i,j) ),
k
j=1
C23 (xi ) =
k
1X
(yi − yp(i,j) )(zi − zp(i,j) ),
k
(5.28)
j=1
C33 (xi ) =
k
1X
(zi − zp(i,j) )2 .
k
j=1
5.3.2 r-Neighborhood Covariance Matrix of Images
Our algorithm is based on the fact that the locations of the pixels are evenly distributed on the image.
Thus, there is a potential that the neighbor searching can be done by filtering. In this section, we give
a simple problem where neighbor searching can be completed by filtering easily.
Consider a training point xi ∈ S, and a set of training points whose distance from xi is smaller
than r, i.e. Nr (xi ) = {x | x ∈ S, x 6= xi , kx − xi k < r}, we want to calculate the covariance
matrix of Nr (xi ) at xi . Here, Nr (xi ) is called the r-neighborhood of xi . For a 2D image, according
to Equation (5.26) the covariance matrix can be written as


D11 (xi ) D12 (xi )

D(xi ) = 
D12 (xi ) D22 (xi )
(5.29)
where
D11 (xi ) = |Nr (xi )|−1
D12 (xi ) = |Nr (xi )|−1
D22 (xi ) = |Nr (xi )|−1
X
(xi − x)2 ,
x∈Nr (xi )
X
(xi − x)(yi − y),
x∈Nr (xi )
X
(yi − y)2 ,
x∈Nr (xi )
61
(5.30)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
and |Nr (xi )| denotes the number of elements in Nr (xi ). Define the covariance operators h11 , h12 ,
and h22 and the disk operator d as follows,
h11 (x) =
h12 (x) =
h22 (x) =


x2 , kxk < r,

0,
otherwise.


xy, kxk < r,

0,
otherwise.


y 2 , kxk < r,
,
,
(5.31)
,

0, otherwise.


1, kxk < r,
d(x) =
.

0, otherwise.
Here, both the covariance and disk operators can be expressed by (2r + 1) × (2r + 1) matrices as
illustrated in Figure 5.6. It can be easily proved that
I(xi ) ∗ h11 (xi ) =
I(xi ) ∗ h12 (xi ) =
I(xi ) ∗ h22 (xi ) =
X
(xi − x)2 ,
x∈Nr (xi )
X
(xi − x)(yi − y),
x∈Nr (xi )
X
(5.32)
(yi − y)2 ,
x∈Nr (xi )
I(xi ) ∗ d(xi ) = |Nr (xi )|.
Thus, the covariance matrix of neighborhood Nr (xi ) can be calculated by


D11 (xi ) D12 (xi )

D(xi ) = 
D12 (xi ) D22 (xi )


I(xi ) ∗ h11 (xi ) I(xi ) ∗ h12 (xi )
1


=
I(xi ) ∗ d(xi ) I(xi ) ∗ h12 (xi ) I(xi ) ∗ h22 (xi )
62
(5.33)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
-4
-3
-2
-1
y
0
1
2
3
4
0
0
0
0
1
0
0
0
0
0
0
1
1
1
1
1
0
0
0
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
0
-4
-3
-2
-1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
0
0
0 R=4
0
y
1
0
0
0
0
0
1
2
3
4
x
-4
-3
-2
-1
0
1
2
3
4
0
0
0
0
0
0
0
0
0
0
9
9
9
9
9
0
0
0
4
4
4
4
4
4
4
0
0
1
1
1
1
1
1
1
0
-4
-3
-2
-1
16
(a)
-4
-3
-2
-1
y
0
1
2
3
4
0
0
0
0
0
0
0
0
0
-4
0 0 0
0 6 3
6 4 2
3 2 1
0 0 0
-3 -2 -1
-6 -4 -2
0 -6 -3
0 0 0
-3
-2
-1
0
0
0
0
0
0
0
0
0
0
x
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
4
4
4
4
4
4
4
0
0
0
9
9
9
9
9
0
0
16
0
1
2
3
4
x
0
0
0 R=4
0
0
0
0
0
(b)
0 0 0
-3 -6 0
-2 -4 -6
-1 -2 -3
0 0 0
1 2 3
2 4 6
3 6 0
0 0 0
1
2
3
0
0
0 R=4
0
y
0
0
0
0
0
4
(c)
-4
-3
-2
-1
0
1
2
3
4
0
0
0
0
0
0
0
0
0
0
0
4
1
0
1
4
0
0
0
9
4
1
0
1
4
9
0
0
9
4
1
0
1
4
9
0
-4
-3
-2
-1
16 0 0 0 0
9
4
1
0
1
4
9
9
4
1
0
1
4
9
16 0
0
x
1
9
4
1
0
1
4
9
0
0
4
1
0
1
4
0
0
0
0 R=4
0
0
0
0
0
0
2
3
4
(d)
F IGURE 5.6: The covariance and disk operators of r = 4. (a): disk operator. (b) h11 covariance operator. (c)
h12 covariance operator. (d) h22 covariance operator.
Similarly, for a 3D image the covariance matrix of neighborhood Nr (xi ) is given by


D11 (xi ) D12 (xi ) D13 (xi )



D(xi ) = 
D12 (xi ) D22 (xi ) D23 (xi )
D13 (xi ) D23 (xi ) D33 (xi )


I(xi ) ∗ h11 (xi ) I(xi ) ∗ h12 (xi ) I(xi ) ∗ h13 (xi )


1
I(xi ) ∗ h12 (xi ) I(xi ) ∗ h22 (xi ) I(xi ) ∗ h23 (xi )
=


I(xi ) ∗ d(xi )
I(xi ) ∗ h13 (xi ) I(xi ) ∗ h23 (xi ) I(xi ) ∗ h33 (xi )
63
(5.34)
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
where
h13 (x) =
h23 (x) =
h33 (x) =
5.3.3
Algorithm


xz,
kxk < r,

0,
otherwise.


yz, kxk < r,

0,
otherwise.


z 2 , kxk < r,

0,
,
,
(5.35)
.
otherwise.
R=3
R=4
F IGURE 5.7: Searching circles of different radii. Assume k = 6, we can find that there are only 2 neighbor
training points inside the green searching circle of radius 3. Thus, we increase the searching radius by one and
find that the orange searching circle of radius 4 contains 6 neighbor training points. Therefor, if we choose a
searching radius r = 4, we have C(x) = D(x).
Based on the discussion in Section 5.3.1 and 5.3.2, we propose our efficient k-nearest neighbors
bandwidth selection algorithm in this section. From the definition of Nr (xi ), we know that ∀x ∈
S and x ∈
/ Nr (xi ) : kx − xi k ≥ r. Thus, all the training points inside Nr (xi ) are closer to xi
than those training points outside Nr (xi ). Then, the r-neighborhood Nr (xi ) can also be viewed
as the |Nr (xi )|-nearest neighbors of xi . Therefore, the covariance matrix C(xi ) = D(xi ) iff
64
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
k = |Nr (xi )|. Since D(xi ) can be easily calculate by filtering, then, as long as we can find a proper
r such that |Nr (xi )| = k, we can calculate C(xi ) from D(xi ) efficiently.
Therefore, we need to find the correct searching radius r for xi . Assume that all points within the
searching circle (or sphere) are training points, which means πr2 = k (or 43 πr3 = k), then we can set
3k −1/3
the initial value of r to d( πk )−1/2 e(or d( 4π
)
e). According to Equation (5.32), we can calculate
|Nr (xi )| using the disk operator. Compare the value of |Nr (xi )| with k. If |Nr (xi )| < k, then
increase r until |Nr (xi )| ≥ k for all xi in S. The increasing step of r determines the performance
of this algorithm. A small r will result a very accurate approximation of C(xi ) but the speed will
be relatively slow. The minimal choice of the increasing step is 1, since the training points are the
indices of pixels on the image. It should be pointed out that different training point xi has different
searching radius. Thus, we need to update C(xi ) according to their searching radius r respectively.
After we got all the covariance matrices, we can them apply eigendecomposition to these matrices
and calculate the bandwidth accordingly. The outline of this algorithm is shown in Algorithm 8.
5.3.4
GPU Implementation
One advantage of our algorithm is that it can be easily accelerated by GPUs. First, the calculation
of r-neighborhood covariance matrix involves lots of image convolutions which can be easily done
by GPUs. One way to achieve this is to calculate the convolution directly via Matlab’s Parallel
Computing Toolbox (PCT). The PCT provides the built-in GPU accelerated functions conv2 and
convn for 2D and 3D convolution respectively. The other way is to accomplish convolution through
FFT base on the convolution theorem [37],
F{f ∗ g} = F{f } · F {g}.
(5.36)
The time complexity of FFT is O(nlog2 n) [38], which is much faster than convolution’s O(n2 ) when
n is large. Lots of GPU packages are available for high performance FFT implementation, such as
PCT, Jacket, CUFFT, OPenCL FFT.
Second, our algorithm need to perform the eigendecomposition to the covariance matrix for
each training point in the image. Since there are a large number of training points and their
eigendecompositions are independent, it is a good idea to put these computations on GPU. Several
GPU libraries (CULA, MAGMA, etc.) are available for computing QR decomposition, but they
are only efficient and competitive with large matrices, at least over 1000 × 1000 [39]. Since the
covariance matrix in our case is only 2 × 2 or 3 × 3, we implemented our own GPU based function
65
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
Algorithm 8 Efficient k-Nearest Neighbors Bandwidth Selection
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
procedure I MAGE BANDWIDTH S ELECTION(I, k, σ )
if the image I is 2D then
Initialize the filtering radius r ← d( πk )−1/2 e
else if the image I is 3D then
3k −1/3
Initialize the filtering radius r ← d( 4π
)
e
end if
for each point x in the image do
the number of neighbors N (x) ← 0
the covariance matrix C(x) ← 0
end for
while exists point x in the image such that N (x) < k and I(x) 6= 0 do
d ← create a disk operator with radius r
Nr ← filter the image I with the disk operator d
if the image I is 2D then
h11 , h12 , h22 ← create the covariance operators
C11 , C12 , C22 ← filter the image I with covariance operators h11 , h12 , h22
else if the image I is 3D then
h11 , h12 , h13 , h22 , h23 , h33 ← create the covariance operators
C11 , C12 , C13 ← filter the image I with covariance operators h11 , h12 , h13
C22 , C23 , C33 ← filter the image I with covariance operators h22 , h23 , h33
end if
for each point x in the image such that Nr (x) ≥ k and C(x) = 0 do
if the image I is 2D then
C (x) C12 (x)
C(x) ← Nr (x)−1 11
C12 (x) C22 (x)
else if the image I is 3D

 then
C11 (x) C12 (x) C13 (x)
C(x) ← Nr (x)−1 C12 (x) C22 (x) C23 (x)
C13 (x) C23 (x) C33 (x)
end if
end for
r ←r+1
end while
for each point x such that I(x) 6= 0 do
Q(x), Λ(x) ← E IGENDECOMPOSITION(C(x))
S(x) ← σ −1 Λ(x)−1/2 Q(x)T
end for
return S
end procedure
66
CHAPTER 5. ALGORITHMS AND IMPLEMENTATIONS
to perform millions of small matrix eigendecompositions simultaneously. For more details about the
performance of our algorithm, please see Chapter 6.
67
Chapter 6
Experiments and Results
In this chapter, we present the experimental results of the efficient methods, optimization techniques,
and vesselness measures that we introduced in Chapter 3 and 5. We first introduce the hardware
environment of the experiments in Section 6.1. Then, in Section 6.2, we investigate the speed performance of the efficient methods and optimization techniques that we used in our kernel smoothing
library. Finally, in Section 6.3, we test the overall performance of the kernel smoothing library when
applied to two medical imaging techniques.
6.1
Environment
The experiments were performed on two platforms. One platform is a GPU node on Northeastern
University’s discovery cluster. This node has an NVIDIA Tesla K20m GPU, dual Intel Xeon E5-2670
CPU and 256GB RAM. The NVIDIA Tesla K20m GPU has 2496 cuda cores (13 SMs, 192 cores
each), 0.7GHz clock rate, 5GB GDDR RAM, and 3.0 compute capability and we conduct all the
GPU experiments using CUDA 6.5 Toolkit. The Intel Xeon E5-2670 CPU has 2.6GHz clock rate and
8 physical cores. Since each physical core has 2 logical cores and there are 2 Intel Xeon E5-2670
CPUs, this platform has 32 logical cores in total.
The other platform is a computer with a Intel Core i7-3615QM CPU, an NVIDIA GeForce GT
650M GPU, and 8GB RAM. The NVIDIA GeForce GT 650M GPU has 0.9GHz clock rate, 384 cuda
cores (2 SMs, 192 cores each), and 1GB GDDR RAM. The Intel Core i7-3615QM CPU has 2.3GHz
clock rate and 8 logical cores (4 physical cores, 2 logical core each). A summary of the specification
of these two platform is given in Table 6.1.
68
CHAPTER 6. EXPERIMENTS AND RESULTS
GPU
CPU
NVIDIA
NVIDIA GeForce
Intel Xeon
Intel Core
Tesla K20m
GT 650M
E5-2670
i7-3615QM
Clock Rate
0.7 GHz
0.9 GHz
2.60 GHz
2.30 GHz
GPU/CPU Cores
2496
384
8
4
Device/Host Memory
5120 MB
1024 MB
256GB
8GB
Name
TABLE 6.1: Experiment environment.
6.2
Performance Evaluation
In this section, we present the experimental results of algorithms introduced in Chapter 5. We first
evaluate the performance of the efficient separable multivariate kernel derivative (SMKD) algorithm,
in Section 6.2.1, by providing the visualized complexity analysis as well as the detailed running time
between the naive and the efficient algorithms. In Section 6.2.2, we compare the speed-ups between
different versions of high-speed KDE and KDDE methods. An memory performance analysis is also
provided to illustrate the effectiveness of different memory optimization techniques used in these
methods. Finally, the performance comparison results of the efficient k-NN bandwidth selection
method on CPU and GPU platforms are given in Section 6.2.3.
6.2.1
Efficient SMKD
We perform all the experiments for efficient SMKD algorithm on the Intel Xeon E5-2670 CPU platform. For the first set of experiments, we compare the theoretical number of multiplications between
the naive and efficient algorithms with different dimensions and orders at a single sample point. This
comparison is based on the complexity analysis Equation (5.9) and (5.10). The experimental results
are given in Figure 6.1. As can be seen from the figures, the efficient algorithm outperforms the
naive algorithm significantly as the dimension and derivative order increase. For example, the top
left figure shows that when the order and data dimension is low, there is only a slightly difference in
multiplication number between naive algorithm and efficient algorithm. However, when the order
and data dimension increases, as shown in the bottom right figure, the multiplication number of the
naive algorithm can be several times higher than the efficient algorithm.
69
number of multiplications number of multiplications number of multiplications
order = 1
2000
naive algorithm
efficient algorithm
1500
1000
500
0
5
0
#10
10
20
30
40
dimension
order = 3
5
naive algorithm
efficient algorithm
4
3
2
1
0
5
0
10
20
30
40
dimension
order = 5
#10 7
naive algorithm
efficient algorithm
4
3
2
1
0
2.5
0
10
20
30
40
dimension
order = 7
#10 9
number of multiplications
number of multiplications
number of multiplications number of multiplications
number of multiplications
CHAPTER 6. EXPERIMENTS AND RESULTS
naive algorithm
efficient algorithm
2
1.5
1
0.5
0
0
10
20
30
40
dimension
4
#10 4
order = 2
naive algorithm
efficient algorithm
3
2
1
0
5
0
#10
10
20
30
40
30
40
30
40
30
40
dimension
order = 4
6
naive algorithm
efficient algorithm
4
3
2
1
0
4
0
10
20
dimension
order = 6
#10 8
naive algorithm
efficient algorithm
3
2
1
0
15
0
10
20
dimension
order = 8
#10 9
naive algorithm
efficient algorithm
10
5
0
0
10
20
dimension
F IGURE 6.1: The comparison of the number of multiplications in computing different orders of derivatives
of separable multivariate kernel function using the naive method and the proposed efficient method with
dimensions from 1 to 40.
Next, we test the performance of the naive and efficient algorithms on the synthetic data. The
synthetic data is generated based on the univariate Gaussian kernel N (0, 1) and its derivatives. For
both algorithms, some basic memory optimization techniques are used to minimize the latency
introduced by memory operations. Thus, our experiments focus mostly on the computational
70
CHAPTER 6. EXPERIMENTS AND RESULTS
differences between these two algorithms. The experimental results is given in Figure 6.2. We
investigate the performance of these two algorithms at different orders and dimensions. Here, we
choose the orders from 1 to 6 and dimensions from 2 to 20. As we can see from the top left figure
that when the order is 1 and dimension is smaller than 20, the proposed algorithm is slightly slower
than the naive algorithm. It is because that there is a constant computational overhead when deciding
(j)
the number of elements in set Ni . And as the order and dimension increase, the proposed efficient
algorithm outperforms the naive algorithm significantly. The execution time increasing rate in Figure
6.2 is consistent with the multiplication number increasing number in Figure 6.1. It proves that our
complexity analysis in Section 5.1.3 is correct.
6.2.2
High Performance KDE and KDDE
We perform our CPU experiments on the Intel Core i7-3615QM platform and perform our GPU
experiments on the NVIDIA Tesla K20m platform. Because the speed performances of KDE and
KDDE algorithms are insensitive to the data type. Our experiments are based on the synthetic data.
We generate the synthetic training points and test points from random number generators directly.
The experiments are divided into three groups. For the first group, we investigate the speed-ups
of our different optimization methods on synthetic 2D data. For the second group, we discover the
speed performance of different optimization methods on synthetic 3D data. And for the final group,
we try to find the GPU device memory performance of these methods. In all these experiments, we
use the functions, which can perform kernel density estimation, kernel gradient estimation and kernel
curvature estimation at the same time, from our kernel smoothing library.
For the first group, we perform two set of experiments. First, as can be seen in the left bar graph
of Figure 6.3, we present the speed-up comparison between CPU serial, CPU parallel, and GPU
naive methods. Four experiments have been performed with data size ranging from 107 to 1010 . The
data size is denoted by m × n, where m is the number of test points and n is the number of training
points. Here, we can see that the CPU parallel method is almost three to four times faster than the
CPU serial methods. It is reasonable, since the Intel Core i7-3615QM contains four CPU cores. We
also can find that the naive GPU method is way more fast than the CPU methods. Especially, when
the data size is 1010 , the naive GPU method can be more than 100 times faster than the serial GPU
method. However, the naive GPU method doesn’t achieve its best performance when the data size is
small. The reason is that the naive GPU method can’t achieve a good occupancy when the workload
is low.
71
CHAPTER 6. EXPERIMENTS AND RESULTS
Order = 1
naive algorithm
efficient algorithm
0.01
0.008
0.006
0.004
0.002
0
0
5
10
Order = 2
0.1
Execution Time
Execution Time
0.012
15
naive algorithm
efficient algorithm
0.08
0.06
0.04
0.02
0
20
0
5
Dimension
Order = 3
naive algorithm
efficient algorithm
0.6
0.4
0.2
0
0
5
10
15
Execution Time
Execution Time
20
15
10
5
5
10
15
20
1
0
5
10
Order = 6
120
naive algorithm
efficient algorithm
0
20
2
Dimension
25
0
15
3
0
20
Order = 5
30
20
naive algorithm
efficient algorithm
4
Dimension
35
15
Order = 4
5
Execution Time
Execution Time
0.8
10
Dimension
15
80
60
40
20
0
20
Dimension
naive algorithm
efficient algorithm
100
0
5
10
Dimension
F IGURE 6.2: The execution time of naive method and the proposed efficient method in computing the different
orders of derivatives of the multivariate kernel function on synthetic data. Here the number of samples is
10000, and the data dimension ranges from 2 to 20.
Second, the right bar graph shows the performance differences between GPU optimization
methods. Since the GPU methods are much faster than CPU methods, we use bigger data sizes in
this set of experiments. We can see that the GPU Optimization I method is almost two to three times
faster than the naive method, which means the kernel merging, loop unrolling and memory layout
optimization techniques can bring about two to three times speed up to naive method. Then, we can
72
CHAPTER 6. EXPERIMENTS AND RESULTS
120
100
40
CPU Serial
CPU Parallel
GPU Naive
35
GPU Naive
GPU Optimization I
GPU Optimization II
GPU Optimization III
30
Speed Up
Speed Up
80
60
25
20
15
40
10
20
5
0
1e+07
1e+08
1e+09
0
1e+10
1e+09
Data Size
1e+10
1e+11
1e+12
Data Size
F IGURE 6.3: The comparison of speed-ups between different optimization methods on synthetic 2D data.
find that if the data size is large the Optimization II method can bring about 7 times speed up to
the Optimization I method. It looks like the simplification of Gaussian kernel and the removal of
outer loop bring lots of benefits to Optimization II method. The right bar graph also shows that the
Optimization III method speeds up the Optimization II method by two times, when the data size is
large. However, we can find that when the data size is small, the Optimization III algorithm has a
bad performance. It is even slower than the Optimization II method when the data size is 109 . This
is because that we need to rearrange the memory layout for Optimization III method. The cost of
this rearrangement is constant and is ignored when the total running time is long. But when the data
size is small, this rearrangement can’t be ignored and thus will affect the overall performance of
Optimization III method.
80
70
60
CPU Serial
CPU Parallel
GPU Naive
50
GPU Naive
GPU Optimization I
GPU Optimization II
GPU Optimization III
60
Speed Up
Speed Up
40
50
40
30
30
20
20
10
10
0
1e+07
1e+08
1e+09
0
1e+10
Data Size
1e+09
1e+10
1e+11
1e+12
Data Size
F IGURE 6.4: The comparison of speed-ups between different optimization methods on synthetic 3D data.
73
CHAPTER 6. EXPERIMENTS AND RESULTS
For the second group, we do the similar experiments as the first group, except that this time we
test their performances on 3D data. The test results are given in Figure 6.4. We can see that, for
3D data, the performance results in the second group are close to the results in the first group. One
thing we should notice is that, in the 3D case, the Optimization III method is four to five times faster
than the Optimization II method, which is much better than the 2D case. It is because that the data
structure in the 3D case is more complex than the 2D case. Thus, more global memory operations
are involved when computing the 3D data. Hence, once the shared memory is introduced, the 3D
KDE and KDDE algorithm can have more benefits.
GPU Implementation
Global Memory Transactions
Naive
16.4M
Optimization I
1.96M
Optimization II
1.56M
Optimization III
11.7K
TABLE 6.2: Global memory transactions between different optimization methods.
For the GPU optimization methods, we perform the third group of experiments. We analysis
the number of global memory transactions for each optimization method using the CUDA GPU
profiler gprof. The experimental results are shown in Table 6.2. Here, the size of data is 106 . We
can see that, from the naive method to the Optimization I method, the number of global memory
transactions is decreased by 8 times. Such a big improvement is because both kernel merging and
memory layout optimization techniques aim at reducing global memory transactions. And from
Optimization I to Optimization II, there is only a slight global memory transaction decrease. Since
Gaussian kernel simplification only focuses on reducing computation and the outer loop removal
only focuses on decreasing gpu-kernel call overhead, this result is reasonable. We can also see that
the use of shared memory reduces global memory transactions significantly. From Optimization II to
Optimization III the number of global memory transactions has been decreased more than 100 times.
However, according to the quantitative global memory analysis given by Equation (5.19) and (5.19),
74
CHAPTER 6. EXPERIMENTS AND RESULTS
we know that the theoretical global memory transaction differences should be b × c times. Since,
in the experiment, the block size we use is 1024 and the memory coalescing factor c is 32, then the
theoretical global memory transaction decreasing should be 32768 times, which is much bigger than
our experimental result. The reason of this problem is because, in the Optimization II method, the
GPU L1 cache will be used to help reduce global memory transactions.
6.2.3
Efficient k-NN Bandwidth Selector
In this section, we perform our CPU experiments on the Intel Core i7-3615QM platform and perform
our GPU experiments on the NVIDIA Tesla K20m platform. We investigate the performance of naive
and efficient algorithms. For the efficient algorithm, we implemented it on both CPU and GPU, and
we call them the GPU efficient algorithm and the CPU efficient algorithm. We divide the experiments
into two groups. In the first group, we test the execution time of naive and efficient k-NN bandwidth
selector on 2D images. And in the second group, we conduct our experiments subject to naive, CPU
efficient and GPU efficient algorithms. We test their performances on 3D images.
20
18
Naive
Efficient CPU
16
Execution Time
14
12
10
8
6
4
2
0
64 # 64
64 # 128
128 # 128 128 # 256 256 # 256
Image Size
F IGURE 6.5: Performance of the k-NN bandwidth selector on 2D images using naive algorithm and CPU
efficient algorithm.
For the first group, the experimental results are shown in Figure 6.5. Here, we perform our
75
CHAPTER 6. EXPERIMENTS AND RESULTS
experiments with five different image sizes. When the image size is small, we can find that the
performances of the naive algorithm and efficient algorithm are similar. However, the execution time
of the naive algorithm increases much faster than the efficient algorithm as the image size increases.
When the image size increases to 256 × 256, the efficient algorithm is 6 times faster than the naive
algorithm.
100
90
Naive
Efficient CPU
Efficient GPU
80
Execution Time
70
60
50
40
30
20
10
0
32 # 32 # 32
32 # 32 # 64
64 # 64 # 64
64 # 64 # 128 128 # 128 # 128
Image Size
F IGURE 6.6: Performance of the k-NN bandwidth selector on 3D images using naive algorithm, CPU efficient
algorithm and GPU efficient algorithm.
For the second group, five different 3D image sizes are used in the experiments. For image size
of 32 × 32 × 32 and 32 × 32 × 64, we conduct experiments subject to naive, CPU efficient and
GPU efficient algorithms. For larger 3D image sizes, the naive algorithm will face an out of memory
problem, which is because this algorithm need to store a huge distant map between points. Hence,
we only conduct our experiments subject to GPU and CPU efficient algorithm on large 3D images.
The experimental results are shown in Figure 6.6. We can see that the GPU efficient algorithm is way
more faster than the CPU efficient algorithm and the naive algorithm.
76
CHAPTER 6. EXPERIMENTS AND RESULTS
6.3
Vesselness Measure
In this section, we apply our kernel smoothing library to the two vesselness measure algorithms introduced in Section 3.3 and Section 3.4. We compare both the filtering results and speed performances
of these algorithms when using and not using our kernel smoothing library. All the experiments
are conducted on the discovery cluster platform with NVIDIA Tesla K20m GPU and Intel Xeon
E5-2670 CPU.
6.3.1
Frangi Filtering
As we discussed in Section 3.3, a Frangi filtering based vesselness measure uses the eigenvalues
of the Hessian matrices obtained from an image to analysis the likelihood of a pixel being on the
tubular structure. To calculate the Hessian matrices, there are three different ways to achieve this:
gradient operator, Gaussian smoothing, and KDDE (See Section 3.2). In the original Frangi paper
[3], the Hessian matrix is computed through Gaussian smoothing, which is actually identical with
the binned estimation method of kernel density estimation theory. However, this method only uses
constrained bandwidth meaning that the same smoothing is applied to every coordinate direction.
Therefore, to get a more accurate Hessian matrix, one can choose the variable and unconstrained
bandwidth for their kernel density derivative estimator. In this way, the estimator is allowed to
smooth in any direction whether coordinate or not.
In this section, we implement the Frangi filter in three different ways. For the first way, we
implement the filter exactly as the original paper. We use the Gaussian smoothing method to calculate
the Hessian matrices and perform all the calculation on CPU only. For the second way, we calculate
the Hessian matrices using variable unconstrained bandwidth KDDE. This is still performed on CPU.
The third way is similar with the second way, except that we now use our GPU accelerated kernel
smoothing library to calculate KDDE.
We test the performance of these implementations respectively. The vesselness measure results
are given in Figure 6.7. We can see that the Frangi filtering result using KDDE, in the middle, gives
more details about the retina image than the result using Gaussian smoothing. However, to get such
a good vesselness measure result is extremely expensive. The Gaussian smoothing based Frangi
filtering only takes 0.38 seconds to get the result. However, if we calculate using KDDE with CPU
only, the execution time would be 3433 seconds! Fortunately, this can be accelerated using our GPU
accelerated kernel smoothing library, and the execution time is 14.9 seconds. In this case, we only
spend a little bit more time to get a much better vesselness measure result.
77
CHAPTER 6. EXPERIMENTS AND RESULTS
6.3.2
Ridgeness Filtering
We implement a ridgeness filtering based vessel segmentation algorithm according to the pipeline
in Figure 6.8. For a given image, we first perform preprocessing algorithms, such as anisotropic
diffusion, unsharp masking, adaptive histogram equalization, etc, to suppress background noise and
highlight tubular structures. Then, for the preprocessed image, we use k-N N bandwidth selector
to calculate the bandwidth of each training points (nonzero points). Based on the bandwidths and the
preprocessed image, we calculate kernel density estimates f , kernel gradient estimates g, and kernel
curvature estimates H. For each kernel curvature estimate H, we use lamabda selector to calculate
its largest absolute eigenvalue |λ|max . We compute the ridgeness scores s using ridgeness f ilter
from kernel gradient estimates and kernel curvature estimates. Finally, based on the values of |λ|max ,
s, and f , the classifier makes a combined decision on the vesselness of each pixel. Before outputing
the segmented image, a postprocessing procedure is used to refine the results from the classifier.
The vessel segmentation results are shown in Figure 6.7. We compare the results of this algorithm
when using and not using our kernel smoothing library. As can be seen from the figure that the
results are exactly the same. It means our kernel smoothing library won’t bring any inaccuracies
when accelerating the algorithm. The total execution time of the GPU accelerated implementation is
12.56 seconds. And the total execution time of the non-GPU implementation is 938.7293 seconds. It
shows that we achieved 75 times speed-up when using our kernel smoothing library for the ridgeness
filtering based vessel segmentation.
78
CHAPTER 6. EXPERIMENTS AND RESULTS
F IGURE 6.7: Vesselness measure results using Frangi filter. Top: Original image. Middle: Frangi filtering
result using KDDE. Bottom: Frangi filtering result using Gaussian smoothing.
79
CHAPTER 6. EXPERIMENTS AND RESULTS
Image
Preprocessing
k-NN
Bandwidth
Selector
KDDE
Lambda
Selector
Ridgeness
Filtering
Classifier
Postprocessing
Output
F IGURE 6.8: Algorithm pipeline of the ridgeness filtering based vessel segmentation.
80
CHAPTER 6. EXPERIMENTS AND RESULTS
F IGURE 6.9: Vesselness measure results using ridgeness filter. Top: Original image. Middle: Ridgeness
filtering result with GPU. Bottom: Ridgeness filtering result without GPU.
81
Chapter 7
Conclusion and Future Work
7.1
Conclusion
We started our discussion with the background introduction of kernel smoothing theory in Chapter 1.
Then, in Chapter 2, 3 and 4, we provided the essential background knowledge for the discussion in
Chapter 5 and 6.
In Chapter 2, Section 2.2 and 2.5 gave a detailed knowledge about the kernel density and kernel
density derivative estimation that we need for implementing the high performance functions in
Section 5.2. Section 2.3 introduced the separable multivariate kernels, which provided the foundation
for our discussion in Section 5.1. The discussion of k-nearest neighbors bandwidth selection, in
Section 2.4, helped the understanding of the efficient method in Section 5.3. In Chapter 3, we
introduced two vesselness measure algorithms in Section 3.3 and 3.4. We used these two algorithms
to demonstrate the full potential of our kernel smoothing library in applications. Section 3.1 and 3.2
provided the knowledge background of these two algorithms. Section 6.3 provided the performance
of these two algorithms when using the kernel smoothing library. In Chapter 4, we gave a detailed
introduction of the GPU architecture and the CUDA programming framework. It helped to understand
the optimization techniques we used in Section 5.2.
Based on the background knowledge introduced in the previous chapters, we presented three
major contributions of our kernel smoothing library in Chapter 5. First, we proposed an efficient
method to calculate the separable multivariate kernel derivative. Second, we implemented the kernel
density and kernel density derivative estimators using several optimization techniques on multi-core
CPU and GPU platforms. Third, we also designed an efficient k-nearest neighbors bandwidth
selection algorithm for image processing. We provided a GPU implementation for this algorithm
82
CHAPTER 7. CONCLUSION AND FUTURE WORK
as well. In Chapter 6, we designed a series of experiments to evaluate the performance of the
algorithms and implementations we presented in Chapter 5. It shows that the presented algorithms
and implementations achieved significant speed-ups than their direct or naive counterparts. The
performance evaluation of our kernel smoothing library on two vesselness measure algorithms was
provided as well.
7.2
Future Work
There are several places we can improve in the future. First, in the current version, our kernel
smoothing library only implemented the GPU accelerated KDE and KDDE functions for 2D and
3D data. In the future, we can add a GPU implementation for higher dimensional data. Second,
the bandwidth selection methods are crucial in kernel smoothing. But we only implemented one
bandwidth selection method in our library. We should add more implementations for bandwidth
selection methods. Since some bandwidth selection methods are also computational intensive, there
is a potential to implement them on GPU. Finally, object-oriented programming can be used in our
library.
83
Bibliography
[1] M. Rosenblatt et al., “Remarks on some nonparametric estimates of a density function,” The
Annals of Mathematical Statistics, vol. 27, no. 3, pp. 832–837, 1956.
[2] E. Parzen, “On estimation of a probability density function and mode,” The annals of mathematical statistics, pp. 1065–1076, 1962.
[3] A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever, “Multiscale vessel enhancement
filtering,” in Medical Image Computing and Computer-Assisted Interventation—MICCAI’98.
Springer, 1998, pp. 130–137.
[4] B. Silverman, “Algorithm as 176: Kernel density estimation using the fast fourier transform,”
Applied Statistics, pp. 93–99, 1982.
[5] M. Wand, “Fast computation of multivariate kernel estimators,” Journal of Computational and
Graphical Statistics, vol. 3, no. 4, pp. 433–445, 1994.
[6] A. Elgammal, R. Duraiswami, and L. S. Davis, “Efficient kernel density estimation using the
fast gauss transform with applications to color modeling and tracking,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 25, no. 11, pp. 1499–1504, 2003.
[7] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis, “Improved fast gauss transform
and efficient kernel density estimation,” in Computer Vision, 2003. Proceedings. Ninth IEEE
International Conference on.
IEEE, 2003, pp. 664–671.
[8] A. Sinha and S. Gupta, “Fast estimation of nonparametric kernel density through pddp, and its
application in texture synthesis.” in BCS Int. Acad. Conf., 2008, pp. 225–236.
[9] J. M. Phillips, “ε-samples for kernels,” in Proceedings of the Twenty-Fourth Annual ACM-SIAM
Symposium on Discrete Algorithms.
SIAM, 2013, pp. 1622–1632.
84
BIBLIOGRAPHY
[10] Y. Zheng, J. Jestes, J. M. Phillips, and F. Li, “Quality and efficiency for kernel density estimates in large data,” in Proceedings of the 2013 ACM SIGMOD International Conference on
ACM, 2013, pp. 433–444.
Management of Data.
[11] S. Łukasik, “Parallel computing of kernel density estimates with mpi,” in Computational
Science–ICCS 2007.
Springer, 2007, pp. 726–733.
[12] J. Racine, “Parallel distributed kernel estimation,” Computational Statistics & Data Analysis,
vol. 40, no. 2, pp. 293–302, 2002.
[13] P. D. Michailidis and K. G. Margaritis, “Parallel computing of kernel density estimation
with different multi-core programming models,” in Parallel, Distributed and Network-Based
Processing (PDP), 2013 21st Euromicro International Conference on. IEEE, 2013, pp. 77–85.
[14] ——, “Accelerating kernel density estimation on the gpu using the cuda framework,” Applied
Mathematical Sciences, vol. 7, no. 30, pp. 1447–1476, 2013.
[15] W. Andrzejewski, A. Gramacki, and J. Gramacki, “Graphics processing units in acceleration of bandwidth selection for kernel density estimation,” International Journal of Applied
Mathematics and Computer Science, vol. 23, no. 4, pp. 869–885, 2013.
[16] T. Duong et al., “ks: Kernel density estimation and kernel discriminant analysis for multivariate
data in r,” Journal of Statistical Software, vol. 21, no. 7, pp. 1–16, 2007.
[17] M. Wand and B. Ripley, “Kernsmooth: Functions for kernel smoothing for wand & jones
(1995),” R package version, vol. 2, pp. 22–19, 2006.
[18] T. Hayfield and J. S. Racine, “Nonparametric econometrics: The np package,” Journal of
statistical software, vol. 27, no. 5, pp. 1–32, 2008.
[19] A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel
Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford
University Press, 1997.
[20] V. A. Epanechnikov, “Non-parametric estimation of a multivariate probability density,” Theory
of Probability & Its Applications, vol. 14, no. 1, pp. 153–158, 1969.
[21] M. Shaker, J. N. Myhre, and D. Erdogmus, “Computationally efficient exact calculation of
kernel density derivatives,” Journal of Signal Processing Systems, pp. 1–12, 2014.
85
BIBLIOGRAPHY
[22] B. W. Silverman, Density estimation for statistics and data analysis. CRC press, 1986, vol. 26.
[23] T. Duong, Bandwidth selectors for multivariate kernel density estimation.
University of
Western Australia, 2004.
[24] G. R. Terrell and D. W. Scott, “Variable kernel density estimation,” The Annals of Statistics, pp.
1236–1265, 1992.
[25] M. Jones, “Variable kernel density estimates and variable kernel density estimates,” Australian
Journal of Statistics, vol. 32, no. 3, pp. 361–371, 1990.
[26] I. S. Abramson, “On bandwidth variation in kernel estimates-a square root law,” The Annals of
Statistics, pp. 1217–1223, 1982.
[27] L. Breiman, W. Meisel, and E. Purcell, “Variable kernel estimates of multivariate densities,”
Technometrics, vol. 19, no. 2, pp. 135–144, 1977.
[28] J. E. Chacón, T. Duon, and M. Wand, “Asymptotics for general multivariate kernel density
derivative estimators,” 2009.
[29] J. E. Chacón, T. Duong et al., “Data-driven density derivative estimation, with applications
to nonparametric clustering and bump hunting,” Electronic Journal of Statistics, vol. 7, pp.
499–532, 2013.
[30] J. R. Magnus and H. Neudecker, “Matrix differential calculus with applications in statistics and
econometrics,” 1995.
[31] H. V. Henderson and S. Searle, “Vec and vech operators for matrices, with some uses in
jacobians and multivariate statistics,” Canadian Journal of Statistics, vol. 7, no. 1, pp. 65–81,
1979.
[32] T. Duong, A. Cowling, I. Koch, and M. Wand, “Feature significance for multivariate kernel
density estimation,” Computational Statistics & Data Analysis, vol. 52, no. 9, pp. 4225–4242,
2008.
[33] T. M. Apostol, “Mathematical analysis,” 1974.
[34] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37,
no. 1, pp. 10–21, 1949.
86
BIBLIOGRAPHY
[35] U. Ozertem and D. Erdogmus, “Locally defined principal curves and surfaces,” The Journal of
Machine Learning Research, vol. 12, pp. 1249–1286, 2011.
[36] E. Bas and D. Erdogmus, “Principal curves as skeletons of tubular objects,” Neuroinformatics,
vol. 9, no. 2-3, pp. 181–191, 2011.
[37] Y. Katznelson, An introduction to harmonic analysis.
Cambridge University Press, 2004.
[38] V. Y. Pan, “The trade-off between the additive complexity and the asynchronicity of linear and
bilinear algorithms,” Information processing letters, vol. 22, no. 1, pp. 11–14, 1986.
[39] R. Solcà, T. C. Schulthess, A. Haidar, S. Tomov, I. Yamazaki, and J. Dongarra, “A hybrid
hermitian general eigenvalue solver,” arXiv preprint arXiv:1207.1773, 2012.
87
Download