Exploring cell tower data dumps for supervised learning-based point-of-interest prediction (industrial paper)

advertisement
Geoinformatica
DOI 10.1007/s10707-015-0237-7
Exploring cell tower data dumps for supervised
learning-based point-of-interest prediction
(industrial paper)
Ran Wang1 · Chi-Yin Chow1 · Yan Lyu1 ·
Victor C. S. Lee1 · Sarana Nutanong1 · Yanhua Li2 ·
Mingxuan Yuan3
Received: 3 December 2014 / Revised: 20 July 2015 / Accepted: 28 September 2015
© Springer Science+Business Media New York 2015
Abstract Exploring massive mobile data for location-based services becomes one of the
key challenges in mobile data mining. In this paper, we investigate a problem of finding a
correlation between the collective behavior of mobile users and the distribution of points
of interest (POIs) in a city. Specifically, we use large-scale cell tower data dumps collected
from cell towers and POIs extracted from a popular social network service, Weibo. Our
objective is to make use of the data from these two different types of sources to build a model
for predicting the POI densities of different regions in the covered area. An application
domain that may benefit from our research is a business recommendation application, where
a prediction result can be used as a recommendation for opening a new store/branch. The
Chi-Yin Chow
chiychow@cityu.edu.hk
Ran Wang
ranwang3-c@my.cityu.edu.hk
Yan Lyu
yanlv2-c@mycityu.edu.hk
Victor C. S. Lee
csvlee@cityu.edu.hk
Sarana Nutanong
snutanon@cityu.edu.hk
Yanhua Li
yli15@wpi.edu
Mingxuan Yuan
yuan.mingxuan@huawei.com
1
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,
Kowloon, Hong Kong
2
Department of Computer Science, Worcester Polytechnic Institute (WPI), Worcester, USA
3
Huawei Noah’s Ark Lab, Shatin, Hong Kong
Geoinformatica
crux of our contribution is the method of representing the collective behavior of mobile users
as a histogram of connection counts over a period of time in each region. This representation
ultimately enables us to apply a supervised learning algorithm to our problem in order to
train a POI prediction model using the POI data set as the ground truth. We studied 12
state-of-the-art classification and regression algorithms; experimental results demonstrate
the feasibility and effectiveness of the proposed method.
Keywords Spatio-temporal data analysis · Classification · Regression ·
Cell tower data dumps · Point-of-interest prediction
1 Introduction
The ubiquity of mobile devices such as smartphones and tablet computers enables us to
collect useful spatial and temporal data in a large scale and also opens up the possibility
of extracting useful information from the data [10, 17, 33]. For example, a popular mapping service, Google Maps, makes use of real-time GPS records obtained from the users
of Google Location Services to show the current traffic conditions of different road segments on the maps. Another example is a driving direction recommendation system called
T-Drive [36], which makes use of trajectories collected from over 33,000 taxis in a period
of three months to compute the fastest route for users.
In this paper, we focus on a specific type of mobile-user data known as cell tower data
dumps, which contain connection records collected by 9,563 cell towers operated by the
China Mobile Limited1 in Guangzhou, China, as illustrated in Fig. 1a. This data set was
collected within a time period of six days (from 4 September 2013 to 9 September 2013).
For the purpose of this investigation, we focus on records produced by phone calls and
SMSs. For each record, we use the connection time, and the identifier and location of each
cell tower. We extracted 18,290 restaurants in Guangzhou from Weibo2 , a popular Chinese
social network web site, as our point-of-interest (POI) data set, as depicted in Fig. 1b.
The main objective of our research is to make use of the cell phone and POI data sets to
help predict the existence of a POI and the number of POIs in the vicinity of a cell tower.
An application domain that may benefit from POI prediction is a business recommendation application, where a company is interested in generalizing the pattern of POIs of a
particular type (e.g., a coffee shop) in order to identify areas that have a great potential of
supporting its business but have not been fully utilized yet. Our investigation is driven by a
hypothesis that there is a correlation between the collective behavior of mobile users and
the existence of a certain type of POIs in a certain area.
The main challenge of this work is twofold: (1) Representation. To test our hypothesis,
we should find a meaningful representation of collective mobile user behaviors by summarizing a large amount of data extracted from the cell tower data dumps. For example, the cell
tower network in a city like Guangzhou generates user connection records in the scale of
tens of Gigabytes on a daily basis. (2) Application. To provide LBS, we should find effective techniques to predict the existence of a certain type of POIs and the number of POIs
in a certain area. For example, if our framework predicts that a certain area should have
restaurants, but that area does not have any restaurant, it has potential for a new restaurant.
1 http://www.chinamobileltd.com
2 http://weibo.com
Geoinformatica
Cell Tower
23.6
23.6
23.4
23.4
23.2
23.2
23.0
23.0
22.8
22.8
22.6
22.6
113.0
113.2
POI Restaurant
23.8
Latitude
Latitude
23.8
113.4
113.6
Longitude
113.8
114.0
113.0
113.2
113.4
113.6
113.8
114.0
Longitude
Fig. 1 Geographical distribution of cell towers and restaurants in the Guangzhou city of China
To overcome the representation challenge, mobile user data can be summarized in two
different methods. The first method is to group the records by users, and get an action list
or a moving trajectory of each user; and the other is to group them by cell towers, and get
the spatio-temporal features of the geographical areas. In this work, we adopt the second
method due to the following reasons.
–
–
There is no exact location information for users. Each record shows the cell tower that a
mobile device is connected to rather than the exact location of the device. In fact, even
if the user stays in a fixed location, he or she may connect to different cell towers due
to some uncontrollable factors such as signal intensity and facility maintenance.
The number of cell tower connections of different users has a small mean and a large
standard deviation. That is to say, one user may make a number of connections in a day
but another user may make zero connection. As a result, the numbers of connections
made by different users vary tremendously, and the average number is too small to be
considered as a trajectory data set.
The result from the cell-tower-based summarization method is a spatio-temporal data set
with cell towers spanning the spatial dimensions. In the temporal dimension, each cell tower
is associated with a histogram of connection counts where each histogram bin occupies a
time period of one hour. In this way, the collective behavior of mobile users of an entire
city is compactly represented as connection counts over a period of time from different cell
towers.
For the application challenge, we aim to design a framework to build up a model between
the features of mobile user behaviors and LBS. In particular, we study how to employ stateof-the-art supervised learning algorithms to design (i) a classification model to predict the
POI existence (i.e., naive bayes [21], radial basis function (RBF) framework [13], support vector machine (SVM) [28], decision trees (DT) [19], bagging [6], adaboost [8]) and
(ii) a regression model to predict the number of POIs (i.e., simple linear regression, linear
regression [22], isotonic regression [2], pace regression [31], addictive regression [24], and
regression via discretization [27]).
Geoinformatica
In general, the contributions of our work can be summarized as follows.
–
–
–
We formulate a generic representation method of summarizing cell tower data dumps
for mobile user behaviors.
We design a framework with classification and regression algorithms to build up
a model between mobile users’ behaviors and LBS for business recommendation
applications.
We conduct extensive evaluation of our framework on real cell tower data dumps and
POI data set. Experimental results show that there is a strong correlation between the
collective behavior of mobile users and the restaurant data set and demonstrate the
feasibility and effectiveness of the proposed framework.
The remainder of this paper is organized as follows. Section 2 gives a brief introduction
to supervised machine learning, and highlights related work. In Section 3, we describe how
to predict the POI existence and the number of POIs based on the cell tower data dumps, and
present the proposed framework. In Section 4, we present implementation details and analyze extensive experimental results to study the feasibility and effectiveness of the proposed
framework and analyze their results. Finally, Section 5 concludes this paper.
2 Related work
Most existing work on mobile and spatio-temporal data focuses on recommender systems [1, 34, 38], urban planning [3], discovering [35], social networking services [40], etc.
In particular, mobile phone call data and cellular network data are often used to discover
useful information in various scenarios such as traffic anomalies [18], regions of different functions in a city [35], routine behavior patterns of people [16, 39], and important
places [15], etc. Besides, they are also used for urban analysis [20] and urban planning, such
as characterizing dense urban areas [29] and capturing city dynamics [3]. In general, the
most commonly used techniques include collaborative filtering, density estimation, image
and signal processing, etc. However, none of them put their focus on machine learning,
especially supervised learning, which is also a potential tool to mine useful information and
make accurate prediction on mobile phone call data or cellular network data for valuable
location-based applications.
Supervised learning [5] refers to the problem of inferring a model from a set of labeled
training samples, in order to achieve accurate predictions on unseen data. Given a training
set X with N labeled samples, i.e., X = {(xi , yi )}N
i=1 , each sample is associated with a set
of conditional attributes xi = {xi1 , xi2 , . . . , xiL } and a decision attribute yi . The goal is to
learn a function f : x → y, such that given a new unlabeled sample x̂ = {x̂1 , x̂2 , . . . , x̂L },
its desired output value could be predicted by ŷ = f (x̂). Besides, the learning task is
classification or regression if the decision attribute is discrete or continuous, respectively. In
order to solve a supervised learning problem, the solution has to perform the steps as shown
in Fig. 2. Each step has unique significance that may affect the final performance.
Currently, the most widely used classification models include naive bayes classifier
(NBC) [21], support vector machines (SVMs) [28], decision trees (DTs) [19], artificial neural networks (ANNs) [13], etc. While the most widely used regression models include linear
regression [22], pace regression [31], isotonic regression [2], etc. Due to the well-known nofree-lunch theorem [32], no algorithm can perform best on all problems. Thus, each model
has unique advantages that can be adopted under certain environments, meanwhile, each
one has its own restrictions that may affect the final performance.
Geoinformatica
Fig. 2 Structure of a supervised learning process
Supervised learning covers a wide range of application domains such as image processing [30], text classification [25], face recognition [7], video indexing [37], etc. Besides,
several learning techniques have been applied on mobile and spatio-temporal data in recent
literature. In [11], kernel-based SVM is used as a classifier in the detection of harmful algal
blooms in the Gulf of Mexico based on mobile data. In [26], the random forest approach is
used to classify the land usage in a city based on mobile phone activities. In [4], a densitybased clustering algorithm is proposed for a wide range of spatio-temporal data. To the
best of our knowledge, no one has applied supervised learning models to predict the POI
existence or the number of POIs in a certain region of a city using cell tower data dumps.
3 Using supervised learning for POI predictions
In this section, we will describe how to apply the supervised learning models to predict the
POI existence or the number of POIs in a region of a city based on the cell tower data dumps
and Weibo POI data.
3.1 Pre-clustering of cell towers
As demonstrated in Fig. 1, the geographical distributions of the cell towers and POIs in
Guangzhou city are roughly consistent with each other. That is to say, if a given region has
a larger number of cell towers, it also has a high chance to cover a larger number of POIs,
and vice versa. Besides, the density of cell towers is also related to the user visiting rate. For
example, the downtown is usually the most popular and busiest area in a city, so it records
the highest user visiting rate, and thus needs more cell towers. In comparison, very few
people visit the suburb in a day, thus the density of cell towers is low in such area. Having
these basic observations, it is possible to predict the POI existence or the number of POIs
in a region based on the user visiting rate, which is reflected by the number of connections
established by cell towers in that region.
Given N cell towers T = {T1 , T2 , . . . , TN } with geographical location information, we
denote Ti = (ti1 , ti2 ), where ti1 and ti2 represent the longitude and latitude of Ti , i =
1, 2, . . . , N , respectively. The intuitive scheme is to divide the city into N regions R =
{R1 , R2 , . . . , RN }, such that each region contains one cell tower. These regions could be
defined by the Voronoi diagram [9], which treats each cell tower as a seed. Given a point in
a region, the point is closer to the seed of the region than the seeds in other regions, i.e.,
∀x ∈ Ri , d(x, Ti ) ≤ d(x, Tj ),
Geoinformatica
where i ∈ {1, 2, . . . , N }, and j = 1, . . . , i − 1, i + 1, . . . , N . An example of the Voronoi
diagram with 10 seeds located in a unit square is given in Fig. 3.
Suppose there is a set of M POIs P = {P1 , P2 , . . . , PM } with geographical location
information, we denote Pi = (pi1 , pi2 ), where pi1 and pi2 represent the longitude and
latitude of Pi , i = 1, 2, . . . , M, respectively. For a given POI Pi , the region that covers
it could be discovered by a nearest neighbor (NN) search process among T. Finally, the
number of POIs in each region is computed as the target that we aim to predict. However,
when it comes to a real application, we have to consider the following two issues:
–
–
The signal intensity of a cell tower is not stable, which leads to an unreliable relation
between the POI density and the number of connections. For example, given two neighboring regions with similar POI densities, their user visiting rates are also supposed to
be similar. However, the signal intensity of one cell tower may be much stronger than
that of another. Thus, when a user is in an intermediate location between them, the
stronger one will always make the connection for the user.
Due to an unbalanced distribution of cell towers, the separated regions may be too small
in downtown and too large in suburb. As a result, the number of POIs covered by them
may be balanced out and have no obvious difference.
In order to overcome the above-mentioned problems, we conduct a pre-clustering process on the cell towers, such that the cell towers with similar geographical information
are grouped into one cluster. Accordingly, their regions defined by the Voronoi diagram
are merged, and the numbers of their covered POIs are summed up as the target that
we aim to predict. As the most widely used one, k-means clustering technique [12] is
adopted, which aims to partition N observations (i.e., T1 , T2 , . . . , TN ) into k sets (i.e.,
S = {S1 , S2 , . . . , Sk }), so as to minimize the within-cluster sum of square:
argminS
k ||Tj − μi ||,
(1)
i=1 Tj ∈Si
where
μi =
1 Tj ,
ki
(2)
Tj ∈Si
and ki is the number of cell towers in the i-th cluster.
In this work, k could be treated an input number related to the evaluation unit and defined
by the user. For instance, the user could define a smaller k if he wants to evaluate larger
regions and a lager k if he wants to evaluate smaller regions. In other words, there is no
Fig. 3 Voronoi diagram with 10
seeds in a unit square
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Geoinformatica
best or worst number of k, its value is decided by the user’s willingness. Obviously, it is
hard for us to try all the possible k values, thus, we test several representative values, i.e.,
{250, 500, 1000, 2500, 5000}. Due to limited space, we only plot the clustering result when
k = 250, as shown in Fig. 4.
3.2 Density distribution of the number of POIs
We use kernel density estimation to get the distribution characteristics of the number of
POIs, in order to investigate whether the data is suitable for supervised learning models.
Kernel density estimation is the generalized form of histogram, which gives the continuous
distribution of a set of observations. Given k cell tower clusters grouped by the k-means
algorithm (i.e., {S1 , S2 , . . . , Sk }), the covered region of Si is denoted by Ri∗ , and the number
of POIs located in Ri∗ is denoted by ni , then the kernel density estimation of the number of
POIs is
k
k
1
1 n − ni
,
(3)
Kh (n − ni ) =
K
fˆh (n) =
k
kh
h
i=1
i=1
where n is the argument for the density estimation, i.e., the number of POIs in a region, K
is the kernel function and h is the bandwidth. By applying Gaussian kernel, i.e.,
n2
1
Kh (n) = √ exp− 2 ,
2π
(4)
the estimator (3) becomes
(n−ni )
1 1
−
fˆh (n) =
√ exp 2h2 .
kh
2π
k
2
(5)
i=1
23.8
Latitude
23.6
23.4
23.2
23.0
22.8
22.6
113.0
113.2
113.4
113.6
Longitude
Fig. 4 Pre-clustering result of cell towers when k = 250
113.8
114.0
Geoinformatica
1
According to [23], we select the optimal bandwidth as h = (4σ̂ 2 /3k) 5 , where σ̂ is the
standard deviation of {n1 , . . . , nk }. Finally, the density distribution of the number of POIs
is derived as shown in Fig. 5.
Figure 5a gives the distribution of the original data without the pre-clustering process. It
is easy to observe that many cell towers do not cover any POI, and when the POI number
Density
5
10
15
20
25
30
0
20
25
30
Kernel Density Estimation
0.06
Kernel Density Estimation
0.04
0.03
Density
0.01
0.02
0.15
0.10
0.05
Density
15
Number of POIs
0.00
0.00
0
10
20
30
40
50
0
Number of POIs
Density
0.010
0.000
50
100
150
200
250
Number of POIs
Fig. 5 Density distribution of the number of POIs
300
0.000 0.002 0.004 0.006 0.008 0.010 0.012
0.015
0.020
0.025
Kernel Density Estimation
0
20
40
60
80
100 120 140
Number of POIs
0.005
Density
10
5
Number of POIs
0.05
0
Kernel Density Estimation
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
0.4
0.0
0.2
Density
0.6
0.8
Kernel Density Estimation
Kernel Density Estimation
0
100
200
300
Number of POIs
400
Geoinformatica
is larger than 10, the density is approximately zero. That is to say, the numbers of POIs
covered by the cell towers have no obvious difference. In this case, it is hard to establish a
supervised learning model for both classification and regression. However, the distribution
becomes more rational with a pre-clustering process, as shown in Fig. 5b to f. Basically,
with the decrease of the number of clusters, we have the following observations:
–
–
–
The difference among clusters becomes more obvious with a larger range of numbers
of POIs per cluster.
The distribution becomes more smooth with a smaller range of density.
The percentage of clusters that do not cover any POI becomes much smaller.
Table 1 reports the statistic information of the cell tower clusters. It is clear that with
the decrease of the number of clusters, the ratio of empty clusters becomes smaller, and
the average number of POIs per cluster becomes larger. Besides, we define the inter-cluster
standard deviation (SD) as
k
1 (ni − μ)2 ,
(6)
σ1 = k
i=1
and intra-cluster SD as
k
1 1 (nij − μi )2 ,
σ2 =
k
ki
∗
i=1
(7)
j ∈Ri
where μ = k1 ki=1 ni , μi = k1i j ∈R ∗ nij , and ki is the number of cell towers in Ri∗ .
i
Obviously, both Eqs. 6 and 7 increase with the decrease of the number of clusters. However,
the increasing amplitude of (6) is much larger than that of Eq. 7, which demonstrates that
the pre-clustering process can enlarge the difference among clusters while retaining the
similarity of cell towers in the same cluster.
3.3 Refine time resolution
We aim to use spatio-temporal data to perform POI predictions. In Sections 3.1 and 3.2, we
have introduced how to make use of the spatial data. In this section, we further discuss how
to make use of the temporal data.
Basically, the time in a day can be divided into 24 slots in the unit of an hour. Each slot
defines a feature for the cell tower T . Each connection record indicates that a user has visited
in the region covered by T , thus the connection frequency distribution of a region could
Table 1 Cell tower cluster information
No.
No. clusters
No. clusters
Minimum
Maximum
Average
Inter-cluster
Intra-cluster
clusters
with zero
with non-
No. POIs
No. POIs
No. POIs
SD
SD
POI
zero POI
per cluster
per cluster
per cluster
9,563
5,292
4,271
0
192
2
4.26
0
5,000
2,187
2,813
0
192
4
7.12
0.78
2,500
880
1,620
0
192
8
13.02
1.28
1,000
202
798
0
278
19
27.5
1.84
500
62
438
0
374
38
50.66
2.09
250
17
233
0
591
76
95.56
2.36
Geoinformatica
possibly reflect the characteristics of its user visiting rate. Given a region, the distributions
in different days are supposed to be similar. However, this statement does not hold in the
reality. Figure 6 demonstrates the connection frequency distributions of a region in different
days. We pay attention to the following observations:
–
–
–
There is no unified pattern for the distributions in different days.
The distribution in weekday is more uniform than that of weekend, but Friday is in
between.
There may have some missing values, which give zero connection in a time slot.
The reason for the first observation is obvious. Since user activity is dynamic, it is hard
to find a unified pattern for different days. The second observation is also easy to explain,
since people always have different living habits in weekdays and weekends. In weekdays,
they have a regular time schedule for working and rest, but in weekends, even the same
person can take part in different activities. Besides, Friday is a transition between weekdays
and weekends, thus it exhibits some pluralistic characteristics. As for the third observation,
it is possibly caused by some facility problems, such as a poor signal intensity or periodic
maintenance of the cell towers.
Furthermore, it is observed from Fig. 6 that the connection distribution in a weekday can
be roughly divided into several intervals. Take Fig. 6a as an instance:
–
The frequency is the lowest during 0:00am to 7:00am, since this period is the sleeping
time for most people.
Fig. 6 Connection frequency of the cell towers in a region in different days
Geoinformatica
–
–
–
–
The frequency gradually increases during 7:00am to 9:00am, and reaches a small peak
during 9:00am to 12:00pm.
During 12:00pm to 14:00pm, the frequency decreases a little bit, since this period is the
siesta time for some people.
During 14:00pm to 18:00pm, the frequency maintains a high level and reaches another
peak.
Finally, the frequency begins to decrease until mid-night.
It is noteworthy these rules are not strictly obeyed by all the weekdays. However, they
reflect some basic features of the data, which are consistent with peoples’ daily life. Thus,
we refine the time resolution of a weekday into seven new time slots as listed in Table 2,
and compute the new features as the average number of connections during the slots. As for
the weekend, the 24 time slots are retained. Finally, the feature vector of a given region Ri
is denoted as xi = (xi1 , xi2 , . . . , xiL ), where L = 7 ∗ 4 + 24 ∗ 2 = 76 (i.e., four weekdays
and two weekend days), with each dimension reflecting its user visiting rate in a specific
time slot.
3.4 The proposed framework
Given a certain area in a city, a company wants to know whether there should exist any POI
or how many POIs should be there for business planning. Thus, it is useful to resolve these
problems from the view points of both classification and regression.
By considering the issues discussed above, the POI prediction framework is sketched in
Algorithm 1. The algorithm consists of three main steps.
Step 1: The Voronoi diagram step In this step, the city is divided into a number of
consecutive regions based on the Voronoi diagram by taking the cell towers as the seeds.
Then, the number of POIs located in each region is found. (Lines 2 to 3)
Step 2: The clustering step This step performs the k-means clustering algorithm on the
cell towers, and the cell towers with similar geographical locations are grouped into the
same cluster. Each cell tower cluster defines a region of the city, with a feature vector (i.e.,
including the cell tower identifier and time of each connection) extracted from the cell tower
data dumps. (Lines 5 to 15)
Step 3: The supervised learning step Finally, the POI existence (treated as positive if
there exists any POI and negative if no POI exist) or the number of POIs is taken as the
Table 2 Time slots in a weekday
Time slot
Duration
Activity
0:00 to 7:00
7 hours
Sleeping hours
7:00 to 9:00
2 hours
Morning rush hours
9:00 to 12:00
3 hours
Morning working hours
12:00 to 14:00
2 hours
Lunch hours
14:00 to 18:00
4 hours
Afternoon working hours
18:00 to 21:00
3 hours
Evening rush & dinner hours
21:00 to 24:00
3 hours
Home hours
Geoinformatica
output target of the region, and a learner f is built up based on these labeled regions for a
classification or regression model that will be used to predict POI existence or the number
of POIs in the region, respectively. (Lines 17 to 21)
Once the learner f is trained based on a set of given regions, it can be used in two
directions. (1) Prediction of unknown regions: when there comes a new region without any
POI information, we can extract its feature vector from the user connection records of the
cell tower data dumps and predict the POI existence or the number of POIs by f , then,
the company can make business plans based on the classification and regression results.
(2) Evaluation of existing regions: given a region, if the regression number of POIs is larger
than or equal to the real one, there may be adequate number of POIs; however, if the regression number is smaller than the actual one, it indicates a possibility to set up more POIs in
the future.
4 Implementation and analysis
In this section, we first describe how to implement the classification and regression learning
modes for our proposed framework, and then analyze extensive experimental results to study
the feasibility and effectiveness of the proposed framework.
4.1 Implementation
We here present the implementation details of the classification and regression learning
modes in Algorithm 1. Note that the model selection is not the main concern in this work,
thus we just adopt several widely used parameter settings for the learning algorithms.
Classification learning mode The purpose of this experiment is to correctly identify
whether there exists any POI in a given region of a city. We study six state-of-the-art algorithms for the classification learning model, which are naive bayes classifier (NBC), radial
basis function (RBF) network, SVM, decision tree, bagging, and adaboost. The first four
algorithms are single classifier based methods. Among them, NBC is a probabilistic classifier based on the bayes’ theorem [21]. We apply the Gaussian function to estimate the class
probabilities, where the parameters μ and σ 2 are computed as the mean and variance of
the training samples in this class, respectively. RBF network [13] is an artificial neural network that adopts the radial basis function as the activation function. SVM [28] is a binary
classification model based on statistical learning theory, which aims to generate an optimal
separating hyper-plane that can maximize the margin between the two referred classes. We
apply the soft-margin SVM with the Gaussian RBF kernel, where the kernel parameter γ
and the slack variable C are set to 1. Decision Tree (DT) [19] is a rule based classifier, which
builds up a knowledge-based expert system by inductive inference from training samples.
The induction of DT is a recursive process that follows a top-down approach by repeatedly splitting of the training set. We apply the standard C4.5 algorithm, which adopts the
information gain ratio as the criterion to split nodes. The last two algorithms are aggregated
methods based on ensemble learning. For bagging [6], the bag size is 100, the number of
iterations is 10, and REPTree is employed as the base classifier. For adaboost [8], the weight
threshold is set as 100, the number of iterations is 10, and a decision stump is used as the
base classifier.
Geoinformatica
Regression learning mode The purpose of this experiment is to directly predict the number of POIs located in a given region of a city. We also study six state-of-the-art algorithms
for the regression learning model, which are isotonic regression, linear regression, pace
regression, simple linear regression, addictive regression, and regression via discretization.
The linear regression and the simple linear regression are two regression models based on
statistics [22], the difference between them is that the linear regression has one or more
explanatory variables, while the simple linear regression has only one explanatory variable.
The isotonic regression [2] is a non-linear model based on numerical analysis, which could
be formulated as a quadratic programming (QP) problem. The pace regression [31] is an
aggregated method consisting of a group of estimators, which are either overall optimal or
conditionally optimal. The addictive regression [24] is a nonparametric model, which adopts
Geoinformatica
a smooth function to fit the shape of the training data. Finally, the regression via discretization [27] transforms a regression problem into a classification one, and gets the prediction
result by using a classification learning system.
4.2 Experiment settings
We first conduct the experiments on the original data without a pre-clustering process,
then set the number of clusters as {250, 500, 1000, 2500, 5000} and observe the trend of
the result. In order to avoid random effects, we conduct 10-fold cross validation 10 times,
and observe the average values. The experiments are conducted with the standard machine
learning toolbox WEKA [14], which are performed on a computer with an Intel Core 2 Duo
CPU with 4GB memory, it runs on 32-bit Windows 7.
4.3 Classification results
In a binary classification problem, the result can be summarized into a confusion matrix as
shown in Table 3.
We adopt three metrics to evaluate the performance, i.e., testing accuracy, precision, and
recall, which are respectively defined as:
T esting Accuracy =
TP +TN
,
T P + T N + FP + FN
P recision =
TP
,
T P + FP
(8)
(9)
and
TP
.
(10)
T P + FN
Basically, testing accuracy gives the overall rate of correctly classified testing samples, precision gives the correct rate in the set that has been classified as positive, and recall gives
the correct rate in the real positive set.
The average values of the 10×10 results (10 trials of 10-fold cross validation) regarding
the three evaluation metrics with different numbers of clusters are shown in Fig. 7. It can
be seen that NBC has obtained the highest precision, but the testing accuracy and recall are
lower than others. This is probably because NBC is more sensitive to the imbalanced problem, i.e., it tends to correctly classify the positive samples but wrongly classify the negative
samples. Besides, bagging and boosting have shown the most stable performance among
the six algorithms. The reason is straightforward, since the ensemble mechanism makes
the final decision by combining the classification results of multiple classifiers, which usually outperforms single classifier. Finally, the three statistical algorithms, i.e., RBF network,
SVM, and DT, have shown similar performances. They have obtained lower accuracy and
precision than bagging and boosting, but have shown the highest recall, which demonstrates
that they are less sensitive to the imbalanced problem.
It is observed from Fig. 7 that the changing trend of recall is not clear. From Eqs. 9 and
10, we can see that the difference between precision and recall lies in the denominator, i.e,
Recall =
Table 3 Confusion matrix of
classification result
True Positive (TP)
False Positive (FP)
False Negative (FN)
True Negative (TN)
1.0
0.8
0.9
5000
2500
1000
500
250
9563
5000
2500
1000
500
0.6
0.2
Naive Bayes
RBF Network
SVM
Decision Tree
Bagging
Adaboost
0.0
0.4
0.5
9563
0.4
Recall
0.8
0.7
Precision
0.6
Naive Bayes
RBF Network
SVM
Decision Tree
Bagging
Adaboost
0.5
0.7
0.8
Naive Bayes
RBF Network
SVM
Decision Tree
Bagging
Adaboost
0.6
Accuracy
0.9
1.0
Geoinformatica
250
9563
5000
Number of Clusters
Number of Clusters
2500
1000
500
250
Number of Clusters
Fig. 7 Comparative results of different classification models
0.15
0.10
0.05
Testing Seconds
Naive Bayes
RBF Network
SVM
Decision Tree
Bagging
Adaboost
0.00
10
20
30
40
Naive Bayes
RBF Network
SVM
Decision Tree
Bagging
Adaboost
0
Tr aining Seconds
50
0.20
precision is affected by the negative samples that have been wrongly classified as positive
and recall is affected by the positive samples that have been wrongly classified as negative. As shown in Table 1, when a pre-clustering process is performed, the data set becomes
imbalanced. Especially when k is small, the number of positive samples is much larger
than the number of negative samples. In this case, the learning process gets biased towards
predicting positive for almost all the samples by default and frequently misclassifies the
negative samples. In other words, it is easier to have False Positive result (classify a negative samples as positive), but is irregular to have False Negative result (classify a positive
samples as negative). This explains why the changing trend of recall is not clear.
Furthermore, it can be seen that both precision and recall are mainly decided by the
True Positive rate, which represents the number of positive samples that have been correctly
classified. Obviously, the smaller k is, the more imbalanced the data set will be. In this case,
all the adopted learning algorithms will get biased towards the positive class and lead to
a high True Positive rate. However, when k = 5, 000, the data set achieves a state that is
relatively balanced. In this case, the learning algorithms will no longer get biased towards
any class, some of them may achieve a higher True Positive rate and others may achieve a
9563
5000
2500
1000
500
250
Number of Clusters
Fig. 8 Efficiency of different classification models
9563
5000
2500
1000
500
Number of Clusters
250
9563
5000
2500
1000
9563
5000
2500
1000
500
Number of Clusters
250
110
105
100
95
90
85
Isotonic Regression
Linear Regression
Pace Regression
Simple Linear Regression
Addictive Regression
Regression Via Discretization
80
80
85
90
95
Root Relative Squared Error (%)
100
250
75
Relative Absolute Error (%)
500
Number of Clusters
Isotonic Regression
Linear Regression
Pace Regression
Simple Linear Regression
Addictive Regression
Regression Via Discretization
70
40
60
80
Isotonic Regression
Linear Regression
Pace Regression
Simple Linear Regression
Addictive Regression
Regression Via Discretization
20
Root Mean Squared Error
Geoinformatica
9563
5000
2500
1000
500
250
Number of Clusters
Fig. 9 Comparative results of different regression models
lower True Positive rate. As a result, there is an irregularity among different methods when
k = 5, 000.
We put more focus on the testing accuracy and precision, both of which have a clear
increasing trend with the decrease of the number of clusters. In fact, without a pre-clustering
process (shown as 9,563 clusters in Fig. 7), both the testing accuracy and precision are
between 0.5 and 0.6, which are slightly better than a simple random guess. When the number of clusters becomes smaller, both of them have an obvious improvement. That is to
say, the prediction is more effective when the city is divided into larger region areas. This
observation is easy to explain, since in a small region, the relation between the number of
connections of cell towers and the user visiting rate may be unreliable due to some uncontrollable factors such as a poor signal intensity and periodical maintenance. However, when
the cell towers are grouped geographically, the negative effects of such problems can be
alleviated. For instance, if a cell tower has some missing values, the prediction on it may be
inaccurate. However, if it is grouped into a cluster with other cell towers, such missing values could be alleviated to a certain extent. The larger the cluster is, the better the prediction
result will be. Besides, the adopted learning algorithms have different advantages regarding
different evaluation metrics. For instance, the two ensemble based learning algorithms can
give much higher testing accuracy than NBC, but fail to outperform it regarding precision.
Thus, it is important to choose an appropriate method based on the requirement and purpose
of the problem.
The training time and testing time of the classification algorithms are given in Fig. 8a and
b, respectively. We can see that SVM is the most time consuming one to solve this problem,
while the execution time of all the other algorithms are in an acceptable range.
4.4 Regression results
In a regression problem, the most commonly used evaluation metric is the root mean squared
error (RMSE), which is defined as:
k
1 (yi − ŷi )2 ,
(11)
RMSE = k
i=1
where ŷi is the target number of POIs located in the region covered by the i-th cell tower
cluster, yi is the number of POIs predicted by the regression model, and k is the number of
clusters.
8
1.4
Geoinformatica
6
4
2
Testing Seconds
1.0
0.8
0.6
0.4
Isotonic Regression
Linear Regression
Pace Regression
Simple Linear Regression
Addictive Regression
Regression Via Discretization
0
0.0
0.2
Tr aining Seconds
1.2
Isotonic Regression
Linear Regression
Pace Regression
Simple Linear Regression
Addictive Regression
Regression Via Discretization
9563
5000
2500
1000
500
250
9563
Number of Clusters
5000
2500
1000
500
250
Number of Clusters
Fig. 10 Efficiency of different regression models
However, as depicted in Table 1, the ranges of the prediction targets differ a lot with
different numbers of clusters, which lead to some incomparable results. In this case, we
adopt another two metrics, i.e., relative absolute error (RAE) and root relative squared error
(RRSE), which are respectively defined as:
k
|yi − ŷi |
× 100 %,
(12)
RAE = i=1
k
i=1 |ŷi − ȳ|
and
k
(yi − ŷi )2
RRSE = i=1
× 100 %,
k
2
i=1 (ŷi − ȳ)
(13)
where
1
ŷi .
k
k
ȳ =
(14)
i=1
Basically, the RAE takes the total absolute error and normalizes it by dividing the total
absolute error of the simple predictor, and the RRSE takes the total squared error and normalizes it by dividing the total squared error of the simple predictor. For both the RAE and
RRSE, smaller values are better, and 100 % represents the baseline of just predicting the
mean. Thus, values less than 100 % are considered as effective for predicting the number of
POIs in a region.
The average values of the 10×10 results (10-fold cross validation 10 times) regarding
the RMSE, RAE, and RRSE with different numbers of clusters are depicted in Fig. 9. It can
be seen that most of the selected algorithms, i.e., isotonic regression, linear regression, pace
regression, simple linear regression, and addictive regression, have obtained very similar
performances with regard to the three evaluation metrics. This is because all of these algorithms try to construct the regression curve by directly utilizing the input samples. However,
regression via discretization transforms the original samples into some intervals, which may
lose some important information and perform worse than others.
Geoinformatica
Table 4 The RAE, RRSE, training time and testing time for regression results when k = 250
Method
RAE (%)
RRSE (%)
Train (s)
Test (s)
Isotonic
68.05±12.90
79.72±14.56
0.0175
0.0000
Linear
67.96±12.16
78.63±16.91
0.0223
0.0002
Pace
69.17±15.18
86.12±24.66
0.0089
0.0002
Simple linear
68.26±12.54
78.48±13.43
0.0005
0.0000
Addictive
73.40±15.43
86.57±21.24
0.0238
0.0000
Discretization
81.39±20.72
99.72±23.06
0.0341
0.0125
With the decrease of the number of clusters, the RMSE increases quickly, while the RAE
and RRSE decrease gradually. We put more focus on the RAE and RRSE. Without a preclustering process (shown as 9,563 clusters in Fig. 9), both the RAE and RRSE are around
100 %, which cannot perform better than just predicting the mean. However, when the city
is divided into fewer regions with a smaller number of clusters, the error can be reduced in
most cases. In fact, except the regression via discretization, all the other algorithms exhibit
a very clear decreasing trend, which demonstrate the effectiveness of the prediction.
The training time and testing time of the regression algorithms are given in Fig. 10a
and b, respectively. Basically, the training time of the six algorithms gradually decreases
when the number of clusters gets smaller. As for the testing time, except the regression via
discretization, all the methods can perform very fast. Finally, the RAE, RRSE, training time,
and testing time for k = 250 are listed in Table 4. When the city is divided into 250 regions,
the linear regression and the simple linear regression can give the lowest RAE and RRSE,
respectively. Besides, both the training time and testing time are below 0.1 second.
4.5 Summary
From both the classification and regression results, we can summarize that when the city is
divided into less regions, the predictions on the POI existence and the number of POIs are
more accurate. As aforementioned, one major reason is that the relation between the number
of connections of cell towers and the user visiting rate in a small region is unreliable due
to a poor signal intensity and periodical maintenance. However, these negative effects can
be alleviated by defining larger regions. It is hard to tell which learning method is the best,
since different methods can exhibit different advantages with regard to different evaluation
metrics. It leaves us a possibility to improve the performance by designing some adaptive
algorithms, which could be one of our future research directions.
5 Conclusions
In this paper, we proposed a supervised learning-based framework for predicting the existence of POIs and the number of POIs in a given region using the spatio-temporal features
extracted from cell tower call dumps in Guangzhou, China and the information of a set
of restaurants collected from the Chinese social network Weibo. The Voronoi diagram is
adopted to divide the Guangzhou city into small and consecutive regions geographically.
Then, a k-means clustering process is performed on the cell towers to merge small regions
into larger ones. The connection frequencies of cell towers are taken as the features of a
region, and a classification or regression model is used to predict the POI existence or the
Geoinformatica
number of POIs in a given region, respectively. We have studied 12 state-of-the-art classification and regression algorithms. Experimental results show the feasibility and effectiveness
of the proposed framework.
We consider two related research problems as our future work: the problem of determining the value of k and the choice of time resolution. One possible solution to the first
problem is to design a metric for k based on some objective factors. For example, if the metric is based on travel time, we need to find the total length of the roads in the city (length),
the average driving speed of vehicles (speed), and how much time the user is willing to
spend (time); thus, a driving distance speed × time can be computed as the total length of
the roads in a cluster, and k can be determined as length/(speed×time). For the second problem, the main idea is to separate two consecutive hours if they exhibit an obvious change of
connection frequency (i.e., a new time slot should begin). In the future, we will collect data
in other cities to verify the effectiveness of this method.
Acknowledgments R. Wang and C.-Y. Chow were partially supported by a research grant (CityU Project
No. 9231131). S. Nutanong was partially supported by a CityU research grant (CityU Project No. 7200387).
This work was also supported by the National Natural Science Foundation of China under the Grant
61402460.
References
1. Bao J, Zheng Y, Mokbel MF (2012) Location-based and preference-aware recommendation using sparse
geo-social networking data. In: ACM SIGSPATIAL
2. Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under order
restrictions: The theory and application of isotonic regression. Wiley, New York
3. Becker RA, Caceres R, Hanson K, Loh JM, Urbanek S, Varshavsky A, Volinsky C (2011) A tale of one
city: Using cellular network data for urban planning. IEEE Pervasive Computing 10(4):18–26
4. Birant D, St-dbscan AK (2007) An algorithm for clustering spatial–temporal data. DKE 60(1):208–221
5. Bishop CM (2006) Pattern recognition and machine learning (information science and statistics).
Springer, New York
6. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
7. Chen XM, Liu WQ, Lai JH, Li Z, Lu C (2012) Face recognition via local preserving average
neighborhood margin maximization and extreme learning machine. Soft Comput 16(9):1515–1523
8. Collins M, Schapire RE, Singer Y (2002) Logistic regression, adaboost and bregman distances. Mach
Learn 48(1-3):253–285
9. Ghosh S, Lee K, Moorthy S (1995) Multiple scale analysis of heterogeneous elastic structures using
homogenization theory and voronoi cell finite element method. IJSS 32(1):27–62
10. Goh JY, Taniar D (2004) Mobile data mining by location dependencies. In: IDEAL
11. Gokaraju B, Durbha SS, King RL, Younan NH (2011) A machine learning based spatio-temporal data
mining approach for detection of harmful algal blooms in the Gulf of Mexico. IEEE J-STARS 4(3):710–
720
12. Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J R Stat Soc: Ser C:
Appl Stat 28(1):100–108
13. Haykin S (1994) Neural networks: A comprehensive foundation. Prentice Hall PTR
14. Holmes G, Donkin A, Weka IH (1994) Witten: A machine learning workbench. In: ANZIIS
15. Isaacman S, Becker R, Cáceres R, Kobourov S, Martonosi M, Rowland J, Varshavsky A (2011)
Identifying important places in people’s lives from cellular network data. In: Pervasive Computing
16. Kanasugi H, Sekimoto Y, Kurokawa M, Watanabe T, Muramatsu S, Shibasaki R (2013) Spatiotemporal
route estimation consistent with human mobility using cellular network data. In: IEEE PerCom
17. Miller HJ, Han J (2009) Geographic data mining and knowledge discovery. CRC Press
18. Pan B, Zheng Y, Wilkie D, Shahabi C (2013) Crowd sensing of traffic anomalies based on human
mobility and social media. In: ACM SIGSPATIAL
19. Quinlan JR (1996) Improved use of continuous attributes in C4.5. JAIR 4:77–90
20. Ratti C, Williams S, Frenchman D, Pulselli RM (2006) Mobile landscapes: using location data from cell
phones for urban analysis. Environ Plan B: Planning and Design 33(5):727
Geoinformatica
21. Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI
22. Seber GAF, Lee AJ (2012) Linear regression analysis, volume 936. John Wiley & Sons
23. Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density
estimation. JRSS, Series B 53(3):683–690
24. Stone CJ (1985) Additive regression and other nonparametric models. Ann Stat:689–705
25. Tong S, Koller D (2002) Support vector machine active learning with applications to text classification.
J Mach Learn Res 2:45–66
26. Toole JL, Ulm M, González MC, Bauer D (2012) Inferring land use from mobile phone activity. In:
ACM UrbComp
27. Torgo L, Gama J (1996) Regression by classification. In: Advances in Artificial Intelligence, pp 51–60
28. Vapnik V (2000) The nature of statistical learning theory. Springer
29. Vieira MR, Frias-Martinez V, Oliver N, Frias-Martinez E (2010) Characterizing dense urban areas from
mobile phone-call data: Discovery and social dynamics. In: IEEE SocialCom
30. Wang L, Huang YP, Luo XY, Wang Z, Luo SW (2011) Image deblurring with filters learned by extreme
learning machine. Neurocomputing 74(16):2464–2474
31. Wang Y, Witten IH (1999) Pace regression. Technical Report 99/12, Department of Computer Science,
The University of Waikato
32. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE TEVC 1(1):67–82
33. Yavaṡ G, Katsaros D, Ulusoy Ö, Manolopoulos Y (2005) A data mining approach for location prediction
in mobile environments. DKE 54(2):121–146
34. Ye M, Yin P, Lee W-C, Lee D-L (2011) Exploiting geographical influence for collaborative point-ofinterest recommendation. In: ACM SIGSPATIAL
35. Yuan J, Zheng Y, Xie X (2012) Discovering regions of different functions in a city using human mobility
and pois. In: ACM SIGKDD
36. Yuan J, Zheng Y, Xie X, Sun G (2013) T-drive: Enhancing driving directions with taxi drivers’
intelligence. IEEE TKDE 25(1):220–232
37. Zha Z, Wang M, Zheng Y, Yang Y, Hong R, Chua T (2012) Interactive video indexing with statistical
active learning. IEEE TMM 14(1):17–27
38. Zhang J-D, Chow C-Y (2013) iGSLR: Personalized geo-social location recommendation: A kernel
density estimation approach. In: ACM SIGSPATIAL
39. Zheng J, Liu S, Ni LM (2013) Effective routine behavior pattern discovery from sparse mobile phone
data via collaborative filtering. In: IEEE PerCom
40. Zheng Y, Chen Y, Xie X, Ma WY (2009) Geolife2.0: A location-based social networking service. In:
IEEE MDM
Ran Wang received the B.Sc. degree in computer science from the College of Information Science and Technology, Beijing Forestry University, Beijing, China, in 2009, and the Ph.D. degree from the City University
of Hong Kong, Hong Kong, in 2014. She is currently a Post-Doctoral Senior Research Associate with the
Department of Computer Science, the City University of Hong Kong. Her current research interests include
pattern recognition, machine learning, fuzzy sets and fuzzy logic, and their related applications.
Geoinformatica
Chi-Yin Chow received the M.S. and Ph.D. degrees from the University of Minnesota-Twin Cities in 2008
and 2010, respectively. He is currently an assistant professor in Department of Computer Science, City University of Hong Kong. His research interests include spatio-temporal data management and analysis, GIS,
mobile computing, and location-based services. He is the co-founder and co-organizer of ACM SIGSPATIAL
MobiGIS 2012, 2013, and 2014.
Yan Lyu received the M.S. degree in pattern recognition and intelligent systems from University of Science and Technology of China, China, in 2013. She is currently working toward the Ph.D. degree in the
Department of Computer Science, City University of Hong Kong. Her research interests include data mining,
machine learning and location-based services.
Geoinformatica
Victor C. S. Lee received the Ph.D. degree in computer science from the City University of Hong Kong
in 1997. He is an Assistant Professor with the Department of Computer Science, City University of Hong
Kong. His research interests include data management in mobile computing systems, real-time databases,
and performance evaluation. Dr. Lee is a member of the ACM and IEEE Computer Society. From 2006 to
2007, he was the Chairman of the Computer Chapter, IEEE Hong Kong Section.
Sarana Nutanong is an Assistant Professor in the Department of Computer Science at City University of
Hong Kong. He received his PhD from the University of Melbourne. Before joining CityU in January 2014,
he was a Postdoctoral Research Associate at University of Maryland Institute for Advanced Computer Studies
between 2010 and 2012 and held a research faculty position at the Johns Hopkins University from 2012 to
2013. His research interests include scientific data management, dataintensive computing, spatial-temporal
query processing, and large-scale machine learning. More specifically, his research is aimed at providing a
large-scale, high-throughput support for computational scientific exploration applications.
Geoinformatica
Yanhua Li is a researcher with HUAWEI Noah’s Ark LAB, Hong Kong. He obtained two PhD degrees
in computer science from University of Minnesota, Twin Cities in 2013, and in electrical engineering from
Beijing University of Posts and Telecommunications in 2009. His broad research interests are in analyzing,
understanding, and making sense of big data generated from various complex networks in many contexts. His
specific interests include large-scale network data sampling, measurement, and performance analysis, and
spatio-temporal data analytics. He has held visiting positions in Bell Labs in New Jersey, Microsoft Research
Asia, and HUAWEI research labs of America. He served on TPC of INFOCOM 2015, ICDCS 2014, 2015,
and he is the co-chair of SIMPLEX 2015.
Mingxuan Yuan is currently a Researcher of Huawei Noah’s Ark lab, Hong Kong. Before that, he served
as a PostDoc fellow in the Department of Computer Science and Engineering of the Hong Kong University
of Science and Technology. His research interests include big telecom (spatiotemporal) data storage/mining,
telecom data mining and data privacy.
Download