Geoinformatica DOI 10.1007/s10707-015-0237-7 Exploring cell tower data dumps for supervised learning-based point-of-interest prediction (industrial paper) Ran Wang1 · Chi-Yin Chow1 · Yan Lyu1 · Victor C. S. Lee1 · Sarana Nutanong1 · Yanhua Li2 · Mingxuan Yuan3 Received: 3 December 2014 / Revised: 20 July 2015 / Accepted: 28 September 2015 © Springer Science+Business Media New York 2015 Abstract Exploring massive mobile data for location-based services becomes one of the key challenges in mobile data mining. In this paper, we investigate a problem of finding a correlation between the collective behavior of mobile users and the distribution of points of interest (POIs) in a city. Specifically, we use large-scale cell tower data dumps collected from cell towers and POIs extracted from a popular social network service, Weibo. Our objective is to make use of the data from these two different types of sources to build a model for predicting the POI densities of different regions in the covered area. An application domain that may benefit from our research is a business recommendation application, where a prediction result can be used as a recommendation for opening a new store/branch. The Chi-Yin Chow chiychow@cityu.edu.hk Ran Wang ranwang3-c@my.cityu.edu.hk Yan Lyu yanlv2-c@mycityu.edu.hk Victor C. S. Lee csvlee@cityu.edu.hk Sarana Nutanong snutanon@cityu.edu.hk Yanhua Li yli15@wpi.edu Mingxuan Yuan yuan.mingxuan@huawei.com 1 Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong 2 Department of Computer Science, Worcester Polytechnic Institute (WPI), Worcester, USA 3 Huawei Noah’s Ark Lab, Shatin, Hong Kong Geoinformatica crux of our contribution is the method of representing the collective behavior of mobile users as a histogram of connection counts over a period of time in each region. This representation ultimately enables us to apply a supervised learning algorithm to our problem in order to train a POI prediction model using the POI data set as the ground truth. We studied 12 state-of-the-art classification and regression algorithms; experimental results demonstrate the feasibility and effectiveness of the proposed method. Keywords Spatio-temporal data analysis · Classification · Regression · Cell tower data dumps · Point-of-interest prediction 1 Introduction The ubiquity of mobile devices such as smartphones and tablet computers enables us to collect useful spatial and temporal data in a large scale and also opens up the possibility of extracting useful information from the data [10, 17, 33]. For example, a popular mapping service, Google Maps, makes use of real-time GPS records obtained from the users of Google Location Services to show the current traffic conditions of different road segments on the maps. Another example is a driving direction recommendation system called T-Drive [36], which makes use of trajectories collected from over 33,000 taxis in a period of three months to compute the fastest route for users. In this paper, we focus on a specific type of mobile-user data known as cell tower data dumps, which contain connection records collected by 9,563 cell towers operated by the China Mobile Limited1 in Guangzhou, China, as illustrated in Fig. 1a. This data set was collected within a time period of six days (from 4 September 2013 to 9 September 2013). For the purpose of this investigation, we focus on records produced by phone calls and SMSs. For each record, we use the connection time, and the identifier and location of each cell tower. We extracted 18,290 restaurants in Guangzhou from Weibo2 , a popular Chinese social network web site, as our point-of-interest (POI) data set, as depicted in Fig. 1b. The main objective of our research is to make use of the cell phone and POI data sets to help predict the existence of a POI and the number of POIs in the vicinity of a cell tower. An application domain that may benefit from POI prediction is a business recommendation application, where a company is interested in generalizing the pattern of POIs of a particular type (e.g., a coffee shop) in order to identify areas that have a great potential of supporting its business but have not been fully utilized yet. Our investigation is driven by a hypothesis that there is a correlation between the collective behavior of mobile users and the existence of a certain type of POIs in a certain area. The main challenge of this work is twofold: (1) Representation. To test our hypothesis, we should find a meaningful representation of collective mobile user behaviors by summarizing a large amount of data extracted from the cell tower data dumps. For example, the cell tower network in a city like Guangzhou generates user connection records in the scale of tens of Gigabytes on a daily basis. (2) Application. To provide LBS, we should find effective techniques to predict the existence of a certain type of POIs and the number of POIs in a certain area. For example, if our framework predicts that a certain area should have restaurants, but that area does not have any restaurant, it has potential for a new restaurant. 1 http://www.chinamobileltd.com 2 http://weibo.com Geoinformatica Cell Tower 23.6 23.6 23.4 23.4 23.2 23.2 23.0 23.0 22.8 22.8 22.6 22.6 113.0 113.2 POI Restaurant 23.8 Latitude Latitude 23.8 113.4 113.6 Longitude 113.8 114.0 113.0 113.2 113.4 113.6 113.8 114.0 Longitude Fig. 1 Geographical distribution of cell towers and restaurants in the Guangzhou city of China To overcome the representation challenge, mobile user data can be summarized in two different methods. The first method is to group the records by users, and get an action list or a moving trajectory of each user; and the other is to group them by cell towers, and get the spatio-temporal features of the geographical areas. In this work, we adopt the second method due to the following reasons. – – There is no exact location information for users. Each record shows the cell tower that a mobile device is connected to rather than the exact location of the device. In fact, even if the user stays in a fixed location, he or she may connect to different cell towers due to some uncontrollable factors such as signal intensity and facility maintenance. The number of cell tower connections of different users has a small mean and a large standard deviation. That is to say, one user may make a number of connections in a day but another user may make zero connection. As a result, the numbers of connections made by different users vary tremendously, and the average number is too small to be considered as a trajectory data set. The result from the cell-tower-based summarization method is a spatio-temporal data set with cell towers spanning the spatial dimensions. In the temporal dimension, each cell tower is associated with a histogram of connection counts where each histogram bin occupies a time period of one hour. In this way, the collective behavior of mobile users of an entire city is compactly represented as connection counts over a period of time from different cell towers. For the application challenge, we aim to design a framework to build up a model between the features of mobile user behaviors and LBS. In particular, we study how to employ stateof-the-art supervised learning algorithms to design (i) a classification model to predict the POI existence (i.e., naive bayes [21], radial basis function (RBF) framework [13], support vector machine (SVM) [28], decision trees (DT) [19], bagging [6], adaboost [8]) and (ii) a regression model to predict the number of POIs (i.e., simple linear regression, linear regression [22], isotonic regression [2], pace regression [31], addictive regression [24], and regression via discretization [27]). Geoinformatica In general, the contributions of our work can be summarized as follows. – – – We formulate a generic representation method of summarizing cell tower data dumps for mobile user behaviors. We design a framework with classification and regression algorithms to build up a model between mobile users’ behaviors and LBS for business recommendation applications. We conduct extensive evaluation of our framework on real cell tower data dumps and POI data set. Experimental results show that there is a strong correlation between the collective behavior of mobile users and the restaurant data set and demonstrate the feasibility and effectiveness of the proposed framework. The remainder of this paper is organized as follows. Section 2 gives a brief introduction to supervised machine learning, and highlights related work. In Section 3, we describe how to predict the POI existence and the number of POIs based on the cell tower data dumps, and present the proposed framework. In Section 4, we present implementation details and analyze extensive experimental results to study the feasibility and effectiveness of the proposed framework and analyze their results. Finally, Section 5 concludes this paper. 2 Related work Most existing work on mobile and spatio-temporal data focuses on recommender systems [1, 34, 38], urban planning [3], discovering [35], social networking services [40], etc. In particular, mobile phone call data and cellular network data are often used to discover useful information in various scenarios such as traffic anomalies [18], regions of different functions in a city [35], routine behavior patterns of people [16, 39], and important places [15], etc. Besides, they are also used for urban analysis [20] and urban planning, such as characterizing dense urban areas [29] and capturing city dynamics [3]. In general, the most commonly used techniques include collaborative filtering, density estimation, image and signal processing, etc. However, none of them put their focus on machine learning, especially supervised learning, which is also a potential tool to mine useful information and make accurate prediction on mobile phone call data or cellular network data for valuable location-based applications. Supervised learning [5] refers to the problem of inferring a model from a set of labeled training samples, in order to achieve accurate predictions on unseen data. Given a training set X with N labeled samples, i.e., X = {(xi , yi )}N i=1 , each sample is associated with a set of conditional attributes xi = {xi1 , xi2 , . . . , xiL } and a decision attribute yi . The goal is to learn a function f : x → y, such that given a new unlabeled sample x̂ = {x̂1 , x̂2 , . . . , x̂L }, its desired output value could be predicted by ŷ = f (x̂). Besides, the learning task is classification or regression if the decision attribute is discrete or continuous, respectively. In order to solve a supervised learning problem, the solution has to perform the steps as shown in Fig. 2. Each step has unique significance that may affect the final performance. Currently, the most widely used classification models include naive bayes classifier (NBC) [21], support vector machines (SVMs) [28], decision trees (DTs) [19], artificial neural networks (ANNs) [13], etc. While the most widely used regression models include linear regression [22], pace regression [31], isotonic regression [2], etc. Due to the well-known nofree-lunch theorem [32], no algorithm can perform best on all problems. Thus, each model has unique advantages that can be adopted under certain environments, meanwhile, each one has its own restrictions that may affect the final performance. Geoinformatica Fig. 2 Structure of a supervised learning process Supervised learning covers a wide range of application domains such as image processing [30], text classification [25], face recognition [7], video indexing [37], etc. Besides, several learning techniques have been applied on mobile and spatio-temporal data in recent literature. In [11], kernel-based SVM is used as a classifier in the detection of harmful algal blooms in the Gulf of Mexico based on mobile data. In [26], the random forest approach is used to classify the land usage in a city based on mobile phone activities. In [4], a densitybased clustering algorithm is proposed for a wide range of spatio-temporal data. To the best of our knowledge, no one has applied supervised learning models to predict the POI existence or the number of POIs in a certain region of a city using cell tower data dumps. 3 Using supervised learning for POI predictions In this section, we will describe how to apply the supervised learning models to predict the POI existence or the number of POIs in a region of a city based on the cell tower data dumps and Weibo POI data. 3.1 Pre-clustering of cell towers As demonstrated in Fig. 1, the geographical distributions of the cell towers and POIs in Guangzhou city are roughly consistent with each other. That is to say, if a given region has a larger number of cell towers, it also has a high chance to cover a larger number of POIs, and vice versa. Besides, the density of cell towers is also related to the user visiting rate. For example, the downtown is usually the most popular and busiest area in a city, so it records the highest user visiting rate, and thus needs more cell towers. In comparison, very few people visit the suburb in a day, thus the density of cell towers is low in such area. Having these basic observations, it is possible to predict the POI existence or the number of POIs in a region based on the user visiting rate, which is reflected by the number of connections established by cell towers in that region. Given N cell towers T = {T1 , T2 , . . . , TN } with geographical location information, we denote Ti = (ti1 , ti2 ), where ti1 and ti2 represent the longitude and latitude of Ti , i = 1, 2, . . . , N , respectively. The intuitive scheme is to divide the city into N regions R = {R1 , R2 , . . . , RN }, such that each region contains one cell tower. These regions could be defined by the Voronoi diagram [9], which treats each cell tower as a seed. Given a point in a region, the point is closer to the seed of the region than the seeds in other regions, i.e., ∀x ∈ Ri , d(x, Ti ) ≤ d(x, Tj ), Geoinformatica where i ∈ {1, 2, . . . , N }, and j = 1, . . . , i − 1, i + 1, . . . , N . An example of the Voronoi diagram with 10 seeds located in a unit square is given in Fig. 3. Suppose there is a set of M POIs P = {P1 , P2 , . . . , PM } with geographical location information, we denote Pi = (pi1 , pi2 ), where pi1 and pi2 represent the longitude and latitude of Pi , i = 1, 2, . . . , M, respectively. For a given POI Pi , the region that covers it could be discovered by a nearest neighbor (NN) search process among T. Finally, the number of POIs in each region is computed as the target that we aim to predict. However, when it comes to a real application, we have to consider the following two issues: – – The signal intensity of a cell tower is not stable, which leads to an unreliable relation between the POI density and the number of connections. For example, given two neighboring regions with similar POI densities, their user visiting rates are also supposed to be similar. However, the signal intensity of one cell tower may be much stronger than that of another. Thus, when a user is in an intermediate location between them, the stronger one will always make the connection for the user. Due to an unbalanced distribution of cell towers, the separated regions may be too small in downtown and too large in suburb. As a result, the number of POIs covered by them may be balanced out and have no obvious difference. In order to overcome the above-mentioned problems, we conduct a pre-clustering process on the cell towers, such that the cell towers with similar geographical information are grouped into one cluster. Accordingly, their regions defined by the Voronoi diagram are merged, and the numbers of their covered POIs are summed up as the target that we aim to predict. As the most widely used one, k-means clustering technique [12] is adopted, which aims to partition N observations (i.e., T1 , T2 , . . . , TN ) into k sets (i.e., S = {S1 , S2 , . . . , Sk }), so as to minimize the within-cluster sum of square: argminS k ||Tj − μi ||, (1) i=1 Tj ∈Si where μi = 1 Tj , ki (2) Tj ∈Si and ki is the number of cell towers in the i-th cluster. In this work, k could be treated an input number related to the evaluation unit and defined by the user. For instance, the user could define a smaller k if he wants to evaluate larger regions and a lager k if he wants to evaluate smaller regions. In other words, there is no Fig. 3 Voronoi diagram with 10 seeds in a unit square 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Geoinformatica best or worst number of k, its value is decided by the user’s willingness. Obviously, it is hard for us to try all the possible k values, thus, we test several representative values, i.e., {250, 500, 1000, 2500, 5000}. Due to limited space, we only plot the clustering result when k = 250, as shown in Fig. 4. 3.2 Density distribution of the number of POIs We use kernel density estimation to get the distribution characteristics of the number of POIs, in order to investigate whether the data is suitable for supervised learning models. Kernel density estimation is the generalized form of histogram, which gives the continuous distribution of a set of observations. Given k cell tower clusters grouped by the k-means algorithm (i.e., {S1 , S2 , . . . , Sk }), the covered region of Si is denoted by Ri∗ , and the number of POIs located in Ri∗ is denoted by ni , then the kernel density estimation of the number of POIs is k k 1 1 n − ni , (3) Kh (n − ni ) = K fˆh (n) = k kh h i=1 i=1 where n is the argument for the density estimation, i.e., the number of POIs in a region, K is the kernel function and h is the bandwidth. By applying Gaussian kernel, i.e., n2 1 Kh (n) = √ exp− 2 , 2π (4) the estimator (3) becomes (n−ni ) 1 1 − fˆh (n) = √ exp 2h2 . kh 2π k 2 (5) i=1 23.8 Latitude 23.6 23.4 23.2 23.0 22.8 22.6 113.0 113.2 113.4 113.6 Longitude Fig. 4 Pre-clustering result of cell towers when k = 250 113.8 114.0 Geoinformatica 1 According to [23], we select the optimal bandwidth as h = (4σ̂ 2 /3k) 5 , where σ̂ is the standard deviation of {n1 , . . . , nk }. Finally, the density distribution of the number of POIs is derived as shown in Fig. 5. Figure 5a gives the distribution of the original data without the pre-clustering process. It is easy to observe that many cell towers do not cover any POI, and when the POI number Density 5 10 15 20 25 30 0 20 25 30 Kernel Density Estimation 0.06 Kernel Density Estimation 0.04 0.03 Density 0.01 0.02 0.15 0.10 0.05 Density 15 Number of POIs 0.00 0.00 0 10 20 30 40 50 0 Number of POIs Density 0.010 0.000 50 100 150 200 250 Number of POIs Fig. 5 Density distribution of the number of POIs 300 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.015 0.020 0.025 Kernel Density Estimation 0 20 40 60 80 100 120 140 Number of POIs 0.005 Density 10 5 Number of POIs 0.05 0 Kernel Density Estimation 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.4 0.0 0.2 Density 0.6 0.8 Kernel Density Estimation Kernel Density Estimation 0 100 200 300 Number of POIs 400 Geoinformatica is larger than 10, the density is approximately zero. That is to say, the numbers of POIs covered by the cell towers have no obvious difference. In this case, it is hard to establish a supervised learning model for both classification and regression. However, the distribution becomes more rational with a pre-clustering process, as shown in Fig. 5b to f. Basically, with the decrease of the number of clusters, we have the following observations: – – – The difference among clusters becomes more obvious with a larger range of numbers of POIs per cluster. The distribution becomes more smooth with a smaller range of density. The percentage of clusters that do not cover any POI becomes much smaller. Table 1 reports the statistic information of the cell tower clusters. It is clear that with the decrease of the number of clusters, the ratio of empty clusters becomes smaller, and the average number of POIs per cluster becomes larger. Besides, we define the inter-cluster standard deviation (SD) as k 1 (ni − μ)2 , (6) σ1 = k i=1 and intra-cluster SD as k 1 1 (nij − μi )2 , σ2 = k ki ∗ i=1 (7) j ∈Ri where μ = k1 ki=1 ni , μi = k1i j ∈R ∗ nij , and ki is the number of cell towers in Ri∗ . i Obviously, both Eqs. 6 and 7 increase with the decrease of the number of clusters. However, the increasing amplitude of (6) is much larger than that of Eq. 7, which demonstrates that the pre-clustering process can enlarge the difference among clusters while retaining the similarity of cell towers in the same cluster. 3.3 Refine time resolution We aim to use spatio-temporal data to perform POI predictions. In Sections 3.1 and 3.2, we have introduced how to make use of the spatial data. In this section, we further discuss how to make use of the temporal data. Basically, the time in a day can be divided into 24 slots in the unit of an hour. Each slot defines a feature for the cell tower T . Each connection record indicates that a user has visited in the region covered by T , thus the connection frequency distribution of a region could Table 1 Cell tower cluster information No. No. clusters No. clusters Minimum Maximum Average Inter-cluster Intra-cluster clusters with zero with non- No. POIs No. POIs No. POIs SD SD POI zero POI per cluster per cluster per cluster 9,563 5,292 4,271 0 192 2 4.26 0 5,000 2,187 2,813 0 192 4 7.12 0.78 2,500 880 1,620 0 192 8 13.02 1.28 1,000 202 798 0 278 19 27.5 1.84 500 62 438 0 374 38 50.66 2.09 250 17 233 0 591 76 95.56 2.36 Geoinformatica possibly reflect the characteristics of its user visiting rate. Given a region, the distributions in different days are supposed to be similar. However, this statement does not hold in the reality. Figure 6 demonstrates the connection frequency distributions of a region in different days. We pay attention to the following observations: – – – There is no unified pattern for the distributions in different days. The distribution in weekday is more uniform than that of weekend, but Friday is in between. There may have some missing values, which give zero connection in a time slot. The reason for the first observation is obvious. Since user activity is dynamic, it is hard to find a unified pattern for different days. The second observation is also easy to explain, since people always have different living habits in weekdays and weekends. In weekdays, they have a regular time schedule for working and rest, but in weekends, even the same person can take part in different activities. Besides, Friday is a transition between weekdays and weekends, thus it exhibits some pluralistic characteristics. As for the third observation, it is possibly caused by some facility problems, such as a poor signal intensity or periodic maintenance of the cell towers. Furthermore, it is observed from Fig. 6 that the connection distribution in a weekday can be roughly divided into several intervals. Take Fig. 6a as an instance: – The frequency is the lowest during 0:00am to 7:00am, since this period is the sleeping time for most people. Fig. 6 Connection frequency of the cell towers in a region in different days Geoinformatica – – – – The frequency gradually increases during 7:00am to 9:00am, and reaches a small peak during 9:00am to 12:00pm. During 12:00pm to 14:00pm, the frequency decreases a little bit, since this period is the siesta time for some people. During 14:00pm to 18:00pm, the frequency maintains a high level and reaches another peak. Finally, the frequency begins to decrease until mid-night. It is noteworthy these rules are not strictly obeyed by all the weekdays. However, they reflect some basic features of the data, which are consistent with peoples’ daily life. Thus, we refine the time resolution of a weekday into seven new time slots as listed in Table 2, and compute the new features as the average number of connections during the slots. As for the weekend, the 24 time slots are retained. Finally, the feature vector of a given region Ri is denoted as xi = (xi1 , xi2 , . . . , xiL ), where L = 7 ∗ 4 + 24 ∗ 2 = 76 (i.e., four weekdays and two weekend days), with each dimension reflecting its user visiting rate in a specific time slot. 3.4 The proposed framework Given a certain area in a city, a company wants to know whether there should exist any POI or how many POIs should be there for business planning. Thus, it is useful to resolve these problems from the view points of both classification and regression. By considering the issues discussed above, the POI prediction framework is sketched in Algorithm 1. The algorithm consists of three main steps. Step 1: The Voronoi diagram step In this step, the city is divided into a number of consecutive regions based on the Voronoi diagram by taking the cell towers as the seeds. Then, the number of POIs located in each region is found. (Lines 2 to 3) Step 2: The clustering step This step performs the k-means clustering algorithm on the cell towers, and the cell towers with similar geographical locations are grouped into the same cluster. Each cell tower cluster defines a region of the city, with a feature vector (i.e., including the cell tower identifier and time of each connection) extracted from the cell tower data dumps. (Lines 5 to 15) Step 3: The supervised learning step Finally, the POI existence (treated as positive if there exists any POI and negative if no POI exist) or the number of POIs is taken as the Table 2 Time slots in a weekday Time slot Duration Activity 0:00 to 7:00 7 hours Sleeping hours 7:00 to 9:00 2 hours Morning rush hours 9:00 to 12:00 3 hours Morning working hours 12:00 to 14:00 2 hours Lunch hours 14:00 to 18:00 4 hours Afternoon working hours 18:00 to 21:00 3 hours Evening rush & dinner hours 21:00 to 24:00 3 hours Home hours Geoinformatica output target of the region, and a learner f is built up based on these labeled regions for a classification or regression model that will be used to predict POI existence or the number of POIs in the region, respectively. (Lines 17 to 21) Once the learner f is trained based on a set of given regions, it can be used in two directions. (1) Prediction of unknown regions: when there comes a new region without any POI information, we can extract its feature vector from the user connection records of the cell tower data dumps and predict the POI existence or the number of POIs by f , then, the company can make business plans based on the classification and regression results. (2) Evaluation of existing regions: given a region, if the regression number of POIs is larger than or equal to the real one, there may be adequate number of POIs; however, if the regression number is smaller than the actual one, it indicates a possibility to set up more POIs in the future. 4 Implementation and analysis In this section, we first describe how to implement the classification and regression learning modes for our proposed framework, and then analyze extensive experimental results to study the feasibility and effectiveness of the proposed framework. 4.1 Implementation We here present the implementation details of the classification and regression learning modes in Algorithm 1. Note that the model selection is not the main concern in this work, thus we just adopt several widely used parameter settings for the learning algorithms. Classification learning mode The purpose of this experiment is to correctly identify whether there exists any POI in a given region of a city. We study six state-of-the-art algorithms for the classification learning model, which are naive bayes classifier (NBC), radial basis function (RBF) network, SVM, decision tree, bagging, and adaboost. The first four algorithms are single classifier based methods. Among them, NBC is a probabilistic classifier based on the bayes’ theorem [21]. We apply the Gaussian function to estimate the class probabilities, where the parameters μ and σ 2 are computed as the mean and variance of the training samples in this class, respectively. RBF network [13] is an artificial neural network that adopts the radial basis function as the activation function. SVM [28] is a binary classification model based on statistical learning theory, which aims to generate an optimal separating hyper-plane that can maximize the margin between the two referred classes. We apply the soft-margin SVM with the Gaussian RBF kernel, where the kernel parameter γ and the slack variable C are set to 1. Decision Tree (DT) [19] is a rule based classifier, which builds up a knowledge-based expert system by inductive inference from training samples. The induction of DT is a recursive process that follows a top-down approach by repeatedly splitting of the training set. We apply the standard C4.5 algorithm, which adopts the information gain ratio as the criterion to split nodes. The last two algorithms are aggregated methods based on ensemble learning. For bagging [6], the bag size is 100, the number of iterations is 10, and REPTree is employed as the base classifier. For adaboost [8], the weight threshold is set as 100, the number of iterations is 10, and a decision stump is used as the base classifier. Geoinformatica Regression learning mode The purpose of this experiment is to directly predict the number of POIs located in a given region of a city. We also study six state-of-the-art algorithms for the regression learning model, which are isotonic regression, linear regression, pace regression, simple linear regression, addictive regression, and regression via discretization. The linear regression and the simple linear regression are two regression models based on statistics [22], the difference between them is that the linear regression has one or more explanatory variables, while the simple linear regression has only one explanatory variable. The isotonic regression [2] is a non-linear model based on numerical analysis, which could be formulated as a quadratic programming (QP) problem. The pace regression [31] is an aggregated method consisting of a group of estimators, which are either overall optimal or conditionally optimal. The addictive regression [24] is a nonparametric model, which adopts Geoinformatica a smooth function to fit the shape of the training data. Finally, the regression via discretization [27] transforms a regression problem into a classification one, and gets the prediction result by using a classification learning system. 4.2 Experiment settings We first conduct the experiments on the original data without a pre-clustering process, then set the number of clusters as {250, 500, 1000, 2500, 5000} and observe the trend of the result. In order to avoid random effects, we conduct 10-fold cross validation 10 times, and observe the average values. The experiments are conducted with the standard machine learning toolbox WEKA [14], which are performed on a computer with an Intel Core 2 Duo CPU with 4GB memory, it runs on 32-bit Windows 7. 4.3 Classification results In a binary classification problem, the result can be summarized into a confusion matrix as shown in Table 3. We adopt three metrics to evaluate the performance, i.e., testing accuracy, precision, and recall, which are respectively defined as: T esting Accuracy = TP +TN , T P + T N + FP + FN P recision = TP , T P + FP (8) (9) and TP . (10) T P + FN Basically, testing accuracy gives the overall rate of correctly classified testing samples, precision gives the correct rate in the set that has been classified as positive, and recall gives the correct rate in the real positive set. The average values of the 10×10 results (10 trials of 10-fold cross validation) regarding the three evaluation metrics with different numbers of clusters are shown in Fig. 7. It can be seen that NBC has obtained the highest precision, but the testing accuracy and recall are lower than others. This is probably because NBC is more sensitive to the imbalanced problem, i.e., it tends to correctly classify the positive samples but wrongly classify the negative samples. Besides, bagging and boosting have shown the most stable performance among the six algorithms. The reason is straightforward, since the ensemble mechanism makes the final decision by combining the classification results of multiple classifiers, which usually outperforms single classifier. Finally, the three statistical algorithms, i.e., RBF network, SVM, and DT, have shown similar performances. They have obtained lower accuracy and precision than bagging and boosting, but have shown the highest recall, which demonstrates that they are less sensitive to the imbalanced problem. It is observed from Fig. 7 that the changing trend of recall is not clear. From Eqs. 9 and 10, we can see that the difference between precision and recall lies in the denominator, i.e, Recall = Table 3 Confusion matrix of classification result True Positive (TP) False Positive (FP) False Negative (FN) True Negative (TN) 1.0 0.8 0.9 5000 2500 1000 500 250 9563 5000 2500 1000 500 0.6 0.2 Naive Bayes RBF Network SVM Decision Tree Bagging Adaboost 0.0 0.4 0.5 9563 0.4 Recall 0.8 0.7 Precision 0.6 Naive Bayes RBF Network SVM Decision Tree Bagging Adaboost 0.5 0.7 0.8 Naive Bayes RBF Network SVM Decision Tree Bagging Adaboost 0.6 Accuracy 0.9 1.0 Geoinformatica 250 9563 5000 Number of Clusters Number of Clusters 2500 1000 500 250 Number of Clusters Fig. 7 Comparative results of different classification models 0.15 0.10 0.05 Testing Seconds Naive Bayes RBF Network SVM Decision Tree Bagging Adaboost 0.00 10 20 30 40 Naive Bayes RBF Network SVM Decision Tree Bagging Adaboost 0 Tr aining Seconds 50 0.20 precision is affected by the negative samples that have been wrongly classified as positive and recall is affected by the positive samples that have been wrongly classified as negative. As shown in Table 1, when a pre-clustering process is performed, the data set becomes imbalanced. Especially when k is small, the number of positive samples is much larger than the number of negative samples. In this case, the learning process gets biased towards predicting positive for almost all the samples by default and frequently misclassifies the negative samples. In other words, it is easier to have False Positive result (classify a negative samples as positive), but is irregular to have False Negative result (classify a positive samples as negative). This explains why the changing trend of recall is not clear. Furthermore, it can be seen that both precision and recall are mainly decided by the True Positive rate, which represents the number of positive samples that have been correctly classified. Obviously, the smaller k is, the more imbalanced the data set will be. In this case, all the adopted learning algorithms will get biased towards the positive class and lead to a high True Positive rate. However, when k = 5, 000, the data set achieves a state that is relatively balanced. In this case, the learning algorithms will no longer get biased towards any class, some of them may achieve a higher True Positive rate and others may achieve a 9563 5000 2500 1000 500 250 Number of Clusters Fig. 8 Efficiency of different classification models 9563 5000 2500 1000 500 Number of Clusters 250 9563 5000 2500 1000 9563 5000 2500 1000 500 Number of Clusters 250 110 105 100 95 90 85 Isotonic Regression Linear Regression Pace Regression Simple Linear Regression Addictive Regression Regression Via Discretization 80 80 85 90 95 Root Relative Squared Error (%) 100 250 75 Relative Absolute Error (%) 500 Number of Clusters Isotonic Regression Linear Regression Pace Regression Simple Linear Regression Addictive Regression Regression Via Discretization 70 40 60 80 Isotonic Regression Linear Regression Pace Regression Simple Linear Regression Addictive Regression Regression Via Discretization 20 Root Mean Squared Error Geoinformatica 9563 5000 2500 1000 500 250 Number of Clusters Fig. 9 Comparative results of different regression models lower True Positive rate. As a result, there is an irregularity among different methods when k = 5, 000. We put more focus on the testing accuracy and precision, both of which have a clear increasing trend with the decrease of the number of clusters. In fact, without a pre-clustering process (shown as 9,563 clusters in Fig. 7), both the testing accuracy and precision are between 0.5 and 0.6, which are slightly better than a simple random guess. When the number of clusters becomes smaller, both of them have an obvious improvement. That is to say, the prediction is more effective when the city is divided into larger region areas. This observation is easy to explain, since in a small region, the relation between the number of connections of cell towers and the user visiting rate may be unreliable due to some uncontrollable factors such as a poor signal intensity and periodical maintenance. However, when the cell towers are grouped geographically, the negative effects of such problems can be alleviated. For instance, if a cell tower has some missing values, the prediction on it may be inaccurate. However, if it is grouped into a cluster with other cell towers, such missing values could be alleviated to a certain extent. The larger the cluster is, the better the prediction result will be. Besides, the adopted learning algorithms have different advantages regarding different evaluation metrics. For instance, the two ensemble based learning algorithms can give much higher testing accuracy than NBC, but fail to outperform it regarding precision. Thus, it is important to choose an appropriate method based on the requirement and purpose of the problem. The training time and testing time of the classification algorithms are given in Fig. 8a and b, respectively. We can see that SVM is the most time consuming one to solve this problem, while the execution time of all the other algorithms are in an acceptable range. 4.4 Regression results In a regression problem, the most commonly used evaluation metric is the root mean squared error (RMSE), which is defined as: k 1 (yi − ŷi )2 , (11) RMSE = k i=1 where ŷi is the target number of POIs located in the region covered by the i-th cell tower cluster, yi is the number of POIs predicted by the regression model, and k is the number of clusters. 8 1.4 Geoinformatica 6 4 2 Testing Seconds 1.0 0.8 0.6 0.4 Isotonic Regression Linear Regression Pace Regression Simple Linear Regression Addictive Regression Regression Via Discretization 0 0.0 0.2 Tr aining Seconds 1.2 Isotonic Regression Linear Regression Pace Regression Simple Linear Regression Addictive Regression Regression Via Discretization 9563 5000 2500 1000 500 250 9563 Number of Clusters 5000 2500 1000 500 250 Number of Clusters Fig. 10 Efficiency of different regression models However, as depicted in Table 1, the ranges of the prediction targets differ a lot with different numbers of clusters, which lead to some incomparable results. In this case, we adopt another two metrics, i.e., relative absolute error (RAE) and root relative squared error (RRSE), which are respectively defined as: k |yi − ŷi | × 100 %, (12) RAE = i=1 k i=1 |ŷi − ȳ| and k (yi − ŷi )2 RRSE = i=1 × 100 %, k 2 i=1 (ŷi − ȳ) (13) where 1 ŷi . k k ȳ = (14) i=1 Basically, the RAE takes the total absolute error and normalizes it by dividing the total absolute error of the simple predictor, and the RRSE takes the total squared error and normalizes it by dividing the total squared error of the simple predictor. For both the RAE and RRSE, smaller values are better, and 100 % represents the baseline of just predicting the mean. Thus, values less than 100 % are considered as effective for predicting the number of POIs in a region. The average values of the 10×10 results (10-fold cross validation 10 times) regarding the RMSE, RAE, and RRSE with different numbers of clusters are depicted in Fig. 9. It can be seen that most of the selected algorithms, i.e., isotonic regression, linear regression, pace regression, simple linear regression, and addictive regression, have obtained very similar performances with regard to the three evaluation metrics. This is because all of these algorithms try to construct the regression curve by directly utilizing the input samples. However, regression via discretization transforms the original samples into some intervals, which may lose some important information and perform worse than others. Geoinformatica Table 4 The RAE, RRSE, training time and testing time for regression results when k = 250 Method RAE (%) RRSE (%) Train (s) Test (s) Isotonic 68.05±12.90 79.72±14.56 0.0175 0.0000 Linear 67.96±12.16 78.63±16.91 0.0223 0.0002 Pace 69.17±15.18 86.12±24.66 0.0089 0.0002 Simple linear 68.26±12.54 78.48±13.43 0.0005 0.0000 Addictive 73.40±15.43 86.57±21.24 0.0238 0.0000 Discretization 81.39±20.72 99.72±23.06 0.0341 0.0125 With the decrease of the number of clusters, the RMSE increases quickly, while the RAE and RRSE decrease gradually. We put more focus on the RAE and RRSE. Without a preclustering process (shown as 9,563 clusters in Fig. 9), both the RAE and RRSE are around 100 %, which cannot perform better than just predicting the mean. However, when the city is divided into fewer regions with a smaller number of clusters, the error can be reduced in most cases. In fact, except the regression via discretization, all the other algorithms exhibit a very clear decreasing trend, which demonstrate the effectiveness of the prediction. The training time and testing time of the regression algorithms are given in Fig. 10a and b, respectively. Basically, the training time of the six algorithms gradually decreases when the number of clusters gets smaller. As for the testing time, except the regression via discretization, all the methods can perform very fast. Finally, the RAE, RRSE, training time, and testing time for k = 250 are listed in Table 4. When the city is divided into 250 regions, the linear regression and the simple linear regression can give the lowest RAE and RRSE, respectively. Besides, both the training time and testing time are below 0.1 second. 4.5 Summary From both the classification and regression results, we can summarize that when the city is divided into less regions, the predictions on the POI existence and the number of POIs are more accurate. As aforementioned, one major reason is that the relation between the number of connections of cell towers and the user visiting rate in a small region is unreliable due to a poor signal intensity and periodical maintenance. However, these negative effects can be alleviated by defining larger regions. It is hard to tell which learning method is the best, since different methods can exhibit different advantages with regard to different evaluation metrics. It leaves us a possibility to improve the performance by designing some adaptive algorithms, which could be one of our future research directions. 5 Conclusions In this paper, we proposed a supervised learning-based framework for predicting the existence of POIs and the number of POIs in a given region using the spatio-temporal features extracted from cell tower call dumps in Guangzhou, China and the information of a set of restaurants collected from the Chinese social network Weibo. The Voronoi diagram is adopted to divide the Guangzhou city into small and consecutive regions geographically. Then, a k-means clustering process is performed on the cell towers to merge small regions into larger ones. The connection frequencies of cell towers are taken as the features of a region, and a classification or regression model is used to predict the POI existence or the Geoinformatica number of POIs in a given region, respectively. We have studied 12 state-of-the-art classification and regression algorithms. Experimental results show the feasibility and effectiveness of the proposed framework. We consider two related research problems as our future work: the problem of determining the value of k and the choice of time resolution. One possible solution to the first problem is to design a metric for k based on some objective factors. For example, if the metric is based on travel time, we need to find the total length of the roads in the city (length), the average driving speed of vehicles (speed), and how much time the user is willing to spend (time); thus, a driving distance speed × time can be computed as the total length of the roads in a cluster, and k can be determined as length/(speed×time). For the second problem, the main idea is to separate two consecutive hours if they exhibit an obvious change of connection frequency (i.e., a new time slot should begin). In the future, we will collect data in other cities to verify the effectiveness of this method. Acknowledgments R. Wang and C.-Y. Chow were partially supported by a research grant (CityU Project No. 9231131). S. Nutanong was partially supported by a CityU research grant (CityU Project No. 7200387). This work was also supported by the National Natural Science Foundation of China under the Grant 61402460. References 1. Bao J, Zheng Y, Mokbel MF (2012) Location-based and preference-aware recommendation using sparse geo-social networking data. In: ACM SIGSPATIAL 2. Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under order restrictions: The theory and application of isotonic regression. Wiley, New York 3. Becker RA, Caceres R, Hanson K, Loh JM, Urbanek S, Varshavsky A, Volinsky C (2011) A tale of one city: Using cellular network data for urban planning. IEEE Pervasive Computing 10(4):18–26 4. Birant D, St-dbscan AK (2007) An algorithm for clustering spatial–temporal data. DKE 60(1):208–221 5. Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, New York 6. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140 7. Chen XM, Liu WQ, Lai JH, Li Z, Lu C (2012) Face recognition via local preserving average neighborhood margin maximization and extreme learning machine. Soft Comput 16(9):1515–1523 8. Collins M, Schapire RE, Singer Y (2002) Logistic regression, adaboost and bregman distances. Mach Learn 48(1-3):253–285 9. Ghosh S, Lee K, Moorthy S (1995) Multiple scale analysis of heterogeneous elastic structures using homogenization theory and voronoi cell finite element method. IJSS 32(1):27–62 10. Goh JY, Taniar D (2004) Mobile data mining by location dependencies. In: IDEAL 11. Gokaraju B, Durbha SS, King RL, Younan NH (2011) A machine learning based spatio-temporal data mining approach for detection of harmful algal blooms in the Gulf of Mexico. IEEE J-STARS 4(3):710– 720 12. Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J R Stat Soc: Ser C: Appl Stat 28(1):100–108 13. Haykin S (1994) Neural networks: A comprehensive foundation. Prentice Hall PTR 14. Holmes G, Donkin A, Weka IH (1994) Witten: A machine learning workbench. In: ANZIIS 15. Isaacman S, Becker R, Cáceres R, Kobourov S, Martonosi M, Rowland J, Varshavsky A (2011) Identifying important places in people’s lives from cellular network data. In: Pervasive Computing 16. Kanasugi H, Sekimoto Y, Kurokawa M, Watanabe T, Muramatsu S, Shibasaki R (2013) Spatiotemporal route estimation consistent with human mobility using cellular network data. In: IEEE PerCom 17. Miller HJ, Han J (2009) Geographic data mining and knowledge discovery. CRC Press 18. Pan B, Zheng Y, Wilkie D, Shahabi C (2013) Crowd sensing of traffic anomalies based on human mobility and social media. In: ACM SIGSPATIAL 19. Quinlan JR (1996) Improved use of continuous attributes in C4.5. JAIR 4:77–90 20. Ratti C, Williams S, Frenchman D, Pulselli RM (2006) Mobile landscapes: using location data from cell phones for urban analysis. Environ Plan B: Planning and Design 33(5):727 Geoinformatica 21. Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 22. Seber GAF, Lee AJ (2012) Linear regression analysis, volume 936. John Wiley & Sons 23. Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. JRSS, Series B 53(3):683–690 24. Stone CJ (1985) Additive regression and other nonparametric models. Ann Stat:689–705 25. Tong S, Koller D (2002) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66 26. Toole JL, Ulm M, González MC, Bauer D (2012) Inferring land use from mobile phone activity. In: ACM UrbComp 27. Torgo L, Gama J (1996) Regression by classification. In: Advances in Artificial Intelligence, pp 51–60 28. Vapnik V (2000) The nature of statistical learning theory. Springer 29. Vieira MR, Frias-Martinez V, Oliver N, Frias-Martinez E (2010) Characterizing dense urban areas from mobile phone-call data: Discovery and social dynamics. In: IEEE SocialCom 30. Wang L, Huang YP, Luo XY, Wang Z, Luo SW (2011) Image deblurring with filters learned by extreme learning machine. Neurocomputing 74(16):2464–2474 31. Wang Y, Witten IH (1999) Pace regression. Technical Report 99/12, Department of Computer Science, The University of Waikato 32. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE TEVC 1(1):67–82 33. Yavaṡ G, Katsaros D, Ulusoy Ö, Manolopoulos Y (2005) A data mining approach for location prediction in mobile environments. DKE 54(2):121–146 34. Ye M, Yin P, Lee W-C, Lee D-L (2011) Exploiting geographical influence for collaborative point-ofinterest recommendation. In: ACM SIGSPATIAL 35. Yuan J, Zheng Y, Xie X (2012) Discovering regions of different functions in a city using human mobility and pois. In: ACM SIGKDD 36. Yuan J, Zheng Y, Xie X, Sun G (2013) T-drive: Enhancing driving directions with taxi drivers’ intelligence. IEEE TKDE 25(1):220–232 37. Zha Z, Wang M, Zheng Y, Yang Y, Hong R, Chua T (2012) Interactive video indexing with statistical active learning. IEEE TMM 14(1):17–27 38. Zhang J-D, Chow C-Y (2013) iGSLR: Personalized geo-social location recommendation: A kernel density estimation approach. In: ACM SIGSPATIAL 39. Zheng J, Liu S, Ni LM (2013) Effective routine behavior pattern discovery from sparse mobile phone data via collaborative filtering. In: IEEE PerCom 40. Zheng Y, Chen Y, Xie X, Ma WY (2009) Geolife2.0: A location-based social networking service. In: IEEE MDM Ran Wang received the B.Sc. degree in computer science from the College of Information Science and Technology, Beijing Forestry University, Beijing, China, in 2009, and the Ph.D. degree from the City University of Hong Kong, Hong Kong, in 2014. She is currently a Post-Doctoral Senior Research Associate with the Department of Computer Science, the City University of Hong Kong. Her current research interests include pattern recognition, machine learning, fuzzy sets and fuzzy logic, and their related applications. Geoinformatica Chi-Yin Chow received the M.S. and Ph.D. degrees from the University of Minnesota-Twin Cities in 2008 and 2010, respectively. He is currently an assistant professor in Department of Computer Science, City University of Hong Kong. His research interests include spatio-temporal data management and analysis, GIS, mobile computing, and location-based services. He is the co-founder and co-organizer of ACM SIGSPATIAL MobiGIS 2012, 2013, and 2014. Yan Lyu received the M.S. degree in pattern recognition and intelligent systems from University of Science and Technology of China, China, in 2013. She is currently working toward the Ph.D. degree in the Department of Computer Science, City University of Hong Kong. Her research interests include data mining, machine learning and location-based services. Geoinformatica Victor C. S. Lee received the Ph.D. degree in computer science from the City University of Hong Kong in 1997. He is an Assistant Professor with the Department of Computer Science, City University of Hong Kong. His research interests include data management in mobile computing systems, real-time databases, and performance evaluation. Dr. Lee is a member of the ACM and IEEE Computer Society. From 2006 to 2007, he was the Chairman of the Computer Chapter, IEEE Hong Kong Section. Sarana Nutanong is an Assistant Professor in the Department of Computer Science at City University of Hong Kong. He received his PhD from the University of Melbourne. Before joining CityU in January 2014, he was a Postdoctoral Research Associate at University of Maryland Institute for Advanced Computer Studies between 2010 and 2012 and held a research faculty position at the Johns Hopkins University from 2012 to 2013. His research interests include scientific data management, dataintensive computing, spatial-temporal query processing, and large-scale machine learning. More specifically, his research is aimed at providing a large-scale, high-throughput support for computational scientific exploration applications. Geoinformatica Yanhua Li is a researcher with HUAWEI Noah’s Ark LAB, Hong Kong. He obtained two PhD degrees in computer science from University of Minnesota, Twin Cities in 2013, and in electrical engineering from Beijing University of Posts and Telecommunications in 2009. His broad research interests are in analyzing, understanding, and making sense of big data generated from various complex networks in many contexts. His specific interests include large-scale network data sampling, measurement, and performance analysis, and spatio-temporal data analytics. He has held visiting positions in Bell Labs in New Jersey, Microsoft Research Asia, and HUAWEI research labs of America. He served on TPC of INFOCOM 2015, ICDCS 2014, 2015, and he is the co-chair of SIMPLEX 2015. Mingxuan Yuan is currently a Researcher of Huawei Noah’s Ark lab, Hong Kong. Before that, he served as a PostDoc fellow in the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology. His research interests include big telecom (spatiotemporal) data storage/mining, telecom data mining and data privacy.