Distributed and Parallel Databases manuscript No. (will be inserted by the editor) SMashQ: Spatial Mashup Framework for k-NN Queries in Time-Dependent Road Networks Detian Zhang · Chi-Yin Chow · Qing Li · Xinming Zhang · Yinlong Xu Received: date / Accepted: date Abstract The k-nearest-neighbor (k-NN) query is one of the most popular spatial query types for location-based services (LBS). In this paper, we focus on k-NN queries in time-dependent road networks, where the travel time between two locations may vary significantly at different time of the day. In practice, it is costly for a LBS provider to collect real-time traffic data from vehicles or roadside sensors to compute the best route from a user to a spatial object The work described in this paper was partially supported by grants from City University of Hong Kong (Project No. 7002686 and 7002606) and the National Natural Science Foundation of China under Grant No. 61073185 and 61073038. Detian Zhang Department of Computer Science and Technology, University of Science and Technology of China, Hefei, China Department of Computer Science, City University of Hong Kong, Hong Kong, China USTC-CityU Joint Advanced Research Center, Suzhou, China E-mail: tianzdt@mail.ustc.edu.cn Chi-Yin Chow Department of Computer Science, City University of Hong Kong, Hong Kong, China E-mail: chiychow@cityu.edu.hk Qing Li Department of Computer Science, City University of Hong Kong, Hong Kong, China USTC-CityU Joint Advanced Research Center, Suzhou, China E-mail: itqli@cityu.edu.hk Xinming Zhang Department of Computer Science and Technology, University of Science and Technology of China, Hefei, China USTC-CityU Joint Advanced Research Center, Suzhou, China E-mail: xinming@ustc.edu.cn Yinlong Xu Department of Computer Science and Technology, University of Science and Technology of China, Hefei, China USTC-CityU Joint Advanced Research Center, Suzhou, China E-mail: ylxu@ustc.edu.cn 2 Detian Zhang et al. of interest in terms of the travel time. Thus, we design SMashQ, a server-side spatial mashup framework that enables a database server to efficiently evaluate k-NN queries using the route information and travel time accessed from an external Web mapping service, e.g., Microsoft Bing Maps. Due to the expensive cost and limitations of retrieving such external information, we propose three shared execution optimizations for SMashQ, namely, object grouping, direction sharing, and user grouping, to reduce the number of external Web mapping requests and provide highly accurate query answers. We evaluate SMashQ using Microsoft Bing Maps, a real road network, real data sets, and a synthetic data set. Experimental results show that SMashQ is efficient and capable of producing highly accurate query answers. Keywords Spatial mashups · location-based services · k-nearest-neighbor queries · time-dependent road networks · Web mapping services 1 Introduction With the ubiquity of wireless Internet access, GPS-enabled mobile devices and the advance in spatial database management systems, location-based services (LBS) have been realized to provide valuable information for their users based on their locations [15, 17]. LBS are an abstraction of spatio-temporal queries. Typical examples of spatio-temporal queries include range queries (e.g., “How many vehicles in a certain area”) [9, 20, 24] and k-nearest-neighbor (k-NN) queries (e.g., “What are the three nearest gas stations to my car”) [13,20,22,27]. The distance between two locations in a road network is measured in terms of the network distance, instead of the Euclidean distance, to consider the physical movement constrains of the road network [24]. It is usually defined by the distance of their shortest path. However, this kind of distance measure would hide the fact that the user may take longer time to travel to his/her nearest object of interest (e.g., restaurant and hotel) than other ones due to many realistic factors, e.g., heterogeneous traffic conditions and traffic accidents. Travel time (e.g., driving time) is in reality a more meaningful and reliable distance measure for LBS in the road network [6, 7, 10]. Figure 1 depicts an example where Alice wants to find the nearest clinic for emergency medical treatment. A traditional shortest-path based NN query algorithm returns Clinic X. However, a driving-time based NN query algorithm returns Clinic Y because Alice will spend less time to reach Y (3 mins.) than X (10 mins.). Since travel time is highly dynamic, e.g., the driving time on a segment of I-10 freeway in Los Angeles, USA between 8:30AM to 9:30AM changes from 30 minutes to 18 minutes, i.e., 40% decrease in driving time [6], it is almost impossible to accurately predict the travel time between two locations in the road network based on their network distance. The best way to provide realtime travel time computation is to continuously monitor the traffic in the road network; however, it is difficult for every LBS provider to do so due to very expensive deployment cost. Title Suppressed Due to Excessive Length 3 Clinic X Distance: 500 meters Travel time: 10 mins Alice: Where is my nearest clinic? Clinic Y Distance: 1000 meters Travel time: 3 mins Fig. 1 Shortest-path-based nearest object X versus travel-time-based nearest object Y . A mashup, one of the key Web 2.0 technologies, is a web application that combines data, representation, and/or functionality from multiple web applications to create a new application [30]. A server-side spatial mashup (or GIS mashup) provides a cost-effective way for a database server to access the route and travel time information between two locations in the road network from external Web mapping services, e.g., Google Maps [11], Yahoo! Maps [32], Microsoft Bing Maps [19] and government agencies. Web mapping service providers have many sources to collect data (e.g., historical traffic statistics and real-time traffic conditions) to estimate the travel time and direction between any two locations. The application scenario of this work is that location-based service (LBS) providers can subscribe direction services from Web mapping services through their APIs to provide services for their users based on current traffic conditions. In other words, our proposed framework enables the LBS provider to outsource the real-time traffic monitoring and traffic analysis operations to the Web mapping service provider, and thus, the LBS provider can focus on its core business activities and reduce its operational costs. However, existing spatial mashups suffer from the following limitations. (1) It is costly to access direction information from a Web mapping service, e.g., retrieving travel time from the Microsoft MapPoint web service to a database engine takes 502 ms while the time needed to read a cold and hot 8 KB buffer page from disk is 27 ms and 0.0047 ms, respectively [18]. (2) There is a charge on the number of requests issued to a Web mapping service, e.g., Google Maps allows only 2,500 requests per day for evaluation users and 100,000 requests per day for business license users [12]. A location-based service provider needs to pay more to have higher usage limits. (3) The use of retrieved route information is restricted, e.g., the route information must not be pre-fetched, cached, or stored, except only limited amount of content can be temporarily stored for the purpose of improving system performance [12]. (4) Existing Web mapping services only support primitive operations, e.g., the travel direction and time between two locations. A database server has to issue a large number of exter- 4 Detian Zhang et al. nal requests to collect small pieces of information from the supported simple operations to process relatively complex spatial queries, e.g., k-NN queries. In this paper, we design SMashQ, a framework that uses a server-side Spatial Mashup paradigm to process k-NN Queries in the time-dependent road network. Given a set of objects R and a k-NN query with a user U ’s location, SMashQ finds at most k objects from R with the shortest travel time from U ’s location. The objectives of SMashQ are to reduce the number of external Web mapping requests and provide highly accurate query answers. To accomplish these objectives, SMashQ first prunes the objects that are definitely not part of a query answer from R to find a set of candidate objects for the query answer. Then, three shared execution optimizations, namely, object grouping, direction sharing, and user grouping, are designed for SMashQ. The main idea of the object grouping optimization is to group candidate objects to a set of intersections I. For each intersection I ∈ I, one external Web mapping request is issued to retrieve the route and travel time from U to I, and then the travel time from I to each of its objects is estimated. To further reduce the number of external requests, the direction sharing optimization aims at using the route and travel time information from U to an intersection in I to find that from U to other intersections in I. The user grouping optimization groups queries to a set of intersections. For each intersection, its queries are processed at the same time, as they can share the route and travel time information from the intersection. To evaluate the performance of SMashQ, we compare SMashQ with two basic approaches using a real road network, two real-world data sets, synthetic data sets, and Microsoft Bing Maps [19]. Experimental results show that SMashQ outperforms the basic approaches in terms of the number of external Web mapping requests and it provides highly accurate travel time estimation and k-NN query answers. The remainder of this paper is organized as follows. Section 2 highlights related work. Section 3 describes the system model. Section 4 presents an exact algorithm. Section 5 describes the three shared execution optimizations for SMashQ. Experimental results are analyzed in Section 6. Finally, Section 7 concludes this paper. 2 Related Work In this section, we first highlight related work in the areas of spatial queries in distributed systems and shared execution for spatial queries. Then, we focus on the related work in the areas of spatial queries in road networks and spatial queries with external data or services. Spatial queries in distributed systems. Location-based query processing algorithms have been designed for various distributed systems, e.g., client-server systems [20, 23], peer-to-peer systems [26], mobile peer-to-peer systems [4,34], and wireless sensor networks [8,31]. In the client-server systems, a centralized database server stores all objects and employs query execution Title Suppressed Due to Excessive Length 5 optimizations to process location-based queries issued by mobile users [20,23]. A distributed spatial index structure has been designed for the peer-to-peer system to process location-based queries [26]. When a distributed server does not have enough spatial object information to process a location-based query, it uses the spatial index structure to enlist other peer servers that store the required information for help. Mobile peer-to-peer systems enable users to cache their nearby spatial objects to process location-based queries without enlisting any server for help [4,34]. Wireless sensor networks (WSNs) are more challenging than peer-to-peer systems, because WSNs have many limitations (e.g., limited storage and computational capacity, scarce battery power, and restricted communication ranges and bandwidth). Spatial query processing algorithms designed for WSNs have to be optimized for such limitations [8, 31]. Shared execution for spatial queries. Shared execution techniques are one of the widely used query execution optimizations for spatial queries. A shared execution scheme has been used to enable a system to scale up to a large number of concurrent continuous range and k-nearest-neighbor queries [20,21]. The basic idea is to group similar queries together and then evaluate a group of queries at the same time through a spatial join between moving objects and queries. Another shared execution scheme has been designed to optimize the query execution by clustering moving objects and queries based on their common spatial and temporal properties [23]. However, none of existing shared execution techniques is designed for sharing direction information retrieved from a Web mapping service to process spatial queries. Spatial queries in road networks. Existing location-based query processing algorithms can be categorized into two main classes: location-based queries in time-independent road networks and in time-dependent road networks. (1) Time-independent road networks. In this class, location-based query processing algorithms assume that the cost or weight of a road segment, which can be in terms of distance or travel time, is constant (e.g., [14, 24, 25]). These algorithms mainly rely on pre-computed distance or travel time information of road segments in road networks. However, the actual travel time of a road segment may vary significantly during different times of the day due to dynamic traffic on road segments [6,7]. (2) Time-dependent road networks. The location-based query processing algorithms designed for time-dependent road networks have the ability to support dynamic weights of road segments and topology of a road network, which can change with time. Location-based queries in time-dependent road networks are more realistic but also more challenging. George et al. [10] proposed a time-aggregated graph, which uses time series to represent time-varying attributes. The time-aggregated graph can be used to compute the shortest path for a given start time or to find the best start time for a path that leads to the shortest travel time. In [6,7], Demiryurek et al. proposed solutions for processing k-NN queries in time-dependent road networks where the weight of each road segment is a function of time. Spatial queries with external data or services. In this paper, we focus on k-NN queries in time-dependent road networks. This paper distinguishes itself from previous work [6, 7, 10] that it does not model the underlying road 6 Detian Zhang et al. network based on different criteria. Instead, it employs third party Web mapping services, e.g., Microsoft Bing Maps, to compute the travel time of road networks and provide the route and travel time information through serverside spatial mashups. Since the use of external Web mapping requests is more expensive than accessing local data [18], we propose three shared execution optimizations for SMashQ to reduce the number of such external requests and provide highly accurate query answers. There are some query processing algorithms designed to deal with expensive attributes that are accessed from external Web services (e.g., [1, 2, 18]). To minimize the number of external requests, these algorithms mainly focused on using either some cheap attributes that can be retrieved from local data sources [1, 18] or sampling methods [2] to prune a whole set of objects into a smaller set of candidate objects, and they only issue absolutely necessary external requests for candidate objects. The closest work to this paper is [18], where Levandoski et al. developed a framework for processing spatial skyline queries by pruning candidate objects that are guaranteed not to be part of a query answer and only issuing an external request for each remaining candidate object to retrieve the travel time from the user to the object. However, all previous work did not study how to use shared execution techniques and the road network topology to reduce the number of external requests that is exactly the problem that we solved in this paper. 3 System Model In this section, we describe our system architecture, road network model, and problem definition. Figure 2 depicts the system architecture of SMashQ that consists of three entities, users, a database server, and a Web mapping service provider. Users send k-NN queries to the database server at a LBS provider through their mobile devices at anywhere, anytime. The database server processes queries based on local data (e.g., the location and basic information of restaurants) and external data (i.e., route and travel time information) accessed from the Web mapping service provider. In general, accessing external data is much more expensive than accessing internal data [18]. A road map is modeled as a graph G = (V, E) comprising a set V of vertices with a set E of edges, where each road segment is an edge and each intersection of road segments is a vertex. For instance, Figure 3a depicts a 1: Location-based queries Users SMashQ 2: Web mapping requests 3: Route and travel time 4: Answers Database Server (e.g., restaurant data) (Low Cost Local Data) Fig. 2 System architecture. Web Mapping Service Provider (High Cost External Data) Title Suppressed Due to Excessive Length 7 I1 I2 I5 I6 I9 I10 I13 (a) A road map I14 I3 I4 I7 I8 I11 I12 I15 Road segment Intersection (b) A graph model Fig. 3 Road network model. real road map that is modeled as an undirected graph (Figure 3b), where an edge represents a road segment (e.g., I1 I2 and I1 I5 ) and a square represents an intersection of road segments (e.g., I1 and I2 ). Our problem can be defined as follows. Given a set of objects R and a NN query Q = (λ, k) from a user U , where λ is U ’s location, and k is the maximum number of returned objects, SMashQ returns U at most k objects in R with the shortest travel time from λ based on the route and travel time information accessed from a Web mapping service provider, e.g., Microsoft Bing Maps [19]. Since the access to the Web mapping service is expensive, our objectives are to reduce the number of external Web mapping requests and provide highly accurate a k-NN query answer. 4 Exact Algorithm In this section, we present an exact algorithm that produces accurate answers for k-NN queries. Since there could be a very large number of objects in a spatial data set, it is extremely inefficient to issue an external Web mapping request to retrieve the route and travel time from a user to each object in the data set. To this end, the exact algorithm employs the existing network expansion technique designed for processing range queries in a road network [24] and existing pruning techniques designed for query processing with external data [1,2,18] to minimize the number of external Web mapping requests. Given an underlying road network modeled as a graph G = (V, E), a spatial object set R, a query Q = (λ, k) from a user U , and U ’s maximum movement speed U.max speed, the exact algorithm performs an incremental search from U to find an accurate k-NN query answer. 8 Detian Zhang et al. Algorithm 1 Exact Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: input: G = (V, E), R, Q = (λ, k), U.max speed Initialize a current answer set A, a node set S, and a priority queue Q distmax ← ∞; Tmax ← ∞ Ni ← the nearest node from U N Dist ← Dist(U, Ni ) Q.enqueue(Ni , N Dist); S ← {Ni } while Q is not empty do (Ni , N Dist) ← Q.dequeue() if A contains k objects and N Dist > distmax then break end if if Ni is an object (i.e., Ni ∈ R) then Issue an external Web mapping request to retrieve T ime(U → Ni ) if A contains less than k objects then A ← A ∪ {Ni } if A contains k objects then Tmax ← the largest travel time of the objects in A distmax ← U.max speed × Tmax end if else if T ime(U → Ni ) < Tmax then Replace the object with Tmax with Ni in A Tmax ← the largest travel time of the objects in A distmax ← U.max speed × Tmax end if end if for Each neighbor node Nj of Ni , where Nj ∈ / S do Q.enqueue(Nj , N Dist + Dist(Ni , Nj )); S ← S ∪ {Nj } end for end while return A Algorithm 1 depicts the pseudo code of the exact algorithm. It considers the intersections and objects in the graph as two types of nodes that will be incrementally inserted into a priority queue Q. Each entry in Q is defined as (Ni , N Dist), where Ni is a node identifier and N Dist is the network distance from U to Ni . When we dequeue Q, the node with the smallest distance is returned and removed from Q. The dequeue operation of Q returns the node with the smallest network distance from U and removes it from Q. Without loss of generality, we assume that U can only U-turn at an intersection. Initially, U ’s nearest node Ni with their distance Dist(U, Ni ) is enqueued to Q (Lines 4 to 6 in Algorithm 1). For each iteration, a node (Ni , N Dist) is dequeued from Q. Ni is processed based on the following two cases: Case 1: Ni is an intersection. For each neighbor node Nj of Ni that has not been enqueued to Q, Nj is enqueued to Q along with the network distance from U to Nj , i.e., Dist(Ni , Nj ) + N Dist, where Dist(Ni , Nj ) is the distance from Ni to Nj (Lines 26 to 28 in Algorithm 1). Case 2: Ni is an object. An external Web mapping request is issued to retrieve the travel time from U to Ni , T ime(U → Ni ). If a current answer A contains less than k objects, Ni is inserted into A (Lines 15 to 18 Title Suppressed Due to Excessive Length 9 3 3 I3 1 I2 2 I4 1 R4 3 R6 R5 1 2 1 3 I8 2 I7 R7 3 2 I5 1 I6 2 R8 R3 U 3 R9 R2 2 2 1 3 1 2 1 I10 I9 R1 I11 R10 I12 3 3 6 3 I13 I14 I15 3 I1 5 2 1 3 User Object Intersection Fig. 4 An example for the exact algorithm. in Algorithm 1). If Ni is the k-th object inserted into A, we set the largest travel time of the objects in A to Tmax . In case that A contains k objects and T ime(U → Ni ) is less than the largest travel time of the objects in A (i.e., Tmax ), the object with Tmax is replaced with Ni and Tmax is updated accordingly (Lines 21 to 23 in Algorithm 1). Then, the neighbor nodes of Ni are processed as in Case 1. The algorithm repeatedly dequeues nodes from Q until an accurate answer is found. Whenever Tmax is updated, we calculate the maximum possible network distance distmax that U can travel within the time interval Tmax (i.e., distmax = U.max speed × Tmax ). For each iteration, if the network distance of the node dequeued from Q is larger than distmax , the algorithm terminates and returns the current answer A to U (Line 9 in Algorithm 1). This is because the travel time from U to any unprocessed object is larger than Tmax , and thus, it cannot be part of A. Example. Figure 4 shows a running example for the exact algorithm, where a user U issues a query Q = (λ, k = 2), U.λ is represented by a triangle, 10 objects (i.e., R = {R1 , . . . , R10 }) are represented by circles, the distance of each edge is indicated by a number beside it, and the maximum movement speed U.max speed is 25 m/s (i.e., 90 km/h). Since U is moving towards I6 and the distance from U to I6 is 1 km, (I6 , 1) is enqueued to an empty Q. Figure 5 depicts the 12 iterations performed by the algorithm to find an accurate k-NN query answer. 1. The algorithm dequeues (I6 , 1) from Q. Since I6 is an intersection, its neighbor nodes (i.e., (R5 , 4), (I5 , 4), and (I10 , 4)) are enqueued to Q. 2. (R5 , 4) is dequeued. Since R5 is an object, an external Web mapping request is issued to get the travel time from U to R5 , T ime(U → R5 ) = 350 seconds. Since the current answer A is empty, R5 is simply inserted into 10 Detian Zhang et al. Node Ni Time(U®Ni) (I6,1) (R5,4) (I5,4) (I10,4) (R4,5) (R1,5) (R2,6) (I2,6) (I11,7) (I14,7) (I9,7) (R9,8) 350 s 500 s 400 s 200 s 600 s Current Answer A (I6,1) (R5,4), (I5,4), (I10,4) (I5,4), (I10,4), (R4,5) R5 (I10,4), (R4,5), (R2,6), (I1,9) R5 (R4,5), (R1,5), (R2,6), (I11,7), (I14,7), (I1,9) R5 (R1,5), (R2,6), (I2,6), (I11,7), (I14,7), (I1,9) R5, R4 (R2,6), (I2,6), (I11,7), (I14,7), (I9,7), (I1,9) R5, R1 (I2,6), (I11,7), (I14,7), (I9,7), (I1,9) R5, R2 (I11,7), (I14,7), (I9,7), (I1,9), (I3,9) R5, R2 (I14,7), (I9,7), (R9,8), (I1,9), (I3,9), (R10,9) R5, R2 (I9,7), (R9,8), (I1,9), (I3,9), (R10,9), (I13,10), (I15,13) R5, R2 (R9,8), (I1,9), (I3,9), (R10,9), (I13,10), (I15,13) R5, R2 (I1,9), (I3,9), (R10,9), (I13,10), (R8,10), (I15,13) R5, R2 Priority Queue Tmax distmax ¥ ¥ ¥ ¥ ¥ 500 s 400 s 350 s 350 s 350 s 350 s 350 s 350 s ¥ ¥ ¥ ¥ ¥ 12.5 km 10 km 8.75 km 8.75 km 8.75 km 8.75 km 8.75 km 8.75 km Fig. 5 The intermediate steps of the exact algorithm. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. A. R4 is the only neighbor of R5 that has not been processed by the algorithm, so (R4 , 5) is enqueued to Q. (I5 , 4) is dequeued from Q and two of its neighbor nodes, (R2 , 6) and (I1 , 9), are enqueued to Q. This iteration dequeues another intersection (I10 , 4) from Q and enqueues three of its neighbor nodes (R1 , 5), (I11 , 7), and (I14 , 7) to Q. The algorithm dequeues (R4 , 5) from Q and issues an external request for R4 to get T ime(U → R4 ) = 500 seconds. Since A contains less than k = 2 objects, R4 is inserted into A. Then, Tmax is set to the largest travel time of the objects in A, i.e., 500 seconds. As U.max speed = 25 m/s, distmax is set to 25m/s × 500s=12.5 km. One of R4 ’s neighbor nodes, i.e., (I2 , 6), is enqueued to Q. Since (R1 , 5) is an object, an external request is issued for R1 . As T ime(U → R1 ) = 400 is less than Tmax , R4 is replaced with R1 . Then, Tmax and distmax are updated to 400 seconds and 25 × 400 = 10 km, respectively. One of R1 ’s neighbor node, I9 , has not been processed by the algorithm, so (I9 , 7) is enqueued to Q. (R2 , 6) is dequeued from Q and an external request is issued. Since T ime(U → R2 ) = 200 is less than Tmax , R1 is replaced with R2 in A. Then, Tmax and distmax are updated to 350 seconds and 25 × 350 = 8.75 km, respectively. As all the neighbor nodes of R2 have been processed by the algorithm, no node is enqueued to Q. (I2 , 6) is dequeued from Q. (I3 , 9) is enqueued to Q. (I11 , 7) is dequeued from Q. (R9 , 8) and (R10 , 9) are enqueued to Q. (I14 , 7) is dequeued from Q. (I13 , 10) and (I15 , 13) are enqueued to Q. (I9 , 7) is dequeued from Q and no node is needed to be enqueued to Q. The algorithm dequeues (R9 , 8) from Q and issues an external request for R9 to get T ime(U → R9 ) = 600. Since T ime(U → R9 ) is larger than Title Suppressed Due to Excessive Length 11 Tmax = 350, no update is needed for A. One of R9 ’s neighbor node, i.e., (R8 , 10), is enqueued to Q. After the above 12 iterations, (I1 , 9) is dequeued from Q. Since N Dist(U, I1 ) = 9 is larger than distmax = 8.75, the algorithm terminates and returns A = {R5 , R2 } to U as a query answer. 5 SMashQ Although the exact algorithm can prune a large number of objects to reduce the number of external Web mapping requests, if the object density in the vicinity of a querying user is very high, the database server would still need to issue a large number of external requests. To this end, we propose three shared execution optimizations, namely, object grouping, direction sharing, and user grouping, for SMashQ based on the road network topology to further reduce the number of external requests. This section is organized as follows: Section 5.1 presents the object grouping optimization for SMashQ. Section 5.2 extends SMashQ to support the direction sharing optimization. Section 5.3 discusses the user grouping optimization that further improves SMashQ. Finally, Section 5.4 analyzes the performance of our algorithms. 5.1 Object Grouping Optimization We observe that many spatial objects are generally located in clusters in real world. For example, many restaurants are located in a downtown area (Figure 3a). Thus, it makes sense to group nearby objects to a representative point. The database server issues only one external request to the Web mapping service for an object group. Then, it shares the retrieved travel time and route information from a user to the representative point among the objects in the group to estimate the travel time from the user to each object. There are two challenges in grouping objects: (a) How to select representative points in a road network? and (b) How to group objects to representative points? Our solution is to select intersections as representative points in a road network and group objects to them. The reason is twofold: (1) Since there is only one possible path from the intersection of a group G to each object in G, it is easy to estimate the travel time from the intersection to each object in G. (2) A road segment should have the same conditions, e.g., speed limit and direction. The estimation of travel time would be more nature and accurate by considering a road segment as a basic unit [16]. Figure 6 gives an example for object grouping where objects R8 , R9 and R10 are grouped together by intersection I11 . The database server only needs one external Web mapping request to retrieve the route information and travel time from U and I11 , and then it estimates the travel time from I11 to each of its group members. 12 Detian Zhang et al. I2 R4 R5 I1 I3 R6 I 7 R7 I5 U I6 I4 R8 I8 R3 R9 R2 I9 R1 I10 I13 I14 I11 R10 I12 I15 Intersection of an object group User Intersection Object Fig. 6 Object grouping optimization. 5.1.1 Greedy Object Grouping Algorithm To minimize the number of external Web mapping requests, we should find a minimal set of intersections that cover all road segments having target objects. Given a k-NN query issued by a user U and a set R of candidate objects, which is computed by Algorithm 2, a subgraph Gu = (Vu , Eu ) is extracted from the graph G of the underlying road network model, where Eu is an edge set where each edge contains at least one candidate object in R and Vu is a vertex set that consists of the intersections of the edges in Eu . Algorithm 2 is similar to the exact algorithm (Algorithm 1), except that after Algorithm 2 issues k external Web mapping requests for the k-nearest objects of U to find an initial answer, and then computes Tmax and distmax for the initial answer (Lines 13 to 18 in Algorithm 2). After determining the initial answer, Algorithm 2 does not issue any external Web mapping requests for other candidate objects. Instead, Algorithm 2 adds such candidate objects to a candidate object set R (Line 21 in Algorithm 2). The greedy object grouping algorithm returns a vertex set I ⊆ Vu that is a minimal one covering all the edges in Eu . The object grouping problem is the vertex cover problem [5], which is NP-complete, so we use a greedy algorithm to find I. Algorithm. Algorithm 3 depicts the pseudo code of a greedy object grouping algorithm. The input consists of the graph G of the underlying road network model and a set R of candidate objects, which are found by Algorithm 2, for a k-NN query issued by a user U . The algorithm first extracts a subgraph Gu from G and calculates the number of edges connected to a vertex as the degree of that vertex (Lines 3 to 4 in Algorithm 3). Then, it selects a vertex v from Vu with the largest degree to I. In case of a tie, the vertex with the shortest distance to U is selected. Each object on each edge connected to v Title Suppressed Due to Excessive Length 13 Algorithm 2 Candidate object search algorithm 1: input: G = (V, E), R, Q = (λ, k), U.max speed 2: Initialize a current answer set A, a node set S, a priority queue Q, and a candidate object set R 3: distmax ← ∞; Tmax ← ∞ 4: Ni ← the nearest node from U 5: N Dist ← Dist(U, Ni ) 6: Q.enqueue(Ni , N Dist); S ← {Ni } 7: while Q is not empty do (Ni , N Dist) ← Q.dequeue() 8: 9: if A contains k objects and N Dist > distmax then 10: break 11: end if 12: if Ni is an object (i.e., Ni ∈ R) then 13: if A contains less than k objects then 14: Issue an external Web mapping request to retrieve T ime(U → Ni ) 15: A ← A ∪ {Ni } 16: if A contains k objects then 17: Tmax ← the largest travel time of the objects in A 18: distmax ← U.max speed × Tmax end if 19: 20: else 21: R ← R ∪ {Ni } 22: end if 23: end if 24: for Each neighbor node Nj of Ni , where Nj ∈ / S do Q.enqueue(Nj , N Dist + Dist(Ni , Nj )); S ← S ∪ {Nj } 25: 26: end for 27: end while 28: return R Algorithm 3 Greedy object grouping algorithm 1: input: Graph G = (V, E) and a set of candidate objects R from Algorithm 2 2: I ← {∅} 3: Extract a subgraph Gu = (Vu , Eu ) from G = (V, E) such that each edge in Eu contains some objects in R and Vu contains the intersections of the edges in Eu 4: Calculate the degree of each vertex in Vu 5: while Vu is NOT empty do v ← the vertex with the largest degree in Vu 6: 7: Assign each object on each edge connected to v to the object group of v 8: Insert v into I and remove v from Vu Remove edges connected to v from Eu 9: 10: Update the degree of each vertex in Vu 11: Remove the vertex without any edge from Vu 12: end while is assigned to the object group of v. Note that each candidate object is assigned to one or two object groups. v is removed from Vu and the edges of v are removed from the Eu . After that, the degrees of the vertices of the removed edges are updated accordingly. The selection process is repeated until Vu becomes empty (Lines 5 to 12 in Algorithm 3). Example. Figures 7 and 8 depict an example for the greedy object grouping algorithm, where a subgraph Gu = (Vu , Eu ) extracted from G, as depicted 14 Detian Zhang et al. I1 Current Answer={R4, R5} I2 R4 R5 I3 R6 I7 R7 I6 R8 I5 R2 U I9 R1 I10 I13 I14 I4 I8 R3 R9 I11 R10 I12 I15 Intersection of a group User Intersection Object Fig. 7 Example of the object grouping optimization. d=1 I5 I9 I10 d=1 I7 d=1 I5 I11 d=1 d=2 d=1 d=2 (a) I = { I11 } I12 I9 I10 d=0 I7 d=0 I5 d=0 I12 d=2 d=1 (b) I = { I11, I9 } I10 d=0 (c) I = { I11, I9 } Fig. 8 Example of the greedy object grouping algorithm. in Figure 8a. Note that since R4 and R5 is the current answer A, which is found by Algorithm 2, the route and travel time from U to R4 and R5 have been retrieved from the Web mapping service. Thus, R4 and R5 are not candidate objects in R = {R1 , R2 , R8 , R9 , R10 } (Figure 7), which are found by Algorithm 2. Since edges I5 I9 , I9 I10 , I7 I11 and I11 I12 contain some candidate objects in R, these four edges constitute Eu and their vertices constitute Vu . Then, the greedy algorithm calculates the degree for each vertex in Vu . For example, the degree of I11 is two because two edges I7 I11 and I11 I12 are connected to I11 . Figure 8a shows that I11 has the largest degree and is closer to U than I9 , I11 is selected to I and the edges connected to I11 are removed from Gu . Figure 8b shows that the degrees of I7 and I12 of the two deleted edges are updated accordingly. Any vertex in Gu without any edge (i.e., I7 and I12 ) is removed from Vu . Since I9 has the largest degree, I9 is selected to I and the edges connected to I9 are removed. Since I5 and I10 have no edge, they are removed from Vu . Finally, Vu becomes empty, so the greedy algorithm terminates and clusters the candidate objects into two object groups (indicated by dotted rectangles), I9 = {R1 , R2 } and I11 = {R8 , R9 , R10 }, as indicated in Figure 7. Title Suppressed Due to Excessive Length 15 5.1.2 Travel Time Estimation for Grouped Objects Most Web mapping service providers, e.g., Microsoft Bing Maps, return turn-by-turn route information for a request. For example, if the database server sends a request to the Web mapping service to retrieve the route and travel time information from user U to intersection I9 in Figure 7, the service returns the turn-by-turn route information, i.e., Route(U → I9 ) = {⟨Dist(U → I6 ), T ime(U → I6 )⟩, ⟨Dist(I6 → I5 ), T ime(I6 → I5 )⟩, ⟨Dist(I5 → I9 ), T ime(I5 → I9 )⟩}, where Dist(A → B) and T ime(A → B) are the distance and travel time from location A to location B, respectively. If a group contains only one object, the database server simply issues one external request to retrieve the route information from the user to that object. Otherwise, our algorithm performs two steps to estimate the travel time for each object and select the best route (if appropriate) as follows. Step 1: Travel time calculation. Based on whether a candidate object is on the last road segment of a retrieved route, we can distinguish two cases. Case 1: An object is on the last road segment. In this case, we assume that the speed of the last road segment is constant. For example, given the information of the route from U to I9 (i.e., Route(U → I9 )) retrieved from the Web mapping service, object R2 is on the last road segment of the route, i.e., I5 I9 . Hence, the travel time from U to R2 is calculated as: T ime(U → R2 ) = 2 →I9 ) T ime(U → I9 ) − T ime(I5 → I9 ) × Dist(R Dist(I5 →I9 ) . Case 2: An object is NOT on the last road segment. In this case, a candidate object is not on a retrieved route, i.e., the object is on a road segment S connected to the last road segment S ′ of the route. We assume that the travel speed of S is the same as that of S ′ . For example, given the retrieved information of the route from U to I9 (i.e., Route(U → I9 )), object R1 is not on the last road segment of the route, i.e., I5 I9 . Hence, the travel time from U to R1 is cal9 →R1 ) culated as: T ime(U → R1 ) = T ime(U → I9 ) + T ime(I5 → I9 ) × Dist(I Dist(I5 →I9 ) . Step 2: Route selection. The greedy object grouping algorithm may assign a candidate object to two object groups. In this case, the database server can find the routes from U to the object via each of the intersections. This step selects the route with the shortest travel time. For example, if both the intersections of edge I9 I10 , i.e., I9 and I10 , are selected (Figure 7), the travel time from user U to R1 is selected as T ime(U → R1 ) = min(T ime(U → I9 → R1 ), T ime(U → I10 → R1 )). 5.1.3 Algorithm Algorithm 4 depicts the pseudo code of SMashQ with the object grouping optimization. It first executes Algorithm 2 to find a candidate object set R, a current answer A, the maximum travel time Tmax of the objects in A, and the maximum travel distance distmax for a querying user U moving at its maximum possible speed (i.e., U.max speed) for time period Tmax (Line 2 in Algorithm 4). Then, it executes the greedy object grouping algorithm (i.e., Algorithm 3) to find an intersection set I. The intersections in I are sorted based 16 Detian Zhang et al. Algorithm 4 SMashQ with the object grouping optimization 1: input: G = (V, E), R, Q = (λ, k), U.max speed 2: Execute Algorithm 2 to find a set of candidate objects R, a current answer A, Tmax and distmax 3: Execute the Greedy Object Grouping Algorithm to get a set of intersections I (Algorithm 3) 4: I is sorted according to the intersection’s network distance from U in non-decreasing order 5: while I is not empty do Ii ← the intersection in I with the shortest network distance from U 6: 7: Issue a Web mapping request to retrieve Route(U → Ii ) and T ime(U → Ii ) 8: for Each object Rj in the object group of Ii do 9: Estimate the time from U to Rj , i.e., T ime(U → Rj ) 10: if T ime(U → Rj ) < Tmax then 11: Replace the object with Tmax in A with Rj 12: Tmax ← max{T ime(U → Rl )|Rl ∈ A} 13: distmax ← U.max speed × Tmax 14: for Each object Rk ∈ R with the shortest network distance from U > distmax do 15: Remove Rk from R and its associated object group(s) end for 16: 17: end if end for 18: 19: Remove Ii and the intersection with an empty object group from I 20: end while 21: return A on their shortest network distance from U in non-decreasing order (Line 4 in Algorithm 4). The algorithm repeatedly processes the intersection with the shortest network distance Ii ∈ I from U because Ii has a higher chance to find objects that would be part of a query answer and prune more objects in R. For each Ii , the algorithm issues an external Web mapping request to retrieve the route and travel time information from U to Ii , denoted as Route(U → Ii ) and T ime(U → Ii ), respectively (Line 7 in Algorithm 4). For each object Rj in the object group of Ii , the travel time from U to Rj is estimated. If T ime(U → Rj ) < Tmax , the object with Tmax in A is replaced with Rj . Tmax and distmax are then updated accordingly. After that, for each object Rk ∈ R with the shortest network distance from U larger than distmax (i.e., distmax = U.max speed × Tmax ), Rk is pruned from R and its associated object group(s) (Lines 14 to 16 in Algorithm 4). Ii and the intersection with an empty object group are removed from I. This pruning process is repeated until I becomes empty (Lines 5 to 20 in Algorithm 4). In the running example, the exact algorithm has to issue five external Web mapping requests, i.e., U → R1 , U → R2 , U → R4 , U → R5 , and U → R9 .. With the object grouping optimization, SMashQ only needs to issue four requests, i.e., U → R4 , U → R5 , U → I9 , and U → I11 . Title Suppressed Due to Excessive Length 17 5.2 Direction Sharing As described in Section 5.1, the object grouping optimization can effectively reduce the number of external Web mapping requests. To further reduce the number of requests, we design the direction sharing optimization to further boost shared execution in SMashQ. This section is organized as follows. We present the main idea and challenges of the direction sharing optimization (Section 5.2.1), its technical details (Section 5.2.2), and the algorithm of SMashQ with the object grouping and direction sharing optimizations (Section 5.2.3). 5.2.1 Main Idea and Challenges Most Web mapping service providers will return the route Route(A → B) with the shortest travel time, the turn-by-turn route and travel time information for a request from location A to location B. Let X be an intermediate intersection in Route(A → B), i.e., T = Route(A . . . → X → . . . B). We observe that every prefix in T , i.e., Route(A → X), is also the route with the shortest travel time from A to X. The correctness of this observation is shown in the following theorem. Theorem 1 Given a route T = Route(A . . . → X → . . . B), where X is an intermediate intersection in T , with the shortest travel time from location A to location B, every prefix in T (i.e., Route(A → X)) is also the route with the shortest travel time from location A to intersection X. Proof A simple cut-and-paste argument is used to prove this theorem [5]. Suppose that there is a route T ′ = Route(A . . . → Y → . . . X), where Y ∈ / T, with a travel time shorter than the prefix Route(A → X) in T , then we could cut Route(A → X) out of T and paste in T ′ , resulting in a route with a shorter travel time than T , thus contradicting the optimality of T . ⊔ ⊓ Based on Theorem 1, given a set of intersections I computed by the greedy object grouping algorithm for a k-NN query issued from a user U , we should first process the intersection Ii ∈ I such that the direction information from U to Ii can be shared with the largest subset I ′ of other intersections in I (i.e., Ii has the highest sharing ability). The sharing ability of an intersection →Ii ))∩I| Ii ∈ I is measured as |Set(Route(U , where Set(Route(U → Ii )) is a set |I| of intersections in Route(U → Ii ). The direction information from U to Ii will be used to determine the route and travel time from U to each intersection in I ′ without issuing additional external Web mapping requests. Consider an example that the greedy object grouping algorithm finds I = {I5 , I6 , I8 , I9 , I11 , I15 } which are represented by solid squares (Figure 9). If we know that Route(U → I9 ) passes through I1 , I2 , I5 , I6 , and I9 , its direction information can be shared with intersections I5 , I6 , and I9 , and thus the sharing ability of I9 is |{I5 ,I66 ,I9 }| = 36 = 0.5. In this example, only three external Web mapping requests are needed, i.e., U → I8 , U → I9 , and U → I15 , because their direction information can be shared with all other intersections in I. 18 Detian Zhang et al. I2 I1 I3 I4 Route(UoI8) I7 I8 Route(UoI9) I5 I6 Route(UoI11) U I9 I10 I13 I14 I11 I12 I15 Intersection of an object group User Intersection Fig. 9 Direction sharing optimization. Since the traffic condition would be very dynamic, as discussed in Section 1, we could not accurately predict the route between two locations. We will describe a histogram approach (Section 5.2.2) that is employed to estimate the sharing ability for an intersection based on the historical route information retrieved from the Web mapping service. 5.2.2 Histogram Approach In this section, we describe a histogram approach that is used to maintain historical route information and provide a heuristic measure of sharing ability for an intersection. Since querying users are mobile users, it is impossible to store an entry from every possible location to an intersection. Coupled with the fact that the sharing ability measure is not based on the querying user’s location, the histogram approach only maintains the historical route information from one intersection to another one. Given a querying user U who is moving towards an intersection Iu , Iu is considered as a starting location for the sharing ability measure. The histogram approach basically maintains a two-dimensional matrix H in which an entry H[Ii , Ij ] (1 ≤ i, j ≤ |V |, where V is the number of intersections in the graph G of the road network model) stores the number of intersections from intersection Ii to intersection Ij , inclusive of Ii and Ij , based on the latest route information from Ii to Ij . The rationale behind the histogram approach is that H[Ii , Ij ] is used to estimate the sharing ability from Ii to Ij because a route with more intersections has a higher chance to be shared with other intersections. Initially, if Ii = Ij , H[Ii , Ij ] = 1; otherwise, H[Ii , Ij ] = −1. Given a route from Iu to Id , we can distinguish two cases for the sharing ability measure based on the value of H[Iu , Id ]: Case 1: H[Iu , Id ] > 0. The sharing ability of Id is simply equal to H[Iu , Id ]. Title Suppressed Due to Excessive Length 19 Algorithm 5 SMashQ with the object grouping and direction sharing optimizations 1: input: G = (V, E), R, Q = (λ, k), U.max speed, H 2: Execute Algorithm 2 to find a set of candidate objects R, a current answer A, Tmax and distmax 3: Execute the Greedy Object Grouping Algorithm to get a set of intersections I (Algorithm 3) 4: I is sorted according to the intersection’s sharing ability in non-decreasing order 5: while I is not empty do 6: Ih ← the intersection in I with the highest sharing ability 7: Issue a Web mapping request to retrieve Route(U → Ih ) and T ime(U → Ih ) 8: I ′ ← the intersections included in Set(Route(U → Ih )) ∩ I 9: for Each intersection Ii ∈ I ′ do 10: for Each object Rj in the object group of Ii do 11: Estimate the time from U to Rj , i.e., T ime(U → Rj ) if T ime(U → Rj ) < Tmax then 12: 13: Replace the object with Tmax in A with Rj Tmax ← max{T ime(U → Rl )|Rl ∈ A} 14: 15: distmax ← U.max speed × Tmax 16: for Each object Rk ∈ R with the shortest network distance from U > distmax do 17: Remove Rk from R and its associated object group(s) 18: end for 19: end if 20: end for 21: end for 22: Remove Ii , the intersection with an empty object group, and I ′ from I Update H based on Route(U → Ih ) accordingly 23: 24: end while 25: return A Case 2: H[Iu , Id ] = −1. Since the histogram has no historical route information from Iu to Id , we employ a shortest path algorithm [5] (e.g., Dijkstra’s algorithm and A∗ algorithm) to find the shortest network path from Iu and Id . Then, H[Iu , Id ] and the sharing ability of Id are set to the number of intersections in the shortest network path, inclusive of Iu and Id . After SMashQ receives the route and travel time information from U to an selected intersection Id (i.e., Route(U → Id ) and T ime(U → Id )), it updates the entry in H corresponds to the route from the first intersection to Id in Route(U → Id ) and every entry in H corresponds to each prefix route in Route(U → Id ). For example, the database server receives Route(U → I9 ) = {⟨Dist(U → I6 ), T ime(U → I6 )⟩, ⟨Dist(I6 → I2 ), T ime(I6 → I2 )⟩, ⟨Dist(I2 → I1 ), T ime(I2 → I1 )⟩, ⟨Dist(I1 → I5 ), T ime(I1 → I5 )⟩, ⟨Dist(I5 → I9 ), T ime(I5 → I9 )⟩} from the Web mapping service provider, it will update H[I6 , I2 ] = 2, H[I6 , I1 ] = 3, H[I6 , I5 ] = 4, and H[I6 , I9 ] = 5. 5.2.3 Algorithm Algorithm 5 depicts the pseudo code of SMashQ with the object grouping and direction sharing optimizations. Similar to Algorithm 4, Algorithm 5 20 Detian Zhang et al. executes Algorithm 2 to find a set R of candidate objects, a current answer A, the maximum travel time Tmax of the objects in A, and the maximum travel distance distmax for a querying user U moving at its maximum possible speed (i.e., U.max speed) for time period Tmax (Line 2 in Algorithm 5). Then, Algorithm 3 is executed to find a set of intersections I. There are three modifications in the pruning process for the direction sharing optimization. The first modification is that the intersections in I are sorted by their sharing ability in non-decreasing order, instead of their shortest network distance from U in Algorithm 4 (Line 4 in Algorithm 5), and the algorithm repeatedly processes the intersection Ih with the highest sharing ability (Line 6 in Algorithm 5). Since the route and travel time information retrieved for a request from U to Ih can also be shared with the prefix intersections included in I, the second modification is to use such information to process the object groups of a set intersections I ′ included in both the set of intersection in Route(U → Ih ) and I (i.e., I ′ = Set(Route(U → Ih )) ∩ I) (Lines 8 to 22 in Algorithm 5). The third modification is to update the histogram H based on the retrieved route information, i.e., Route(U → Ih ), accordingly. 5.3 User Grouping To further reduce the number of external Web mapping requests for a database server with a high workload, e.g., a large number of users or continuous queries, we design another shared execution for grouping users. Similar to object grouping, we consider intersections in the road network as representative points. The database server only needs to issue one external request from the intersection of a user group Gu to the intersection of an object group Go to estimate the travel time from each user Ui in Gu to each object Rj in Go . The key difference between user grouping optimization and object grouping optimization is that users are moving. Grouping users has to consider their movement direction, so a user is grouped to the nearest intersection to which the user is moving. Figure 10 depicts an example, where user A on edge I9 I10 is moving towards I10 , so A is grouped to the user group of I10 . Similarly, users D and E are also grouped to the user group of I10 . Users C and B on edges I1 I2 and I6 I2 , respectively, are both moving to I2 , so they are grouped to the user group of I2 . After having grouped users, we process queries in terms of user groups instead of users. The database server finds an overall query answer A for all users in a user group, and then extract the query answer from A for each user in the group. For example, given a user group Gu and the intersection Iu of Gu , the database server first search the kmax nearest objects to Iu , where kmax is the largest value of k of the k-NN queries issued by the users in Gu . Then, kmax and Iu are used to replace k and U in the algorithm of SMashQ with the object grouping optimization (i.e., Algorithm 4) or the algorithm of SMashQ with the object grouping and direction sharing optimizations (Algorithm 5) to find an overall query answer A for all the users in Gu . For the ki -NN query of Title Suppressed Due to Excessive Length 21 each user Ui in Gu , the database server takes the ki objects with the shortest travel time in A as Ui ’s query answer Ai , and estimates the travel time from the user’s location to Iu as in Section 5.1.2 to calculate the travel time from Ui to each object in Ai . Therefore, each query in Gu can be effectively processed. 5.4 Performance Analysis We now evaluate the performance of our algorithms. Since accessing travel time and direction information from an external Web mapping service is more expensive than local data processing [18], we compare the performance of our algorithms in terms of the number of external Web mapping requests. Let M and N be the number of querying users and the average number of candidate objects per query, respectively. – The exact algorithm needs to issue N × M external Web mapping queries to answer all M queries; hence, its cost is CostExact = N × M – By using the object grouping optimization, SMashQ can reduce the average number of candidate objects from N to N ′ through grouping objects to intersections, where N ′ ≤ N ; hence, CostO = N ′ × M . – If SMashQ employs the object grouping and direction sharing optimization techniques, it can further reduce N ′ to N ′′ by sharing the direction information among intersections, where N ′′ ≤ N ′ ; hence, CostOD = N ′′ × M . – When SMashQ employs all the optimization techniques (i.e., object grouping, direction sharing, and user grouping), it can reduce M to M ′ by grouping users to intersections, where M ′ ≤ M ; hence, CostODU = N ′′ × M ′ . I1 I2 R4 R5 C B I4 R6 I7 R7 I6 R8 D R9 A R1 E I10 I11 I13 I14 I5 R2 I9 I3 I8 R3 R10 I12 I15 Intersection of a user group User Intersection Object Fig. 10 User grouping optimization. 22 Detian Zhang et al. In general, N ′′ ≪ N ′ ≪ N and M ′ ≪ M , so CostExact ≪ CostO ≪ CostOD ≪ CostODU . We will verify our performance analysis through the experiments in Section 6. 6 Performance Evaluation In this section, we first give our evaluation model in Section 6.1, including a basic approach, performance metrics, and default experiment settings. Then, we analyze the experiment result in Section 6.2. 6.1 Evaluation Model Basic approach. To our best knowledge, there is no other existing algorithms about k-NN queries using spatial mashups in time-dependent road networks. We design a basic approach (denoted as BasicExact), as proposed in our previous work [33]. The basic idea of BasicExact is to issue external Web mapping requests for the k-nearest objects to find an initial answer, calculate distmax based on the largest travel time in the initial answer and the user’s maximum movement speed, and find other objects within the network distance distmax from the user as a candidate object set R for a query answer. Then, BasicExact issues one external Web mapping request from the querying user to each candidate object in R to get an exact k-NN query answer. We also consider the exact algorithm as another basic approach (denoted as Exact) to evaluate the performance of our SMashQ with only the object grouping and direction sharing optimizations (denoted as SMashQ-OD) and our SMashQ with all the three shared execution optimizations (i.e., object grouping, direction sharing, and user grouping) (denoted as SMashQ-ODU). Performance metrics. We evaluate the performance of the basic approaches and our SMashQ algorithms in terms of four metrics. (1) The average number of external Web mapping requests per query and (2) the average response time per query are metrics evaluating the effectiveness of our algorithms. The response time of a query is defined as the sum of the local CPU processing time and the response time from Microsoft Bing Maps (including the communication cost and the CPU processing time at Microsoft Bing Maps). It is calculated as the sum of its local CPU query processing time and the multiplication of the number of external requests and the average response time per external request. From our experimental results, the average response time per external request from Microsoft Bing Maps is 14.3 milliseconds. (3) The accuracy of estimated travel time (denoted as Acc T ime) is a metric computed by: ( Acc T ime = 1 − min ) |Tb − T | ,1 , T (1) Title Suppressed Due to Excessive Length 23 (a) The restaurant data (b) The hotel data Fig. 11 Two real data sets. where T and Tb are the actual travel time (retrieved by the exact algorithm) and the estimated one, respectively, and 0 ≤ Acc T ime ≤ 1. (4) The accuracy of k-NN query answers (denoted as Acc Ans) evaluates the accuracy returned by our SMashQ algorithms is calculated by: Acc Ans = b ∩ A| |A , |A| (2) b is a where A is an exact query answer returned by the exact algorithm and A query answer returned by our SMashQ algorithms and 0 ≤ Acc Ans ≤ 1. Experiment settings. We evaluated the performance of the basic approaches (i.e., BasicExact and Exact) and the SMashQ algorithms (i.e., SMashQ-OD and SMashQ-ODU) using C++ and Microsoft Bing Maps [19] with a real road network of Hennepin County, MN, USA [29]. We select an area of 8×8 km2 that contains 6,109 road segments and 3,593 intersections, and the latitude and longitude of its left-bottom and right-top corners are (44.898441, -93.302791) and (44.970094, -93.204015), respectively. We retrieved two real data sets within the selected area from Google Maps, i.e., a restaurant data set of 1,835 restaurants and a hotel data set of 363 hotels, as depicted in Figures 11a and 11c, respectively. Unless mentioned otherwise, we used the real restaurant data set and uniformly generated 5,000 querying users in the road network for all the experiments, and the default requested number of nearest objects of k-NN queries (i.e., the value of k) is 10. The maximum allowed speed is 90 km per hour. 24 Detian Zhang et al. 20 BasicExact Exact 1200 SMashQ-OD SMashQ-ODU 1000 Average Query Response Time (s) Average No. of External Requests 1400 800 600 400 200 0 16 BasicExact Exact SMashQ-OD SMashQ-ODU 14 12 10 8 6 4 2 0 1 10 20 30 40 Requested Number of Nearest Objects (k) 50 1 (a) Number of external requests 0.96 0.96 0.94 0.94 0.92 0.9 0.88 0.86 0.84 10 20 30 40 Requested Number of Nearest Objects (k) 50 (b) Average query response time Accuracy of Query Answers Accuracy of Estimated Travel Time 18 SMashQ-OD SMashQ-ODU 0.82 0.92 0.9 0.88 0.86 0.84 SMashQ-OD SMashQ-ODU 0.82 1 10 20 30 40 Requested Number of Nearest Objects (k) (c) Accuracy of travel time 50 1 10 20 30 40 Requested Number of Nearest Objects (k) 50 (d) Accuracy of query answers Fig. 12 Effect of the Requested Number of Nearest Objects (k). 6.2 Experiment Results We evaluated the scalability and efficiency of the basic approaches (i.e., BasicExact and Exact) and the SMashQ algorithms (i.e., SMashQ-OD and SMashQ-ODU) with respect to the number of required number of nearest objects, i.e., the value of k (Section 6.2.1), the number of objects (Section 6.2.2), and the number of querying users (Section 6.2.3). 6.2.1 Effect of the Requested Number of Nearest Objects Figure 12 shows the performance of the basic approaches and our SMashQ algorithms with respect to various numbers of requested number of nearest objects (i.e., k) from 1 to 50. As depicted in Figure 12a, the SMashQ algorithms (i.e., SMashQ-OD and SMashQ-ODU) outperform the basic approaches in terms of the average number of external Web mapping requests. The exact algorithm (Exact) reduces the number of external requests by 32% on average compared to BasicExact. SMashQ-OD further significantly reduces the number of external requests by 62%. With the user grouping optimization, SMashQ-ODU yields the least number of external requests. This is because Title Suppressed Due to Excessive Length 25 20 BasicExact Exact 1200 SMashQ-OD SMashQ-ODU 1000 Average Query Response Time (s) Average No. of External Requests 1400 800 600 400 200 0 1000 2000 3000 4000 Number of Objects 16 BasicExact Exact SMashQ-OD SMashQ-ODU 14 12 10 8 6 4 2 0 1000 5000 (a) Number of external requests 2000 3000 Number of Objects 4000 5000 (b) Average query response time 0.94 0.92 SMashQ-OD SMashQ-ODU 0.93 Accuracy of Query Answers Accuracy of Estimated Travel Time 18 0.92 0.91 0.9 0.89 0.88 0.87 0.86 0.85 0.84 1000 2000 3000 4000 Number of Objects (c) Accuracy of travel time 5000 SMashQ-OD SMashQ-ODU 0.91 0.9 0.89 0.88 0.87 0.86 0.85 1000 2000 3000 4000 Number of Objects 5000 (d) Accuracy of query answers Fig. 13 Effect of the number of objects using synthetic data. SMashQ-ODU has the ability to process queries from several users at the same time instead of processing them one by one in SMashQ-OD. It is expected that when k gets larger, there are more candidate objects, and thus the database server needs to issue more external requests. Since processing external requests is more expensive than local query processing, the query response time has the same trend as the number of external requests (Figure 12b). The two basic approaches BasicExact and Exact provide the exact travel time and k-NN query answers. Conversely, the shared execution optimizations designed for SMashQ require travel time estimation to reduce the number of external requests, so we need to evaluate the accuracy of estimated travel time (measured by Equation 1) and the accuracy of k-NN query answers (measured by Equation 2) provided by SMashQ. Figures 12c and 12d depict that both SMashQ-OD and SMashQ-ODU provide highly accurate travel time (at least 84% by SMashQ-OD and 83% by SMashQ-ODU) and query answers (at least 83% by SMashQ-OD and 82% by SMashQ-ODU), respectively. Both SMashQOD and SMashQ-ODU need to estimate the travel time from an intersection to each object in an object group, but SMashQ-ODU also needs to estimate the travel time from each querying user in a user group to an intersection. As a result, the accuracy of SMashQ-ODU is slightly worse than SMashQ- 26 Detian Zhang et al. 8 BasicExact Exact 500 SMashQ-OD SMashQ-ODU Average Query Response Time (s) Average No. of External Requests 600 400 300 200 100 0 BasicExact Exact SMashQ-OD 6 SMashQ-ODU 7 5 4 3 2 1 0 363(hotels) 1835(restaurants) Number of Objects 363(hotels) 1835(restaurants) Number of Objects (a) Number of external requests (b) Average query response time 0.95 0.93 SMashQ-OD SMashQ-ODU 0.94 0.93 0.92 0.91 0.9 0.89 Accuracy of Query Answers Accuracy of Estimated Travel Time 0.96 SMashQ-OD SMashQ-ODU 0.92 0.91 0.9 0.89 0.88 363(hotels) 1835(restaurants) Number of Objects (c) Accuracy of travel time 363(hotels) 1835(restaurants) Number of Objects (d) Accuracy of query answers Fig. 14 Effect of the number of objects using real data. OD. When k increases, the accuracy of our SMashQ algorithms improves. The reason for this is that the larger k value requires them to process objects further away from the querying user, so the ratio of the distance that we need to estimate its travel time to the total distance between two locations decreases. Thus, the accuracy of the travel time estimation improves, as k gets larger (Figure 12c). The more accurate travel time estimation leads to more accurate query answers, as depicted in Figure 12d. The results of this experiment confirm that our SMashQ algorithms not only effectively reduce the number of external Web mapping requests, but they also provide highly accurate k-NN query answers. 6.2.2 Effect of the Number of Objects Both synthetic data and real data sets were used to evaluate the performance of SMashQ with respect to the number of objects. Synthetic data. In the synthetic data set, the objects are randomly distributed in the road network. Figure 13 depicts the performance of the basic approaches and our SMashQ algorithms with respect to the increase of the number of synthetic objects from 1,000 to 5,000. Figure 13a depicts that our Title Suppressed Due to Excessive Length 27 14 BasicExact Exact SMashQ-OD SMashQ-ODU 800 700 Average Query Response Time (s) Average No. of External Requests 900 600 500 400 300 200 100 0 2000 4000 6000 8000 Number of Querying Users 10 8 6 4 2 0 2000 10000 (a) Number of external requests 4000 6000 8000 Number of Querying Users 10000 (b) Average query response time 0.93 0.93 SMashQ-OD SMashQ-ODU 0.92 Accuracy of Query Answers Accuracy of Estimated Travel Time BasicExact Exact SMashQ-OD SMashQ-ODU 12 0.91 0.9 0.89 0.88 0.87 2000 4000 6000 8000 Number of Querying Users 10000 (c) Accuracy of travel time SMashQ-OD SMashQ-ODU 0.92 0.91 0.9 0.89 0.88 0.87 2000 4000 6000 8000 Number of Querying Users 10000 (d) Accuracy of query answers Fig. 15 Effect of the number of querying users. SMashQ algorithms require much less external requests than the basic approaches, when the number of objects increases, and thus our SMashQ algorithms can reduce the query response time (Figure 13b). The accuracy of travel time estimation and query answers gets worse, when the number of objects increases, as depicted in Figures 13c and 13d, respectively. This is because when there are more objects, the required search range is smaller; hence, the ratio of the distance that we need to estimate its travel time to the total distance between two locations increases. As a result, a small error in the travel time estimation has a larger impact on the accuracy measure. However, SMashQOD and SMashQ-ODU still provide accurate travel time estimation and query answers (over 85% accuracy). Real data. The experimental results of the two real data sets (363 hotels and 1,835 restaurants) exhibit the same trend as the synthetic data. The SMashQ algorithms perform much better than the basic approaches in both the data sets, when there are more objects (Figures 14a and 14b). The SMashQ algorithms also achieve high accuracy for travel time estimation and query answers. The accuracy of travel time estimation is over 90% and the accuracy of query answer is over 88%, as depicted in Figures 14c and 14d, respectively. 28 Detian Zhang et al. The experimental results of both the synthetic and real data sets confirm that our SMashQ algorithms are able to scale up to a large number of objects. 6.2.3 Effect of the Number of Querying Users Figure 15 depicts the performance of the basic approaches and our SMashQ algorithms with respect to the increase of the number of querying users from 2,000 to 10,000. The number of querying users only slightly affects the performance of the basic approaches and SMashQ-OD because these algorithms process k-NN queries one by one (Figures 15a and 15b). Conversely, SMashQODU is able to process a group of queries issued from different users at the same time. When the number of users increases, there is a higher chance to form larger user groups, and thus, the number of external Web mapping requests per query of SMashQ-ODU reduces. The number of querying users has no obvious impact on the accuracy of travel time estimation and query answer, as depicted in Figures 15c and 15d, respectively. The results of this experiment also verify that the SMashQ algorithms can scale up to a large number of users. 7 Conclusion In this paper, we proposed a spatial mashup framework SMashQ for a database server to process k-nearest-neighbor (NN) queries in time-dependent road networks. SMashQ issues external requests to the Web mapping service, e.g., Google Maps and Microsoft Bing Maps, to retrieve the route and travel time between two locations. Based on the retrieved route and travel time information, SMashQ is able to answer k-NN queries. An exact algorithm was designed for SMashQ based on the existing pruning techniques for external data. Since external Web mapping requests are expensive, three shared execution techniques, namely, object grouping, direction sharing, and user grouping, were proposed for SMashQ to effectively reduce the number of external requests. We conducted extensive experiments to evaluate the performance of SMashQ with the shared execution optimizations using Microsoft Bing Maps, the real road network, the real data sets, and the synthetic data set. The experimental results show that SMashQ significantly reduces the number of external requests, it provides highly accurate query answers, and it is able to scale up to large numbers of objects and users. 7.1 Future Research Directions We have two future research directions for this work. In this paper, we assumed that the location-based service provider is trusted, so there is no privacy issue for the user. However, if the user does not trust the service provider, one possible solution is place a location anonymizer between the user and SMashQ that Title Suppressed Due to Excessive Length 29 is responsible for blurring a user’s exact location into a cloaked road segment set S that satisfies the user’s privacy requirements, namely, k-anonymity and minimum road segment length [3]. The location anonymizer can transform the user’s exact location into a set of external intersections of S, and then send all these external intersections to SMashQ as the user’s location to find a k-NN query answer for each external intersection. After the location anonymizer receives the query answers, since it knows the user’s location, it can compute the exact answer and send it to the user. As a result, SMashQ and the Web mapping service provider should not be able to identify the actual querying user with a probability larger than 1/k or pinpoint the user’s location within S. Stricter privacy requirements lead to more external intersections that have to be processed by SMashQ. Thus, new query optimization techniques should also be developed for SMashQ to find query answers for external intersections efficiently. Google Maps API has recently provided a new service, called distance matrix, that returns the network distance and travel time of every combination from a set of source locations to a set of destination locations defined in a distance matrix request [28]. Unfortunately, we have chosen Microsoft Bing Maps to implement SMashQ, but it does not support the distance matrix service. The second research direction is to investigate how to incorporate the distance matrix service into SMashQ, overcome its limitations (e.g., the evaluation license of Google Distance Matrix API limits to 100 locations per query and 10 seconds and 2,500 locations per 24 hours), and design new query optimization techniques for SMashQ. References 1. Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: Proceedings of the IEEE International Conference on Data Engineering (2002) 2. Chang, K.C.C., Hwang, S.W.: Minimal probing: Supporting expensive predicates for top-k queries. In: Proceedings of the ACM Conference on Management of Data (2002) 3. Chow, C.Y., Mokbel, M.F., Bao, J., Liu, X.: Query-aware location anonymization for road networks. GeoInformatica 15(3), 571–607 (2011) 4. Chow, C.Y., Mokbel, M.F., Leong, H.V.: On efficient and scalable support of continuous queries in mobile peer-to-peer environments. IEEE Trans. on Mobile Computing 10(10), 1473–1487 (2011) 5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms (3. ed.). MIT Press (2009) 6. Demiryurek, U., Banaei-Kashani, F., Shahabi, C.: Efficient k-nearest neighbor search in time-dependent spatial networks. In: Proceedings of the International Conference on Database and Expert Systems Applications (2010) 7. Demiryurek, U., Banaei-Kashani, F., Shahabi, C.: Towards k-nearest neighbor search in time-dependent spatial network databases. In: International Workshop on Databases in Networked Systems (2010) 8. Fu, T.Y., Peng, W.C., Lee, W.C.: Parallelizing itinerary-based KNN query processing in wireless sensor networks. IEEE Trans. on Knowledge and Data Engineering 22(5), 711–729 (2010) 9. Gedik, B., Liu, L.: Mobieyes: A distributed location monitoring service using moving location queries. IEEE Trans. on Mobile Computing 5(10), 1384–1402 (2006) 30 Detian Zhang et al. 10. George, B., Kim, S., Shekhar, S.: Spatio-temporal network databases and routing algorithms: A summary of results. In: Proceedings of the International Symposium on Spatial and Temporal Databases (2007) 11. Google Maps: URL http://maps.google.com 12. Google Maps/Google Earth APIs Terms of Service: URL http://code.google.com/ apis/maps/terms.html 13. Hu, H., Xu, J., Lee, D.L.: A generic framework for monitoring continuous spatial queries over moving objects. In: Proceedings of the ACM Conference on Management of Data (2005) 14. Huang, X., Jensen, C.S., Saltenis, S.: The islands approach to nearest neighbor querying in spatial networks. In: Proceedings of the International Symposium on Spatial and Temporal Databases (2005) 15. Jensen, C.S.: Database aspects of location-based services. In: Location-Based Services, pp. 115–148. Morgan Kaufmann (2004) 16. Kanoulas, E., Du, Y., Xia, T., Zhang, D.: Finding fastest paths on a road network with speed patterns. In: Proceedings of the IEEE International Conference on Data Engineering (2006) 17. Lee, D.L., Zhu, M., Hu, H.: When location-based services meet databases. Mobile Information Systems 1(2), 81–90 (2005) 18. Levandoski, J.J., Mokbel, M.F., Khalefa, M.E.: Preference query evaluation over expensive attributes. In: Proceedings of the International Conference on Information and Knowledge Management (2010) 19. Microsoft Bing Maps: URL http://www.bing.com/maps/ 20. Mokbel, M.F., Xiong, X., Aref, W.G.: SINA: scalable incremental processing of continuous queries in spatio-temporal databases. In: Proceedings of the ACM Conference on Management of Data (2004) 21. Mokbel, M.F., Xiong, X., Hammad, M.A., Aref, W.G.: Continuous query processing of spatio-temporal data streams in PLACE. Geoinformatica 9(4), 343–365 (2005) 22. Mouratidis, K., Papadias, D., Hadjieleftheriou, M.: Conceptual partitioning: An efficient method for continuous nearest neighbor monitoring. In: Proceedings of the ACM Conference on Management of Data (2005) 23. Nehme1, R.V., Rundensteiner, E.A.: SCUBA: Scalable cluster-based algorithm for evaluating continuous spatio-temporal queries on moving objects. Proceedings of the International Conference on Extending Database Technology (2006) 24. Papadias, D., Zhang, J., Mamoulis, N., Tao, Y.: Query processing in spatial network databases. In: Proceedings of the International Conference on Very Large Data Bases (2003) 25. Samet, H., Sankaranarayanan, J., Alborzi, H.: Scalable network distance browsing in spatial databases. In: Proceedings of the ACM Conference on Management of Data (2008) 26. Tanin, E., Harwood, A., Samet, H.: Using a distributed quadtree index in peer-to-peer networks. The International Journal on Very Large Data Bases 16(2), 165–178 (2007) 27. Tao, Y., Papadias, D., Shen, Q.: Continuous nearest neighbor search. In: Proceedings of the International Conference on Very Large Data Bases (2002) 28. The Google Distance Matrix API (Last visited: May 18, 2012): URL https:// developers.google.com/maps/documentation/distancematrix/ 29. TIGER/Line Shapefiles 2009 for: Hennepin County, Minnesota: URL http://www2. census.gov/cgi-bin/shapefiles2009/county-files?county=27053 30. Vancea, A., Grossniklaus, M., Norrie, M.C.: Database-driven web mashups. In: IEEE ICWE (2008) 31. Wu, S.H., Chuang, K.T., Chen, C.M., Chen, M.S.: DIKNN: An itinerary-based KNN query processing algorithm for mobile sensor networks. In: Proceedings of the IEEE International Conference on Data Engineering (2007) 32. Yahoo! Maps: URL http://maps.yahoo.com 33. Zhang, D., Chow, C.Y., Li, Q., Zhang, X., Xu, Y.: Efficient evaluation of k-NN queries using spatial mashups. In: Proceedings of the International Symposium on Spatial and Temporal Databases (2011) 34. Zhu, Q., Lee, D.L., Lee, W.C.: Collaborative caching for spatial queries in mobile P2P networks. In: Proceedings of the IEEE International Conference on Data Engineering (2011)