A Generalized (k, m)-Segment Mean Algorithm for Long Term Modeling of Traversable Environments by Todd Samuel Layton B.S., Massachusetts Institute of Technology (2013) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2014 c Massachusetts Institute of Technology 2014. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 23, 2014 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prof. Daniela Rus Andrew (1956) and Erna Viterbi Professor of EECS Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prof. Albert R. Meyer Chairman, Masters of Engineering Thesis Committee 2 A Generalized (k, m)-Segment Mean Algorithm for Long Term Modeling of Traversable Environments by Todd Samuel Layton Submitted to the Department of Electrical Engineering and Computer Science on May 23, 2014, in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Engineering Abstract We present an efficient algorithm for computing semantic environment models and activity patterns in terms of those models from long-term value trajectories defined as sensor data streams. We use an expectation-maximization approach to calculate a locally optimal set of path segments with minimal total error from the given data signal. This process reduces the raw data stream to an approximate semantic representation. The algorithm’s speed is greatly improved by the use of lossless coresets during the iterative update step, as they can be calculated in constant amortized time to perform operations with otherwise linear runtimes. We evaluate the algorithm for two types of data, GPS points and video feature vectors, on several data sets collected from robots and human-directed agents. These experiments demonstrate the algorithm’s ability to reliably and quickly produce a model which closely fits its input data, at a speed which is empirically no more than linear relative to the size of that data set. We analyze several topological maps and representative feature sets produced from these data sets. Thesis Supervisor: Prof. Daniela Rus Title: Andrew (1956) and Erna Viterbi Professor of EECS 3 4 Acknowledgments I would like to thank Professor Daniela Rus for finding the kindness in her heart to supervise me. Week after week, she was always willing to help, never lacking advice on how to pursue and improve my research, experimentation, and analysis. I’ve learned much about how to clearly present and evaluate my ideas, a skill which will be of great value into the future. In addition, I am grateful to the entire Distributed Robotics Laboratory (DRL), for being such a great place to hang my proverbial hat during the long slog. In particular, I would like to thank Danny Feldman for his involvement in the development of this research, as well as Guy Rosman and Mikhail Volkov for their patience throughout the trying task of integrating our respective systems. And finally, I would like to thank my parents, for their incessant weekly support throughout the year. Especially since they will likely be the only people outside the lab who actually see this thesis. 5 6 Contents 1 Introduction 19 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6 Organization 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem Statement 27 2.1 Input Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Output Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Related Work 29 3.1 SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 GPS-Based Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Location Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Formal Specification 4.1 33 Background: Paths and Trajectories . . . . . . . . . . . . . . . . . . . 33 4.1.1 Segment Constraints . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.2 Path Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 7 4.2 4.3 Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Fitting Costs and Trajectory Means . . . . . . . . . . . . . . . 35 4.2.2 The (k, m)-Segment Mean . . . . . . . . . . . . . . . . . . . . 36 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 Video Features . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 (k, m)-Segment Mean Algorithm 5.1 41 k-Partition Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1.1 Applying for Non-Linear Segment Models . . . . . . . . . . . 43 5.1.2 Selecting the k Parameter . . . . . . . . . . . . . . . . . . . . 43 m-Clustering Initialization . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1 Selecting the m Parameter . . . . . . . . . . . . . . . . . . . . 44 5.3 Segment m-Set Construction . . . . . . . . . . . . . . . . . . . . . . . 44 5.4 Expectation-Maximization Iteration . . . . . . . . . . . . . . . . . . . 46 5.4.1 Expectation: k-Partition and m-Clustering Improvement . . . 46 5.4.2 Maximization: Segment m-Set Reoptimization . . . . . . . . . 47 5.4.3 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 6 Optimization 6.1 49 Fitting Cost Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.1 GPS Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.1.2 Feature Vector Coreset . . . . . . . . . . . . . . . . . . . . . . 53 6.2 RDP Partition Initialization . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 K-Means Clustering Initialization . . . . . . . . . . . . . . . . . . . . 55 6.4 Path Segment Calculation . . . . . . . . . . . . . . . . . . . . . . . . 56 6.5 2-Segment Trajectory Update . . . . . . . . . . . . . . . . . . . . . . 56 7 Experimental Evaluation 7.1 59 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.1.1 59 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 7.2 7.1.2 Processing Environment . . . . . . . . . . . . . . . . . . . . . 62 7.1.3 Proportional Fitting Cost . . . . . . . . . . . . . . . . . . . . 62 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.2.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.2.3 Selected Parameters . . . . . . . . . . . . . . . . . . . . . . . 66 7.2.4 Sample Results . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8 Conclusion 73 9 10 List of Figures 1-1 Left: A geographic view of a GPS point signal with a closely-fitted trajectory. Right: The same signal and trajectory, progressing up through time, along with the underlying topological map. Note that the trajectory and the map are geographically equivalent, indicating the high degree of path repetition present in the trajectory. . . . . . . . . . . 21 2-1 Left: A GPS point signal and the path segments of an approximate (k, m)-segment mean, viewed geographically. Right: The same signal and the same (k, m)-segment mean’s trajectory segments, viewed progressing up through time, numbered by their place in the k-partition and color-matched to the underlying path segments. See Chapter 4 for a formal definition of this terminology. . . . . . . . . . . . . . . . . . 28 7-1 A conceptual layout of the region traversed by the individual as recorded in the short phone video. In terms of this graph, the individual’s trajectory would be labeled as ABCDBAEDCEBCA. . . . . . . . . . 61 7-2 A conceptual layout of the region traversed by the individual as recorded in the long phone video. In terms of this graph, the individual’s trajectory would be labeled as ABCABCABDABCABDABCABCABCABDABCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 62 7-3 The proportional fitting cost drops off, at first quickly and then more slowly, as the map size increases. Each data set has a characteristic curve relation to m. Note that the feature vector data sets have noticeably higher proportional costs than the GPS data sets. This could be because their much larger dimensionality introduces greater structural cost into the data, or it could indicate that the ‘natural’ partition size for those sets is greater that k = 300. . . . . . . . . . . . . . . . . . 64 7-4 The algorithm’s run time per varies significantly around its average relation to m. While some data sets’ run times generally increase relative to m, others’ are independent of it, or even decrease as it increases. This is likely due to the behavior of the EM loop: increasing m might cause the trajectory to be initialized closer to its local minimum, reducing the number of EM iterations needed, and therefore the total run time. Given these run times, the system could easily process in real time data streams up to 5 Hz. . . . . . . . . . . . . . . . . . . . 65 7-5 Two geographic maps of the ground robot data set, with the (300, 20)and (300, 200)-segment trajectory outputs of the algorithm, respectively. With far fewer path segments to utilize, the m = 20 trajectory’s fit to the GPS points is much rougher, compared to the m = 200 trajectory’s close fit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7-6 Two geographic maps of the quadrotor robot data set, with the (300, 20)and (300, 200)-segment trajectory outputs of the algorithm, respectively. Since this data set contains a low degree of actual path repetition, the produced trajectories tend to simply align themselves to the GPS points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 67 7-7 Two geographic maps of the personal smartphone data set, with the (300, 20)- and (300, 200)-segment trajectory outputs of the algorithm, respectively. Because of the extended discontinuities in the data set, the algorithm has made a best-effort attempt to bridge these gaps. As a result, some parts of the map traverse areas lacking any input points. It may be valuable to ‘repair’ such signal gaps using data patching, as described in [18]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7-8 Two clustering maps of the short phone video data set. Because this video contains only a few repeated loops, the skew of the transition strength towards the most prominent clusters is relatively low, especially when a large number of clusters are allowed. Despite this, the qualitative appearance of the clusters’ representative frames are widely varied, even amongst the most prominent clusters. In the left map, the most prominent cluster (cluster 1) occurs 92 times, while the least prominent (cluster 20) only occurs once. In the right map, the most prominent cluster occurs 7 times, while the least prominent (cluster 200) still only occurs once. . . . . . . . . . . . . . . . . . . . . . . . 69 7-9 Two clustering maps of the long phone video data set. Unlike the short video, this video contains a larger number of repeated loops, and so the transitions’ strengths tend to skew significantly towards the most prominent clusters. Even at m = 200, the green lines are noticeably thicker and denser around the top-right region of the map, where the larger blue circles are arranged. In the left map, the most prominent cluster (cluster 1) occurs 66 times, while the least prominent (cluster 20) only occurs once. In the right map, the most prominent cluster occurs 8 times, while the least prominent (cluster 200) still only occurs once. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 70 14 List of Tables 7.1 If the parameters k and m are not provided as input, the algorithm attempts to select good values for them as part of the trajectory initialization. For each experimental data set, it selects a characteristic (k, m) pair. Deviation of these selected values from the ground-truth parameters is primarily a result of the imprecision of the initial assumptions made about the data, such as the linear or constant structure of the path segments. In particular, note that the GPS data sets tend to have low parameter values relative to their ground truths, while the feature vector sets tend to have high values. . . . . . . . . . . . . . . 15 66 16 List of Algorithms 1 Approximate (k, m)-Segment Mean EM Algorithm . . . . . . . . . . . 41 2 RDP-Based Partition Initialization Algorithm . . . . . . . . . . . . . 42 3 K-Means Section Clustering Algorithm . . . . . . . . . . . . . . . . . 45 4 Optimal Path Segment Set Construction Algorithm . . . . . . . . . . 46 5 2-Segment Partition and Clustering Update . . . . . . . . . . . . . . 47 17 18 Chapter 1 Introduction 1.1 Overview This thesis describes an algorithm for efficiently creating semantic event representations of raw sensor streams, which capture underlying patterns of the stream in the form of an activity map. We intend for this algorithm to be applicable to the creation of life-logging systems, which can track and summarize a person’s activities using sensor data over an extended period of time. The simplest example of such a system would be one which uses a GPS trace from a user’s smart phone to develop a topological map of their daily routine. More complex implementations would have the ability to analyze this data for the construction of textual descriptions of the user’s patterns of movement. Towards the goal of developing such a system, we present an algorithm capable of extracting semantic activities from GPS or other sensor data. Specifically, the algorithm computes a sequence of instances of semantic activities (in the form of trajectory segments) corresponding to a data stream produced by a mobile source over an extended period of time. During this time, if the source’s data trajectory contains a significant number of repeated representative paths, each path can be interpreted as a distinct activity. The algorithm identifies and aggregates these repetitions, as shown in Figure 1-1. The source can be a robot with autonomous movement, or a machine under external human direction, such as a personal smartphone. The algorithm takes as input formatted sensor data, such as GPS or video 19 feature vectors, and computes both a semantic path graph and a trajectory through that graph, in the respective forms of an activity set and a sequence within that set. It is able to efficiently process very large data streams (inputs in the tens of thousands of points have been tested with no difficulty) by computing trajectory coresets1 [7]. Applications of this semantic representation algorithm include autonomous mapping, surveillance, and activity recognition. In this thesis, we describe a technique for developing an activity representation of GPS or video feature data in the form of an activity graph, present an algorithm for producing these graphs quickly and with consistently low error, and show the experimental results of applying this algorithm to several data sets of varied size and origin. The input consists of a GPS stream (a series of geographic coordinates) or a feature vector stream (a series of feature-to-frame correlation weights) with accompanying timestamps. Despite uncertainty in the data, both from error in the data measurements and from the imprecise nature of semantic activity identification in the real world, the algorithm aims to identify the repeated patterns of representative paths that appear in the source’s trajectory over an extended period of time. For example, the GPS stream produced by a car moving around a city would provide a partial road map of the city, and that of a robot in long-term operation would reveal its common movement paths. Identifying the best path graph under these constraints, the simultaneous division of the data stream into simple trajectories and clustering of these trajectories into shared paths, is referred to as the (k, m)-segment mean problem. 1.2 Motivation With recent advances in device technology, modern society now produces a surfeit of data of all types. Thus, research focus has turned to the field of ‘big-data’ analysis, the endeavor to extract meaningful results from these large-scale data sets. Due to the inherent difficulty of processing large input sets, this field is still largely in its 1 Coresets are a strategy for improving the computation runtime of an operation by preprocessing the inputs into a reduced form ( the ‘coreset’), see Section 6.1. 20 GPS input signal constructed trajectory underlying topological map latitude time GPS input signal constructed trajectory underlying topological map longitude latitude longitude Figure 1-1: Left: A geographic view of a GPS point signal with a closely-fitted trajectory. Right: The same signal and trajectory, progressing up through time, along with the underlying topological map. Note that the trajectory and the map are geographically equivalent, indicating the high degree of path repetition present in the trajectory. infancy, and most big-data systems are designed to solve relatively straightforward, ‘number-crunching’ mathematical problems. For analysis requiring a more complex, nuanced approach, human involvement is often required, introducing variability and significant slowdown into what could be an automatic process. For example, the current standard approach to creating topological maps, such as is used to produce Google Maps, relies primarily on information obtained through geographical surveys. Such surveys are expensive and infrequent, and so the data they provide can easily become outdated as the result of changes to the surveyed environment, for example by construction and development. Alternatively, collaborative approaches such as OpenStreetMap can be more quickly amended and updated, but are dependent on the highly variable availability and quality of crowdsourced data, possibly impacting their accuracy and coverage. Furthermore, these approaches both focus on roads and other outdoor paths, neglecting building interiors and other indoor spaces, for which data collection can be greatly complicated by privacy and ownership issues. This thesis aims to describe an approach to modeling such problems in mathematical form and to demonstrate an algorithmic approach to solving them. Although several assumptions and approximation are necessarily made in reducing the concep21 tual problem to a quantitative objective, this algorithm shows how complex structural content can be extracted from large data sets in a manageable processing time. 1.3 Challenges Satisfying constraints Algorithmically developing a semantic model under any non-negligible constraints, such as those involved in identifying patterns of repetition, can be challenging. Describing a progression of values through time as a single smooth function is infeasible, so it must be constructed from some number of piecewise components. The standard approach, using dynamic programming, is asymptotically too slow for effectively handling large data sets. Furthermore, as these values are constrained to match an underlying representative model with a smaller number of elements, each component of the value trajectory must be optimized not individually, but instead in conjunction with all other pieces that traverse along the same representative section. These constraints are necessary in order to encourage the expression of the source’s patterns of behavior in the structure of the resulting model. Managing data noise Determining an agent’s location without first deriving its relative position presents a very different set of challenges than using relative-position sensor data. Agent-local data streams, such as vision or rangefinding, are more complex and tend to have relatively low inherent error, but accumulate inaccuracy over time. Long-term SLAM techniques based on those forms of data, such as those described in [9] and [4], must contend primarily with this drift, by applying loop closing or other correctional methods, which have high computation times. Conversely, global location data, such as GPS, and non-geographic characterization data, such as video feature vectors, have comparatively high error, but which does not necessarily grow over time. Therefore, a GPS- or feature-based environment modeling algorithm must be able to robustly handle significant error throughout the entire data stream. As a result, this semantic algorithm is less accurate than a SLAM approach, but also much faster and with more consistently bounded error 22 over long operation times. Generalizing across data types This system is intended to implement the (k, m)segment mean algorithm without restricting its application to a single type of data, such as GPS or feature vectors. In order to achieve this, the generalized operation framework of the algorithm (the overarching EM process) must be properly distinguished from logic specific to the data type being processed (input structure, error model, etc.). This requires that each step of the algorithm be implemented in a sufficiently abstract manner, so as to allow it to potentially handle data of an arbitrary type. 1.4 Contributions This thesis contributes the following theoretical, systemic, and experimental results. Trajectory and model creation A time-efficient (linear on the input signal size) algorithm for recognizing repetitions in an agent’s trajectory and constructing a semantic model of its behavior, as exemplified in Figure 1-1, by finding a locally optimal approximate solution to the (k, m)-segment mean problem, for a GPS point set or a video feature vector set. Algorithm implementation A practical system implementing this algorithm, taking as input a GPS or video feature data set and producing a well-fitted trajectory and underlying representative model. Experimental evaluation A statistical analysis of the algorithm’s behavior when applied to several different GPS and feature data input sets, numerically demonstrating its accuracy and speed, as well as a selection of constructed geographic maps and representative feature sets, highlighting several qualitative features of the algorithm and its output. 23 1.5 Applications The work described in this paper can be gainfully applied to problems in the following areas. Map construction For a system operating long-term in an unknown environment, developing a model of that environment is a foremost priority. By constructing and applying a semantic data space, such as the geographic region map produced by the (k, m)-segment mean algorithm with GPS input, the system can interpret sensor information in a more meaningful context than in its raw form. This approach is especially valuable when the data are noisy or their native format is particularly opaque. Furthermore, by aggregating data from multiple agents within a region, the constructed map’s accuracy and coverage can be drastically improved [12]. Pattern recognition By clustering discrete segments of the data stream according to their structure, this algorithm creates a cluster signal, the sequence of cluster assignment labels through time. Intelligently evaluating this signal to identify repeated subsequences can reveal semantic patterns underlying the original input signal. These patterns can then be compared to new data as the stream grows in order to predict the source’s behavior [19]. Semantic compression As elaborated in [14] and [18], transforming a data set into an activity-based semantic representation can massively improve its compressibility, resulting in significant savings in data storage space. Moreover, because such semantic compression operates on a completely different basis than traditional text-based compression, the two can be used in conjunction to reduce the compressed data’s size even further. 1.6 Organization Chapter 2 states the (k, m)-segment mean problem, laying out the inputs it recieves and the intended results it should provide. Chapter 3 discusses existing works related 24 to this problem and its various aspects, both in general and for particular types of data. Chapter 4 formally defines the (k, m)-segment mean and other terminology involved in this system. Chapter 5 describes the structure, operation, and behavior of this approximate (k, m)-segment mean algorithm. Chapter 6 elaborates the performance optimizations and runtime improvements of the algorithm’s implementation. Chapter 7 presents and analyzes experimental results of this implementation’s application to various data sets. Chapter 8 concludes this thesis. 25 26 Chapter 2 Problem Statement In this chapter we describe the (k, m)-segment mean algorithm. This algorithm takes as input a sensor data stream and recognizes repeating patterns in this stream. The patterns corresponds to a (k, m) segmentation of the data. 2.1 Input Structure The primary input of the (k, m)-segment mean algorithm is the data stream to be processed. This consists of a sequence of data values, timestamped and ordered by time. Additionally, the algorithm may receive settings for one or both of its parameters: k, the number of sections into which to partition the data; and m, the number of underlying representative elements to which to assign those sections. If either of these parameters is not specified in the input, the algorithm performs a parameter finding process as part of its initialization step. 2.2 Output Objective The (k, m)-segment mean algorithm should produce three data structures: a partition of the input data stream into k sections, a collection of m representative semantic elements, and a clustering assignment of those k sections to those m elements. Together, these products should describe a high-accuracy approximation of the (k, m)-segment 27 mean of the input data. 2.3 Data Types The (k, m)-segment mean algorithm aims to be sufficiently general as to be easily extensible to any arbitrary type of data. In our implementation, we focused on two particular data types which help illustrate this principle: GPS and video features. For GPS, the original motivational application of this algorithm, the input is a GPS trace of an agent’s trajectory, and the resulting representative model is a topological map traversed by that trajectory. For video features, the raw data is a video stream of an agent’s viewpoint as it traverses the environment, but the algorithm’s input is a stream of feature vectors, produced by applying image feature recognition techniques to that video stream. The resulting model is a set of feature vectors, each representing a local region of the environment as percieved by the agent through the video. GPS input trajectory segments path segments 66 70 65 64 47 58 52 46 GPS input trajectory segments path segments selected fitting distances latitude time 33 26 7 latitude longitude 34 45 39 27 32 25 8 19 1 13 6 67 59 63 53 57 51 40 44 38 31 20 24 14 18 12 48 35 28 9 2 68 69 62 56 50 43 37 30 23 17 11 5 60 54 49 61 55 41 36 29 42 21 15 10 3 22 16 4 longitude Figure 2-1: Left: A GPS point signal and the path segments of an approximate (k, m)-segment mean, viewed geographically. Right: The same signal and the same (k, m)-segment mean’s trajectory segments, viewed progressing up through time, numbered by their place in the k-partition and color-matched to the underlying path segments. See Chapter 4 for a formal definition of this terminology. 28 Chapter 3 Related Work 3.1 SLAM SLAM (simultaneous localization and mapping) describes the class of problem wherein an agent must jointly construct an understanding of its surrounding environment (mapping) and of its position within that environment (localization), from the noisy data produced by its sensors. The development of robust, generalized SLAM algorithms is an ongoing area of research [6]. While the (k, m)-segment mean problem can generally be considered a type of SLAM, it differs in several significant ways from the sort of SLAM systems that are the focus of most research. ‘Traditional’ SLAM techniques are based on data produced by local sensors, whose measurements are relative to the pose of the agent which hosts them [5]. These sensors, such as cameras and range-finders, tends to have high precision and low error in their actual data. However, deriving a global pose from local data requires the aggregation of those data, and so error tends to accumulate over time. For this reason, one of the main challenges in most SLAM applications is loop closing, determining when the agent has returned to a previously-visited position. The (k, m)-segment mean, in contrast, does not attempt to integrate sensor data in this way. Instead, it uses the inherent information in its input data (such as global position for GPS, or qualitative environment description for detected video features) to build a description of its surroundings in term of the same characteristics 29 by which the data describes them. The (k, m)-segment mean algorithm thus does not suffer from the loop closing problem. However, while the error in its sensor data does not accumulate over time, that data does tend to natively have higher error and lower precision than the kinds used in existing SLAM systems. Therefore, whereas most SLAM applications must first and foremost contend with the difficulty of building a self-consistent global environment model, the (k, m)-segment mean’s primary challenge is discerning the environment’s underlying structure from its noisy input data. 3.2 Clustering The process of divvying up the input stream’s points by their mutual fit is a form of clustering. Clustering is a well-studied domain, including problems with notable parallels to ours, such as projective clustering [1]. However, viewed as a clustering problem, the (k, m)-segment mean construction objective has several distinct challenges. It requires multiple layers of clustering, first of data values into trajectory sections, and then of those into semantic paths. Furthermore, because each trajectory segment must span a continuous, unbroken range of the input stream, the time order of the values is an additional constraint on a cluster assignment’s validity. 3.3 GPS-Based Mapping Extracting geographic and topological maps from GPS data, which is the application of the (k, m)-segment mean algorithm to geographic input, is not a new idea. A variety of algorithmic approaches to this problem have been explored [2]. Most use a clustering strategy similar to ours, though some use different methods such as trace merging [3] or kernel density estimation [16], in order to construct a path representation from input streams. 30 3.4 Location Recognition Similarly, analyzing video data to identify distinct locations or scenes has been attempted using a variety of approaches, each with its own strengths and weaknesses. Some, such as [10], use existing comprehensive databases into order to globally identify the view location. Others, such as [11], focus on performing deep image analysis to extract the semantic structure of the video, in order to facilitate inter-data comparisons. Some approaches specifically focus on the clustering problem [15] as ours does, though they often still incorporate a specific image feature analysis model into their processing. Our approach instead aims to be as general as possible. It makes only the most basic assumptions about the semantic meaning of the input feature data, allowing for its specific clustering behavior to be defined by the intentions of the vector quantization algorithm which preprocesses the video. 31 32 Chapter 4 Formal Specification The following sections precisely define the terminology used in this thesis to describe the mathematical structures involved in the algorithm’s operation, building up to a formal specification of the (k, m)-segment mean. 4.1 Background: Paths and Trajectories Definition 1 (trajectory). A trajectory is a function f~ : T → Rd , where T is a subinterval of R. f~(t) represents the value of the trajectory at time t. Definition 2 (path). A path is a function f~ : T → Rd , where T is a subinterval of R. Unlike a trajectory, however, f~(τ ) is parametric, and so does not give any particular significance to τ . The ‘velocity’ df~ dτ is not meaningful, as f~(g(τ 0 )) is considered the same path regardless of the choice of monotonic non-decreasing function g : T 0 → T . Both paths and trajectories can be thought of as traces of an agent’s progression through the value space Rd , where a trajectory includes an explicit measure of the passage of time, but a path does not. It is easy, then, to see how any trajectory can be transformed into a path, by simply removing its time component. Inversely, any path can be expanded into an infinite number of possible trajectories, by applying to it an abritrary time component. Note that these definitions alone do not include any constraints on the continuity, differentiability, or other forms of ‘neatness’ of paths 33 and trajectories. However, in order to practically construct and manipulate them, some reasonable assumptions will need to be made. 4.1.1 Segment Constraints Definition 3 (k-segment function). A k-segment function is a function f~ : T → Rd , where T is a subinterval of R, which is smooth on all intervals Ti , i = 1 → k, where {Ti } is a partition1 of T . Each Ti is called a section of the k-partition, and each subfunction f~i (t) : Ti → Rd is called a segment. Further constraints can be placed on a k-segment function by restricting its segments to a certain mathematical category, such as constants, lines, parabolae, etc. Henceforth, we will only consider k-segment trajectories and paths, constrained to line segments for GPS data and to constant segments for video feature data. 4.1.2 Path Graphs Definition 4 (path graph). A path graph is a graph in which each vertex is associated with a path. In particular, we constrain the paths associated with a path graph’s vertices to be 1-segment paths. A k-segment path, and thus a k-segment trajectory, can be considered a sequence of 1-segment paths, so a directed path graph can be constructed from the adjacencies of segments in that sequence. A path graph produced this way is not particularly meaningful, unless there are segments which are repeated in the path. This system is intended to work with paths which contain a significant degree of segment repetition. This mapping of the trajectory to a path graph constitutes an m-clustering of the trajectory segments to path segments, where m is the number of path segments (i.e. vertices) in the graph. The edges of the graph can be assigned weights representing the proportional occurrence of that transition in the k-segment 1 Unlike the usual mathematical definition, this thesis allows partitions to contain empty intervals. Thus, every k-partition is also a (k + c)-partition for any positive integer c 34 path, resulting in a Markov process describing the probabilistic behavior of the agent whose activity the path describes. 4.2 Data Fitting Definition 5 (data stream). A data stream S = (T, V~ ) = [(ti ∈ R, ~vi ∈ Rd ) : i = 1 → n] is a sequence of timestamp-value pairs, in non-decreasing order by timestamp. A data stream can be thought of as a possibly-noisy sampling of a trajectory. It represents the actual measured sensor data which is available as input to the system. The system’s goal is to reconstruct (a close approximation of) the original trajectory from a data stream, by finding the trajectory which best ‘fits’ the stream. 4.2.1 Fitting Costs and Trajectory Means Definition 6 (error model). An error model is a function Err : Rd × Rd → R≥0 , representing the likelihood error between a model value (the prior) and a data value (the posterior). To be a useful metric, an error model Err(~x|~λ) should have have three particular properties. ~ = {~xi , ..., ~xn } and all ~λ, Err(X| ~ ~λ) = Property (additive joining). For all X n P Err(~ xi |~λ). i=1 ~ = {~xi , ..., ~xn }, argmin Err(X| ~ ~λ) = Property (likelihood maximization). For all X ~λ ~ = argmax p(X| ~ ~λ). argmax L(~λ|X) ~λ ~λ Property (null optimality). For all ~x, min Err(~x|~λ) = 0. ~λ Additive joining declares that the error from multiple points can be aggregated by sum. Likelihood maximization declares that error corresponds to probability in that their optima occur at the same model value. Null optimality guarantees that the error of an individual data value can always reach zero, removing the data’s structural likelihood from consideration. 35 Definition 7 (fitting cost). Given a trajectory f~, a timestamp-value pair (t, ~v ), where t is in the time range of f~ and ~v is of the same data type at f~, and an error model Err(~v |~λ), the fitting cost of f~ to (ti , vi ), C(ti , vi |f~), is equal to Err(~v |f~(t)). Given a data stream S = (T, V~ ) instead of a single timestamp-value pair, the fitting cost of f~ to all of S, C(S|f~), is equal to the sum of its fitting costs to each pair in S. The fitting cost describes how well-matched a trajectory is to a data stream, given an error model applicable to the data type. Similarly, a path segment can be said to have a fitting cost as well, equivalent to the fitting cost of a trajectory segment which projects to that path segment and which spans the data stream’s time range with some ‘reasonable’ speed function. In this thesis, we constrain that speed function to be constant. Definition 8 (trajectory mean). Given a data stream S, the trajectory mean f ∗ of S is the trajectory with the lowest fitting cost to S, f~∗ = argmin C(S|f~). f~ Of course, the search space of all possible trajectories whose time ranges contain S is unfeasibly large, so it is reasonable to constrain the optimization problem to trajectories of a certain structure. 4.2.2 The (k, m)-Segment Mean Definition 9 ((k, m)-segment trajectory). A (k, m)-segment trajectory is a k-trajectory which can be reduced to a path graph with m vertices. In other words, a (k, m)-segment trajectory contains k trajectory segments, but only m path segments, which are repeated so as to bridge the difference between those two parameters. Definition 10 ((k, m)-segment mean). Given a data stream S, the (k, m)-segment mean is the trajectory mean of S among the category of (k, m)-segment trajectories. The ultimate objective of this system is to find the (k, m)-segment mean of the input stream, the (k, m)-segment trajectory best fitted to that stream. This trajec36 tory corresponds to a hidden Markov model underlying the input data, obscured by sampling noise. 4.3 Data Types The data-type-specific components of the algorithm must also be clearly defined, and specified for the particular types which are implemented and used by this system. 4.3.1 GPS A GPS point is a geometric value in R2 . It uses the geometric (squared distance) error model, Err(~x|~λ) = ||~x − ~λ||2 . The 1-segment mean of a GPS data stream S = (T, V~ ) n P can thus be found by solving argmin ||~vi − f~(ti )||2 . f~ 4.3.2 i=1 Video Features Each frame in a video stream is represented by a set of recognized features. Each frame’s feature vector is a non-geometric value in (R≥0 )d , where d is the number of distinct features found throughout the video. Error Model Feature data is patently non-geometric, not least because its support does not encompass all of Rd . It thus requires a non-geometric error model to evaluate comparisons between a model feature vector and an element of a data stream. Conceptually, we need a distribution function to describe image feature data. This distribution should reflect the idea of a feature value as the relative ‘strength’ of that feature in the image, without relying on the specific definition or method of calculation of the feature values. To this end, we choose to base the error model on a Poisson distribution, π(x|λ) = λx e−λ , x! conceptualizing a feature value as the ‘count’, or number of instances, of that feature in the image. However, as feature data is not limited to integers, it must be thought of an expected or approximate count. This necessitates the adapta37 tion of the Poisson distribution (natively defined only for integers) to the continuous support R≥0 . This idea, of a ‘continuous Poisson distribution’, often arises in attempts to model occurence data, but a single precise definition of this distribution is deceptively difficult to discern. [8] presents one definition, derived from the Poisson distribution’s CDF Π(x|Λ) = Γ(bx+1c,Λ) . Γ(bx+1c) However, this approach is not viable for our modeling pur- poses, most obviously because its continuous Poisson distribution has π̃(0|λ > 0) = 0, meaning that a zero feature value could never be produced by a nonzero model segment. Instead, this system uses an alternate appoach based on a non-uniform prior distribution of λ. Consider the function f (x|λ) = λx e−λ , Γ(x+1) which is the extension of the discrete Poisson distribution π(x|λ) into the continuous domain.2 This function is not normalized over x, which can be remedied by the addition of a normalization factor, R∞ resulting in the distribution π̃(x|λ) = f (x|λ) : Kλ = 0 f (x|λ)dx. However, it is Kλ not feasible to include Kλ in an optimization operation, so we include the prior R∞ p(λ) = KKλ : K = Kλ dλ. This results in the distribution π̃(x, λ) = f (x|λ) . By the K 0 definition of the Γ-function, K is known to be infinite, which would also be infeasible, except that this term can be extracted from consideration, as will be shown. Consider the negative logarithmic probability of this distribution, − log π̃(x, λ) = λ − x log λ + log Γ(x + 1) + log K. By the mathematical properties of the logarithm, this term fulfills the first two requirements for an error model, additive joining and likelihood maximization. The third requirement, null optimality, is not fulfilled, since at the optimal model value is − log π̃(x, x) = x − x log x log Γ(x + 1) + log K > 0. Instead, define the error function: Err(x, λ) = − log π̃(x, λ) + log π̃(x, x) = (λ − x) − x(log λ − log x) Because this differs from the negative logarithmic probability only by an additive term that is independent of λ, the first two requirements are preserved, and the 2 Note that f (x|λ) has a hole discontinuity at x = 0, λ = 0. This hole can be ‘plugged’ by finding the function’s limit, f (0|0) = 1, using l’Hôpital’s rule. 38 error function also achieves the third property, while also canceling out K from the function. This is the error model which will be used for each dimension of the image feature data. In addition to an explicit model of error given the data and model values, it is also necessary to be able to calculate the optimal model value given a set of data values. ~ = (X ~ 1 , ..., X ~ n ), the model Theorem 1. Given a video feature vector data set X ~ ~λ) = E[X]. ~ element which minizes the total error is ~λ∗ = argmin Err(X, ~λ Proof. Consider a single dimension of a feature data set and the corresponding model value. n dErr(X, λ∗ ) d X ∗ = ∗ ((λ − Xi ) − Xi (log(λ∗ ) − log(Xi ))) = 0 dλ∗ dλ i=1 n X Xi (1 − ∗ ) = 0 λ i=1 n 1X λ = Xi = E[X] n i=1 ∗ The model value can be optimized by simply setting it to the average of the data values. By extension, since the error model operates independently on each dimension, ~ the average of the feature the optimal multi-dimensional model element is ~λ∗ = E[X], vector data set. Production and Preprocessing Because the (k, m)-segment mean algorithm takes feature vector data as its input rather than a video frame sequence, it is first necessary to use a separate feature recognition system to produce feature data from raw video. This data type and its error model are applicable to any kind of feature vector which purports to signify expected ‘counts’ of features in frames. To produce such data, this thesis relies on the system described in [13]. That process uses an adaptive vector quantization across the entire video sequence to produce a high-dimensional (d = 5000) bags-of-words feature data set, and then applies median filtering to compress this data down to a 39 low-dimensional (d = 300) representation. The resulting feature data is provided as input into the (k, m)-segment mean algorithm. 40 Chapter 5 (k, m)-Segment Mean Algorithm ApproxKMSegmentMean (Algorithm 1) is a local-approximation, expectationmaximization algorithm for solving the (k, m)-segment mean problem, as defined in Chapter 2, using input as specified there. Given a data stream S = (T, V~ ), it initializes a reasonable first guess (k, m)-segment trajectory, then repeatedly alternates between locally improving the k-partition-and-m-clustering and re-optimizing the m path segments. It terminates once a locally cost-minimal solution is reached, at which point the patterns of movement in the original data should have sufficiently coalesced so as to be expressed in the resulting semantic map. Algorithm 1 Approximate (k, m)-Segment Mean EM Algorithm 1: procedure ApproxKMSegmentMean(S, k?, m?) 2: initialize k-partition part with RDPPartition(S, k?) 3: initialize m-clustering clustering with KMeansCluster(S, part, m?) 4: construct segment m-set segments with OptimalSegments(S, part, cluster) 5: loop 6: TwoSegmentUpdate(S, part, segments) with odd boundaries fixed 7: TwoSegmentUpdate(S, part, segments) with even boundaries fixed 8: reconstruct segments with OptimalSegments(S, part, cluster) 9: evaluate total fitting cost cost 10: if this cost is no less than the previous cost then 11: return the (k, m)-segment trajectory {part, cluster, segments, cost} 41 5.1 k-Partition Initialization A modified version of the Ramer-Douglas-Peucker algorithm, RDPPartition (Algorithm 2), is used to partition the input stream into k complete, non-overlapping sections, the k-partition of the initial (k, m)-segment trajectory. Algorithm 2 RDP-Based Partition Initialization Algorithm 1: procedure RDPPartition(S, k?) 2: initialize part as a size-|S| 1-partition 3: for i = 2 → |S| do 4: for all data points (~x, t) ∈ S do ~ be the spanning segment of the section of part containing (~x, t) 5: let L ~ to (~x, t) 6: calculate the fitting cost from L 7: find the pair ((~xi , ti ), (~xi+1 , ti+1 )) with the greatest combined fitting cost 8: insert a boundary between these points into part . part is an i-partition 9: if input k is defined then 10: if i = k then 11: return part 12: else 13: calculate S’s optimal i-trajectory with i-partition part 14: record the fitting cost costi of this i-trajectory to S 15: record the current part as parti 16: find the elbow of the (i, costi ) curve 17: select k to be the i found this way 18: return partk There are three key modifications from typical RDP: • The standard RDP algorithm uses a purely geometric definition of the ‘distance’ cost between the model segment and the input values. In the modified version, at line 6, the stream’s timestamps are used, in conjunction with its data type’s error model, to calculate the per-point fitting cost as the distance instead. • The standard RDP algorithm uses each value’s individual cost to its containing subsequence’s spanning segment (the line segment from the subsequence’s first value to its last value) as the metric for determining whether to split that subsequence at that value. In the modified version, since the (k, m)-segment mean allows discontinuities between consecutive segments, the metric, at line 7, is in42 stead the combined costs of each consecutive pair of values within a subsignal. • The standard RDP algorithm terminates once the maximum cost between a value and its containing subsequence’s spanning segment falls within some given > 0. In this version, it is fixed that = 0, and termination instead occurs once k − 1 splits have been performed, at line 11, resulting in a k-partition, assuming the k parameter value is given. 5.1.1 Applying for Non-Linear Segment Models The RDP algorithm places splits between segments at data values which deviate the furthest from the existing bounding segments. The efficacy of this approach is reliant on certain assumptions: namely, that the value discontinuities between adjacent segment ends are negligible; and that individual segments cannot possess internal local extrema, so an extremum must indicate a joint between two segments. GPS data, being derived from a trace of a continuous geographic trajectory with notionally linear segments, fulfills these requirements, but other data types might not. Pertinently, video feature vector data does not, as its constant segments can have arbitrarily large inter-segment discontinuities, and consist entirely of local extrema by dint of each one having a single value across its time span. Thus, the data must be transformed into a structure which does fulfill these assumptions before RDP is applied. For feature data, this transformation is simple; the stream is cumulatively summated. The discontinuity between adjacent sums is of the magnitude of a single input value, and constant segments in the original data space correspond to linear segments in the transformed space. RDP can then be effectively applied to this transformed data stream. 5.1.2 Selecting the k Parameter If k is not given to the algorithm as input, then the k-partition initialization step attempts to determine the best value of k to fit the input stream. It uses a standard elbow finding method in order to identify the k at which the cost improvements gained 43 from adding cluster shift from significant to incremental [17]. Moreover, because the RDP algorithm is naturally progressive (i.e. it inductively calculates the result for k from the result for (k − 1)), it is possible to calculate the fitting costs for all k up to the elbow region, sufficient for finding the elbow a line 16, in the same asymptotic runtime as calculating for a single k. 5.2 m-Clustering Initialization Once the input stream has been partitioned, the k sections are grouped into m clusters using the k-means algorithm, as described in KMeansCluster (Algorithm 3). Each centroid is a model segment, and the appropriate error model is used to calculate the cost of assigning a given section to a given centroid. However, it is important to note that these sections may have internal cost: the fitting cost of a single section to its own centroid is not necessarily zero. In keeping with the principle of null optimality, therefore, each section’s fitting cost is subtracted from its centroid-assignment costs. In this way, sections with higher internal cost are not unduely favored. The result of this process is an m-clustering, which assigns the k sections to m clusters. 5.2.1 Selecting the m Parameter Similar to the k parameter selection process, the m parameter is selected using the elbow method. However, because k-means is not progressive in the same way, instead a ternary search is used to locate the elbow value among m ∈ {1, ..., k}, performing kmeans at each candidate value. This multiplies the asymptotic runtime of the process by a factor of O(log k) compared to performing a single k-means with m as an input. 5.3 Segment m-Set Construction Once the k-partition and m-clustering have been initialized, the set of m underlying path segments can be produced by OptimalSegments (Algorithm 4). For each cluster, a segment is calculated to minimize the total error from it to the partition 44 Algorithm 3 K-Means Section Clustering Algorithm 1: procedure KMeansCluster(S, part, m?) 2: if m is defined then 3: set search range to {m} 4: else 5: initialize discrete ternary search range {1, ..., part.k} 6: calculate and record the 1-clustering cost cost1 7: calculate and record the (part.k)-clustering cost costpart.k 8: loop 9: if search range contains at least two non-boundary values then 10: select candidate non-boundary values ma and mb in search range 11: let M be {ma , mb } 12: else 13: let M be the entire search range 14: for all mi ∈ M do 15: initialize mi -clustering cluster of part’s sections with k-means++ 16: perform k-means on cluster 17: record current cluster as clustermi 18: calculate and record the mi -clustering cost costmi 19: find elbow in {(mi , costmi )} with limits {(1, cost1 ), (part.k, costpart.k )} 20: if search range contains at least two non-boundary values then 21: update search according to whether the elbow is ma or mb 22: else 23: return clusterm for elbow value m 45 sections assigned to that cluster. This set will be equivalent to the final centroid set from the k-means process. Algorithm 4 Optimal Path Segment Set Construction Algorithm 1: procedure OptimalSegments(S, part, cluster) 2: for all clusters in cluster do 3: collect all sections of part assigned to that cluster by cluster 4: extract all substreams of S defined by those sections 5: calculate path segment with minimal total fitting cost to those substreams 6: return the set of those m path segments 5.4 Expectation-Maximization Iteration Once the components of the (k, m)-segment trajectory have been initialized, the algorithm performs an iterative improvement process using an expectation-maximization approach. 5.4.1 Expectation: k-Partition and m-Clustering Improvement While the set of m path segments is held fixed, the k-partition boundaries and mclustering assignments are jointly improved. This is achieved with two substeps: in each one, either the even or odd partition boundaries are held fixed, and each twosection pair between them is independently updated using TwoSegmentUpdate (Algorithm 5) by calculating its 2-segment optimum given the existing path segment set. In this way, an NP-hard problem is reduced through approximation to a series of parallelizable polynomial problems. Moreover, as long as the initialization process was effective enough to place the (k, m)-segment trajectory in the correct local descent region, the effect of this approximation on the resulting cost is negligible. 46 Algorithm 5 2-Segment Partition and Clustering Update 1: procedure TwoSegmentUpdate(S, part, segments) 2: for all adjacent section pairs in part do 3: join that pair into a single section 4: for all possible boundary positions in that section do 5: split that section into two at that position 6: for all segments in segments do 7: calculate the fitting cost of that segment to S within each section 8: indentify the lowest-cost segment for each section 9: identify the boundary position with the lowest-cost pair of sections 10: return the aggregate (k, m)-segment trajectory 5.4.2 Maximization: Segment m-Set Reoptimization Once the k-partition and m-clustering have been updated, they are in turn held fixed while the set of m path segments is recalculated based on them, again using OptimalSegments. Unlike the expectation step, this optimization is not incremental; rather, the segments are completely and precisely refitted to their assigned sections, in the same way that they were initially constructed. 5.4.3 Termination This iterative process continues until an iteration occurs in which the total fitting cost does not improve. At this point, the locally optimal (k, m)-segment trajectory has been found, and the algorithm terminates, producing this trajectory as the approximate (k, m)-segment mean. See Chapter 7 for a full experimental analysis of the resulting output. 47 48 Chapter 6 Optimization Given this algorithm’s extensive use of the EM paradigm, it is nontrivial to realistically evaluate its runtime, even before the application of any optimization techniques. Instead, we will identify and justify the various individual runtime improvements which this implementation achieves, and evaluate their relative gains over the naive alternative. 6.1 Fitting Cost Coresets Many of the significant runtime reductions achieved by this implementation are a result of the usage of fitting cost coresets. In general, coresets are a type of intermediate data structure, derived from an initial data set, which can reduce the runtime of performing a certain calculation in aggregate on that data set, possibly with a degree of approximation [7]. In this case, a lossless coreset is used to reduce the amortized runtime of a group of fitting cost calculations. Specifically, these coresets should have three particular properties. Property (section-segment independence). The fitting cost coreset of a stream section should constructible independent of any path segment. The fitting cost to a particular segment can then be calculated using only that segment and the previouslyconstructed coreset. The runtime necessary to construct a coreset should be asymptotically no greater than that which is necessary to perform a traditional fitting cost 49 calculation. The runtime necessary to calculate a fitting cost using an existing coreset should be asymptotically less than that which is necessary to perform a traditional fitting cost calculation. In other words, using an intermediate coreset to calculate a fitting cost should never have a worse runtime than calculating the cost directly, and should have a better runtime when multiple costs are calculated for a single stream section. Property (cumulative construction). Given a stream section and the corresponding fitting cost coreset, it should be possible, after adding a single point to either end of the section, to update the coreset to include this point in a runtime which is asymptotically faster than completely recalculating the new coreset from the updated section. Essentially, this allows the fitting cost corset to be cumulatively calculated for a given stream section, producing the coresets for all O(n) left-or-right-affixed subsections in a reduced runtime compared to what is generally necessary to construct O(n) different coresets. Property (optimizability). The optimal model segment for a stream section should be calculable from that segment’s coreset in a runtime asymptotically no worse than that of calculating the segment directly from the stream. For both GPS and feature data, a fitting cost coreset can be constructed in linear time relative to the stream section’s size, at which point a fitting cost can be calculated in constant time relative to that size. A group of cumulative coresets can be still be constructed in linear time, rather than the natural quadratic runtime. The optimal segment can also be calculated from the coreset in constant time per dimension. 6.1.1 GPS Coresets ~ = The fitting cost from a trajectory line segment ~λ(t) to a GPS stream section S ~ can be expressed in matrix terms as ||Y~T − X|| ~ 2 , where Y~T consists of the (T, X) F values of ~λ(t) at each Ti ∈ T [18]. For a path line segment described by its endpoints (~λs , ~λe ), then, the segment must first be progressed into the time domain so that 50 it exactly spans the range [T1 , T|T | ] = [Ts , Te ]. The value of this time-progressed 1 ~ where ((Te − t)~λs + (t − Ts )~λe ). Therefore, Y~ = AΛ, segment is ~λ[Ts ,Te ] (t) = Te −T s h i ~λs 1 ~ . By the definition A∗i = Te −Ts Te − Ti Ti − Ts for i = 1 → |S| and Λ = ~λe ~ 2 = tr((Y~T − X) ~ T (Y~T − X)). ~ of the Frobenius norm, ||Y~T − X|| Applying the value of F Y~T and rearranging the terms, it is found that h ~ T A)∗1 − −(X ~ T A)∗2 − tr(X ~ T X) ~ (AT A)11 (AT A)12 (AT A)22 −(X ~ Λ) ~ = C(S| · h ~λ2 2~λs · ~λe ~λ2 −(−2 · ~λs )− −(−2 · ~λe )− 1 e s i Theorem 2. The fitting cost coreset of GPS data can be expressed in terms of eight parameters of size O(1), each of which can be calculated cumulatively over the point stream. These parameters can be calculated for a cumulative coreset group of size n in O(n) time, and a fitting cost can be calculated from these parameters in O(1) time. Proof. Let B∗i = h Te − Ti Ti − Ts i so that A = 1 B. Te −Ts As shown above, the ~ fitting cost can be expressed in terms of six parameters derived from A and X, ~ Ts and Te . Given these parameters, meaning that they can be derived from B, X, the fitting cost expression can thus be calculated in constant time. Moreover, each of these parameters can be updated with additional GPS points in constant time per point added. Ts is simply replaced if the new point is at the beginning of the stream section, or Te if it is at the end. For other six terms, cumulative sums must be used. First, we will show that this can be achieved for the progressive addition of points to the top of the stream section (i.e., for a set of left-affixed subsets), and then we will show that this calculation can be wrapped in a stream section transformation which allows it to apply to the bottom of the section as well (i.e. right-affixed subsets). Let j = 0 → n be the right bound index of the subset. T (B B)11 = j X 2 (Tj − Ti ) = i=1 jTj2 − 2Tj j X i=1 51 Ti + j X i=1 Ti2 i T (B B)12 = j X (Ti − T1 )(Tj − Ti ) = −jT1 Tj + (T1 + Tj ) j X i=1 Ti − i=1 T (B B)22 = j X 2 (Ti − T1 ) = jT12 − 2T1 j X ~ i∗ = Tj (Tj − Ti )X j X j X ~ i∗ − X i=1 i=1 ~ T B)∗2 = (X Ti + j X ~ i∗ = (Ti − T1 )X j X j X i=1 Ti2 ~ i∗ Ti X i=1 ~ i∗ − T1 Ti X i=1 i=1 ~ = ~ T X) tr(X j X Ti2 i=1 i=1 i=1 ~ T B)∗1 = (X j X j X j X ~ i∗ X i=1 ~ i∗ ||2 = ||X|| ~ 2F ||X i=1 These expressions can also be used to calculate the coreset parameters for the right-affixed subset group, by apply a transformation to the stream section before processing, and then reversing the transformation on the calculated values which are input into the coreset parameter expressions. Specifically, the order of the elements in ~ and T must be reversed, and the timestamp values in T must be negated (so as to X invert the relative values of any two timestamps in the stream). Accordingly, T1 and j P Tj must be negated and switched with one another, and the cumulative terms Ti i=1 and j P ~ i∗ must also be negated (these are the only cumulative terms which contain Ti X i=1 unsquared time values). Calculating the parameters from these modified intermediate values produces coresets corresponding to the group of right-affixed stream subsets. Therefore, the left- and right-affixed subset group coresets of a size-n stream section can be constructed in O(n) time, at which point a fitting cost for any of these subsets can be calculated from the corresponding coreset in O(1) time. Theorem 3. Given the coreset for a GPS stream section of any size, the optimal line path segment for that section can be calculated in constant time. ~ Λ). ~ This must occur where the all Proof. The objective is to calculate argmin C(S| ~ Λ 52 λ-derivatives are equal to zero. d d~λs d d~λe ~ Λ) ~ = C(S| ~0 ~0 ~ Λ) ~ in terms of the parameters of the coreset, the By applying the expression for C(S| ~ is found. optimal Λ ~λs ~λe ~λs ~λe Te − Ts T T (B B)11 (B B)22 − = ((B T B)12 )2 = 6.1.2 ~ T B)∗1 − (B T B)12 (X ~ T B)∗2 (B B)22 (X ~ T B)∗2 − (B T B)12 (X ~ T B)∗1 (B T B)11 (X ~ T A)∗1 − (AT A)12 (X ~ T A)∗2 (AT A)22 (X ~ T A)∗2 − (AT A)12 (X ~ T A)∗1 (A A)11 (X 1 (AT A)11 (AT A)22 − ((AT A)12 )2 T T Feature Vector Coreset The fitting cost coreset for video feature vector data is conceptually simpler than that for GPS data. However, unlike GPS data’s fixed 2-dimensionality, feature data can have an arbitrary dimension d, so its coreset’s behavior must be described in terms of that value as well as n. Theorem 4. The fitting cost coreset of d-dimensional feature data can be expressed in terms of three parameters of size O(d), each of which can be calculated as a cumulative sum over the vector stream. A fitting cost can be calculated from these parameters in O(d) time, and these parameters can be calculated from a cumulative coreset group of size n in O(nd) time. The optimal model vector can also be calculated from these parameters in O(d) time. Proof. Consider the fitting cost function for feature vectors: ~ ~λ) = C((T, X), d n X X j=1 ! ~ ij ) − X ~ ij (log(~λj ) − log(X ~ ij )))) ((~λj − X i=1 53 ~ ~λ) = C((T, X), d X ~λj 1− i=1 j=1 ~ ~λ) = C((T, X), n X d X n X ~ ij − log(~λj ) X i=1 ~λj n − (~1 + log(~λj )) ~ ~λ) = C((T, X), n n d P ~i P X i=1 n X j=1 n X ! ~ ij log(X ~ ij ) X i=1 ~ ij + X i=1 ~ ij + X i=1 j=1 n X n X ! ~ ij log(X ~ ij ) X i=1 d n P P ~ i log(X ~ i) X (~λj ) ~1 + log(~λ) 1 · i=1 j=1 A feature data coreset consists of three parameters: the scalar n, the size-d vector n n P ~ i , and the size-d vector P X ~ i log(X ~ i ). Multiplying each parameter by is corX i=1 i=1 responding λ-derived coefficient take O(d) time, so a fitting cost can be calculated from these parameters in that runtime. Since each parameter can be expressed as a sum of terms formed from individual data values in O(d) time, an entire cumulative coreset group of size n can be constructed in O(nd) time. The optimal model segn ~ = 1 PX ~ i , so can be calculated from the two ment of a video feature data set is E[X] n i=1 n P ~ i in O(d) time. parameters n and X i=1 . 6.2 RDP Partition Initialization The standard RDP algorithm has an expected runtime of O(nd log n) with a worst case of O(n2 d). Described purely in terms of the input stream’s size and dimension, the modified RDP used to initialize the k-partition shares this runtime behavior. However, it can be more precisely analyzed by also accounting for the parameter k. If k is specified as input, then the algorithm can terminate once that k has been achieved, giving an expected runtime of O(nd log k) with a worst case of O(nkd). If k is not specified and must be automatically selected with the elbow finding method, then the algorithm cannot necessarily terminate at the selected k, because additional candidates beyond that value must be tested in order to determine that that k is in fact the optimal one. However, this does not mean that all candidates k ∈ {1, ..., n} must always be tested. Since the fitting cost can never be reduced 54 below 0, testing can cease once it reaches a k = kt where, even if its resulting cost were 0, it would not be preferable to the best k found so far. Given this, the runtime is O(nd log kt ) expected and O(nkt d) worst case. In theory, kt is not guaranteed to be lower than n. However, it is easy to see that, for reasonable input data, kt is significantly smaller than n. Let k ∗ be the optimal k, which will be selected by the algorithm. Let c1 and ck∗ be, respectively, the fitting cost at k = 1 and k = k ∗ . Let r∗ be the cost value of the elbow method’s reference line at k = k ∗ . The fractional reduction of kt from n is equal to the fractional reduction of ck∗ from r∗ . n − kt rk∗ − ck∗ = n c1 In other words, kt ’s improvement of the RDP process’s runtime is proportional to the strength of k ∗ ’s optimality, the degree to which is improves the fitting cost relative to the reference line. The only way for kt = n is if ck∗ = r∗ , which is essentially impossible even with unreasonable input data. In practice, kt usually provides a runtime decrease of at least twofold, and often more. 6.3 K-Means Clustering Initialization The behavior of the k-means algorithm is very well studied, and is understood to have a high worst-case runtime but to perform much better in practice. Though it is impractical to attempt to fully analyze the runtime of this step, we can still demonstrate significant increases by improving the performance of the process’s core operation: the calculation of fitting costs between centroid segments and stream sections. Calculating the fitting cost of a size-n stream section to a single path segments takes O(n) time, and so, naively, calculating those costs for m segments takes O(nm) time. Using fitting cost coresets, however, the O(n)-time processing of the stream sections can be separated from the particular path segment it is being fitted to, allowing multiple fitting costs to be evaluated in only O(n + m) time. Since the k-partition boundaries are fixed throughout the entire m-clustering initialization process, the 55 coresets for the k sections can be calculated once at the beginning, replacing the kmean’s runtime dependence on n with the (usually) much lower k. This improvement applies both to the k-means++ initialization step and to the iterative clustering steps. Additionally, the k-means++ step can in its entirety be extracted from the O(log k)iteration search for the parameter m, by simply creating a complete ordering of the k-sections according to the usual k-means++ process, and then truncating this ordering as necessary for each individual k-means run. As well as reducing the total runtime of that step by a factor of O(log k), this modification also reduces the random variation between k-means runs. This stabilizes the reliability of meaningful comparisons between the clustering costs resulting from different values of m, thus improving the algorithm’s ability to accurately identify the correct elbow value for that parameter. 6.4 Path Segment Calculation The specific operation necessary to construct the optimal path segments, given a partition and clustering of a data stream, is dependent on the data type of that stream. For both GPS and video feature data, the optimal segment for a single cluster can be calculated in time linear to the number of data values in that cluster, using the fitting cost coreset, and can be calculated independently for each dimension. Therefore, the full segment set can be linear time O(nd). 6.5 2-Segment Trajectory Update For a single stream section of size ni and a set of m path segments, there are O(ni m2 ) possible 2-segment trajectories. Calculating the fitting cost of each of those take O(ni ) time, so naively a single 2-segment trajectory update will take O(n2i m2 ) time. However, by calculating separately the optimal path segment for each of the two trajectory segments, the optimal 2-segment trajectory, given the location of the segment split, can be calculated in O(ni m) time, resulting in a total O(n2i m) runtime to find 56 the optimal segment split location as well. This runtime can be further reduced through the use of fitting cost coresets. The set of coresets for all left-affixed subsections and for all right-affixed subsections can be calculated in O(ni ) time. The optimal path segment for each of these subsections can then be found in O(ni m) time altogether. By then finding the best of the ni pairings of a left-affixed subsection and a right-affixed subsection with the same nonaffixed boundary location, the optimal 2-segment trajectory can be found in O(ni m) time. Therefore, the optimal 2-segment trajectories for all sections of a partitioned ! k=O(n) P size-n stream can be found in O ni m = O(nm) time. i=1 57 58 Chapter 7 Experimental Evaluation We use two metrics to evaluate the ApproxKMSegmentMean algorithm: the fitting cost (i.e. error) and the runtime. A resulting (k, m)-segment mean approximation is meaningless if it does not fit the input data set well, but the algorithm is not of any use if it cannot process large sets with reasonable speed. In order to evaluate the algorithm’s effectiveness in these contexts, a series of experiments on several data sets were used to collect empirical measurements. These statistics have been analyzed to provide a quantitative and qualitative understanding of ApproxKMSegmentMean’s behavior. 7.1 7.1.1 Experimental Setup Datasets Five data sets of varying size, frequency, and qualitative characteristics were used to test the effectiveness of this algorithm. Some of these are the same sets used in [18], though without the additional preprocessing steps described therein. ground robot 72,273 points, produced by SLAM localization, first floor of an academic building, indoors only, 4 hours. 59 Using a 30-meter-capable Hokuyo scanning laser rangefinder and the GMapping SLAM package for the ROS robotic operating system, a custom-built omnidirectional ground robot was remotely operated to explore the first floor of an academic building, concurrently mapping its surroundings and localizing itself. The resulting path is high-noise, with loops and repeated sections. quadrotor robot 12,000 points, produced by smartphone GPS, courtyard of an academic building, outdoors only, 10 minutes. Using a Samsung Galaxy SIII smartphone and a single onboard computer with ROS, an Ascending Technologies Pelican quadrotor flying robot was remotely operated above an outdoor courtyard, collecting filtered GPS data at a rate of 20 Hz. personal smartphone 20,051 points, produced by smartphone GPS, greater Boston area, indoors and outdoors, 8 months. Using the Travveler data logging smartphone application, GPS data was collected from an individual’s phone at an approximate rate of 30 Hz, with frequent and significant gaps in collection. The data was sanitized to remove points with nonunique timestamps, but in contrast to the experiments in [18], it is not patched to remove discontinuities, as this (k, m)-segment mean algorithm does not rely on the point signal being of a near-constant rate. short phone video 9,900 vectors, produced by smartphone video, third floor of an academic building, indoors only, 5.5 minutes. Using the built-in camera of a handheld Samsung Galaxy S4 smartphone, a 1920x1080 video was recorded at 30 frames per second over the course of about 5 minutes. This video shows the forward perspective of an individual traversing a significantly varying path several times amongst several different locations, pausing to observe each one upon arrival. 60 D C E B A Figure 7-1: A conceptual layout of the region traversed by the individual as recorded in the short phone video. In terms of this graph, the individual’s trajectory would be labeled as ABCDBAEDCEBCA. long phone video 19,800 vectors, produced by smartphone video, throughout an academic building, indoors and outdoors, half an hour. Using the built-in camera of a handheld Samsung Galaxy S4 smartphone, a 1920x1080 video was recorded at 10 frames per second over the course of about half an hour. This video shows the forward perspective of an individual traversing a slightly varying path ten times amongst several very different locations, pausing to observe each one upon arrival. 61 D C B A Figure 7-2: A conceptual layout of the region traversed by the individual as recorded in the long phone video. In terms of this graph, the individual’s trajectory would be labeled as ABCABCABDABCABDABCABCABCABDABCA. 7.1.2 Processing Environment These results were produced using an implementation of the ApproxKMSegmentMean TM R algorithm in MATLAB (R2013a), running in 64-bit Windows 8 on a 2GHz IntelCore i7 four-core processor with 6GB RAM. 7.1.3 Proportional Fitting Cost The proportional fitting cost was used as the primary metric of a solution’s fit to its input point signal. Definition 11 (proportional fitting cost). Given an input signal S = (T, V~ ) of size |S| = n and the total fitting cost Cf of a (k, m)-segment trajectory to that signal, the ef = proportional fitting cost is C Cf , CS where CS is the fitting cost of S to the single 62 constant segment consisting of the mean value vector of V~ , µ ~S = 1 n n P V~i . i=1 Observation. For the (1, 1)-segment trajectory consisting of the single path segment ef = 1. Therefore, the proportional fitting cost of a (k, m)-segment mean along µ ~S, C can never be greater than 1. ef = 0. Therefore the proportional Observation. For the (n, n)-segment mean of S, C fitting cost of a (k, m)-segment mean can never be less than zero. Note that the bounds demonstrated by these observations apply strictly only to true optimal solutions to the (k, m)-segment mean problem, not to the locally approximate solutions produced by ApproxKMSegmentMean. 63 7.2 Results 7.2.1 Accuracy 0 proportional fitting costs of experimental data sets (k = 300) 10 ground robot GPS quadrotor robot GPS personal smartphone GPS short phone video long phone video −1 proportional fitting cost (log−scale) 10 −2 10 −3 10 −4 10 −5 10 −6 10 Figure 7-3: 0 50 100 150 m value 200 250 300 The proportional fitting cost drops off, at first quickly and then more slowly, as the map size increases. Each data set has a characteristic curve relation to m. Note that the feature vector data sets have noticeably higher proportional costs than the GPS data sets. This could be because their much larger dimensionality introduces greater structural cost into the data, or it could indicate that the ‘natural’ partition size for those sets is greater that k = 300. 64 7.2.2 Speed run times of experimental data sets (k = 300) 0.2 ground robot GPS quadrotor robot GPS personal smartphone GPS short phone video long phone video run time (seconds per input value) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Figure 7-4: 0 50 100 150 m value 200 250 300 The algorithm’s run time per varies significantly around its average relation to m. While some data sets’ run times generally increase relative to m, others’ are independent of it, or even decrease as it increases. This is likely due to the behavior of the EM loop: increasing m might cause the trajectory to be initialized closer to its local minimum, reducing the number of EM iterations needed, and therefore the total run time. Given these run times, the system could easily process in real time data streams up to 5 Hz. 65 7.2.3 Selected Parameters data set k m prop. fitting cost run time (seconds per input) ground robot GPS 62 6 0.0601831 0.0285913 quadrotor robot GPS 33 5 0.0709467 0.020304 personal smartphone GPS 161 9 0.0168283 0.0309489 short phone video 555 95 0.263151 0.192963 1429 190 0.230193 1.14047 long phone video Table 7.1: If the parameters k and m are not provided as input, the algorithm attempts to select good values for them as part of the trajectory initialization. For each experimental data set, it selects a characteristic (k, m) pair. Deviation of these selected values from the ground-truth parameters is primarily a result of the imprecision of the initial assumptions made about the data, such as the linear or constant structure of the path segments. In particular, note that the GPS data sets tend to have low parameter values relative to their ground truths, while the feature vector sets tend to have high values. The RDP-based partition initialization is reliant on assumptions made about the mathematical shape of the trajectory’s segments, and so is rather sensitive to input streams which deviate from these assumptions. This helps to explains the selected parameters’ divergence from their ground-truth values, as seen in Table 7.2.3. 66 7.2.4 Sample Results GPS Geographic Maps Geographic map of (300, 20)−segment trajectory for ground robot GPS Geographic map of (300, 200)−segment trajectory for ground robot GPS input data points calculated path segments input data points calculated path segments Figure 7-5: Two geographic maps of the ground robot data set, with the (300, 20)and (300, 200)-segment trajectory outputs of the algorithm, respectively. With far fewer path segments to utilize, the m = 20 trajectory’s fit to the GPS points is much rougher, compared to the m = 200 trajectory’s close fit. Geographic map of (300, 20)−segment trajectory for quadrotor robot GPS Geographic map of (300, 200)−segment trajectory for quadrotor robot GPS input data points calculated path segments input data points calculated path segments Figure 7-6: Two geographic maps of the quadrotor robot data set, with the (300, 20)and (300, 200)-segment trajectory outputs of the algorithm, respectively. Since this data set contains a low degree of actual path repetition, the produced trajectories tend to simply align themselves to the GPS points. 67 Geographic map of (300, 20)−segment trajectory for personal smartphone GPS Geographic map of (300, 200)−segment trajectory for personal smartphone GPS input data points calculated path segments Figure 7-7: input data points calculated path segments Two geographic maps of the personal smartphone data set, with the (300, 20)- and (300, 200)-segment trajectory outputs of the algorithm, respectively. Because of the extended discontinuities in the data set, the algorithm has made a best-effort attempt to bridge these gaps. As a result, some parts of the map traverse areas lacking any input points. It may be valuable to ‘repair’ such signal gaps using data patching, as described in [18]. Feature Vector Clustering Maps The size of each blue circle is proportional to the relative prominence of a particular cluster in the produced trajectory, the number of times which that cluster appears in the sequence. The thickness of each green line is proportional to the relative strength of linkage between two clusters, the number of times which one of the two immediately precedes the other in the sequence, with the absence of a line indicating that the trajectory never directly transits between two clusters. The images are representative frames for the most prominent clusters, each one corresponding to the vector with the lowest fitting cost to its assigned segment. Some of these frames are manually labeled to identify the ground-truth elements with which their clusters correspond, using the letters from the conceptual layouts in Section 7.1.1. 68 Clustering map of (300, 20)−segment trajectory for short phone video 7 Clustering map of (300, 200)−segment trajectory for short phone video 5 61 3 9 11 13 101 19 15 cluster 2 (C) cluster 3 (A) cluster 4 (D) cluster 6 cluster 7 (B) cluster 8 (D) cluster 9 1 121 181 identified feature clusters transitions between clusters 17 cluster 1 21 81 1 identified feature clusters transitions between clusters 41 141 161 cluster 5 cluster 1 cluster 2 (A) cluster 3 (C) cluster 4 (D) cluster 10 (A) cluster 6 cluster 7 (B) cluster 8 (D) cluster 9 cluster 5 cluster 10 (C) Figure 7-8: Two clustering maps of the short phone video data set. Because this video contains only a few repeated loops, the skew of the transition strength towards the most prominent clusters is relatively low, especially when a large number of clusters are allowed. Despite this, the qualitative appearance of the clusters’ representative frames are widely varied, even amongst the most prominent clusters. In the left map, the most prominent cluster (cluster 1) occurs 92 times, while the least prominent (cluster 20) only occurs once. In the right map, the most prominent cluster occurs 7 times, while the least prominent (cluster 200) still only occurs once. 69 Clustering map of (300, 20)−segment trajectory for long phone video 7 Clustering map of (300, 200)−segment trajectory for long phone video 5 61 3 9 11 13 101 19 15 21 81 1 identified feature clusters transitions between clusters 41 121 cluster 2 cluster 3 cluster 4 (B) cluster 5 (C) cluster 6 cluster 7 cluster 8 cluster 9 (D) cluster 10 (B) 181 identified feature clusters transitions between clusters 17 cluster 1 (B) 1 141 cluster 1 161 cluster 2 (A) cluster 3 cluster 4 (C) cluster 5 cluster 6 (B) cluster 7 (D) cluster 8 cluster 9 (B) cluster 10 Figure 7-9: Two clustering maps of the long phone video data set. Unlike the short video, this video contains a larger number of repeated loops, and so the transitions’ strengths tend to skew significantly towards the most prominent clusters. Even at m = 200, the green lines are noticeably thicker and denser around the top-right region of the map, where the larger blue circles are arranged. In the left map, the most prominent cluster (cluster 1) occurs 66 times, while the least prominent (cluster 20) only occurs once. In the right map, the most prominent cluster occurs 8 times, while the least prominent (cluster 200) still only occurs once. These sample clustering maps demonstrate both the successes and shortcomings of ApproxKMSegmentMean as applied to feature vector data. On the one hand, the clear non-uniformity of cluster prominence (the number of sections assigned to the cluster), as well as the relative skew of transitions towards the most prominent cluster, demonstrate how the algorithm is able to identify repeated feature characteristics across the input stream. On the other hand, the presence of a non-negligible number of less prominent or outright trivial clusters shows cases where the algorithm has failed to develop a sufficiently robust understanding of the stream’s underlying patterns of repetition. To some degree, this is a result of the choice of the (k, m)-segment mean as the mathematical model of these patterns: because the algorithm aims to reduce the aggregate fitting costs of the (k, m)-segment trajectory, it will tend to fit in a way which favors small outlier values over larger but less divergent regions of the stream. 70 It is thus not surprising that a certain fraction of the total clusters are consistently given over to such outliers, regardless of whether m is large or small. 71 72 Chapter 8 Conclusion The ApproxKMSegmentMean algorithm produces a semantic map representing the underlying patterns found in a long-term data trajectory with significant repetition. It uses a process of intelligent initialization followed by incremental improvement, in order to converge on a (k, m)-segment trajectory with a locally-optimal fitting cost to the original data. This algorithm is sufficiently generalized to be applicable to a wide variety of input data types, such as GPS points and video feature vectors. Figure 7.2.1 shows that the maps produced are close matches to the input data, and Figure 7.2.2 shows that the algorithm is able to develop these maps quickly enough for real-time applications. Beyond the objective of solving the (k, m)-segment mean problem to high approximation accuracy, however, these results show the qualitative limitations of this algorithm as applied to the development of semantic activity maps, such as its reliance on structural assumptions and its sensitivity to outliers. It bears investigating whether the algorithm or its implementation can be modified in order to dampen these undesirable operative characteristics. In addition to improving the robustness of the algorithm to remedy these issues, there are several promising avenues of potential future research stemming from this work. Most obviously, this algorithm could be applied to other types of data, requiring their own error models. More intriguing, however, is the possibilty of processing multiple data streams at once. Compound input, synthesized from multiple sensors on 73 a single agent, could massively improve the algorithm’s ability to discern underlying patterns in that agent’s behavior. Conversely, analyzing input from multiple agents with overlapping regions of experience could produce a much more detailed map of their shared region. 74 Bibliography [1] Pankaj K Agarwal and Nabil H Mustafa. K-means projective clustering. In 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2004. [2] James Biagioni and Jakob Eriksson. Inferring road maps from GPS traces: Survey and comparative evaluation. In Transportation Research Board 91st Annual Meeting, 2012. [3] Lili Cao and John Krumm. From GPS traces to a routable road map. In 17th ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL GIS), 2009. [4] Winston Churchill and Paul Newman. Experience-based navigation for long-term localisation. The International Journal of Robotics Research, December (Special Issue on Long-Term Autonomy) 2013. [5] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localisation and mapping (SLAM): Part I the essential algorithms. IEEE Robotics & Automation Magazine, 2006. URL: http://www-personal.acfr.usyd.edu.au/tbailey/papers/ slamtute1.pdf. [6] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localisation and mapping (SLAM): Part II state of the art. IEEE Robotics & Automation Magazine, 2006. URL: http://www-personal.acfr.usyd.edu.au/tbailey/papers/ slamtute2.pdf. [7] Daniel Feldman, Cynthia Sung, and Daniela Rus. The single pixel GPS: Learning big data signals from tiny coresets. In 20th ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL GIS), 2012. [8] Andrii Ilienko. Continuous counterparts of poisson and binomial distributions and their properties. Annales Univ. Sci. Budapest, Sect. Comput., 39:137–147, 2013. URL: http://ac.inf.elte.hu/Vol_039_2013/137_39.pdf. [9] Yasir Latif, César Cadena, and José Neira. Robust loop closing over time for pose graph SLAM. The International Journal of Robotics Research, December (Special Issue on Long-Term Autonomy) 2013. 75 [10] Yunpeng Li, Noah Snavely, and Daniel P. Huttenlocher. Location recognition using prioritized feature matching. In 11th European Conference on Computer Vision: Part II, 2010. URL: http://www.cs.cornell.edu/~dph/papers/ localization.pdf. [11] Kai Ni, Anitha Kannan, Antonio Criminisi, and John Winn. Epitomic location recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber= 4587585. [12] Brian Niehoefer, Ralf Burda, Christian Wietfeld, Franziskus Bauer, and Oliver Lueert. GPS community map generation for enhanced routing methods based on trace collection by mobile phones. In 1st International Conference on Advances in Satellite and Space Communications (SPACOMM), 2009. [13] Guy Rosman, Mikhail Volkov, Daniel Feldman, and Daniela Rus. Segmentation of big data signals using coresets (provisional title), 2014. [14] Falko Schmid, Kai-Florian Richter, and Patrick Laube. Semantic trajectory compression. Advances in Spatial and Temporal Databases, 2009. [15] Florian Schroff, C. Lawrence Zitnick, and Simon Baker. Clustering videos by location. In British Machine Vision Conference, 2009. URL: http://research. microsoft.com/pubs/81738/bmvc09_cr.pdf. [16] Wenhuan Shi, Shuhan Shen, and Yuncai Liu. Automatic generation of road network map from massive GPS, 12 vehicle trajectories. 12th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2009. [17] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2002. URL: http://www.stanford. edu/~hastie/Papers/gap.pdf. [18] Cathy Wu. GPSZip: Semantic representation and compression system for GPS using coresets. Master’s thesis, Massachusetts Institute of Technology, 2013. [19] J.J.C. Ying, W.C. Lee, T.C. Weng, and V.S. Tseng. Semantic trajectory mining for location prediction. In 19th ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL GIS), 2011. 76