Human Mobility Modeling at Metropolitan Scales Sibren Isaacman, Richard Becker, Ramón Cáceres, Margaret Martonosi, James Rowland, Alexander Varshavsky, Walter Willinger Princeton University, Princeton AT&T Labs, Florham Park, NJ, USA Mobisys 2012 Junction 101.07.09 Introduction • Human mobility model • Mobile sensing, opportunistic networking, urban planning, ecology, and epidemiology • Aim to produce accurate models of how large populations move within different metropolitan areas • Generate sequences of locations and associated times that capture how individuals move between important places in their lives, such as home and work. • Aggregate the movements of many such individuals to reproduce human densities over time at the geographic scale of metropolitan areas. • Take into account how different metropolitan areas exhibit distinct mobility patterns due to differences in geographic distributions of homes and jobs, transportation infrastructures and other factors Shortages of previous works • Random motion -> unrealistic motion • Lack of memory about recurring movement patterns • Lack of spatiotemporal realism about population densities • Without diverse population and geographic concerns • Small geographic area (e.g. campus) • Too universal -> fail to adopt to different geographic areas Sources • Call Detail Records (CDRs) • Creating human mobility models from such data • Maintained by cellular network operators • Information -> Sporadic samples of the approximate locations of the phone’s owner • The time a voice call was placed or a text message was received • The identity of the call tower with which the phone was associated • Difficulties: • Whether associated locations corresponding to home, work, or other important places for particular cellphone users. • Both the spatial and temporal granularity of CDR data is quite coarse • Spatially, the granularity of cell-tower spacing • Temporally, only generated when phones are actively involved Overall view • WHERE modeling approach (Work and Home Extracted Regions) • Human movement (important locations, commute distances .. ) => probability distributions => generate synthetic CDRs for an arbitrary number of synthetic people Parameters for mobility modeling • Human mobility is tightly coupled to the geography of the city people live. • => should take into account both the area geography and individual user mobility patterns. • Spatial information: important locations • Spatiotemporal information: hourly population densities • Temporal Information: calling patterns Spatial: Important Locations • A full 60% of mobility can be accounted for just the top two cellphone towers with which a user is associated. • Important locations: home and work • Probability distribution • home locations: Home • CommuteDistance : d • Work locations: Work Spatiotemporal: Hourly population densities • Heavily residential area is likely to be more populated at night Commercial district is likely to be more populated during the day • Hourly population density: simply reflects the probability of people being at a particular location at a particular time (Hourly) • Assumption: the spatial densities of telephone calls is approximately equivalent to the spatial density of people 7 ~ 8 p.m. on weekdays Over a 3-month period Temporal: calling patterns • A user’s daily call volume characteristics from distribution: PerUserCallsPerDay ~ N(mean, sd) • Temporal patterns of when those calls are made: • 2 classes (clustered by k-mean method) • How many calls each user makes during each hour of the day • 24-dimensional vector • Hourly call probability distribution: CallTime Algorithm for model generation • WHERE2 => Two-place model: Work and Home • Makes use of the fact that most people spend the majority of their time either at home or at work. WHERE2: Work and Home • Users’ movement: occurs based on synthetic CDRs representing calls made at different locations at different times Extension to Additional Places • WHERE3: increasing the number of important locations • Tradeoff: model complexity <-> fidelity of the synthetic trace Evaluation • Earth Mover’s Distance (EMD) • A “good” synthetic trace has the synthetic user population distributed in a very similar way in space as the real trace, at any point during the day. • A measure for comparing two spatial probability distributions • Attempt to find the minimum amount of energy required to transform one probability distribution into another. • This energy is given by the “amount” of probability to be moved and the “distance” to move it. “distance”, the linear mapping used in EMD. Evaluation • Comparison models • Random Waypoint (RWP) • Each user selects a random destination from all possible destinations in the area to be simulated. • Once a destination is selected, the user moves at a random velocity toward the destination. • When the selected destination is reached, the user waits for a random amount of time and then selects a new destination and new velocity to begin the procedure. • Weighted Random Waypoint (WRWP) • The destination is chosen from a location probability distribution • Here, use the distribution “Hourly” Evaluation • Sources for input probability distributions • Real call detail records (CDRs) • ZIP codes within a 50-mile radius of the LA and NY centers • Billing addresses lie within the metropolitan regions of interest • Phone id, starting time, duration, cell towers locations • Data Validation (CDR can accurately represent the mobility patterns) • The number of sampled phones in each ZIP code is proportional to the population of that ZIP code • The maximum pairwise distance between any 2 cell towers contacted by a phone in one day is a close approximation for how far the phone’s owner traveled that day. • Applying certain clustering and regression techniques produces accurate estimates of important locations in people’s lives, in particular home and work. Evaluation • Sources for input probability distributions • Census Data • Need to buy CDRs! • Publicly available data regarding home and work locations, as well as commute distances, for large populations • Provide little or no information about the hourly probabilities of a given location • Make an assumption about the hourly distributions • Combination of public data and CDRs • home, work, and commute distances come from the census • Other distributions are drawn from CDR data Evaluation • Artificial Test Cases • To reason about the expected behavior of the model and its strengths and weaknesses • Two locations: • The model must be able to place synthetic users at home and work locations and move them according to time of day • since we emulate CDRs with discrete call locations, the synthetic users must only exist at these locations • The entire world is populated by users that move predictably between two locations at highly regimented intervals. • From 7am to 7pm on weekdays, all of the probability is concentrated in a single “work” location. At all other times, the probability is clustered in a second “home” location. Validation • WRWP RWP WHERE2 Idea clearly doesn’t result, isdiffers able inhave which to greatly, sufficient correctly probability because temporal model isitathe isspike given input callatdistribution so to the distinguish little home input how information location information 9amtest mobility regarding atcase 6am, both probabilities and during spatial a spike “home” orshould temporal at work and differ location “work” patterns fromphases. at6am to 9am. model. ones. Two locations Evaluation • Artificial Test Cases • To reason about the expected behavior of the model and its strengths and weaknesses • If multiple possible works exist for a home, the model must select a realistic one. has WHERE2 a single detects user travel that users to each make of the callsfour onlypossible from one locations, of two locations resulting and in thus a signifficantly correctly positions worse EMD. users. Validation: Large Scale Real Data • Modeling based on real CDRs • WRWP improves over original RWP by 4 times • WHERE out performs WRWP by additional 20% • WHERE3 improves further (NY by additional 3 miles accuracy) Validation: Large Scale Real Data • Modeling based on real CDRs Validation: Large Scale Real Data • Modeling based on Census Data • Using all-public data from the US Census • With average error of 8 miles in NY • Using hybrid of census data with some CDR information • An average error of 6.8 miles • Using all-CDR information in WHERE3 • Reduces average error to some 3 miles Example Uses • Daily range: maximum distance a person travels in one day • serves as an important metric for verifying correctness of the generated models • The EMD metric measures the aggregate behavior of the users, but daily range displays the results at a per-user granularity. (2, 25, 50, 75, 98) percentile Comparing NY and LA: *. WHERE2: LA with only median 1 mile is within error 0.8applicability miles of theacross true value. cities with *. WHERE3 very different 0.7 miles mobility error => patterns Addingand a third geographic point to the model characteristics. provides very little benefit for this use. Example Uses • Message Propagation: relevant in opportunistic networking • Social contacts, epidemiology and data carrying • Simulator for epidemic routing • As user meet they exchange all messages (0.5 radius) • Delivery percentage and message delay rate Example Uses • Hypothetical Cities: what-if scenario regarding mobility in NY • Create parameterized model of cities and user behavior patterns • City planner to experiment with the effects of modifications that are being considered • 10% of the people whose original work location were in the borough of Manhattan were instead given work locations that matched their home location. Conclusion • Human mobility modeling • Capture the motion of individuals among important places in their lives • Aggregating that motion to reproduce human densities over time at the scale of a metropolitan area. • Accounting for differences between metropolitan areas. • Refinements • To produce not only sequences of locations with associated times, but also routes taken between those locations. • CDRs and census tables are too coarse to provide route information • Sequences of Cell towers passing by