Origin/Destination Applications from Smart Card Data Jesse Simon A number of studies have explored the potential use of smart cards in generating Origin/Destination matrices and several agencies have sponsored demonstrations on portions of their systems. The LACMTA has developed a methodology that generates O/D from the entire system’s smart card transactions. The prototype practical applications discussed in this paper are oriented to transactions on specific transit lines or specific locations where the O/D data is queried from the system-wide data collected for a specified date or dates. The Linked Trip application pairs the initial line to the final line of each linked trip. The Line O/D application maps destinations of trips originating on the transit line. The Area O/D application maps origins of trips destined to a study area. The Los Angeles County Metropolitan Transportation Authority (LACMTA) operates a “TAP Card” fare system with smart cards. One serendipitous benefit of this system is that it allows the agency to generate origin/destination (O/D) trip information from stored TAP card data. The agency has been exploring methods to do so and is developing applications to exploit the results. The main advantage of knowing O/D is that one can plan a delivery system where the initial stop of a patron’s trip and the final stop is known, even when more than one bus is used. Like many other large agencies, the LACMTA already knows where patrons board and alight within a route from automated passenger counters, but it has had to rely on sparse, expensive, infrequent and possibly biased on-board survey data to estimate trip linkages and final destinations. BACKGROUND Transportation modelers have had great hopes for smart card technology to supplement or replace origin/destination data from on-board surveys. As far a back as 2004 Bagchi and White summarized its advantages: (1) it would generate much larger volumes of data than surveys, (2) it would allow linkage of single trips to an individual’s travel on a single card, (3) it constitutes continuous trip data involving longer time periods than collectable via survey, and (4) insofar as the smart cards are subdivided by fare type, it would allow researchers to track different market segments. (1) Others have added that it would have low marginal cost because it is already collected for other agency applications, (2) the database would be more accurate than personal recollection by survey respondents, (3) and it would be available in weeks rather than in a year or more. (3) In addition, the low and declining response rate of surveys (often much lower than 20%) and the potential for self-selection bias among responders is a general concern among researchers. (4) This is not a particular concern about smart card users: the New York City Transit Authority (NYCTA) used travel diaries to test whether smart card subway patrons differed in travel pattern from non-card subway patrons. They did not. (3) Most of the efforts involving O/D are for transportation modeling applications, particularly development of O/D matrices. (5) They tend to have the modeler’s preference for statistical assignment of uncertain data. For example, Rahbee uses two methods of scaling when data is missing or incomplete, where the model may “reassign the whole passenger or reassign the passenger fractionally”. (5) In addition, because data is usually aggregated, confidence objectives are at the “reasonable approximation” level (i.e., 90% accuracy) that was first stated by the NYCTA in its pioneering 2002 study. They further accept the NYCTA assumption that riders end their last trip of the day at or near the station stop of the first trip of the day, which NYCTA confirmed as a reasonable approximation for its target population of subway riders. (2) (3) This paper describes a more direct application to line specific or corridor/location data. It does not make the assumption that the first trip of the day is the mate of the last trip of the day; rather, it tests whether the data from two trips match. As the discussion will show, this is at a considerable loss of usable matches but at a greater trust in data validity. The goal was not to fill in matrices with reasonable approximations but to aid decision making via tables and maps based on valid data that has face validity for non-statistically trained personnel. On-board surveys will continue to provide information on demographics, trip purpose, and other information that cannot be collected through automated fare and patronage counting systems. Nevertheless, automatically collected material can be organized to provide large volumes of data that is pertinent to route and system planning. It is collected on a daily basis that can be used for time-series analysis which surveys cannot provide. How LACMTA’s TAP Card System Measures Up Some providers, such as BART and WMATA, have entry/exit fare transactions on their rail lines. They have full O/D information within their rail systems as a result. The rest of the providers have entry-only fare transactions. Boarding stops are all recorded, intermediate link alighting stops can be easily inferred from the next link’s boarding stop. Final alighting stops, the linked-trip destination, must be inferred from algorithms or rules of thumb. LACMTA has one of these systems. LACMTA’s smart card data interface is fairly advanced among transit providers, especially multi-modal transit providers. Most providers have AVL systems that are not strongly linked to their smart card systems. Links must be forged to pursue O/D estimation. (5) (6) LACMTA is fortunate to have an integrated AVL/smart card system. All smart card transactions are both time-stamped and geo-stamped automatically. But with all complex operations involving hardware, firmware, software and multi-system interface, system maintenance becomes a key to success. The project was undertaken, in part, to understand system weaknesses and to develop diagnostic reporting and procedures. This effort is not included in this paper except to say that low match rates currently experienced will be improved by the maintenance reports and procedures that are being developed. LACMTA has succeeded with its APC system where other agencies have had difficulty because of this emphasis on diagnostics and maintenance; we expect similar results with our smart card systems. WHAT AN O/D MAP CAN SHOW The following map is a graphic example of the potential of O/D mapping. It is a Desire Line map of the origins of two stations and one stop, each with a different origin recruitment pattern. Desire Lines are “as-the-crow-flies” representations of travel from origin to destination. Desire Line maps, while dramatic, are no longer the method of choice for O/D mapping because it masks detail. Nevertheless, the Desire Line map very clearly shows how the patterns differ. The red Desire Lines connect origins to Metro Center rail and bus stops. Metro Center has the highest number of origins of any destination area of Los Angeles’s Metro system (the system operated by LACMTA – a different location, Union Station, has more patrons when the Metrolink and Amtrak rail systems are thrown in.) Metro Center has an extremely wide origin recruitment area with heavy recruitment along the Metro Rail and Harbor Freeway corridors. The blue Desire Lines connect origins to El Monte Station, the busway station with the largest number of origins. It too has a wide recruitment area, but mostly from the San Gabriel Valley and Downtown Los Angeles extending throughout the Metro Red Line. The green Desire Lines connect origins to 3rd and Vermont, the bus stop with the largest number of origins among bus areas not adjacent to a rail or busway station. Its origin recruitment area, while larger than many stops, is much more localized than the other two. GOALS OF THE O/D PROJECT The primary goal of the O/D project was to generate map and table applications that can be used for scheduling new service and revising existing service. LACMTA has been incrementally introducing multi-use smart cards as fare media. About 45% of all fare transactions currently involve these cards. Generating O/D data from the card transactions will eventually be seamless. The present trial run was as much an effort to troubleshoot as it was an effort to develop the O/D applications. A variety of applications were developed but three seem most promising. Two of these are O/D map applications, and the third is a linked trip application. The most intuitive map application is the map of destinations from an originating transit line - it can be discussed in this section. The other two applications must wait until after the methodology/data definition section. O/D mapping can show how each transit line interfaces with other transit lines to distribute patrons. Below is one of a series of maps that were used to study transit usage in the San Fernando Valley. It is a map of destinations from Orange Line origins. While there were no surprises about the major distribution patterns, there were some about relative pattern strength, and the limits of the catchment area. Inferences from the map: The Orange Line itself was a very frequent destination among people originating on the line. This not only represents destinations in the vicinity of the stations, but also park & ride interface and, for a few stations, transfer to non-Metro transit providers. The Line distributes patrons throughout the Valley via other Metro Bus lines. The Line distributes patrons all along the Red and Purple Lines, but not the Blue or Gold Lines. Via the Red Line it distributes patrons through third Metro Bus links to Hollywood and downtown Los Angeles. It distributes a small but concentrated number of patrons to Westwood via Line 761. This map does not show trips where Orange Line is an intermediate link on a 3-link trip, which would necessarily be shown on another map. METHODOLOGY/DATA DEFINITION TASKS Data definition can be broken into two main tasks: (1) creating linked trips from TAP records and (2) inferring the final link’s alighting stop (the linked trip’s destination). Creating Linked Trips The smart card dataset is organized by smart card identification number and date. On any given date on any given card there may be one or many fare transactions. Each transaction represents a boarding which, in this context, is called a “link”. The question is how to decide which links are parts of a linked trip. According to Chu and Chapleau, “the identification of linked trips in previous studies is solely based on a fixed temporal threshold between transactions”. They cite a variety of thresholds: (1) a transfer occurs if wait time is less than 60 minutes, (2) less than a 90 minute elapsed time between successive boardings, and (3) less than a 30 minute elapsed time between successive boardings. The problem with arbitrary temporal thresholds is that it does not account for variation in trip length and service levels. As Chu and Chapleau put it “this would destroy the disaggregate property of the data”. (7) In their case study their solution was to create a “spatial-temporal path” between successive trips. This was a several step process: (1) Boarding time was obtained from the fare transaction, (2) alighting time of the cardholder at the stop for the prior trip was obtained from the boarding data of other passengers at that stop of that trip, and (3) if no boarding took place, then it was interpolated from other passenger boardings at other stops. (4) Then the distance between stops was found if walk distance is involved. (5) A walk speed of 1.2 m/second (2.7 mph) is applied (6) with an added 5 minutes to account for variations in walk speed. Chu and Chapleau’s advocacy of a spatial-temporal context is an improvement over a fixed temporal element but their method is more applicable to a small scale study. In LACMTA’s case where there are millions of TAP transactions every week, referring to boarding times of other passengers to attach to the cardholder’s transaction is a very large processing speed bump, especially if further interpolation is sometimes required. Instead, LACMTA uses mph between the cardholder’s boardings. In this case the spatial-temporal context is that a link becomes part of the linked trip if the time elapsed between successive boardings is greater than 3 mph, i.e., the time to the next boarding better be faster than walking speed if that boarding is to be part of the same linked trip. It should be noted that the 3 mph is really more a characteristic of the service provided than the speed of the passenger. There were very few instances where Metro service, including headways, was lower than 3 mph between any two connecting lines at any two stops (almost all in Downtown LA, and these were rare). A different mph would be appropriate in other cities. Inferring the Final Destination of the Linked Trip TAP card data clearly indicates where a patron boards; it is where he taps his card. The initial boarding of a trip is the trip origin. In this exercise one must also find out where he finally alights from the trip’s last link because this is his destination. TAP cards do not directly say where this happens, there is no card tapping upon alighting – so, it must be inferred. The basic method of inferring final destinations starts with finding linked trips that can be matched as the inbound and outbound trips of a “round trip”. Once done, the initial boarding stop of the first linked trip is identified as the final alighting stop of the later linked trip and the initial boarding stop of later linked trip is identified as the final alighting stop of the first linked trip. The destinations of each of the linked trips are thereby inferred. There are two separate parts to a round trip, both of which are linked trips. (Multiple site tours are discussed in the “Some Weaknesses in Matching Round Trips” section.) For example, in a home to work round trip, the first linked trip is from home to work and the second linked trip is from work back to home. Note that “linked trip” can be a one-link trip if no other links were found to be part of it. As soon as Linked trips are identified, the boarding stop of the intermediate links can be eliminated and pertinent information from the initial (the origin) link and the final (the destination) link can be merged into one record. These records present an opportunity to infer Trip Origin Stop - Trip Destination Stop matches (OD). The following graphic shows what is known from TAP data and how OD can be inferred from it. The red and orange must each match to make an O/D pair. Line to Area Match Direction of Travel Outgoing Trip: First Link’s Boarding Stop Area’s Associated Line Numbers Incoming Trip: Last Link’s Line Line to Area Match Last Link’s Line First Link’s Boarding Stop Area’s Associated Line Numbers The result is that the Outgoing First Link’s Boarding Stop Area will be named the Origin and Incoming First Link’s Boarding Stop Area will be named the Destination. In this application the definition of “stop area” is important. Presently, stop area is defined as 350 meter circle (just over 1/5 mile) around the stop. This allows for travel to and from bus and rail stops and bus depot stops in selected areas, which while very few, are very busy. This could have been extended to ¼ mile, which is the walk distance many transportation models use to represent the distance people are willing to walk to a bus stop. It was not done for two reasons. The first is that so many options would be available in some areas, especially downtown LA, that matching would become an empty exercise. The second is that in travel surveys walk distance preference questions are about stop distance to and from the true origins and destinations, not distance between transit stops. We may revisit this restrictive approach in the future, but for the present, a distance over 350 meters between two stop areas voids the trip match. On the other hand, the above is a little looser construct than matching First Link’s Line to Last Link’s Line because Metro’s system has stops where a person returning to the same place could choose more than one line since each would make his desired connections. The matching criteria was amended even further because TAP cards currently often record only the parent line of a bus run; if it is on a branch line for part of the day, then the wrong line will be recorded and no match will be made. The geo-stamp is unaffected by line assignment, so using geographic assignment would increase the number of matches. Here is a similar graphic to the above that contains the criteria that matches Area to Area rather than Line to Area: Stop Area Match Outgoing Trip: First Link’s Boarding Stop Area’s Associated Line Numbers Direction of Travel Incoming Trip: Last Link’s Boarding Stop Area’s Associated Line Numbers Stop Area Match Last Link’s Boarding Stop Area’s Associated Line Numbers First Link’s Boarding Stop Area’s Associated Line Numbers Here again, the result is that the Outgoing First Link’s Boarding Stop Area will be named the Origin and Incoming First Link’s Boarding Stop Area will be named the Destination. A beneficial side-effect is that Areas are determined by geo-stamps; current problems (that are hopefully temporary) with proper designation of lines is thereby avoided. TABLE AND MAP APPLICATIONS There are a number of applications that have been developed from this process. The first of which is derived from the designation of linked trips, prior to matching initial and return trips. This is important because of some weaknesses in round-trip matching: many trips are not matched and there is no way to determine if the ones selected represent the population of trips. Some Weaknesses in Matching Round Trips Only 38.3% of all smart card transactions were given origins and destinations through the matching method. This low return is not a problem per se, but it would be if matches do not result in a representative sample of the total population of trips. At the present time many unmatched trips may be due to a recording problem. Two major systems must talk to each other: the passenger counter (APC) system which geo-stamps the boarding and the farebox system which records the TAP. Any failure to make the boarding geostamp part of the TAP record of any multi-link trip, or any part of the round trip, will nullify the ability to match. (Tracking system integration error will be part of another paper.) Insofar as system error is randomly distributed, this does not substantially contribute to a concern about how representative round trips are of the general population of trips. Currently there is a nonrandom distribution of system error among fareboxes; they are far less likely to talk with the APC systems aboard buses on contracted lines rather than aboard buses on directly-operated lines. This is an installation, not a permanent problem. In Metro’s APC experience, non-random patterns that are found are diagnosed, corrected and eliminated. No general non-random pattern, such as a relationship to boarding frequency, has been found. Another possible explanation is that there was no matching trip for a given trip. Some trips are not part of round trips because the return trip is made on another mode (e.g., a car or bicycle is used). In other cases, the return trip could have been made on a subsequent day. There are also two situations in which trip tours are not matched as round trips. First, some Metro trips involve transfers to or from other operators. Since only Metro trips are being tracked, the other operator links would not appear and matching would fail. Second, there could be a trip tour with no way to designate which is the origin and which is the destination. This could represent a strong bias in some localities. A particular instance was a stop area serving a commuter college where many trips involve going from home to work to school and then home again. The computer could not break this tour into two matching components of a round trip without additional knowledge about primary destination, which is not collected at the fare box. These non-random examples lead to questions of sample bias that can only be partially addressed in this paper. The data is a biased sample of the overall population of trips. It primarily represents the travel behavior of regular users who make round trips directly to and from school or work. The commuter college example shows that in some locales the results will not be useful. But the bias potential should be understood in context. LACMTA on-board surveys indicate that 82% of riders use the service 5 or more days a week, and 82% of this group’s trip productions are either home-work or home-school. Fare card users primarily come from this majority group, even if it is an open question as to how their travel differs from others’. Several modelers have attempted to coordinate the O/D information with on/off passenger counts. Their efforts focus on transfer estimates in restricted neighborhoods; (8) (9) there is no methodology extant for application to wide areas with multi-link transfers. The tack taken in this paper is to retain the original O/D of the very large sample (millions of cases per week) knowing that it represents behavior of the core group of users of the system. Line Destiny Report In contrast to the 38.3% match rate, linked trip attribution was successful for 75.2% of the smart card transactions. Designating the linked trip does not yet identify the final destination of the trip but it does allow the identification of the initial link and final link of each trip. LACMTA’s new “Line Destiny” report is the result. The report rank orders transfers to lines from any given line. The report is generated with more matches, and involves fewer inferences, than LACMTA’s O/D applications. The example below is only the first page of the Line Destiny report which shows every line in the system. The report is illuminating. Staff is well aware that Los Angeles’ Metro has the highest proportion of multi-link trips in the country but the report shows that the basic travel pattern is still the one-link trip. The report shows that 57.3% of all trips involve only one link. As to multi-link trips, linkage from any given line is widely distributed among transfer points - to many lines. The report shows that, from any given line, the median for the highest percentage of trips destined to end on another specific line is 4.0%. Only six lines have 10% of its patrons destined to a specific line. The highest among these are patrons on Metro Rail’s Gold Line: 21.3% of its patrons are destined to end their trip on Line 802 (the designation for Metro’s two heavy rail Routes that share a large corridor segment). Obviously, the Gold Line and the heavy rail routes are closely inter-related; planning and scheduling should be approached with this in mind. Another interesting general finding is that no Rapid Line has over 10% of its patrons transfer to the companion local line that travels the same corridor. When Rapid was first proposed it was assumed that Rapid Line patrons would begin or complete their trip on the parallel local line; this is not empirically supported. Patrons may game whether to board the Local or Rapid Line but once aboard they do not tend to subsequently transfer to the parallel service. Line Destiny Report for Weekdays Sept. 7-13, 2010 (Only destinations with over 2.5% of origin boardings) Original Line Final Line Frequency Percent Cumulative Boarded Boarded Percent 2 2 17,532 65.8 65.8 Total 26,658 100.0 4 4 20,023 67.6 67.6 10 14 802 Total 10 Total 14 204 789 29,632 10,506 17,077 13,288 815 2.7 100.0 61.5 100.0 57.9 3.6 70.2 207 754 Total 710 604 22,939 3.1 2.6 100.0 64.6 67.2 61.5 57.9 61.5 O/D applications The O/D applications discussed in this paper were developed with schedule makers and service planners in mind: the applications focus on specific lines or specific places. The data is prepared for all smart card transactions that can be matched to generate linked trips with origins and destinations. Data is then queried and mapped for specific places or lines. The Line O/D map application has already been discussed. The discussion below focuses upon the Area O/D application, a map of origins and destinations related to a specified geographic area. The Area application was applied to Westwood. The map below was part of a series of spatial analyses that began with questions about where to shorten Line 761 that travels along Van Nuys Boulevard in the San Fernando Valley and then crosses the Santa Monica Mountains to end in Westwood. How far up Van Nuys Boulevard was travel demand to and from Westwood? The initial query set up several Van Nuys corridor maps and Line 761 maps, each map generating more questions and more maps. Once an O/D dataset is generated, maps can be drawn to answer location-specific queries as required. The discussion segued to a question about travel to Westwood in general (inspired by the question, “How typical is the long-distance travel from Van Nuys to Westwood?”). The first map shows the area defined as “Westwood”. It is somewhat different than the standard demarcation of the area. As part of an ongoing process, “Westwood” was defined as areas in or near Westwood that people on Van Nuys corridor traveled to. Census tracts were used in this study because they have demographic information attached to them; but any standard set of polygons could have been used such as TAZs, Census block-groups, or Census blocks. Use of demographics will not be discussed in this paper, except to say that in real-life applications lots of resources are often used to answer lots of questions. The second map shows Census Tracts color coded by origins and destinations, where the color represents intensity of travel (number of trips). Destinations are represented by colored outlines of the Westwood tracts; Origins are represented by the colored interiors of tracts both in and outside of Westwood. The study findings were presented in-house in the following slide. Findings • There are two destination tracts in Westwood that dwarf all the others: UCLA (238 trips) and the tract along and south of Wilshire by Westwood Boulevard (128 trips) – An optimal stop on the subway to the sea would be on Wilshire between these two census tracts. • Most of the origin tracts lie on three main corridors: Van Nuys (with a short jog on Ventura), Wilshire/Whittier, and Sunset – The heavy origins are as far north as Nordhoff on Van Nuys – The heavy origins stretch very far to the east on both Wilshire/Whittier and Sunset • They trace out a strong path for potential corridors of the subway to the sea. – All three corridors represent some long trip-making. • The UCLA tract has the most origins, which indicates travel within Westwood. – These are short trips. The findings are not important to this paper in themselves; they are exemplary in showing the utility of the O/D material, and how easily the data can be focused and re-focused to regional, line or local area considerations as discussions and queries evolve. NEXT STEPS The project was not only trying to explore applications, it was also an exploration of data reliability. Multiple complex systems talk to one another to generate the data, and in such situations there are breakdowns of equipment, software, and firmware. In the immediate future the focus will be on farebox data errata, data structure requirements, and diagnostic reporting and tracking. A major reason LACMTA’s APC system is so reliable is that user and maintenance departments have treated errata, and errata tracking, as “telltales” that insure identification of what needs to be fixed and where it is hiding. One of the oversights of the current project was to discard intermediate link information once the origin and destination links were identified. Such information is necessary for investigating critical paths and calculating links per trip. It would also allow the mapping of O/D where the intermediate link is the transit line that is being investigated. The ultimate goal is the development of data structures and routines for regular, routine, processing of O/D datasets for queries and mapping The question of sample bias is always on the research agenda. Ameliorating the bias with on/off data is a very attractive proposal by several researchers. It must be considered in the framework of applications that can generate outcomes for very large datasets that are geographically widespread. Further research must also be undertaken on the extent to which travel in the matched sample represents total travel. And this research should not restrict itself to whether the matched sample represents fare card user travel. It is still problematic as to whether fare card user travel represents all travel. The NYCTA finding that it is representative in that city may be perfectly true and still say nothing about transit users elsewhere in the United States. Even if the data from automatic sources is somewhat biased, it has benefits that survey data lacks. Recognizing, exploiting and combining the strengths and weaknesses of data collected by diverse methods are going to be major endeavors in the coming decade. With relational databases we can already force fit diverse data sources; doing so in a valid manner is the challenge. The real advances will be by researchers who can offer valid ways in which the massive datasets can routinely calibrate or be calibrated to (and can supplement or be supplemented by) the intentionally developed and controlled survey data. REFERENCES (1) Bagchi, M. & White, P.R. The Potential of Public Transport Smart Card Data, Transport Policy, vol. 12, 2004, pp. 464-474. (2) Farzin, J. Constructing an Automated Bus O-D Matrix Using Smart Card and GPS Data in Sao Paulo, Transportation Research Record #2072, 2008, pp. 30-37. (3) Barry, J. Newhouser, R., Rahbee, A and Sayeda, S. Using Automated Fare System Data, Transportation Research Record #1817, 2002, pp. 183-187. (4) Stopher, P., The Travel Survey Toolkit: Where to From Here?, Keynote paper, 8th International Conference on Travel Survey Methods, 2009 (contact peters@itls.ussyd.edu.au ). (5) Rahbee, A. Smart Card Passenger Flow Model at CTA, Transportation Research Record #2072, 2008, pp. 3-9 (6) Wang, W., Attanucci, J., Wilson, N. A Study of Bus Passenger O-D and Travel Behavior using Automated Data Collection Systems in London, unpublished manuscript, 2010, (contact: winniewang@worldbank.org). (7) Chu, K. and Chapleau, R. Enriching Archived Smart Card Transaction Data for Transit Demand Modeling, Transportation Research Record #2063, 2008, pp. 63-72. (8) Navick, D. and Furth, P. Estimating Passenger Miles, Origin-Destination Patterns, and Loads with Location Stamped Farebox Data, Transportation Research Record, 1799, 2002, pp. 107113. (9) Cui, A., Bus Passenger Origin-Destination Matrix Estimation Using Automated Data Collection Systems, Master’s Thesis, MIT, 2006