Modelling Rail Passenger Movements through e-Science Methods Jeremy Cohena , Claire Jamesb , Shamim Rahmanb , Vasa Curcina , Brian Ballb , Yike Guoa , John Darlingtona a London e-Science Centre, Imperial College London, South Kensington Campus, London SW7 2AZ, UK b AEA Technology Rail, Central House, Upper Woburn Place, London WC1H 0JN, UK Email: jeremy.cohen@imperial.ac.uk Abstract. The UK’s railway network is extensive and utilised by many millions of passengers each day. Passenger and train movements around the network create large amounts of data; details of tickets sold, passengers entering and exiting stations, train movements etc. Knowing how passengers want to use the network is critical in planning services that meet their requirements. However, understanding and managing the vast amounts of data generated daily is not easy using traditional methods. We show how, utilising e-Science methods, it is possible to make understanding this data easier and help the various stakeholders within the rail industry to more accurately plan their operations and offer more efficient services that better meet the requirements of passengers. 1 Introduction The UK’s railway network is large and complex. To plan and set strategy, government and other stakeholders need to know how passengers travel on the rail network, and the level of demand for rail services now and in the future. To manage such a network is a difficult task with operators needing to know how frequently to run trains, to which locations and at what times of day. If there is a demand for more frequent services or longer trains on certain routes, it is in an operator’s interests to provide these. However, contrary to the belief of the average traveller, simply adding a new service or running a longer train is not simple. The timetables that control the movements of trains around the network require complex computational models to produce. But how do operators know when these enhancements or changes are, or will be, necessary? We aim to tackle this question through our work in the Department for Transport (DfT) funded project, under the Horizons research programme, ”Effective Tracking of Rail Passenger Journeys”. We begin by looking at the types of rail network data that are available and how these are useful to the project. We then introduce the Discovery Net [8] platform that we have adopted to aid our research and show how it has been used for the necessary data mining and modelling. We conclude by presenting the results of the work. 2 Rail Data There are currently many types of passenger journey data collected in the rail industry. The data may take a snapshot view of passenger journeys, where the data is collected manually (surveys and counts on trains), or be a continuous feed if collected electronically (through automatic gates or ticket sales). Each source of data is characterised by its attributes such as the method of capture, collation and processing, the size of the dataset, and its coverage (in terms of time period and geography). The challenge lies in effectively combining these data sources in a way to derive the most benefit in terms of understanding passenger movements. 2.1 Types of rail data Some of the types of data available are: • Guards/Conductor Counts: Manual counts taken by train guards/conductors. • Terminus Counts: Manual counts carried out at terminus stations. • Automatic Ticket Gates: Data collected by automatic ticket gates installed at many stations. • Automatic Passenger Counts: Provided by weighing equipment fitted on more modern trains. • Ticket Sales: Centrally collected data on National Rail tickets sold at stations and some third party ticket sales agencies. • London Area Travel Survey (LATS): Effectively a census on travel patterns within the London area. Carried out once every ten years, most recent survey conducted in 2001. 2.2 Use Cases Through consultation with key industry stakeholders, a number of use cases were formulated to cover the various ways in which passenger journey data is utilised by different users within the rail industry. These use cases helped highlight key areas where enhanced data was felt to be of greatest importance, as well as focusing the research into those areas where combining the existing passenger journeys data would enable types of analysis and decision-making currently limited by the nature of the separate data sources. One example of a use case is the Route Utilisation Studies (RUS) carried out by Network Rail to develop efficient capacity plans that match stakeholders’ requirements. These studies demand a detailed understanding of travel patterns on particular train routes, including information about train loads on these routes. This is a prime example of a recurring process where multiple sources of data are used to reach a decision. By providing easily accessible, accurate data, a centralised journey system would increase efficiency further through access to better quality information, and reduce the costs of carrying out these studies. 3 Discovery Net Discovery Net [1, 6, 2, 8] is an EPSRC e-Science Pilot Project based at Imperial College London. The project has developed a service-based computing infrastructure for high throughput informatics that supports the integration and analysis of data collected from various high throughput devices. This infrastructure has been designed and implemented based on a workflow model, allowing the composition of data analysis services and resources declared as Web/grid services. The Discovery Net infrastructure is currently used by research scientists worldwide to conduct complex scientific data analysis in three important research areas: Life Sciences, Geohazard Modelling, and Environmental Modelling. The problem posed by the Horizons research project is categorised by a set of key properties that Discovery Net is particularly well suited for Fig. 1. The InforSense KDE client interface displaying one of the workflows developed for this project. working with – Large quantities of data, geographically distributed data sources, a wide variety of different questions that need to be answered, visualisation of results and dissemination to both technical and operational staff. Given the relatively short 1-year duration of the project, development of a bespoke platform would not be possible. The Discovery Net platform provides all the required features and was therefore adopted as the underlying framework for our system. We are utilising the InforSense KDE software which is based on the research outputs of the Discovery Net project. 4 Extracting information through Discovery Net In order to effectively manage railway network capacity, it is necessary to know how passengers use the network. There are many different ticket types, routes and operators. In the case of smaller stations or less complex routes it may be possible to know the operator of the train that a passenger takes, and perhaps even the specific train but in the case of large cities such as London or Birmingham, knowing which train a passenger has taken from a major interchange station may be almost impossible. It has been shown in section 2 that there is no shortage of data available on many routes, the challenge is to be able to link and interpret the data to reliably understand how passengers use the services available to them. Discovery Net enables us to rapidly prototype possible solutions and extract the more detailed information we require, from the available data. This work utilises many of the ideas behind Computational Grids [4, 5] that underpin e-Science and promote easier access to high performance computing resources for carrying out large-scale science. The domain specific knowledge of the required analysis is held by the project members from AEA Technology Rail. Our aim is to encapsulate this knowledge within the computational platform, in the form of workflows. We worked in pairs consisting of an e-Science and rail data specialist in order to transfer the analysis description into Discovery Net workflows. The collaborative nature of the project combined with time constraints led us to see this as the most practical solution. A workflow is a description of the processes and data/control flows required to carry out a task. The building blocks of workflows are components. In Discovery Net, a component can be a data source or a process applied to some data. A large toolbox of processes is available within the system and custom components can be built. One way of developing a custom component is to use the scripting language Groovy [7]. Workflows are built in the InforSense KDE client software using an intuitive drag-and-drop interface to select components from the toolbox and add them into the workflow. 4.1 System Platform be executed in parallel and then taking advantage of multiple processors on a system by executing these components concurrently. To simulate a fully distributed e-Science architecture, we have used an external Oracle database containing warehoused data as one of the data sources within our system. 5 Results This work has shown how rail datasets can be combined together in a systematic way, using the latest e-Science technologies, in order to maximise the useful information derived, and that this combination can provide a vital contribution to tackling a range of standard rail industry questions / studies with increased speed and accuracy. The project outputs can be separated into the following elements: 1. Comprehensive and consistent documentation of all major sources of passenger journey data existing in the rail industry. 2. Demonstration of how these may be combined to maximise the useful information, for two particular examples: (a) The ticket sales database (LENNON) and the London Area Travel Survey. (b) The national rail timetable and on-train automatic passenger counts. 3. Confirmation that a wide range of rail industry questions/studies can be improved by better combination of passenger data sources. 4. Incorporation of an example combination of rail data sources into a system, in order to show how the wide range of different data sources may be combined systematically. 5. Illustrative prototype of the user interface, in the InforSense KDE software. A computational platform has been set up specifically for the project. The InforSense KDE software system has been deployed on a 48 processor Sun Sparc IV based server operated by the London e-Science Centre. This multiprocessor system allows for faster operation of the InforSense engine when executing workflows that exhibit parallel features. The InforSense software can connect to distributed data sources running on different servers in different locations. In our demonstration environment, data is split between the London e-Science Centre’s Presentation of the results of our workflow proOracle 10g server and data that has been imcessing utilises InforSense KDE’s Web-based ported directly into the InforSense server. portal into which workflows can be published. A user logs into the system and is able to see the set of workflows in their Web browser to which 4.2 Workflow execution they have been granted access privileges. The Figure 1 shows the InforSense KDE client in- user can then change some attributes and exeterface displaying one of the workflows we have cute the workflows available to them but may developed. Execution of the workflows can be not modify these workflows. carried out on a multiprocessor system which Completed workflows, based on our rail indusoffers significant speed benefits over a standard try use cases, provide a table of results and one system. The Discovery Net architecture is de- or more visualisation options to allow industry signed to take advantage of parallel machines decision makers to gain a simple visual view of by determining elements of a workflow that can the results of the analysis task. 5.1 Accuracy Using the Discovery Net platform, we have built workflows to represent a set of key analysis When linking data sources in order to interpotasks that help to answer questions rail industry late and extract new information, accuracy of managers need to address. Our flexible solution the original data sources becomes very signifiprovides two points of entry to satisfy the cant. By adding data sets together, the inaccurequirements of both technical and operational racies of each data set are multiplied together, staff and can present answers to important resulting in data that may have such a high level questions faster than existing solutions, or in of inaccuracy that it is effectively useless. In orsome cases, that cannot currently be answered der to combat this risk, a significant amount of by existing systems. time was devoted to carrying out a comprehensive assessment of the accuracy of data sources Acknowledgements: We would like to and linked data that was to be used. These assethank Ian Hawthorne and the Department for ments were carried out with reference to the Transport who have funded this project work quality dimensions used by the Office for Naunder the Horizons Research Programme. We tional Statistics (ONS) and defined for the Euwould also like to thank the Train Operatropean Statistical System [3]. ing Companies and other organisations who We believe that although inaccuracies in have provided data samples without which this source data are a significant problem in this kind project could not have taken place. of work, our proposed solution takes these risks into account and makes an attempt to quantify them. By working to eliminate accuracy issues References in key areas that we have identified, the entities capturing the data we have used can provide 1. V. Curcin, M. Ghanem, Y. Guo, M. Kohler, more accurate results from our framework in the A. Rowe, J. Syed, and P. Wendel. Discovery net: Towards a grid of knowledge discovery. future. 6 Conclusion We have aimed to show in a very concrete manner how e-Science research can be applied in the 2. real world to tackle an important problem. Setting the strategy for, and the management of, the UK’s rail infrastructure is a difficult task, partly because each of the situations that need to be considered is so large. Many of these problems are therefore well suited to e-Science solu3. tions. 4. 5. 6. 7. 8. Fig. 2. Data from distributed sources is processed within the Discovery Net system and disseminated to the relevant industry staff. In KDD-2002: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 2002. http://www.discovery-on-the.net/ documents/kdd-DNET.pdf. V. Curcin, M. Ghanem, Y. Guo, A. Rowe, W. He, H. Pei, L. Qiang, and Y. Li. It service infrastructure for integrative systems biology. In IEEE SCC 2004: IEEE Conference on Service Computing, Shanghai, China, September 2004. http: //csdl.computer.org/comp/proceedings/scc/ 2004/2225/00/22250123abs.htm. Office for National Statistics (ONS). Guidelines for measuring statistical quality. Available at http://www.statistics.gov.uk/StatBase/ Product.asp?vlnk=13578, 2006. I. Foster and C. Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, July 1998. I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure WG, Global Grid Forum, June 2002. M. Ghanem, N. Giannadakis, Y. Guo, and A. Rowe. Dynamic information integration for e-science. In UK e-Science All Hands Meeting, Sheffield, UK, September 2002. Groovy. http://groovy.codehaus.org/. Discovery Net. http://www.discovery-on-the. net/.