Modelling Rail Passenger Movements through e-Science Methods Jeremy Cohen , Claire James

Jeremy Cohena , Claire Jamesb , Shamim Rahmanb , Vasa Curcina , Brian Ballb , Yike Guoa ,
John Darlingtona
London e-Science Centre, Imperial College London, South Kensington Campus, London SW7 2AZ, UK
AEA Technology Rail, Central House, Upper Woburn Place, London WC1H 0JN, UK
Abstract. The UK’s railway network is extensive and utilised by many millions of passengers each day. Passenger and train movements around the network create large amounts of
data; details of tickets sold, passengers entering and exiting stations, train movements etc.
Knowing how passengers want to use the network is critical in planning services that meet
their requirements. However, understanding and managing the vast amounts of data generated daily is not easy using traditional methods. We show how, utilising e-Science methods,
it is possible to make understanding this data easier and help the various stakeholders within
the rail industry to more accurately plan their operations and offer more efficient services
that better meet the requirements of passengers.
The UK’s railway network is large and complex.
To plan and set strategy, government and other
stakeholders need to know how passengers travel
on the rail network, and the level of demand
for rail services now and in the future. To manage such a network is a difficult task with operators needing to know how frequently to run
trains, to which locations and at what times of
day. If there is a demand for more frequent services or longer trains on certain routes, it is in
an operator’s interests to provide these. However, contrary to the belief of the average traveller, simply adding a new service or running a
longer train is not simple. The timetables that
control the movements of trains around the network require complex computational models to
produce. But how do operators know when these
enhancements or changes are, or will be, necessary? We aim to tackle this question through
our work in the Department for Transport (DfT)
funded project, under the Horizons research programme, ”Effective Tracking of Rail Passenger
We begin by looking at the types of rail network data that are available and how these are
useful to the project. We then introduce the Discovery Net [8] platform that we have adopted to
aid our research and show how it has been used
for the necessary data mining and modelling. We
conclude by presenting the results of the work.
Rail Data
There are currently many types of passenger
journey data collected in the rail industry. The
data may take a snapshot view of passenger
journeys, where the data is collected manually
(surveys and counts on trains), or be a continuous feed if collected electronically (through
automatic gates or ticket sales). Each source of
data is characterised by its attributes such as the
method of capture, collation and processing, the
size of the dataset, and its coverage (in terms of
time period and geography). The challenge lies
in effectively combining these data sources in a
way to derive the most benefit in terms of understanding passenger movements.
Types of rail data
Some of the types of data available are:
• Guards/Conductor Counts: Manual
counts taken by train guards/conductors.
• Terminus Counts: Manual counts carried
out at terminus stations.
• Automatic Ticket Gates: Data collected
by automatic ticket gates installed at many
• Automatic Passenger Counts: Provided
by weighing equipment fitted on more modern trains.
• Ticket Sales: Centrally collected data on
National Rail tickets sold at stations and
some third party ticket sales agencies.
• London Area Travel Survey (LATS):
Effectively a census on travel patterns within
the London area. Carried out once every ten
years, most recent survey conducted in 2001.
Use Cases
Through consultation with key industry stakeholders, a number of use cases were formulated
to cover the various ways in which passenger
journey data is utilised by different users within
the rail industry. These use cases helped highlight key areas where enhanced data was felt to
be of greatest importance, as well as focusing
the research into those areas where combining
the existing passenger journeys data would enable types of analysis and decision-making currently limited by the nature of the separate data
One example of a use case is the Route Utilisation Studies (RUS) carried out by Network Rail
to develop efficient capacity plans that match
stakeholders’ requirements. These studies demand a detailed understanding of travel patterns on particular train routes, including information about train loads on these routes. This
is a prime example of a recurring process where
multiple sources of data are used to reach a
decision. By providing easily accessible, accurate data, a centralised journey system would
increase efficiency further through access to better quality information, and reduce the costs of
carrying out these studies.
Discovery Net
Discovery Net [1, 6, 2, 8] is an EPSRC e-Science
Pilot Project based at Imperial College London.
The project has developed a service-based computing infrastructure for high throughput informatics that supports the integration and analysis of data collected from various high throughput devices. This infrastructure has been designed and implemented based on a workflow
model, allowing the composition of data analysis
services and resources declared as Web/grid services. The Discovery Net infrastructure is currently used by research scientists worldwide to
conduct complex scientific data analysis in three
important research areas: Life Sciences, Geohazard Modelling, and Environmental Modelling.
The problem posed by the Horizons research
project is categorised by a set of key properties
that Discovery Net is particularly well suited for
Fig. 1. The InforSense KDE client interface displaying one of the workflows developed for this project.
working with – Large quantities of data, geographically distributed data sources, a wide variety of different questions that need to be answered, visualisation of results and dissemination to both technical and operational staff.
Given the relatively short 1-year duration of
the project, development of a bespoke platform
would not be possible. The Discovery Net platform provides all the required features and was
therefore adopted as the underlying framework
for our system. We are utilising the InforSense
KDE software which is based on the research
outputs of the Discovery Net project.
Extracting information through
Discovery Net
In order to effectively manage railway network
capacity, it is necessary to know how passengers use the network. There are many different
ticket types, routes and operators. In the case
of smaller stations or less complex routes it may
be possible to know the operator of the train
that a passenger takes, and perhaps even the
specific train but in the case of large cities such
as London or Birmingham, knowing which train
a passenger has taken from a major interchange
station may be almost impossible. It has been
shown in section 2 that there is no shortage of
data available on many routes, the challenge is
to be able to link and interpret the data to reliably understand how passengers use the services
available to them.
Discovery Net enables us to rapidly prototype possible solutions and extract the more detailed information we require, from the available
data. This work utilises many of the ideas behind Computational Grids [4, 5] that underpin
e-Science and promote easier access to high performance computing resources for carrying out
large-scale science.
The domain specific knowledge of the required
analysis is held by the project members from
AEA Technology Rail. Our aim is to encapsulate this knowledge within the computational
platform, in the form of workflows. We worked
in pairs consisting of an e-Science and rail data
specialist in order to transfer the analysis description into Discovery Net workflows. The collaborative nature of the project combined with
time constraints led us to see this as the most
practical solution.
A workflow is a description of the processes
and data/control flows required to carry out a
task. The building blocks of workflows are components. In Discovery Net, a component can be
a data source or a process applied to some data.
A large toolbox of processes is available within
the system and custom components can be built.
One way of developing a custom component is to
use the scripting language Groovy [7]. Workflows
are built in the InforSense KDE client software
using an intuitive drag-and-drop interface to select components from the toolbox and add them
into the workflow.
System Platform
be executed in parallel and then taking advantage of multiple processors on a system by executing these components concurrently. To simulate a fully distributed e-Science architecture, we
have used an external Oracle database containing warehoused data as one of the data sources
within our system.
This work has shown how rail datasets can be
combined together in a systematic way, using the
latest e-Science technologies, in order to maximise the useful information derived, and that
this combination can provide a vital contribution to tackling a range of standard rail industry questions / studies with increased speed and
accuracy. The project outputs can be separated
into the following elements:
1. Comprehensive and consistent documentation of all major sources of passenger journey data existing in the rail industry.
2. Demonstration of how these may be combined to maximise the useful information,
for two particular examples:
(a) The ticket sales database (LENNON)
and the London Area Travel Survey.
(b) The national rail timetable and on-train
automatic passenger counts.
3. Confirmation that a wide range of rail industry questions/studies can be improved
by better combination of passenger data
4. Incorporation of an example combination of
rail data sources into a system, in order to
show how the wide range of different data
sources may be combined systematically.
5. Illustrative prototype of the user interface,
in the InforSense KDE software.
A computational platform has been set up
specifically for the project. The InforSense KDE
software system has been deployed on a 48 processor Sun Sparc IV based server operated by
the London e-Science Centre. This multiprocessor system allows for faster operation of the InforSense engine when executing workflows that
exhibit parallel features. The InforSense software can connect to distributed data sources
running on different servers in different locations. In our demonstration environment, data
is split between the London e-Science Centre’s
Presentation of the results of our workflow proOracle 10g server and data that has been imcessing utilises InforSense KDE’s Web-based
ported directly into the InforSense server.
portal into which workflows can be published. A
user logs into the system and is able to see the
set of workflows in their Web browser to which
4.2 Workflow execution
they have been granted access privileges. The
Figure 1 shows the InforSense KDE client in- user can then change some attributes and exeterface displaying one of the workflows we have cute the workflows available to them but may
developed. Execution of the workflows can be not modify these workflows.
carried out on a multiprocessor system which
Completed workflows, based on our rail indusoffers significant speed benefits over a standard try use cases, provide a table of results and one
system. The Discovery Net architecture is de- or more visualisation options to allow industry
signed to take advantage of parallel machines decision makers to gain a simple visual view of
by determining elements of a workflow that can the results of the analysis task.
Using the Discovery Net platform, we have
built workflows to represent a set of key analysis
When linking data sources in order to interpotasks that help to answer questions rail industry
late and extract new information, accuracy of
managers need to address. Our flexible solution
the original data sources becomes very signifiprovides two points of entry to satisfy the
cant. By adding data sets together, the inaccurequirements of both technical and operational
racies of each data set are multiplied together,
staff and can present answers to important
resulting in data that may have such a high level
questions faster than existing solutions, or in
of inaccuracy that it is effectively useless. In orsome cases, that cannot currently be answered
der to combat this risk, a significant amount of
by existing systems.
time was devoted to carrying out a comprehensive assessment of the accuracy of data sources
Acknowledgements: We would like to
and linked data that was to be used. These assethank Ian Hawthorne and the Department for
ments were carried out with reference to the
Transport who have funded this project work
quality dimensions used by the Office for Naunder the Horizons Research Programme. We
tional Statistics (ONS) and defined for the Euwould also like to thank the Train Operatropean Statistical System [3].
ing Companies and other organisations who
We believe that although inaccuracies in
have provided data samples without which this
source data are a significant problem in this kind
project could not have taken place.
of work, our proposed solution takes these risks
into account and makes an attempt to quantify
