Scalable Management of Real-time Data Streams from Mobile Sources using Utility Computing Jeremy Cohena , John Darlingtona , Robin Northb , John Polakb a b Internet Centre, Imperial College London, South Kensington Campus, London SW7 2AZ, UK Centre for Transport Studies, Imperial College London, South Kensington Campus, London SW7 2AZ, UK Email: jeremy.cohen@imperial.ac.uk Abstract. The amount of data generated by our everyday activities and movements is increasing rapidly as more and more data collection devices are installed around us. These include ticket barriers in stations, CCTV cameras, tills in shops etc. However, these devices have generally been static. They are in known fixed locations and retrieving data from them is usually relatively simple. With the enhancements in wireless data transmission technologies over the past couple of years, real-time data collection from mobile devices is now a realistic possibility. However, mobile devices are less predictable than static devices. They may join and leave the network unpredictably and planning for the required computing capacity to manage the network can be difficult. We propose a model that uses modern ondemand utility computing provision to develop a data-stream management fabric that scales up or down dynamically to support the number of devices in the network. This makes the most efficient use of hardware resources and lowers costs since service providers only pay for the resources currently necessary to operate their network. 1 Introduction Our environment is being filled with increasingly large numbers of data collection devices. Many of our everyday activities produce data that is stored and mined to better map our preferences, movements and intentions. Most of this data currently comes from static devices. These include CCTV cameras, ticket barriers at stations, roadside car number plate recognition cameras and so on. With fixed devices, it is clear how many of them exist and where they are located. It can also be assumed that while they are switched on, provided there isn’t a technical fault, they will be generating and sending data in accordance with their monitoring schedule. As mobile data networks have evolved, obtaining network connectivity while on the move has become commonplace. This brings about the possibility of data collection devices also becoming mobile. Data collection using mobile devices has been used in the past but has relied on the devices storing information locally and downloading it when reaching a base station. This is acceptable for the storage of data but impractical where the requirement is to use the information in near real time. The EPSRC/DfT funded e-Science project, Mobile Environmental Sensing System Across Grid Environments (MESSAGE), is working to build an environmental sensing network that utilises a Computational Grid environment to handle and mine the large quantities of data being generated by a complex, geographically distributed fabric of environmental sensors. This is particulary relevant given the growing desire to estimate personal exposure levels in urban environments [7]. An earlier project, the UK e-Science Core Programme project “A Market for Computational Services” [4, 2], pioneered techniques for using on-demand, pay-per-use service-oriented computing and software and this work builds on some of these techniques. The work presented in this paper is being undertaken within the context of the MESSAGE project and is therefore focusing on the use of pollution sensors, however it is equally applicable to networks composed of a wide variety of mobile sensors. As such this project builds on the research conducted in the DiscoveryNet project [9] to develop and demonstrate a mobile environmental sensing network capable of supporting near real time decision making. Initial use cases include the provision of informa- tion for real-time optimisation of traffic control signals, and more detailed investigation of the health impacts of air pollution than has previously been possible. Sensors are being developed that will be placed on various mobile objects. These include motor vehicles, bicycles and for the smallest sensors, people. These sensors will come into and out of the network in a manner that is unpredictable and therefore difficult to model. Vehicles may pass through a road tunnel resulting in a short loss of network connectivity, or they may return to their depot or garage resulting in them going off-line for some period of time. Nevertheless, the network aims to ensure that sufficient data is available at any given time to satisfy the chosen use cases. It should also be noted that in the case of the MESSAGE project, the data collected will be archived and utilised in modelling to provide historic trends and future predictions of pollution events. Methods of combining historic data with real-time information can be used to provide predictions of pollution concentrations when insufficient real-time data is available. In this paper, the hardware and software architecture that has been developed to manage the distributed sensor network is described. Section 2 describes the sensor gateway architecture that we have developed while section 3 looks at the Utility Computing environment that makes this possible. We then describe in section 4 a simulator that has been built to test scalability of the system and in section 5 analyse the costs of using a scalable utility platform in comparison with traditional approaches. 2 Sensor Gateway Architecture The sensors in the network are connected to the rest of the world through a variety of technologies. Ultimately, all sensor data traffic flows onto a common IP network backbone, in the case of initial work, this is the Internet, however data could equally be securely tunnelled over the Internet into a private network. A variety of technologies are being tested for data transmission from sensors. These include 3G/GPRS cellular data transmission, WiFi/WiMAX and ad-hoc, peer-to-peer Zigbee based communications. Each sensor has a unique ID and sensors will be location aware, through either onboard GPS receivers, or wireless communication-based position [8]. Sensors may differ in design quite significantly. The approach that has been taken is to fix the data format and mandate that all sensors should output data in the defined format. This is not considered unreasonable since all sensors have the onboard power to connect to a wireless data network and this in itself requires a reasonable amount of onboard compute power. However, direct transmission of the data in a common XML format may lead to increased data packet sizes. This has an impact on transmission time, reliability and cost, as well as on the energy consumption of the sensor node. To mitigate these issues, intermediate data translation from the output format of the sensor into the defined XML format may be implemented at the gateway. Since all data ultimately flows onto a common IP network, the location of processing resources is of little importance, unless the data volumes to be transferred are sufficiently high that processing resources must be geographically local to the sensors in order to avoid data transfer delays. In the case of our work, cumulative data volumes may be high, however individual packets are small and it has therefore been decided that the location of compute resources is not significant. This allows us to take advantage of Utility Computing platforms that offer remote resources for use on an on-demand basis. 2.1 Sensor Gateway We are using the term sensor gateway to denote a hardware resource that takes in data streams from a number of sensors, pre-processes these streams and then ensures that the data is transferred on to the correct location for subsequent processing, storage etc. The key issues with a sensor gateway are the number of data streams it can manage concurrently and the amount of local storage it has access to. These are governed by the physical hardware that operates the gateway. The number of data streams a gateway can handle is governed by the processing power of the hardware resource. Software running on the gateway resource maintains a number of open sockets to active sensors. Each of these sockets will have data packets sent to them at a given interval period determined by the sample frequency of the corresponding sensor. The data packets are received and the key pieces of information extracted. This information is then bundled with the information coming from other connected sensors and passes through a series of processes before being sent off to a distributed data warehouse for archiving and subsequent historic processing. Prior to archiving in the data warehouse, the incoming data is used to produce a real-time data stream. The realtime stream is used by frontend user interface components of the system that display current information to end-users. In the case of the pollution sensors in use here, the generation of a real-time pollution map is possible. This is displayed to end users via a web portal. Local storage on the sensor gateway is also important. In addition to archiving the data it receives, a sensor gateway may need to carry out more significant processing and modelling of local blocks of data. This may require local storage of data that could run to significant amounts, however, this will not be permenant storage, just a window of data that will all subsequently be transferred into the main data warehouse. The amount of on board processing that a sensor gateway carries out lowers its capacity to handle incoming data streams so it is necessary to balance out the requirements for local stream handling and processing to ensure the most efficient overall system setup. !"#$#%&'&()%*+',#" -#'./&(0#*1($2'.($'&()%3*4$#"*5%&#"6'7#3*#&7 9'%'C#0#%&*+',#" 8#%$)"EF'&#G',*0'%'C#0#%&3*8#"<(7#* >)%&").3*#&7D ='&'*+',#" 8#%$)"*='&'*>'?&2"#*'%@*A"7B(<(%C3*-#'./ &(0#*!")7#$$(%C3*#&7D 8#%$)"*+',#" 9):(.#*8#%$)"*;%<(")%0#%& Fig. 1. Layers within the compute architecture • Mobility - How far does a sensor move, does it regularly move between different areas? In a network of heterogeneous sensors monitoring different parameters, the sensor “type” may also be important. Following evaluation of these issues, it was decided to take an approach whereby the location of the sensor gateway hardware resources is not linked to the physical sensor locations. This is made possible because all sensors will be connected to an IP network that will ultimately provide connection through to the Internet. The gateways or access points (not to be confused with the sensor gateways themselves) that connect the sensors to the public Internet do not need any onboard pro2.2 Sensor Gateway Location grammability, they can simply be custom hardware devices equivalent to the wireless routers Given a number of incoming data streams from used in many homes and offices to link wireless pollution sensors, it is necessary to develop a devices to the Internet. system that has sufficient capacity to handle all the incoming data from all the sensors in the network. While in a static network, this would 3 Scalable Compute Fabric simply be a case of looking at the total number of sensors in the network and the number of sen- The number of sensors that will be online at sors each gateway can handle in order to ensure any one time within the sensor network is difprovisioning of sufficient gateways, a network of ficult to model since it varies stochastically. A continuously mobile sensors presents more sig- number of variables will affect the state of a sennificant problems of where to site gateways and sor. If the sensor is on a moving vehicle, it may how to account for various issues that are spe- move out of reception range from a base station. cific to mobile sensors, these include: It may travel into a tunnel temporarily blocking reception for a short period of time. The vehi• Geography - In which areas are sensors cle may reach its destination and have the igphysically located? nition switched off, cutting power to the sensor • Concentration - Are there concentrations of device, and so on. In the long term, the maxisensors in specific locations? mum number of sensor devices that could join • Reliability - How stable is the connection of the network is theoretically unbounded. If every a sensor, does it regularly drop in and out vehicle, and a large percentage of pedestrians were to carry sensors (perhaps built into mobile of the network? phones or some other wireless device), the number of active sensors could become very large. If a compute farm were maintained with the capacity to service the maximum possible number of sensors that may join the network, this would result in a vast number of CPUs that would, for the majority of time, be highly underutilised. Since the cost of hosting, supporting and managing large quantities of computing resources is very high, due to issues such as power, cooling and space, this is an impractical requirement. To avoid this, our system utilises one of the emerging utility computing platforms that provide on-demand, pay-per-use access to computing resources, as and when they are required. The system we are utilising to test our architecture is the Amazon Elastic Compute Cloud (EC2) [3] service. 3.1 Amazon EC2 Utility Computing Service EC2 is a system set up by Amazon that allows access to compute resources of a fixed specification, on-demand. The system is accessible via a Web Service API that can be used to request access to new resources, provision those resources with the required setup and then begin them executing. The system uses the concept of virtual machine images to provision hardware resources. A virtual machine image is a file containing the complete setup of a machine, including the operating system, associated settings and any software that is to be available on the machine. The API also provides the ability shut a machine down and return it to the pool of available resources when it is no longer required. A virtual machine image, in the case of EC2 an Amazon Machine Image (AMI), that contains the full content of a Sensor Gateway installation is uploaded on to the Amazon S3 storage service [10]. EC2 nodes can access content on the S3 storage service and new machines can then be provisioned from the image stored in S3. Using this approach, new sensor gateways can be provisioned programmatically. This allows the system to scale as necessary, with a very short lead-time. The time taken to provision a new node, from scratch, as a sensor gateway, is somewhere of the order of a couple of minutes. This is in comparison with the traditional approach of having to install new hardware resources manually and load all the relevant software, something which can take several hours. Since machines are only paid for when they are required (on a per hour basis), processing costs can be significantly reduced using this method. The finite availability of hardware resources means that there may be occasions where it is not possible to obtain new resources when the system needs to scale up. In order to reduce the likelihood of this situation occurring, it is a long term aim to support multiple Utility Computing platforms to gain access to a wider potential range of resources. 3.2 Gateway Management Given a means to quickly provision new sensor gateways, or shut them down when they are no longer required, a management layer is required that handles the provisioning and shutdown of resources as necessary (see figure 1). This gateway management layer consists of root gateway controllers. Only a small number of root controllers are required. A root controller carries out bootstrap requests from sensors. These requests are high volume but simple and quick to process. Root gateways communicate in a peerto-peer manner to ensure that the pool of root gateways maintains common information about available sensor gateway capacity. A root gateway is used by a sensor to bootstrap itself upon startup. When a sensor is powered on, it gains access, via its local router, to the Internet. At this time, the sensor has an IP address, this may be a private IP address assigned by its local router using Network Address Translation (NAT), or a global public IP address. The firmware of the sensor contains one or more root gateway addresses. The sensor will contact one of these addresses by way of a bootstrap request. The root gateway will give the sensor the details of a sensor gateway with spare capacity and the sensor will then be paired with this gateway and stream data to it until the sensor is switched off, goes out of range, or is requested to do otherwise by a root gateway. Since some subset of root gateways must always be available at a known location, we have the concept of primary root gateways and secondary root gateways. A primary root gateway is always available at a known, fixed location. A secondary root gateway is one that has been added to handle demand and is passed workload in response to a request received by a primary root gateway. That is, if primary root gateways >33)& '()%*(+& 9334 D"6)7"$C)%1&D: >33)&'()%*(+&1+2(?";(44+& @73A"6"326&6%2637&#()%*(+6&(6&2%;%66(7+ =&=&=&=&=&=&=& 5%2637&'()%*(+ 5%2637&'()%*(+ 89&:(;<$32% ,"-"&'()%*(+ ,"-".,"/(0&-"0%1&(21&/3$"4%&5%26376 !"#$%%&'()%*(+ B%44C4(7&'()%*(+ !"#$%%&-"0%1&(21&/3$"4%&5%26376 E'.'9>5&/3$"4%&5%26376 Fig. 2. Sensor gateway architecture become overloaded, they can in turn provision secondary root gateways dynamically, in order to assist them with their workload. The main task of a root gateway is to manage a set of sensor gateways. The process operates as follows. Initially, a root gateway will have a single attached sensor gateway. As sensors come into the system (they are powered on or come into range of network coverage), a bootstrap request is received by the root gateway which then passes back the address of the sensor gateway. As more sensors come online and the root gateway services the bootstrap requests, the sensor gateway begins to near its sensor connection capacity. Once a threshold is reached – approximately 85% capacity, although this is dependent on the arrival rate of bootstrap requests as against the time taken to provision a new sensor gateway – the root gateway initiates the provision of a new sensor gateway. This new gateway enters the pool of available sensor gateways that are managed by the root gateway. This process continues until eventually the capacity of the root gateway becomes an issue. In this instance, the root gateway can provision a secondary root gateway and hand off management of some sensor gateways to this secondary gateway. The sensor gateway architecture is shown in figure 2. This two layer architecture of sensor gateways and root gateways has been adopted in preference to a one layer architecture where sensor gateways themselves communicate on a peer-topeer basis since it was decided that abstracting the gateway management functionality out from the sensor management was preferable to bundling all the functionality into a single entity. Sensor gateways must manage continuous connections to devices, under a constant workload that can be modelled and managed. Root gateways handle a high frequency of requests with a stochastic arrival rate and short duration. As time passes, sensors will leave the network as well as join it. This will result in the utilisation of sensor gateways dropping. Since there is a charge per hour for every gateway that is in use (the per hour charge for the compute resource that is running the gateway), it is ideal to keep the utilisation of each gateway as high as possible and therefore the number of gateways, and computing costs, as low as possible. To manage this, root gateways run a capacity consolidation process at given intervals. Where it is determined that the total number of sensors in the network could be serviced by a smaller number of sensor gateways, the root gateways calculate a remapping of sensors to gateways and then address the relevant sensors to switch to another sensor gateway. The sensor gateways that are freed up by this process can then be shut down by the root gateways so the resources need not be paid for. 4 Simulation and Implementation Prior to implementation of the full system, a simulation has been produced in order to test the algorithms for sensor management. The simulation is written in Java and makes use of the Java Timing Framework [5] to manage to simulation timesteps and fire events that direct the virtual sensors to generate a data packet. tion of a service including the architecture and implementation language are irrelevant. There are a number of SOA middleware toolkits that make the generation of services possible. It is intended to adopt the Web Services model that is based on the key standards Web Service Description Language (WSDL) [1] for describing a service’s interface, Simple Object Access Protocol (SOAP) [6] defining the message encapsulation format for cross-platform compatible message exchange and Universal Description Discovery and Integration (UDDI) [11] for discovering services. 5 Cost analysis 100 Constant usage Mobile sensors 90 80 70 Processor utilisation In the simulator, data packets are in the format of XML documents. A draft XML Schema has been developed to specify the format of pollution data packets. The packets include information such as the sensor ID, observation time, location and values for a number of pollutants. Some sensors may be able to sample a number of pollutants, others may only be able to sample a single pollutant. As a result of this, all pollutants defined within the schema are optional. From the schema, a Java parser has been created that allows the simple marshalling of XML to a Java object model and vice versa. The parser is used by the virtual sensors to generate random data packets that are then sent as strings to the relevant sensor gateway. At present, gateways store data packets locally in addition to forwarding them on to a central component that is used to drive the web-based frontend to the simulator. The simulator allows the number of sensors handled by each sensor gateway to be varied, along with the sample frequency. This has given the opportunity to test how many connections a gateway can handle for a given data sample frequency. The frontend of the simulator is a JSP web application that links to the Google Maps API in order to display the locations of the simulated sensors. To add to the visual effect of the simulation, sensors can be programmed to follow various routes around Central London and their icons on the map change colour to reflect the pollution in the current location (see figure 3). Clicking on a map marker pops up a window that shows the current pollution levels recorded by that sensor. This display is a prototype to show the possibilities for real-time display of sampled data and is one of the possible visualisation models being considered for the MESSAGE project. While the initial version of the simulator is a local Java application with classes to represent the various entities in the architecture, the MESSAGE e-Science architectural model will be implemented as a Service Oriented Architecture (SOA). In this model, each entity is represented as a service. A service is a ‘black box’ with clearly defined input and output interfaces. The SOA principle means that services are loosely-coupled and can communicate with each other by knowing only input and output data types for the operations offered by a service’s interface. The internal implementa- 60 50 40 30 20 10 0 0 5 10 15 20 25 Time of day Fig. 4. Differing processor usage profiles The conventional model of purchasing hardware resources is workable in cases where there is a constant and quantifiable demand for computing resources, and where the costs of power, cooling and space to host the resources are sufficiently low. For example, in areas where real estate is particularly expensive, using a large quantity of space to host racks of hundreds of processors may be impractical and remote hosting of resources may be a far more reasonable solution. Given the long term view of large quantities of pedestrians and vehicles carrying sensors, it can be assumed that more sensors will be active during morning and evening rush hours when the greatest quantity of sensors are likely to be powered up and mobile. This gives a projected hardware resource usage profile as shown in figure 5. However, the availability of on-demand payper-use computing combined with our scalable Fig. 3. Google Maps-based simulator graphical display architecture opens up the possibility not just of lowering costs by hosting resources remotely to avoid local support, power and estate costs, but using remote resources where only the required capacity is paid for. A simplified cost analysis has been carried out that shows how significant the differences in costs are between hosted and in-house resources in the mobile pollution sensor scenario. We show how the costs of in house hosted resources compare with the costs of a pay-per-use service such as Amazon EC2. We consider these costs over a year and accept that our calculations are significantly simplified by the assumptions that local networking, space and electricity costs are free (given that this is an academic project) for university-hosted resources. We assume that a server similar to those offered by utility computing providers costs $2,000. Given a maximum utilisation of the system requiring 10 hardware resources (these figures will scale linearly), the number of resources in use at particular times of day is shown in table 1 (based on hourly blocks since this is how Amazon’s service is charged). would have an expected lifetime of 3 years so the annual cost would be $6,667. Based on Amazon’s current rate of $0.10+tax ($0.12) per CPU hour and the estimated resources required, the cost per day for resources from Amazon is shown in table 1. Time Resources Cost 0-4 1 $0.48 4-5 3 $0.36 5-6 4 $0.48 6-7 7 $0.84 7-8 8 $0.96 8-9 10 $1.20 9 - 10 6 $0.72 10 - 15 3 $1.80 15 - 16 6 $0.72 16 - 18 9 $2.16 18 - 20 5 $1.20 20 - 24 2 $0.96 Daily total $11.88 Annual total $4,336.20 Table 1. Resources required daily and costs for Amazon EC2 platform Given a data packet size of 1kb, a 5 second sample interval and an average of 180 senIt is necessary to add data transfer costs sors connected to a gateway at a given time, to the CPU costs. Data transfer is charged at the amount of data trasfer per gateway per $0.20 per Gb. The system uses 99 CPU-hours hour is calculated as 129,600kb = approximately per day. As calculated above, each CPU hour 127Mb. utilised has an associated 127Mb of data transSince the maximum possible utilisation re- fer for the data packet transmission. 99x127 quires 10 hardware resources, for locally hosted gives 12,573Mb or approximately 12.5Gb data resources, 10 machines would need to be pur- transfered per day. This would incur transfer chased at a total cost of $20,000. These machines costs of 12.5 * 0.20 = $2.5 per day or $912.5 per year. Adding this to the total CPU costs gives an annual cost for use of the Amazon platform of $5,248.70, undercutting the annual in house resource cost of $6,667 by just over $1,400. It is clear from these figures that using the Amazon EC2 platform is a practical and econmic alternative to using in house resources. It should also be noted that these are the figures before the significant in house costs of hosting resources are taken into account. In some cases, these costs are difficult to quantify and a more thorough cost analysis to put value on these costs would be a useful piece of future work. and is funded jointly by the Engineering and Physical Sciences Research Council and the Department for Transport. The project also has the support of nineteen non-academic organisations from public sector transport operations, commercial equipment providers, systems integrators and technology suppliers. More information is available from the web site www.messageproject.org. The views expressed in this paper are those of the authors and do not represent the view of the Department for Transport or any of the non-academic partners of the MESSAGE project. 6 References Conclusion We have presented an architecture for managing a distributed network of mobile sensor devices. The architecture suits devices that join and leave the network in a stochastic manner, resulting in wide variations of network utility that make the in house hosting of resources wasteful. The architecture is well suited to the dynamic, use-on-demand, pay-per-use nature of the emerging utility computing platforms, such as the platform we are utilising, Amazon EC2. A basic exercise in cost analysis has shown that even without taking into account the costs of space, cooling, power and on-site management staff, the EC2 platform is still cheaper at providing computing power than purchasing the resources for use in house. Further work will include continued development of the simulation platform to allow demonstrations and further analysis to be carried out. In parallel, development on the main Web Services based architecture is to be continued. It is intended to validate the simulation results against the real system as this is developed in order to obtain feedback on the simulation approach. It is hoped that as further utility providers appear on the market, costs will drop as competition ensues and the use of pay-peruse computing will become an accepted model for lower-cost, highly efficient computing use. 7 Acknowledgements This work has been undertaken as part of the e-Science Pilot project Mobile Environmental Sensing System Across Grid Environments (MESSAGE). MESSAGE is a three-year research project which started in October 2006 1. E. Christensen, F. Curbera, G. Meredith, and S. Weerawarana. Web services description language (wsdl) 1.1. http://www.w3.org/TR/wsdl. [23 April 2007]. 2. J. Cohen, J. Darlington, and W. Lee. Payment and negotiation for the next generation grid and web. Concurrency and Computation: Practice and Experience, 2007. To appear. 3. Amazon Elastic Compute Cloud (EC2). http: //aws.amazon.com/ec2. [23 April 2007]. 4. “A Market for Computational Services” project. http://www.lesc.imperial. ac.uk/markets. [23 April 2007]. 5. Timing Framework. https:// timingframework.dev.java.net/. [23 April 2007]. 6. M. Gudgin, M. Hadley, N. Mendelsohn, J-J. Moreau, and H. Nielsen. Soap version 1.2 part 1. http://www.w3.org/TR/soap12-part1. [23 April 2007]. 7. J. Lawton. The Urban Environment, Summary of the Royal Commission on Environmental Pollution’s report. HMSO, Norwich, UK, March 2007. 8. R. Mautz, W.Y. Ochieng, D. Walsh, G. Brodin, A. Kemp, J. Cooper, and T.S. Le. Low cost intelligent pervasive location tracking (iplot) in all environments for the management of crime. The Journal of Navigation, 59:263–279, 2006. 9. M. Richards, M. Ghanem, M. Osmond, Y. Guo, and J. Hassard. Grid-based analysis of air pollution data. Ecological Modelling [Ecol. Model.], 194(1-3):274–286, March 2006. 10. Amazon Simple Storage Service (S3). http:// aws.amazon.com/s3. [23 April 2007]. 11. Discovery Universal Description and OASIS UDDI Specification TC Integration v3.0.2 (UDDI). http://www.oasis-open.org/ committees/uddi-spec/doc/spec/v3/ uddi-v3.0.2-20041019.htm. [23 April 2007].