e-Science from the Antarctic to the GRID Steve Benford a, Neil Crout c, John Crowe b, Stefan Egglestone a, Malcom Foster a ,b, Chris Greenhalgh a, Alastair Hampshire a, Barrie Hayes-Gill b, Jan Humble a, Alex Irune a, Johanna Laybourn-Parry c, Ben Palethorpe b, Timothy Reid c, Mark Sumner b a. b. School of Computer Science & IT, University of Nottingham School of Electrical and Electronic Engineering, University of Nottingham c. School of Life and Environmental Sciences, University of Nottingham Abstract Monitoring life-processes in a frozen lake in the Antarctic raises many practical challenges. To supplement manual monitoring we have designed, built and successfully deployed a remote monitoring device on one of the lakes of interest. This returns data to the Antarctic base over the Iridium satellite phone network. This provides us with a new and uniquely detailed view of the lifeprocesses in that environment, and is allowing us to understand that environment in new ways, for example exploring diurnal effects, and detailed energy flow models. We have integrated this sensing device into a common Grid-based software infrastructure; this makes the device and its sensors visible on the Grid as services, and also maintains an archive of sensor measurements. A desktop user interface allows non-programmers to work with this data in a flexible way. The experience of creating and deploying this device has given us a rich view of the many elements and processes that must be brought together to make possible this kind of e-Science. 1. • They may be sensitive to global changes, perhaps acting as a kind of “early warning” of climate change and its impact. Historically, the process of obtaining data has been highly labour-intensive and potentially hazardous, with scientists making journeys of tens of kilometres from the Antarctic base to collect a handful of readings at a location of interest (see figure 1). Introduction Professor Laybourn-Parry and her colleagues have been studying the ecology of freshwater Antarctic lakes for 12 years, and in particular the cycling of carbon through the ecosystem. These lakes are scientifically important for a number of reasons: • They are isolated, pristine ecosystems with no direct human impact. • They are ice-covered for much or all of the year, which reduces mixing with external materials. • Few species of plant or animal are present in the lake, and the food web is consequently simpler to analyse and model. • They are harsh environments, that force the evolution of interesting survival adaptations in planktonic organisms. • They are fragile ecosystems. Fig. 1. Typical images of the manual datacollection process Consequently, the available data has been quite limited; for example gathering data is subject to availability of personnel, transport (such as helicopters and quad-bikes), the stability of the ice and suitable weather in which to travel and work, and is restricted to the daytime only. As a result, the existing data sets are very sparse, e.g. a set of readings every one to two weeks. The sparseness of the data in turn limits the detail and accuracy at which colleagues back in Nottingham can model and analyse the life-processes in those environments; some of the phenomena of interest occur at time-scales significantly shorter than the available data can address. This paper describes work that has been carried out between March 2002 and August 2003. This work combines wireless devices and sensors, Grid technologies and desktop visualisation tools to address the challenge of supporting – and enhancing – the Antarctic science outlined above: taking e-Science from the Antarctic, through the Grid, and onto the desktop. The remainder of this paper is structured as follows. Section 2 describes the complete system – hardware, networking and software – that we have designed, built and deployed. Section 3 presents some of the new results that have already been gathered using this system. Section 4 discusses some of the many e-Science-related issues that have already been raised by this work. Finally, section 5 identifies areas of ongoing and possible future work. 2. analysing it, and increasing their understanding of what is happening in the Antarctic lake and its ecosystem. Further details of these components are given in the following sub-sections. 2.2. The Antarctic Device The Antarctic device is currently deployed on the ice of Crooked Lake, in the Vestfold Hills of the Antarctic, about 15 km from the Davis Base [3] of the Australian Antarctic Division (68° 35’ 31.9” S, 78° 21’ 32.7” E); figure 3 shows the device in position. System Design 2.1. Overview Figure 2 gives an overview of the system that we have designed, built and deployed over the last 18 months. The total system comprises a number of interlinked components, with sensor measurement data flowing from left to right: • At the left is the Antarctic sensing device itself, which is deployed on the icy surface of the lake to be monitored. • This communicates using the Iridium [1] Low-Earth Orbit (LEO) satellite telephony network to a base computer, where its raw data is unpacked and scaled. • An OGSA-compliant [2] Grid service, the Antarctic device Grid proxy, makes the device – and its data – available on the Grid. • The data is archived in a Grid-accessible database. • The Antarctic scientist can then work with the data in this Grid archive, visualising it, DF GRID Port Types: DeviceProxyFactory D Device S Sensor DB Database Fig. 3. The Antarctic device on Crooked Lake There are many practical challenges when deploying a device in such harsh conditions, with temperatures dropping to –40ºC and wind speeds exceeding 60mph. Suitable provision of power is particularly important due to the limited physical access, low temperatures (which drastically affects the performance of batteries) and the extreme latitude (the sun is below the horizon for 38 days at mid-winter). The transport and assembly of the device was also a challenge, occurring in stages between Nottingham in the UK, Hobart in Australia, Davis Base in the Antarctic and the Crooked Lake site. Register new device Generic device proxy factory(s) DF PAR sensor Iridium Data preprocessor Antarctic device Grid proxy PAR sensor proxy S Other sensor proxy Proprietary and bespoke protocols and technologies S Model and analyse data Add sensor to trial database D UVB sensor Other sensors Device Proxy Management Client Data manager Antarctic device Data logger New device configuration Sensor data-pump Sensor Database Service DB Sensor data-pump RDBMS OGSA-compliant GRID technologies Fig. 2. Current Antarctic e-Science system overview Desktop user interface Data chooser/ fetcher Table views Graph views The device is based on a commercial scientific data logger [4], with various sensor interface and storage modules. This is wired to the various local sensors which currently comprise: • Wind monitor, reporting wind speed and direction; • Battery level sensor; • Temperature sensors above the ice, inside the device itself, and at depths of 3m and 5m in the lake; • A series of temperature sensors straight through and immediately below the ice; • Photosynthetically Active Radiation (PAR) sensors above the ice, facing the ice (to determine albedo), and at depths of 3m, 5m, 10m and 20m in the lake; • Ultraviolet-B (UVB) sensors above the ice and at a depth of 3m in the lake; • Sonar range-finder, which measures the thickness of the ice from below. We had also planned to attach a GPS receiver to the device, to monitor any change in ice position or height, however this has not been possible to date due to coordination issues with the other sensors. We had originally planned a relatively slow measurement cycle (two to four measurements per day), however with the device in place we have actually been able to support a measurement cycle of one reading every five minutes, continuously. This is a dramatic change from the weekly or fortnightly schedule that was possibly with manual measurement. 2.3. Remote Communication The data logger is connected to an Iridium data modem, by which it can transfer data to a computer back at Davis Base (or anywhere else in the world, for that matter). This connection is relatively expensive and slow, at approximately $2 per minute and a throughput in practical use of around 1000 bits per second. However this was the only viable method given the remoteness of the site, the latitude and the intervening hills between the site and Davis Base. As well as downloading data from the device, it can also be re-programmed over the satellite modem. However, if this fails part-way through then the device may well require a manual reprogramming (this has already happened once). The satellite connection also requires that the batteries be in a reasonably good state of charge, and is unable to establish calls at lower supply voltages, whereas the rest of the device is still operational. 2.4. Grid Components We have combined efforts with the other EQUATOR IRC e-Science project to develop a common Grid software infrastructure for devices and sensors. This is described in more detail in [5]. Briefly, we have defined new Grid service port types (interfaces) to represent a generic device and a generic sensor “on the Grid”. The supporting tools and services (on the right-hand side of figure 2) exploit these common port types to handle varying devices and sensors in a standard way. The Device Proxy Management Client (at the top of figure 2) allows the person responsible for a device to create a suitable Grid service to represent that device on the Grid (a device “proxy”). The Data (or Trial) Manager and Sensor Data Pumps can be used to archive data from sensors in a common Sensor Database Service. At this point the data from the sensors is made available to interested parties via this Sensor Database Service; this service is tailored to support the data formats and queries appropriate to sensors, however the internal data model is relational, and we also plan to provide an OGSADAI [6] interface to this archive. 2.5. Data Access and Analysis We have created a simple desktop Graphical User Interface (GUI) to allow scientists to download and analyse the sensor data from the Sensor Database Service. This interface uses a visual data-flow paradigm [7] to allow non-programmers to perform a range of data retrieval, processing and visualisation functions. Figure 4 shows a simple processing network. The first component is a data-loader, which collects data for a particular sensor and time-period from the Sensor Data Archive Grid service. This is then routed through a table viewer which allows the user to view the data as a table and to select a subset of the data. This subset is then routed to a 2D chart generator, which can create a range of standard graphs. Sample results are shown in section 3. Our colleagues at the University of Glasgow have also been analysing data from Antarctic device using a similar Hybrid Information Visualisation Environment (HIVE) [19]; this currently supports a range of multidimensional scaling algorithms, but lacks a Grid client facility. We are actively porting components between these tools, such as the fish-eye table viewer from HIVE. Using the device and sensor Grid services it is also possible to monitor measurements as soon as they reach the Grid; this is described further in [5]. data comparable to figure 5(b) rendered using VTK. We plan to use these kind of 3D visualisations in exploratory virtual reality and augmented reality interfaces. Fig. 4. A simple processing network 3. Fig. 7. PAR readings visualisation Results In this section we show some of the data that was obtained from the Antarctic device during its summer deployment (17th – 31st January 2003). Figure 5 shows the levels of Photosynthetically Active Radiation (PAR) measured at the surface and at various depths in the lake. Figure 5(a) shows the smooth curve resulting from a clear day, while in figure 5(b) the effects of varying cloud cover on a partially cloudy day. (a) (b) Fig. 5. PAR readings (a) clear day (b) cloudy Figure 6 shows the thickness of the ice during the summer deployment, as measurement by the uplooking sonar. The ice begins to melt rapidly towards the end of the period, after which the device was removed while it was possible to land a helicopter on the ice. 4. The work reported in this paper is very much work in progress. However, it has already highlighted a range of e-Science issues, ranging from the environmental science being supported, to the Grid technologies that are being applied. These issues are explored in the sub-sections that follow. 4.1. We have also been exploring the use of Visualization Toolkit (VTK) [6]; figure 7 shows New scientific directions Using the Antarctic device we have been able to capture data regularly, irrespective of weather conditions, approximately 2000 times more frequently than with previous manual methods. This level of temporal detail is providing new insights into the minute-by-minute changes in the lake environment, as seen in the cloudy day data in figure 5(b). At slightly longer timescales, it now becomes possible to observe and analyse diurnal effects in the environment. The level of detail represented by this detail also allows us to apply new modelling and analysis methods. For example, it is possible to begin to model the complete energy balance within the environment, using the detailed light, temperature and ice-thickness measurements. Such a model can then be used to explore hypothetical changes in environmental conditions much more precisely than the current course-grained models based on no more than weekly measurements. 4.2. Fig. 6. Ice thickness Discussion Getting and using the data In order to get maximum utility from the Antarctic device and its data in the short term (before the Grid software components were fully developed and available) we adopted an interim data encoding and exchange methods that would fit directly into their current working practices, i.e. simple textual data files, compatible with Excel, sent by email from the researchers in the Antarctic. This has allowed the environmental scientists at Nottingham to make immediate use of the data and the tools and methods that they are familiar with. We have now reached the point where the data can also be published, archived and distributed via the device and sensor Grid infrastructure that we have been developing (as outlined in section 2). However, there are still many practical issues and choices to be resolved to determine how best to make this data – and other Grid facilities such as remote computation – available to the environmental scientists within their everyday work. This is one reason for our development of the simple visual data-flow user interface mentioned in section 2.5, since many of the scientists who we are working with are not programmers. Our hope is to develop this desktop user interface to the point where it can be used by the scientists with minimal additional effort compared to their existing practices, and to use it as a point of entry to their working environments which we can then grow to make other Grid services and facilities to them. We have chosen to use a standalone desktop application rather than (say) a web portal because we wish to support rich and finer-grained interactivity. 4.3. Remote science issues Working with a device – and colleagues – on the other side of the planet raised many complications compared to local working. The researchers have had to overcome a huge variety of pragmatic issues in the process of deploying and operating the Antarctic device, ranging from the coordination of deliveries of parts across the globe, to on-site problems such as the fracturing of fixing bolts in the extremely cold conditions and the extremely short life of laptop batteries in this climate. One critical set of issues has revolved around establishing and managing confidence in the Antarctic device, and as part of this, the handling of software and hardware failures. A fundamental challenge here is that only certain things can be done remotely; in some situations physical attention is unavoidable. Of course, this is not the complete show-stopper that it would typically be in a satellite-based system, but equally the cost profile is somewhat different (much cheaper devices, correspondingly more limited development effort), and the environmental pressures are also different. Anderson and Lee [9] consider software fault tolerance in terms of four phases, which are directly relevant in this situation: 1. Error detection, i.e. determining that there is a fault. 2. Damage confinement and assessment, i.e. determining – and limiting – the scope of the problem, e.g. what data is affected, and how far incorrect data may have been distributed. 3. Error recovery, i.e. performing compensatory actions, e.g. to correct or withdraw erroneous data. 4. Fault treatment and continued service, i.e. dealing with underlying cause of the fault, e.g. replacing or recalibrating a physical sensor. For example (see also [19]): 1. It was observed at one point that the data logger was reporting negative values for PAR at depth 10m. This is clearly impossible, since it would indicate a negative amount of light: the error has been detected (in this case by a bounds or reasonableness check). 2. By inspection, it could be seen that only certain values from this single sensor were apparently in error (damage assessment); if appropriate the publication of the data could be delayed (damage confinement). 3. Only limited recovery is possible in this case since the historical reading cannot be recaptured; error recovery is therefore limited to the publication of anomaly metadata which warns potential data users of values which should be disregarded (with a suitably flexible data format those values could be individually excluded). 4. The field researcher then went out to the device and determined by direct inspection of the interface hardware, application of synthetic stimuli and comparison with similar reference devices that the gain for this particular PAR sensor was too high, so that in bright light it exceeded the working range of the input, giving an overflow error. The gain was turned down, and the sensor recalibrated and redeployed (fault treatment and continued service). Another significant issue of remote – or more specifically distributed – science has been the effort required to coordinate the activities of researchers in Nottingham and in the Antarctic. As well as task-specific coordination and collaboration, lots of work is also needed if the distributed researchers are to feel a common involvement in the research and the social processes that support it on a day-to-day basis, for example, keeping in touch, maintaining an awareness of promising directions to explore, developing a common agenda and mutual understanding, and so on. These things are not well addressed by emerging Grid technologies and approaches. An open-ended Access Grid [10] session would be about the best support on offer at present, however this cannot be used from Davis base because of the limited networking (a single shared satellite connection to the Internet) and the lack of a suitable installation on the Base. 4.4. Grid software issues The typical vision of the Grid [11] is of a pervasive, i.e. universal, computing and communication infrastructure, connecting – at least potentially and subject to security policies – everyone, everywhere. In principle, then, we might imagine placing the Antarctic device directly onto the Grid, allowing it to expose its resources (in this case the sensors, data log and logging program) through a standard service interface. However, there are two major problems to getting this device – and many other devices – onto the Grid in this way: • The device does not have the code storage or computational power to run even a small Grid software stack, so it cannot directly host a Grid service; and • The device is not – and cannot be – permanently connected to the Grid network, because it is (a) too expensive and (b) subject in any case to periods of nonavailability. Some may argue that these are only temporary problems, or ‘implementation details’, that will be solved in a few years time. We prefer (a) to do useful work in the mean time and (b) to wait and see whether this technological future is actually as perfectly uniform and free of practical problems as this view might suggest. Consequently, we have adopted a dual strategy of defining common device and sensor service interfaces (which could be supported directly by a sufficiently capable networked device) and creating a default implementation framework which uses proxy service on the fixed Grid network to represent our current devices and sensors, with the actual devices – in this case the Antarctic device – connected to its proxy as and when it can, by whatever means are currently available. When the Antarctic device is not connected the proxy can still provide data and queue reconfiguration requests, allowing Grid based clients to be written as if the device was a first-class Grid citizen. This is described in more detail in [5]. 4.5. Sensor/device issues The example of the problem with a PAR sensor in section 4.3 also highlights the need to work with kinds of data additional to the sensors measurements themselves. In that case, the metadata required included: • Calibration metadata, i.e. what measured voltage corresponds to what actual level of PAR (in u-mols-1m-2), which may change from time to time due to adjustments or drift. • Accuracy or fidelity metadata, i.e. how accurate is the sensor, and with what resolution does it provide its measurements. • Data validity or availability metadata, i.e. that some readings should be ignored (in that case any readings less than zero), and perhaps a reason for this. • Structural metadata, i.e. which particular reading from the data logger (e.g. which column in the Excel file) corresponds to which sensor. • Deployment metadata, i.e. where (in the world) is the device and/or sensor actually deployed. We have adopted and extended eXtensible Scientific Interchange Language [12] to describe the structure of tabular text-based data in a standard way. We are also exploring the possible use of SensorML [13] (which is being brought into the OpenGIS consortium standardisation process) as one standard way of documenting some of this data (especially deployment and accuracy metadata). However the choice of an XML format is only part of the solution; we also seek to facilitate the associated work with the device itself. For example, even something as simple as unique tagging of sensors and devices (using RFID tags or barcodes) with suitable handheld support devices would make it much easier to manage collaboration data and link it back to the sensor at issue. Choice of suitable hardware and bus technologies (e.g. comparable to USB [14]) also makes it possible to sensors and devices to (a) describe themselves to some extent and (b) selfdiscover at least some aspects of their own deployment and structure. We are continuing to explore these issues in various EQUATOR projects. 5. Conclusion and Future Directions The Grid is only one part of the total scope of eScience. Through the design, construction and deployment of an environmental sensing device in the Antarctic we have been able to obtain collective primary experience of the many – often apparently mundane – activities and elements that together make up this particular scientific endeavour. This device is already providing data that goes substantially beyond that previously available to us. Making this device available on the Grid – together with the Medical wearable computer and phone-based devices described in [5] – is also driving the design and development of new Grid interfaces and supporting technologies for these kinds of devices. Our ambitions in the remainder of this project are to: • Continue to analyse and exploit the data that we are obtaining from the device, to increase our understanding of this Antarctic lake environment. • To begin to do this using the desktop Grid interface that has been developed, and to use this as a platform from which we can explore other Grid possibilities, such as the more stereotypically large computational analyses on remote machines. • To further develop the supporting software and devices to explore support for configuration, calibration, management and trouble-shooting of physical devices such as this. • To bridge between the normal Grid software infrastructure that we have been working with to date and some of the other ‘experience-oriented’ infrastructures developed and used on other parts of EQUATOR (e.g. EQUIP [15]). Links For more information see the EQUATOR website pages for this project [16], more data from and information about the Antarctic device [17], or the web-site for the sister medical devices project [18]. We anticipate an open-source release of the software infrastructure before the end of the project; email Chris Greenhalgh (cmg@cs.nott.ac.uk) or Tom Rodden (tar@cs.nott.ac.uk) in the first instance. Acknowledgements This work is supported by EPSRC Grant GR/R81985/01 “Advanced Grid Interfaces for Environmental e-science in the Lab and in the Field”, the EQUATOR Interdisciplinary Research Collaboration (EPSRC Grant GR/N15986/01), the Australian Antarctic Division, EPSRC Grant GR/R67743/01 “MYGRID: Directly Supporting the E-Scientist” and the EPSRC DTA scheme. We thank Greg Ross and Matthew Chalmers for contributions to the data analysis tool. References [1] Iridium Satellite LLC, http://www.iridium.com/ (verified 2003-0728). [2] I. Foster, C. Kesselman, J. Nick, S. Tuecke, “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002. [3] Australian Antarctic Division, “Antarctic Families and Communities: Davis”, http://www.antdiv.gov.au/default.asp?casid= 404 (verified 2003-07-28). [4] Campbell Scientific Canada Corp., “CR10X Measurement and Control Module and Accessories”, http://www.campbellsci.ca/ CampbellScientific/Products_CR10X.html (verified 2003-07-28). [5] T. Rodden, C. Greenhalgh, D. DeRoure, A. Friday, L. Tarasenko, H. Muller et al., “Extending GT to Support Remote Medical Monitoring”, Proceedings of the UK e-Science All Hands Meeting 2003, Nottingham, Sept. 2-4. [6] Open Grid Services Architecture Data Access and Integration (OGSA-DAI), http://www.ogsadai.org.uk/ (verified 200307-28) [7] B.A.Myers, “Taxonomies of visual programming and program visualization”, Journal of Visual Languages and Computing, pp. 97-123, March 1990. [8] W. J. Schroeder, K. M. Martin, W. E Lorensen, “The Design and Implementation of an Object-Oriented Toolkit for 3D Graphics and Visualization”, IEEE Visualization '96, pp. 93-100. [9] Lee, P.A. and Anderson, T., “Fault Tolerance: Principles and Practice (Second Revised Edition)”, Springer-Verlag, 1990, ISBN: 3-211-82077-9. [10] The Access Grid Project, http://www.accessgrid.org/ (verified 200307-28). [11] Ian Foster. Karl Kesselman (eds.), “The Grid Blueprint for a New Computing Infrastructure”, Morgan Kaufmann, 1998. [12] Roy Williams, “XSIL: Java/XML for Scientific Data”, http://www.cacr.caltech.edu/projects/xsil/ xsil_spec.pdf (verified 2003-07-28). [13] Open GIS Consortium Inc., “Sensor Model Language (SensorML) for In-situ and Remote Sensors”, OGC 02-026r4, 2002-1220, http://www.opengis.org/techno/ discussions/02-026r4.pdf (verifier 2003-0728). [14] USB Integrators Forum, “USB 2.0 Specification”, http://www.usb.org/developers/docs (verified 2003-07-28). [15] University of Nottingham, “The Equator UnIversal Plaform”, http://www.equator.ac.uk/ technology/equip/index.htm (verified 200307-28). [16] EQUATOR, “Environmental e-Science Project”, http://www.equator.ac.uk/projects/ environmental/index.htm (verified 2003-0728). [17] Malcom Foster, “Data”, http://www.mrl.nott.ac.uk/~mbf/ antarctica/data.htm (verified 2003-07-28). [18] “MIAS EQUATOR Devices”, http://www.equator.ac.uk/mias (verified 2003-07-28). [19] Greg Ross and Matthew Chalmers, “A Visual Workspace for Hybrid Multidimensional Scaling Algorithms”, To appear in Proc. IEEE Information Visualization (InfoVis) 2003, Seattle.