The monthly newsletter from the National e-Science Centre NeSC News Issue 70 June 2009 www.nesc.ac.uk A Question of Integrity By Iain Coleman Spare a thought for poor Geoffrey Chang. He was on top of the world in 2006, a high-flying young protein crystallographer with a prestigious faculty position and a string of high-profile papers to his name. Then someone noticed a problem with his results. It turned out that Chang had been using software in which someone else had swapped two columns of data without his knowledge, and his results were all invalid. To Chang’s credit, he published a swift and complete retraction – but if he had fully understood the provenance of his software, all that work by himself and others would not have been spoiled. Provenance is evidence of authenticity and integrity, an assurance of quality and good process. We invoke provenance whenever we cite a paper from a peer-reviewed journal, or check the quality labels on food. But this is just one side of the provenance coin. Shares in United Airlines dropped precipitously in just a few hours one day in 2008, even though the company was performing well. The culprit was a news story from 2002, describing the financial problems that the airline was facing at that time. For some reason, this story rose to the top of the Google search results for United Airlines six years later – and there was no date attached to the story, so investors assumed it was breaking news and sold their shares. This is the other side of provenance: records of identity and creation, tracking ownership and influences. We each demonstrate our own provenance whenever we show our birth certificate, and an unbroken record of ownership plays an important role in establishing a work of art as genuine. Both these aspects of provenance have become more problematic in the digital age. For information on paper, the process of creating it will generally leave a paper trail of notes, drafts, and approved versions. Furthermore, modifying the information or creating forgeries is difficult, and will usually leave telltale signs. This all serves to make He wanted to investigate what provenance really is, why we think we need it, and how we can know when we have enough. As well as trying to answer these questions, he wanted to identify key challenges for provenance in the context of eScience, and understand how it will develop in the future. The theme concluded on 15 May 2009 with a public lecture at eSI, in which Cheney outlined the specific goals of the theme and assessed how far it had come in achieving them. For e-Science, provenance is primarily about scientific databases and scientific workflows, but is also crucial in the increasing use of electronic lab notebooks. In scientific databases, provenance is needed for two reasons. One is in demonstrating to a reasonable sceptic that the research results captured by the database James Cheney are valid. Having a human being sign off on the data is still considered to be provenance, while not infallible, at the most reliable method of quality least fairly robust. When information control, but even that can go wrong. is in electronic form, often there is If some data is false or erroneous, no “bit trail”: plagiarism is a matter can you track down how and of copy-and-paste, and it is easy where it has been used, and what to forge or alter the information publications depend upon it? undetected. Thus, a sound system of provenance is essential for judging The other need for provenance is in the quality of data. curation, ensuring that the data is still comprehensible in twenty years’ Considerations like these led time. There is an expectation in James Cheney to establish the escience that you don’t need to know Science Institute’s research theme why data was gathered, only how it on “Principles of Provenance”. Issue 70, June 2009 By Iain Coleman Continued was gathered. But the “how” is not usually specific enough without some understanding of the “why”. Scientific workflows, the interfaces to grid computing, require provenance in order to understand and repeat computations. It is vital to be able to attest to the reliability of the results of a computation that has been shipped to different computers running different environments with different software libraries. Here, as ever, the question of what information you record at each stage of a process depends intrinsically on what you want it for. There’s a challenge here to teachers as well as technicians. Science has developed a set of norms around keeping contemporaneous lab notebooks and writing up the details of experiments, and these play a vital role in maintaining the integrity of scientific discovery. As storing and communicating information moves away from paper and onto digital media, this hard-won good practice must be passed on and translated to the new ways of working. It would seem from all of this that provenance is an unalloyed good, but that may not be quite true. Keeping track of provenance can lead to security problems, particularly when information is cleared to be used by people who are not authorised to know some details of its creation. There can also be political problems, if some information is judged to have come from a source that is not in favour with the prevailing powers. Scientifically, one of the main problems relating to provenance is that researchers can be reluctant to put their more wildly speculative ideas online for all the world to see. There needs to be a safe space for tentative or controversial arguments. Chatham House Rules are a classic NeSC News approach to provenance in these cases: they allow statements to be reported as long as they are not attributed to an individual. These rules effectively remove provenance so that people can speak more freely. What has emerged from this theme is the cross-cutting nature of provenance. It recurs in many aspects of computer science, and presents important theoretical and practical problems. This suggests that it should be studied as a topic in itself, much like concurrency, security, or incremental computation. The theme has shown the need for more formal definitions of provenance, and the development of clear use cases and goals. One of the problems with promoting research in this area is the difficulty of getting work on provenance published: greater incentives are needed for people to think about these issues more clearly and in more general cases. An ultimate goal for provenance is the establishment of a complete causal dependence chain for research, but this moves the issues into territory more usually occupied by philosophers. The key challenges ahead are to combine insights from all the different perspectives on provenance, and to build systems that exhibit the new ideas. Over the next ten years, we are likely to move to a world with provenance everywhere, as our use of data is increasingly tracked automatically by computer systems. If we improve our theoretical and practical understanding of provenance, we will be better able to face the security and privacy challenges to come. Slides and a webcast from this event can be downloaded from http://www. nesc.ac.uk/esi/events/987/ eSI Public Lecture: “How Web 2.0 Technologies and Innovations are Changing eResearch Activities” Prof. Mark Baker A question of integrity The e-Science Institute is delighted to host a public lecture by Prof Mark Baker, Research Professor of Computer Science at the University of Reading in the School of Systems Engineering. The public lecture will be held at 4pm on June 16. Technologies of various types appear in waves. Some are taken up and are successful, and others die out quickly. These innovations include new hardware, operating systems, tools and utilities, as well as applications, and also, the way users interact with systems. The Web 2.0 arena seems to have been one of those areas that has taken off and changed the way we do things; not only on the Internet, but also via the Web. When Tim O’Reilly first coined the term ’Web 2.0‘ back in 2004, may of us thought the area being referred to was fairly empty, but since those days, the extent that people collaborate, communicate, and the range of tools and technologies that have appeared have dramatically changed the way we do things. In this presentation, we will look at the way Web 2.0 technologies have developed, and investigate their impact and influence on services, application, users, and overall usability More information is available at: http://www.nesc.ac.uk/esi/events/960/ www.nesc.ac.uk Issue 70, June 2009 Preparing particle physics for the many-core future By Alan Gray, EPCC A recent project at EPCC has performed major enhancements to an international particle physics code repository. The work has enabled the effective exploitation of emerging technologies, including those incorporating chips with many cores, and has led to new scientific results worldwide. EPCC collaborated with UK and US academics to develop new functionality within the USQCD code suite. This software, originally developed in the US, is heavily used in the UK and throughout the world. Spacetime is represented computationally as a lattice of points, allowing for a wide range of simulations aimed at probing our current understanding of the laws of physics, and searching for new and improved theories. The Standard Model of Particle Physics, a group of theories which encompasses our current understanding of the physical laws, is known to be extremely accurate. However it is not able to explain all observed phenomena, and is therefore thought to be an approximation of a yet undevised deeper theory. Progress in this fundamental research area requires, in combination with experimental measurements such as those at the Large Hadron Collider, demanding calculations that stress even the world’s largest supercomputers: the rate of progress depends on the code performance on such machines. Until recently, increases in the performance of computing systems have been achieved largely through increases in the clock frequency of the processors. This trend is reaching its limits due to power requirements and heat dissipation problems. Instead the drive towards increased processing power is being satisfied by the inclusion of multiple processing elements on each chip. Current systems contain dual or quad-core chips, and the number of cores per chip is expected to continue to rise. This layout poses programming challenges, not least for scientific computing on large parallel systems. Like many HPC codes, the original USQCD software had no mechanism for understanding which processes are associated with which chip: each process was treated as equally distinct from one another and communication was universally done through the passing of messages. A threading model has been developed within the software, where processes running within a chip can be organised as a group of threads, and can communicate with one another directly through memory which they share. One thread per chip (or group) handles the messages needed to communicate with external chips. This new programming model, which more closely maps on to the emerging hardware, can display some modest performance advantages in current systems but the real efficiency gains will be realised as the number of cores per chip rises in the future. The project has also created additional functionality within the USQCD software. The code has been enhanced to allow calculations using new theories beyond the Standard Model, signals of which may be discovered experimentally at the Large Hadron Collider. Furthermore, improvements have been implemented which allow calculations (within the Standard Model framework) to an unprecedented precision, that in turn allow accurate testing against experimental results: any discrepancies may point to new theories. USQCD Software: www.usqcd.org/ usqcd-software/ Large Hadron Collider tunnel Photo credit: Mike Procario NeSC News www.nesc.ac.uk Issue 70, June 2009 All Hands: Call for Submission of Abstracts Authors are invited to submit abstracts of unpublished, original work for this year’s All Hands Meeting, to be held in Oxford on December 7-9. Authors are asked to submit to one of the following themes or as a ‘general paper’: Social Sciences, Arts and Humanities, Medical and Biological Sciences, Physical and Engineering Sciences, Environmental and Earth Sciences, Sharing and Collaboration, Distributed and High Performance Computing Technologies, Data and Information Management, User Engagement or Foundations of e-Science. This year, we would especially like to encourage industry collaborators to take a full part in the conference, including contributing to papers. We will be introducing the Tony Hey prize for the best student paper, named in honour of the outstanding contribution Tony has made to UK e-Science. This competition is open to any UK student. The prize winner will be asked to present the paper in a special session. Further details will be available soon. Important Dates: 30 June 2009 - Deadline for abstract submission and 19 August 2009 - Decisions to authors More information is available here; http://www.allhands.org.uk/papers Gridipedia:relaunched and expanded By K. Kavoussanakis, EPCC Gridipedia – the European online repository of Grid tools and information – preserves and makes accessible a whole range of resources on Grid computing and related technologies such as cloud computing and virtualization. Originally populated with results from the BEinGRID research project, it is continuously enriched by commercial organisations and research projects and welcomes contributions. Gridipedia (www.gridipedia.eu) has expanded massively recently and its contents include case studies, software components, business analysis and legal advice. Unique visitor numbers exceed 2,000 a month and include collaborators using Gridipedia to distribute software and research results. Gridipedia was initially populated with the results of BEinGRID, which focuses on extracting best practice and common components from a series of pilot implementations of Grid computing in diverse business NeSC News settings. More recently the site has hosted work by commercial organisations such as case studies from Univa, Sun, Digipede and IBM, as well as other research projects such as BREIN, Akogrimo and TrustCom, and on behalf of open source middleware groups including GRIA and Globus. In the long term, Gridipedia aims to become the definitive distribution channel for cloud and Grid technologies: where vendors will meet buyers from the commercial sector; where consultants, suppliers and programmers will exhibit and trade their offerings; and where potential buyers at all levels will find the information and products they require. Relaunched in May, Gridipedia targets key decision-makers in both business and technical roles who will oversee the implementation of Grid in their businesses or who are looking to further develop their Grid capabilities. It demonstrates the business benefits of Grid technology, for example reduced time to market, new service offerings, reduced costs, improved quality and greater flexibility. Gridipedia is soliciting contributions from the user community. You can join Gridipedia by contributing to the online Grid voices blog or you can submit your software or article for publication on the site. We look forward to your contributions! www.nesc.ac.uk Issue 70, June 2009 Advanced Distributed Services Summer School The NGS, in conjunction with the training, outreach and education team at NeSC is pleased to announce that registration for the Advanced Distributed Services Summer School 2009 is now open. The summer school will run from 7-22 September at Cosener’s House, Oxfordshire and will bring together students and many of the leading researchers and technology providers in the field of Distributed Computing. It is a chance for students to not only take part in a unique learning experience, including many hands on tutorials, but also to spend time in a small group with the leaders in the field. The aim of the school is to help develop the skills of those involved in providing computational support for research in a wide range of disciplines. In particular the school will focus on the use of, provision of interfaces to and the development of services based on employing the composition or aggregation of computational or data services. The school will show how to compose a variety of services into Cosener’s House bioinformatics work flows that can be used to support biomedical research processes, how to use and develop lab or department scale clusters of computers to run simulations, how to work with the NGS to compose protein simulation models for running on UK or international super computers, developing a portal to support legacy applications. The school aims to provide students with a familiarity with and the tools to use the facilities available in the UK and internationally to transform current research practices in the light of the developing services being made available in a highly networked world. The cost of the summer school, including accommodation and meals (except Tuesday and Thursday) is £276. Please send further queries to adsss09@lists.nesc.ac.uk or register on the ADSSS web site (http://www. iceage-eu.org/adsss09/index.cfm). Cloud Computing Users and the NGS The Belfast e-Science Centre (BeSC) offers a hosting-on-demand service within the NGS in support of UK academic users. The service is currently used extensively by BeSC’s commercial partners but BeSC are keen to have more academic users. The service enables a remote user to deploy software into servers within the BeSC domain and to manage these deployed services remotely. The BeSC service hosting cloud can be accessed via a web UI and using an API BeSC are developing (currently called libcloud); a Europar 09 paper on libcloud can be found at http://www.besc.ac.uk/services/public/papers/2009europar-cloud.pdf. The API is intended to provide a provider neutral interface to remote resources such as those provided by Amazon, Flexiscale etc and the BeSC hosting cloud; plugins for all of these providers are part of the library. If you have an interest in using the hosting cloud or this library in your development and/or helping its development please contact Terry Harmer (t.harmer@besc.ac.uk) for details. NeSC News www.nesc.ac.uk Issue 70, June 2009 Mathematical, Physical and Engineering Sciences Online Table of Contents Alert A new issue of Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences is now available online: http://rsta.royalsocietypublishing.org/ and the Table of Contents below is available at: http://rsta. royalsocietypublishing.org/content/vol367/issue1897/?etoc Introduction: Crossing boundaries: computational science, e-Science and global e-Infrastructure - by Peter V. Coveney and Malcolm P. Atkinson Sector and Sphere: the design and implementation of a high-performance data cloud - by Yunhong Gu and Robert L. Grossman GridPP: the UK grid for particle physics - by D. Britton, A.J. Cass, P.E.L. Clarke, J. Coles, D.J. Colling, A.T. Doyle, N.I. Geddes, J.C. Gordon, R.W.L. Jones, D.P. Kelsey, S.L. Lloyd, R.P. Middleton, G.N. Patrick, R.A. Sansum, and S.E. Pearce Louisiana: a model for advancing regional e-Research through cyberinfrastructure - by Daniel S. Katz, Gabrielle Allen, Ricardo Cortez, Carolina Cruz-Neira, Raju Gottumukkala, Zeno D. Greenwood, Les Guice, Shantenu Jha, Ramesh Kolluru, Tevfik Kosar, Lonnie Leger, Honggao Liu, Charlie McMahon, Jarek Nabrzyski, Bety Rodriguez-Milla, Ed Seidel, Greg Speyrer, Michael Stubblefield, Brian Voss, and Scott Whittenburg Building a scientific data grid with DIGS - by Mark G. Beckett, Chris R. Allton, Christine T.H. Davies, Ilan Davis, Jonathan M. Flynn, Eilidh J. Grant, Russell S. Hamilton, Alan C. Irving, R.D. Kenway, Radoslaw H. Ostrowski, James T. Perry, Jason R. Swedlow, and Arthur Trew Flexible selection of heterogeneous and unreliable services in large-scale grids - by Sebastian Stein, Terry R. Payne, and Nicholas R. Jennings Standards-based network monitoring for the grid - by Jeremy Nowell, Kostas Kavoussanakis, Charaka Palansuriya, Michal Piotrowski, Florian Scharinger, Paul Graham, Bartosz Dobrzelecki, and Arthur Trew The Archaeotools project: faceted classification and natural language processing in an archaeological context - by S. Jeffrey, J. Richards, F. Ciravegna, S. Waller, S. Chapman, and Z. Zhang Integrating Open Grid Services Architecture Data Access and Integration with computational Grid workflows - by Tamas Kukla, Tamas Kiss, Peter Kacsuk, and Gabor Terstyanszky Improved performance control on the Grid - by M.E. Tellier, G.D. Riley, and T.L. Freeman Novel submission modes for tightly coupled jobs across distributed resources for reduced time-to-solution - by Promita Chakraborty, Shantenu Jha, and Daniel S. Katz Real science at the petascale - by Radhika S. Saksena, Bruce Boghosian, Luis Fazendeiro, Owain A. Kenway, Steven Manos, Marco D. Mazzeo, S. Kashif Sadiq, James L. Suter, David Wright, and Peter V. Coveney Enabling cutting-edge semiconductor simulation through grid technology - by Dave Reid, Campbell Millar, Scott Roy, Gareth Roy, Richard Sinnott, Gordon Stewart, Graeme Stewart, and Asen Asenov UKQCD software for lattice quantum chromodynamics - by P.A. Boyle, R.D. Kenway, and C.M. Maynard Adaptive distributed replica–exchange simulations - by Andre Luckow, Shantenu Jha, Joohyun Kim, Andre Merzky, and Bettina Schnor High-performance computing for Monte Carlo radiotherapy calculations - by P. Downes, G. Yaikhom, J.P. Giddy, D.W. Walker, E. Spezi, and D.G. Lewis NeSC News www.nesc.ac.uk Issue 70, June 2009 OGSA-DAI: from open source product to open source project By Mike Jackson The OGSA-DAI project has been funded by EPSRC for an additional year, until April 2010. This funding will enable us to evolve OGSA-DAI from an open source product into an open source project. An international community of users and developers has formed around OGSA-DAI, our unique open source product for access to and integration of distributed heterogeneous data resources. This includes projects and institutions in a myriad of fields including medical research, environmental science, geosciences, the arts and humanities and business. Moving to an open source project will provide the community with a focal point for the evolution, development, use and support of OGSA-DAI and its related components, providing a means by which members can develop and release their components alongside the core product. It will also provide an avenue to ensure the sustainability of their components. Over the next few months we will set in place the governance and infrastructure of the OGSA-DAI open source project. This will be done in conjunction with key community members, and will draw upon the expertise of our OMII-UK partners in Manchester and Southampton and in the Globus Alliance. We aim to roll out our open source project site in October. Our move to an open source project contributes to OMII-UK’s vision to promote software sustainability, and will guarantee that the lifetime of the OGSA-DAI product will exist outwith any single institution or funding stream. In addition, we will continue to develop the product and engage with international standardisation activities: Work on distributed query processing will continue, looking at more powerful distributed relational queries and integrating work on relationalXML queries produced by Japan’s AIST. A review of performance, scalability and robustness will be undertaken, so allowing us to identify key areas for redesign. New components for security, data delivery via GridFTP and access to indexed and SAGA file resources will be released. A new version of OGSA-DAI, with many improvements including refactored APIs and exploiting Java 1.6, will be released. We will participate in inter-operability testing with the OGF DAIS working group, a vital part in the evolution of the WS-DAI specifications into standards. The OGSA-DAI project—-which involves both EPCC and the National e-Science Centre—-is funded by EPSRC through OMII-UK. OGSA-DAI: info@ogsadai.org.uk, eSI Visitor Seminar: “Trust and Security in Distributed Information Infrastructures” The e-Science Institute is delighted to host a seminar with Professor Vijay Varadharajan, Microsoft Chair in Innovation in Computing at Macquarie University, at 4pm on June 15. The seminar is open to all interested parties in academia and industry. The Internet and the web technologies are transforming the way we work and live. Fundamental to many of the technological and policy challenges in this technology enabled society and economy are the issues of trust. For instance, when an entity receives some information from another entity, questions arise as to how much trust is to be placed on the received information, how to evaluate the overall trust and how to incorporate the evaluated trust in the decision making systems. Such issues can indeed arise at many levels in computing, e.g. at a user level, at a service level (in a distributed service oriented architecture), at a network level between one network device and another (e.g. in a sensor network) or at a process level within a system. There is also the social dimension involving how different societies and cultures value and evaluate trust in their contexts. Recently we have been witnessing users, especially the younger generation placing a greater trust on the information available on the Internet applications such as Facebook and MySpace, when it comes to making online decisions. The notion of trust has been around for many decades, if not for centuries, in different disciplines such as psychology, philosophy, sociology as well as in technology. From a security technology point of view, trust has always been regarded as a foundational building block. In this talk, we will take a journey through the different notions of trust in secure computing technology world and their evolution from the operating systems context to distributed systems to trusted computing platforms and trusted online applications. We will look at the some of the challenges involved in developing trusted services and infrastructures, and their influence on growing the digital economy. More information is available here: http://www.nesc.ac.uk/esi/events/998/ NeSC News www.nesc.ac.uk Issue 70, June 2009 XtreemOS summer school Wadham College, University of Oxford, September 7-11, 2009 XtreemOS is a Linux-based operating systems that includes Grid functionalities. It is characterised by properties such as transparency, hiding the complexity of in the underlying distributed infrastructure; scalability, supporting hundreds of thousands of nodes and millions of users; and dependability, providing reliability, highly availability and security. The XtreemOS Summer School will include lectures on modern distributed paradigms such as Grid computing, Cloud computing, and network-centric operating systems. The Summer School will combine lectures from research leaders shaping the future of distributed systems and world leaders in deploying and exploiting distributed infrastructures. Hands-on laboratory exercises and practical sessions using XtreemOS will give participants experience on using modern distributed systems. The aims of the XtreemOS Summer School are: To introduce participants to emergent computing paradigms such as Grid computing and Cloud computing; To provide lectures and practical courses on novel techniques to achieve scalability, highly availability and security in distributed systems; To present Grid applications in the domains of Escience and business; To provide a forum for participants to discuss your research work and share experience with experience researchers. An online registration form is available at the following URL: http://www.xtreemos.eu/xtreemos-events/xtreemossummer-school-2009/registration-form The deadline for registration is July 26th 2009. More information: http://www.xtreemos.eu/xtreemos-events/xtreemos-summer-school-2009 Forthcoming Events Timetable June 2-5 BEinGRID EC Review NeSC http://www.nesc.ac.uk/esi/events/959/ 9 Leaping Hurdles: Planning IT Provision for Researchers NeSC http://www.nesc.ac.uk/esi/events/974/ 10 Implementation of the Data Audit Framework - Progress and Sustainability NeSC http://www.nesc.ac.uk/esi/events/994/ 15 eSI Visitor Seminar: “Trust and Security in eSI Distributed Information Infrastructures” http://www.nesc.ac.uk/esi/events/998/ 16 eSI Public Lecture: “How Web 2.0 Technologies and Innovations are Changing e-Research Activities” eSI http://www.nesc.ac.uk/esi/events/960/ 17 Stakeholder meeting to launch the National Managed Clinical Network for Children with Exceptional Healthcare Needs NeSC http://www.nesc.ac.uk/esi/events/995/ 25-3 July Analysis of Fluid Stability ICMS http://www.icms.org.uk/workshops/ stability This is only a selection of events that are happening in the next few months. For the full listing go to the following websites: Events at the e-Science Institute: http://www.nesc.ac.uk/esi/esi.html External events: http://www.nesc.ac.uk/events/ww_events.html If you would like to hold an e-Science event at the e-Science Institute, please contact: Conference Administrator, National e-Science Centre, 15 South College Street, Edinburgh, EH8 9AA Tel: 0131 650 9833 Fax: 0131 650 9819 Email: events@nesc.ac.uk This NeSC Newsletter was edited by Gillian Law. Email: glaw@nesc.ac.uk The deadline for the July 2009 issue is June 19, 2009 NeSC News www.nesc.ac.uk