NeSC News A Question of Integrity Issue 70 June 2009

The monthly newsletter from the National e-Science Centre
NeSC News
Issue 70 June 2009
A Question of Integrity
By Iain Coleman
Spare a thought for poor Geoffrey
Chang. He was on top of the world
in 2006, a high-flying young protein
crystallographer with a prestigious
faculty position and a string of
high-profile papers to his name.
Then someone noticed a problem
with his results. It turned out that
Chang had been using software in
which someone else had swapped
two columns of data without his
knowledge, and his results were
all invalid. To Chang’s credit, he
published a swift and complete
retraction – but if he had fully
understood the provenance of his
software, all that work by himself and
others would not have been spoiled.
Provenance is evidence of
authenticity and integrity, an
assurance of quality and good
process. We invoke provenance
whenever we cite a paper from a
peer-reviewed journal, or check the
quality labels on food. But this is just
one side of the provenance coin.
Shares in United Airlines dropped
precipitously in just a few hours
one day in 2008, even though the
company was performing well. The
culprit was a news story from 2002,
describing the financial problems that
the airline was facing at that time. For
some reason, this story rose to the
top of the Google search results for
United Airlines six years later – and
there was no date attached to the
story, so investors assumed it was
breaking news and sold their shares.
This is the other side of provenance:
records of identity and creation,
tracking ownership and influences.
We each demonstrate our own
provenance whenever we show our
birth certificate, and an unbroken
record of ownership plays an
important role in establishing a work
of art as genuine.
Both these aspects of provenance
have become more problematic
in the digital age. For information
on paper, the process of creating
it will generally leave a paper trail
of notes, drafts, and approved
versions. Furthermore, modifying the
information or creating forgeries is
difficult, and will usually leave telltale signs. This all serves to make
He wanted to investigate what
provenance really is, why we think
we need it, and how we can know
when we have enough. As well as
trying to answer these questions, he
wanted to identify key challenges
for provenance in the context of eScience, and understand how it will
develop in the future. The theme
concluded on 15 May 2009 with
a public lecture at eSI, in which
Cheney outlined the specific goals of
the theme and assessed how far it
had come in achieving them.
For e-Science, provenance is
primarily about scientific
databases and scientific
workflows, but is also
crucial in the increasing
use of electronic lab
In scientific databases,
provenance is needed
for two reasons. One
is in demonstrating to
a reasonable sceptic
that the research results
captured by the database
James Cheney are valid. Having a human
being sign off on the data
is still considered to be
provenance, while not infallible, at
the most reliable method of quality
least fairly robust. When information
control, but even that can go wrong.
is in electronic form, often there is
If some data is false or erroneous,
no “bit trail”: plagiarism is a matter
can you track down how and
of copy-and-paste, and it is easy
where it has been used, and what
to forge or alter the information
publications depend upon it?
undetected. Thus, a sound system of
provenance is essential for judging
The other need for provenance is in
the quality of data.
curation, ensuring that the data is
still comprehensible in twenty years’
Considerations like these led
time. There is an expectation in
James Cheney to establish the escience that you don’t need to know
Science Institute’s research theme
why data was gathered, only how it
on “Principles of Provenance”.
Issue 70, June 2009
By Iain Coleman
was gathered. But the “how” is not
usually specific enough without some
understanding of the “why”.
Scientific workflows, the interfaces to
grid computing, require provenance
in order to understand and repeat
computations. It is vital to be able to
attest to the reliability of the results of
a computation that has been shipped
to different computers running
different environments with different
software libraries. Here, as ever, the
question of what information you
record at each stage of a process
depends intrinsically on what you
want it for.
There’s a challenge here to teachers
as well as technicians. Science has
developed a set of norms around
keeping contemporaneous lab
notebooks and writing up the details
of experiments, and these play a
vital role in maintaining the integrity
of scientific discovery. As storing and
communicating information moves
away from paper and onto digital
media, this hard-won good practice
must be passed on and translated to
the new ways of working.
It would seem from all of this that
provenance is an unalloyed good, but
that may not be quite true. Keeping
track of provenance can lead to
security problems, particularly when
information is cleared to be used by
people who are not authorised to
know some details of its creation.
There can also be political problems,
if some information is judged to have
come from a source that is not in
favour with the prevailing powers.
Scientifically, one of the main
problems relating to provenance is
that researchers can be reluctant
to put their more wildly speculative
ideas online for all the world to see.
There needs to be a safe space for
tentative or controversial arguments.
Chatham House Rules are a classic
NeSC News
approach to provenance in these
cases: they allow statements to be
reported as long as they are not
attributed to an individual. These
rules effectively remove provenance
so that people can speak more freely.
What has emerged from this
theme is the cross-cutting nature
of provenance. It recurs in many
aspects of computer science, and
presents important theoretical and
practical problems. This suggests
that it should be studied as a topic
in itself, much like concurrency,
security, or incremental computation.
The theme has shown the need
for more formal definitions of
provenance, and the development
of clear use cases and goals. One
of the problems with promoting
research in this area is the difficulty
of getting work on provenance
published: greater incentives are
needed for people to think about
these issues more clearly and in
more general cases. An ultimate goal
for provenance is the establishment
of a complete causal dependence
chain for research, but this moves
the issues into territory more usually
occupied by philosophers.
The key challenges ahead are to
combine insights from all the different
perspectives on provenance, and to
build systems that exhibit the new
ideas. Over the next ten years, we
are likely to move to a world with
provenance everywhere, as our
use of data is increasingly tracked
automatically by computer systems.
If we improve our theoretical
and practical understanding of
provenance, we will be better able
to face the security and privacy
challenges to come.
Slides and a webcast from this event
can be downloaded from http://www.
eSI Public Lecture:
“How Web 2.0
and Innovations
are Changing eResearch Activities”
Prof. Mark Baker
A question of integrity
The e-Science Institute is delighted
to host a public lecture by Prof
Mark Baker, Research Professor of
Computer Science at the University
of Reading in the School of Systems
Engineering. The public lecture will
be held at 4pm on June 16.
Technologies of various types
appear in waves. Some are taken
up and are successful, and others
die out quickly. These innovations
include new hardware, operating
systems, tools and utilities, as well
as applications, and also, the way
users interact with systems. The Web
2.0 arena seems to have been one
of those areas that has taken off and
changed the way we do things; not
only on the Internet, but also via the
Web. When Tim O’Reilly first coined
the term ’Web 2.0‘ back in 2004,
may of us thought the area being
referred to was fairly empty, but since
those days, the extent that people
collaborate, communicate, and the
range of tools and technologies that
have appeared have dramatically
changed the way we do things. In
this presentation, we will look at the
way Web 2.0 technologies have
developed, and investigate their
impact and influence on services,
application, users, and overall
More information is available at:
Issue 70, June 2009
Preparing particle physics for the many-core future
By Alan Gray, EPCC
A recent project at EPCC has
performed major enhancements to
an international particle physics code
repository. The work has enabled
the effective exploitation of emerging
technologies, including those
incorporating chips with many cores,
and has led to new scientific results
EPCC collaborated with UK and
US academics to develop new
functionality within the USQCD
code suite. This software, originally
developed in the US, is heavily
used in the UK and throughout the
world. Spacetime is represented
computationally as a lattice of
points, allowing for a wide range of
simulations aimed at probing our
current understanding of the laws of
physics, and searching for new and
improved theories.
The Standard Model of Particle
Physics, a group of theories
which encompasses our current
understanding of the physical
laws, is known to be extremely
accurate. However it is not able to
explain all observed phenomena,
and is therefore thought to be an
approximation of a yet undevised
deeper theory. Progress in this
fundamental research area requires,
in combination with experimental
measurements such as those at the
Large Hadron Collider, demanding
calculations that stress even the
world’s largest supercomputers: the
rate of progress depends on the code
performance on such machines.
Until recently, increases in the
performance of computing systems
have been achieved largely through
increases in the clock frequency
of the processors. This trend is
reaching its limits due to power
requirements and heat dissipation
problems. Instead the drive towards
increased processing power is being
satisfied by the inclusion of multiple
processing elements on each chip.
Current systems contain dual or
quad-core chips, and the number
of cores per chip is expected to
continue to rise.
This layout poses programming
challenges, not least for scientific
computing on large parallel systems.
Like many HPC codes, the original
USQCD software had no mechanism
for understanding which processes
are associated with which chip:
each process was treated as equally
distinct from one another and
communication was universally done
through the passing of messages. A
threading model has been developed
within the software, where processes
running within a chip can be
organised as a group of threads, and
can communicate with one another
directly through memory which
they share. One thread per chip
(or group) handles the messages
needed to communicate with external
chips. This new programming
model, which more closely maps
on to the emerging hardware, can
display some modest performance
advantages in current systems
but the real efficiency gains will be
realised as the number of cores per
chip rises in the future.
The project has also created
additional functionality within the
USQCD software. The code has
been enhanced to allow calculations
using new theories beyond the
Standard Model, signals of which
may be discovered experimentally
at the Large Hadron Collider.
Furthermore, improvements
have been implemented which
allow calculations (within the
Standard Model framework) to
an unprecedented precision, that
in turn allow accurate testing
against experimental results: any
discrepancies may point to new
USQCD Software:
Large Hadron Collider tunnel
Photo credit: Mike Procario
NeSC News
Issue 70, June 2009
All Hands: Call for Submission of Abstracts
Authors are invited to submit abstracts of unpublished, original work for this year’s All Hands Meeting, to be held in
Oxford on December 7-9.
Authors are asked to submit to one of the following themes or as a ‘general paper’: Social Sciences, Arts and
Humanities, Medical and Biological Sciences, Physical and Engineering Sciences, Environmental and Earth Sciences,
Sharing and Collaboration, Distributed and High Performance Computing Technologies, Data and Information
Management, User Engagement or Foundations of e-Science.
This year, we would especially like to encourage industry collaborators to take a full part in the conference, including
contributing to papers. We will be introducing the Tony Hey prize for the best student paper, named in honour of the outstanding contribution
Tony has made to UK e-Science. This competition is open to any UK student. The prize winner will be asked to
present the paper in a special session. Further details will be available soon.
Important Dates: 30 June 2009 - Deadline for abstract submission and 19 August 2009 - Decisions to authors
More information is available here;
Gridipedia:relaunched and expanded
By K. Kavoussanakis, EPCC
Gridipedia – the European
online repository of Grid tools
and information – preserves and
makes accessible a whole range
of resources on Grid computing
and related technologies such as
cloud computing and virtualization.
Originally populated with results from
the BEinGRID research project, it is
continuously enriched by commercial
organisations and research projects
and welcomes contributions.
Gridipedia ( has
expanded massively recently and
its contents include case studies,
software components, business
analysis and legal advice. Unique
visitor numbers exceed 2,000 a
month and include collaborators
using Gridipedia to distribute
software and research results.
Gridipedia was initially populated
with the results of BEinGRID, which
focuses on extracting best practice
and common components from a
series of pilot implementations of
Grid computing in diverse business
NeSC News
settings. More recently
the site has hosted
work by commercial
organisations such
as case studies from
Univa, Sun, Digipede
and IBM, as well
as other research
projects such as
BREIN, Akogrimo and
TrustCom, and on
behalf of open source middleware
groups including GRIA and Globus.
In the long term, Gridipedia aims to
become the definitive distribution
channel for cloud and Grid
technologies: where vendors will
meet buyers from the commercial
sector; where consultants, suppliers
and programmers will exhibit and
trade their offerings; and where
potential buyers at all levels will find
the information and products they
Relaunched in May, Gridipedia
targets key decision-makers in both
business and technical roles who
will oversee the implementation of
Grid in their businesses or who are
looking to further develop their Grid
capabilities. It demonstrates the
business benefits of Grid technology,
for example reduced time to market,
new service offerings, reduced
costs, improved quality and greater
Gridipedia is soliciting contributions
from the user community. You can
join Gridipedia by contributing to
the online Grid voices blog or you
can submit your software or article
for publication on the site. We look
forward to your contributions!
Issue 70, June 2009
Advanced Distributed Services Summer School
The NGS, in conjunction with the
training, outreach and education
team at NeSC is pleased to
announce that registration for the
Advanced Distributed Services
Summer School 2009 is now open.
The summer school will run from
7-22 September at Cosener’s
House, Oxfordshire and will bring
together students and many of the
leading researchers and technology
providers in the field of Distributed
Computing. It is a chance for
students to not only take part in a
unique learning experience, including
many hands on tutorials, but also to
spend time in a small group with the
leaders in the field.
The aim of the school is to help
develop the skills of those involved
in providing computational support
for research in a wide range of
disciplines. In particular the school
will focus on the use of, provision of
interfaces to and the development
of services based on employing
the composition or aggregation of
computational or data services.
The school will show how to
compose a variety of services into
Cosener’s House
bioinformatics work flows that can be
used to support biomedical research
processes, how to use and develop
lab or department scale clusters of
computers to run simulations, how
to work with the NGS to compose
protein simulation models for
running on UK or international super
computers, developing a portal to
support legacy applications.
The school aims to provide students
with a familiarity with and the tools
to use the facilities available in the
UK and internationally to transform
current research practices in the light
of the developing services being
made available in a highly networked
The cost of the summer school,
including accommodation and meals
(except Tuesday and Thursday) is
£276. Please send further queries to or register
on the ADSSS web site (http://www.
Cloud Computing Users and the NGS
The Belfast e-Science Centre (BeSC) offers a hosting-on-demand service within the NGS in support of UK academic
users. The service is currently used extensively by BeSC’s commercial partners but BeSC are keen to have more
academic users. The service enables a remote user to deploy software into servers within the BeSC domain and to
manage these deployed services remotely.
The BeSC service hosting cloud can be accessed via a web UI and using an API BeSC are developing (currently
called libcloud); a Europar 09 paper on libcloud can be found at
The API is intended to provide a provider neutral interface to remote resources such as those provided by Amazon,
Flexiscale etc and the BeSC hosting cloud; plugins for all of these providers are part of the library. If you have an
interest in using the hosting cloud or this library in your development and/or helping its development please contact
Terry Harmer ( for details.
NeSC News
Issue 70, June 2009
Mathematical, Physical and Engineering Sciences Online Table of Contents Alert
A new issue of Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences is
now available online: and the Table of Contents below is available at: http://rsta.
Introduction: Crossing boundaries: computational science, e-Science and global e-Infrastructure - by Peter V.
Coveney and Malcolm P. Atkinson
Sector and Sphere: the design and implementation of a high-performance data cloud - by Yunhong Gu and
Robert L. Grossman
GridPP: the UK grid for particle physics - by D. Britton, A.J. Cass, P.E.L. Clarke, J. Coles, D.J. Colling, A.T. Doyle,
N.I. Geddes, J.C. Gordon, R.W.L. Jones, D.P. Kelsey, S.L. Lloyd, R.P. Middleton, G.N. Patrick, R.A. Sansum, and S.E.
Louisiana: a model for advancing regional e-Research through cyberinfrastructure - by Daniel S. Katz, Gabrielle
Allen, Ricardo Cortez, Carolina Cruz-Neira, Raju Gottumukkala, Zeno D. Greenwood, Les Guice, Shantenu Jha,
Ramesh Kolluru, Tevfik Kosar, Lonnie Leger, Honggao Liu, Charlie McMahon, Jarek Nabrzyski, Bety Rodriguez-Milla,
Ed Seidel, Greg Speyrer, Michael Stubblefield, Brian Voss, and Scott Whittenburg
Building a scientific data grid with DIGS - by Mark G. Beckett, Chris R. Allton, Christine T.H. Davies, Ilan Davis,
Jonathan M. Flynn, Eilidh J. Grant, Russell S. Hamilton, Alan C. Irving, R.D. Kenway, Radoslaw H. Ostrowski, James
T. Perry, Jason R. Swedlow, and Arthur Trew
Flexible selection of heterogeneous and unreliable services in large-scale grids - by Sebastian Stein, Terry R.
Payne, and Nicholas R. Jennings
Standards-based network monitoring for the grid - by Jeremy Nowell, Kostas Kavoussanakis, Charaka
Palansuriya, Michal Piotrowski, Florian Scharinger, Paul Graham, Bartosz Dobrzelecki, and Arthur Trew
The Archaeotools project: faceted classification and natural language processing in an archaeological
context - by S. Jeffrey, J. Richards, F. Ciravegna, S. Waller, S. Chapman, and Z. Zhang
Integrating Open Grid Services Architecture Data Access and Integration with computational Grid workflows
- by Tamas Kukla, Tamas Kiss, Peter Kacsuk, and Gabor Terstyanszky
Improved performance control on the Grid - by M.E. Tellier, G.D. Riley, and T.L. Freeman
Novel submission modes for tightly coupled jobs across distributed resources for reduced time-to-solution
- by Promita Chakraborty, Shantenu Jha, and Daniel S. Katz
Real science at the petascale - by Radhika S. Saksena, Bruce Boghosian, Luis Fazendeiro, Owain A. Kenway,
Steven Manos, Marco D. Mazzeo, S. Kashif Sadiq, James L. Suter, David Wright, and Peter V. Coveney
Enabling cutting-edge semiconductor simulation through grid technology - by Dave Reid, Campbell Millar, Scott
Roy, Gareth Roy, Richard Sinnott, Gordon Stewart, Graeme Stewart, and Asen Asenov
UKQCD software for lattice quantum chromodynamics - by P.A. Boyle, R.D. Kenway, and C.M. Maynard
Adaptive distributed replica–exchange simulations - by Andre Luckow, Shantenu Jha, Joohyun Kim, Andre
Merzky, and Bettina Schnor
High-performance computing for Monte Carlo radiotherapy calculations - by P. Downes, G. Yaikhom, J.P. Giddy,
D.W. Walker, E. Spezi, and D.G. Lewis
NeSC News
Issue 70, June 2009
OGSA-DAI: from open source product to open
source project
By Mike Jackson
The OGSA-DAI project has been
funded by EPSRC for an additional
year, until April 2010. This funding will
enable us to evolve OGSA-DAI from
an open source product into an open
source project.
An international community of users
and developers has formed around
OGSA-DAI, our unique open source
product for access to and integration
of distributed heterogeneous data
resources. This includes projects
and institutions in a myriad of
fields including medical research,
environmental science, geosciences, the arts and humanities
and business.
Moving to an open source project will
provide the community with a focal
point for the evolution, development,
use and support of OGSA-DAI and
its related components, providing
a means by which members
can develop and release their
components alongside the core
product. It will also provide an
avenue to ensure the sustainability
of their components. Over the next
few months we will set in place the
governance and infrastructure of
the OGSA-DAI open source project.
This will be done in conjunction
with key community members, and
will draw upon the expertise of our
OMII-UK partners in Manchester
and Southampton and in the Globus
Alliance. We aim to roll out our open
source project site in October.
Our move to an open source project
contributes to OMII-UK’s vision to
promote software sustainability, and
will guarantee that the lifetime of the
OGSA-DAI product will exist outwith
any single institution or funding
stream. In addition, we will continue
to develop the product and engage
with international standardisation
Work on distributed query processing
will continue, looking at more
powerful distributed relational queries
and integrating work on relationalXML queries produced by Japan’s
A review of performance, scalability
and robustness will be undertaken,
so allowing us to identify key areas
for redesign.
New components for security, data
delivery via GridFTP and access to
indexed and SAGA file resources will
be released.
A new version of OGSA-DAI, with
many improvements including
refactored APIs and exploiting Java
1.6, will be released.
We will participate in inter-operability
testing with the OGF DAIS working
group, a vital part in the evolution
of the WS-DAI specifications into
The OGSA-DAI project—-which
involves both EPCC and the National
e-Science Centre—-is funded by
EPSRC through OMII-UK.
eSI Visitor Seminar: “Trust and Security in Distributed Information Infrastructures”
The e-Science Institute is delighted to host a seminar with Professor Vijay Varadharajan, Microsoft Chair in Innovation
in Computing at Macquarie University, at 4pm on June 15. The seminar is open to all interested parties in academia
and industry.
The Internet and the web technologies are transforming the way we work and live. Fundamental to many of the
technological and policy challenges in this technology enabled society and economy are the issues of trust. For
instance, when an entity receives some information from another entity, questions arise as to how much trust is to be
placed on the received information, how to evaluate the overall trust and how to incorporate the evaluated trust in the
decision making systems. Such issues can indeed arise at many levels in computing, e.g. at a user level, at a service
level (in a distributed service oriented architecture), at a network level between one network device and another
(e.g. in a sensor network) or at a process level within a system. There is also the social dimension involving how
different societies and cultures value and evaluate trust in their contexts. Recently we have been witnessing users,
especially the younger generation placing a greater trust on the information available on the Internet applications
such as Facebook and MySpace, when it comes to making online decisions. The notion of trust has been around
for many decades, if not for centuries, in different disciplines such as psychology, philosophy, sociology as well as in
technology. From a security technology point of view, trust has always been regarded as a foundational building block.
In this talk, we will take a journey through the different notions of trust in secure computing technology world and their
evolution from the operating systems context to distributed systems to trusted computing platforms and trusted online
applications. We will look at the some of the challenges involved in developing trusted services and infrastructures,
and their influence on growing the digital economy.
More information is available here:
NeSC News
Issue 70, June 2009
summer school
Wadham College, University of Oxford, September 7-11, 2009
XtreemOS is a Linux-based operating systems that includes Grid functionalities. It is characterised by properties such
as transparency, hiding the complexity of in the underlying distributed infrastructure; scalability, supporting hundreds of
thousands of nodes and millions of users; and dependability, providing reliability, highly availability and security.
The XtreemOS Summer School will include lectures on modern distributed paradigms such as Grid computing, Cloud
computing, and network-centric operating systems. The Summer School will combine lectures from research leaders
shaping the future of distributed systems and world leaders in deploying and exploiting distributed infrastructures.
Hands-on laboratory exercises and practical sessions using XtreemOS will give participants experience on using
modern distributed systems.
The aims of the XtreemOS Summer School are: To introduce participants to emergent computing paradigms such
as Grid computing and Cloud computing; To provide lectures and practical courses on novel techniques to achieve
scalability, highly availability and security in distributed systems; To present Grid applications in the domains of Escience and business; To provide a forum for participants to discuss your research work and share experience with
experience researchers.
An online registration form is available at the following URL: The deadline for registration is July 26th 2009.
More information:
Forthcoming Events Timetable
BEinGRID EC Review
Leaping Hurdles: Planning IT Provision
for Researchers
Implementation of the Data Audit
Framework - Progress and Sustainability
eSI Visitor Seminar: “Trust and Security in eSI
Distributed Information Infrastructures”
eSI Public Lecture: “How Web 2.0
Technologies and Innovations are
Changing e-Research Activities”
Stakeholder meeting to launch the
National Managed Clinical Network for
Children with Exceptional Healthcare
25-3 July
Analysis of Fluid Stability
This is only a selection of events that are happening in the next few months. For the full listing go to the following
Events at the e-Science Institute:
External events:
If you would like to hold an e-Science event at the e-Science Institute, please contact:
Conference Administrator,
National e-Science Centre, 15 South College Street, Edinburgh, EH8 9AA
Tel: 0131 650 9833 Fax: 0131 650 9819
This NeSC Newsletter was edited by Gillian Law.
The deadline for the July 2009 issue is June 19, 2009
NeSC News