Tony Hey
Corporate Vice President for External Research
Microsoft Research
• Data collection
– Sensor networks, global databases, local databases, desktop computer, laboratory instruments, observation devices, …
• Data processing, analysis, visualization
– Legacy codes, workflows, data mining, indexing, searching, graphics, screens, …
• Archiving
– Digital repositories, libraries, preservation, …
SensorMap
Functionality: Map navigation
Data: sensor ‐ generated temperature, video camera feed, traffic feeds, etc.
Scientific visualizations
NSF Cyberinfrastructure report, March 2007
• Thousand years ago – Experimental Science
– Description of natural phenomena
• Last few hundred years – Theoretical Science
– Newton’s Laws, Maxwell’s Equations…
• Last few decades – Computational Science
– Simulation of complex phenomena
• Today – eScience or Data ‐ centric Science
– Unify theory, experiment, and simulation
– Using data exploration and data mining
• Data captured by instruments
• Data generated by simulations
• Data generated by sensor networks
¾ Scientists over ‐ whelmed with data
¾ Computer Science and IT companies have technologies that will help
(With thanks to Jim Gray) a
.
a
2
=
4
π
G
ρ
3
− Κ c
2 a
2
http:// www.neptune.washington.edu
/ /
Undersea
Sensor
Network
Connected &
Controllable
Over the
Internet
Visual
Programming
Persistent
Distributed
Storage
Distributed
Computation
Interoperability
& Legacy
Support via
Web Services
• North East Pacific Time ‐
Series Undersea
Networked Experiment
• The world’s first plate ‐ scale undersea observatory
• From raw data to useable data products
– Data cleaning, analysis, regridding, interpolation
• Support real time, on ‐ demand visualization
In ‐ browser workflow editing
Custom activities for oceanographers
• In 2003, the NSF published the ‘Atkins
Report’ on ‘Revolutionizing Science and
Engineering through Cyberinfrastructure’
• Report defined Cyberinfrastructure as:
• Grids of computational centers
• Comprehensive libraries of digital objects
• Well-curated collections of scientific data
• Online instruments and vast sensor arrays
• Convenient software toolkits
Report concluded Cyberinfrastructure relevant to the
Humanities and Social Science to support what they called
‘Digital Scholarship’:
• Building a digital collections of information
• Creating appropriate tools for collection ‐ building
• Creating tools for the analysis and study of collections
• Using tools and digital collections to generate new intellectual products
• Creating authoring tools for these new intellectual products, either in traditional forms or in digital form
¾ eResearch is a more inclusive term than eScience
• Open access to, and unrestricted use of, data promotes scientific progress and facilitates the training of researchers
• Open access will maximise the value derived from public investments in data collection efforts
• The risk that undue restrictions on access to and use of research data from public funding could diminish the quality and efficiency of scientific research and innovation
• Large data archives create the opportunity to:
– Do science at the regional and global scale
– Combine data from multiple disciplines
– Perform historical trend analysis
• Small scientific collaborations need help to:
– Perform analyses using more data than they can currently manage
– Enable data handling and versioning
– Store the currently needed data and metadata
– Browse the data for science
Web users...
• Generate content on the Web
– Blogs, wikis, podcasts, videocasts, etc.
• Form communities
– Social networks, virtual worlds
• Interact, collaborate, share
– Instant messaging, web forums, content sites
• Consume information and services
– Search, annotate, syndicate
Researchers...
• Annotate, share, discover data
– Custom, standalone tools
• Conferences, Journals
– Publication process is long, subscriptions, discoverability issues
• Collaborate on projects, exchange ideas
– Email, F2F meetings, video ‐ conferences
• Use workflow tools to compose services
– Domain ‐ specific services/tools
With thanks to
Catharine van Ingen
• Working with Berkeley Water Center using modern Database technologies
– 149 Ameriflux sites across the
Americas reporting minimum of
22 common measurements
– Carbon ‐ Climate Data published to and archived at Oak Ridge
– Total data reported to date on the order of 192M half ‐ hourly measurements since 1994
• http://public.ornl.gov/ameriflux/
19
Deb Agarwal and Savas Parastatidis
Easy mashups using PopFly and Silverlight
SilverLight Technologies
Sloan Digital Sky Server/SkyServer http://cas.sdss.org/dr5/en/
BLAST
BLAST service delivered through a
Web browser
BLAST service (WSDL) that can be integrated into an application
Taverna
Workflow
Google Account
Open ID
Windows Live ID
Amazon EC2 and its custom SOAP ‐ based authentication
SCRIPPS
(see www.microsoft.com/science )
Services not middleware
• No need to install many thousands of lines of middleware
• Take the Web2.0
example and empower researchers through their browser
Services in the cloud
• Blogs, Wikis
• RSS, Tagging
• Data processing, data mining
• Content upload, sharing, discovery
• Storage, computation, messaging http://ecrystals.chem.soton.ac.uk
Thanks to Jeremy Frey
Simple Storage Service (S3)
• Storage for the Internet
• Simple Web Services interface to store and retrieve any amount of data from anywhere on the Web
• Standards ‐ based REST and SOAP
Web Service interfaces
Elastic Compute Cloud (EC2)
• Compute on demand
• Virtualization
• Integration with S3
Gene Analysis Virtual
Lab Experiment by Jong Youl Choi at Indiana
(For Beth Plale and Sun Kim)
Searching &
Visualization
Reputation
& Influence
Live
Documents
•
•
•
•
•
With thanks to Timo Hannay
Link to simulation software and data in archive
Link to data, follow links back to the raw data archive
A record of an experiment that failed…
Publishable?
Useful?
• Identify information sources, tools and services to support research in STM
• Explore the application of new services
– Collaborative filtering of literature, continual queries and more…
• Build on commercial software, to the extent possible
• Intuitive to use and navigate, user configurable
– “If I can’t figure it out in 15 minutes, I will quit using it…”
Home page with RSS & Live Search integration
4. Faculty menu displays
Contacts and Wiki. LCS adds presence to contacts for IM communication
3. Researcher can move between his Projects by clicking on the project name
2. The research lifecycle phases are represented as tabs
1. John Logs in Using his infocard.
The portal is personalized for the logged in user
8. RSS feeds alerts the user about latest funding topics/opportunities
5. Researcher can search across multiple online sources.
6. User can save searches and rerun them as required
7. References displays URL’s saved by User
User can flag a particular article and saved to his list of references
User can save a particular article to his personal bibliographic store like OneNote
User can view others comments
Lets the user collaborate with other researchers by sending documents directly
Lets the user add comments
Use context sensitive Word ribbons for automating tedious tasks
Use Excel templates for common tasks to increase accuracy
• Journal subscriptions rising faster than library budgets
– Cancelling subscriptions, no freedom for new journals in new and emerging fields
• Web technology and digital media now make dissemination of knowledge ‘easy’ and ‘free’ without the traditional paper journals
– Similar dilemma to that of the music industry with
MP3 and ‘free’ digital copies
¾ Curious ‘crisis’ in that the average academic is often unaware of these issues
• To save money, the University of Michigan’s libraries are canceling some of their journal subscriptions because of budget cuts and the increasing costs of the subscriptions.
• Many of the cuts are to print subscriptions only, while the University continues to subscribe to the journals online.
• University Librarian Paul Courant said that about 2,500 were canceled this fiscal year.
In many cases, Courant said, the University starts by canceling duplicate subscriptions, leaving one copy of the journal in at least one library, as opposed to in multiple libraries.
The University's other prominent case is when subscriptions were cancelled to journals with lower demand....
• The University Library budget has gone up by an average of 3.1
percent per year since 2004.
• According to Library Journal magazine, the average subscription price of national arts and humanities journals has increased 6.8
percent per year since 2003.
National social science journals increased 9.2
percent and national science journals increased by 8.3
percent....
Paul Ginsparg is the creator of arXiv, an open access repository for prepublication of much of the physics and astronomy literature
Published in the Journal of Neuroscience,
September 20, 2006
“
On the one ‐ decade time scale, it is likely that more research communities will join some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium, for the simple reason that it is the best way to communicate knowledge and hence to create new knowledge.”
“Ironically, it is also possible that the technology of the 21st century will allow the traditional players from a century ago, namely the professional societies and institutional libraries, to return to their dominant role in support of the research
Enterprise.”
• Repositories will contain not only full text versions of research papers but also ‘grey’ literature such as workshop papers, presentations, technical reports and theses
• In the future it is likely that repositories will also contain data, images and software
• As Dean of Engineering at Southampton I was responsible for monitoring the research output of over
200 Faculty and 500 Post Docs and Grad Students
• University library could not afford to subscribe to all the journals that my staff published in, not to mention conference proceedings and workshop contributions …
m
Individual Data Models and Services
SensorMap
Functionality: Map navigation
Data: sensor ‐ generated temperature, video camera feed, traffic feeds, etc.
• Semantic relationships between different data
• Semantic descriptions of services
• Annotations
• Provenance
• Repositories
• Ontologies
• Folksonomies my
• Led by British Library, with National Libraries of
Netherlands, Austria and
Denmark , and National
Archives of UK,
Netherlands and
Switzerland
•Using new XML-based file formats in Office 2007 as a preservation metadata format
•Open, royalty-free file format specification will allow interoperability
•Office OpenXML is now an ECMA Standard
•Creating translators from old versions of Word to
OOXML
General infrastructure and eResearch ‐ oriented services
Reference management
Project management domain ‐ visualization and analysis services specific services blogs scholarly communications
& social networking search books citations knowledge management knowledge discovery instant messaging identity mail notification document store storage/data services compute services virtualization
Access to all services through the cloud
Reference management
Project managemnet domain ‐ visualization and analysis services specific services blogs scholarly communications
& social networking search books citations knowledge management knowledge discovery instant messaging identity mail notification document store storage/data services compute services virtualization
Services accessible through the browser…
Reference management
Project managemnet domain ‐ visualization and analysis services specific services blogs scholarly communications
& social networking search books citations knowledge management knowledge discovery instant messaging identity mail notification document store storage/data services compute services virtualization
...
or from desktop applications
Reference management
Project managemnet domain ‐ visualization and analysis services specific services blogs scholarly communications
& social networking search books citations knowledge management knowledge discovery instant messaging identity mail notification document store storage/data services compute services virtualization
¾ Research ‐ as ‐ a ‐ service
• More affordable/sustainable infrastructure for research organizations
– Major IT companies bringing down the cost of the infrastructure
– Will be cheaper to pay ‐ per ‐ use than own/manage private infrastructure
• Combination of Data Intensive HPC facilities and
Software + Services
– Will enable research community to address global scientific grand challenges
• An eScience Cyberinfrastructure emerging
Grids of computational centers
Comprehensive libraries of digital objects
Well ‐ curated collections of scientific data
Online instruments and vast sensor arrays
Convenient software toolkits
¾ Some successes, encouraging progress
• Scientific services in the Cloud
¾ Future of eScience Cyberinfrastructure is likely to be mix of software + services
• The ideas presented here were developed with input from many colleagues in the community and at Microsoft:
– Special thanks are due to David De Roure, Geoffrey
Fox, Jeremy Frey, Dennis Gannon, Carole Goble and
Paul Watson
– And to Roger Barga, Savas Parastatidis and colleagues at Microsoft
• See www.microsoft.com/science for some more details of Microsoft’s activities in Technical
Computing