eScience and Digital Scholarship Tony Hey Corporate Vice President for External Research

advertisement

eScience and

Digital Scholarship

Tony Hey

Corporate Vice President for External Research

Microsoft Research

• Data   collection

– Sensor   networks,   global   databases,   local   databases,   desktop   computer,   laboratory   instruments,   observation   devices,   …

• Data   processing,   analysis,   visualization

– Legacy   codes,   workflows,   data   mining,   indexing,   searching,   graphics,   screens,   …

• Archiving

– Digital   repositories,   libraries,   preservation,   …

SensorMap

Functionality:   Map   navigation

Data:   sensor ‐ generated   temperature,   video   camera   feed,   traffic   feeds,   etc.

Scientific   visualizations

NSF   Cyberinfrastructure   report,   March   2007

• Thousand   years   ago   – Experimental   Science

– Description   of   natural   phenomena

• Last   few   hundred   years   – Theoretical   Science

– Newton’s   Laws,   Maxwell’s   Equations…

• Last   few   decades   – Computational   Science

– Simulation   of   complex   phenomena

• Today   – eScience or   Data ‐ centric   Science

– Unify   theory,   experiment,   and   simulation  

– Using   data   exploration   and   data   mining

• Data   captured   by   instruments

• Data   generated   by   simulations

• Data   generated   by   sensor   networks

¾ Scientists   over ‐ whelmed   with   data

¾ Computer   Science   and   IT   companies have   technologies   that   will   help

(With   thanks   to   Jim   Gray) a

.

a

2

=

4

π

G

ρ

3

− Κ c

2 a

2

http:// www.neptune.washington.edu

/ /

Undersea

Sensor

Network

Connected &

Controllable

Over the

Internet

Visual

Programming

Persistent

Distributed

Storage

Distributed

Computation

Interoperability

& Legacy

Support via

Web Services

• North   East   Pacific   Time ‐

Series   Undersea  

Networked   Experiment

• The   world’s   first   plate ‐ scale   undersea   observatory

• From   raw   data   to   useable   data   products

– Data   cleaning,   analysis,   regridding,   interpolation

• Support   real   time,   on ‐ demand   visualization

In ‐ browser   workflow   editing

Custom   activities   for   oceanographers

eScience and Cyberinfrastructure

• In 2003, the NSF published the ‘Atkins

Report’ on ‘Revolutionizing Science and

Engineering through Cyberinfrastructure’

• Report defined Cyberinfrastructure as:

• Grids of computational centers

• Comprehensive libraries of digital objects

• Well-curated collections of scientific data

• Online instruments and vast sensor arrays

• Convenient software toolkits

Humanities and Social Sciences?

eResearch and

 

Digital

 

Scholarship

Report   concluded   Cyberinfrastructure   relevant   to   the  

Humanities   and   Social   Science   to   support   what   they   called  

‘Digital   Scholarship’:

• Building   a   digital   collections   of   information

• Creating   appropriate   tools   for   collection ‐ building

• Creating   tools   for   the   analysis   and   study   of   collections

• Using   tools   and   digital   collections   to   generate   new   intellectual   products

• Creating   authoring   tools   for   these   new   intellectual   products,   either   in   traditional   forms   or   in   digital   form

¾ eResearch is   a   more   inclusive   term   than   eScience

• Open   access   to,   and   unrestricted   use   of,   data   promotes   scientific   progress   and   facilitates   the   training   of   researchers

• Open   access   will   maximise   the   value   derived   from   public   investments   in   data   collection   efforts

• The   risk   that   undue   restrictions   on   access   to   and   use   of   research   data   from   public   funding   could   diminish   the   quality   and   efficiency   of   scientific   research   and   innovation

Present Experience

• Large data archives create the opportunity to:

– Do science at the regional and global scale

– Combine data from multiple disciplines

– Perform historical trend analysis

• Small scientific collaborations need help to:

– Perform analyses using more data than they can currently manage

– Enable data handling and versioning

– Store the currently needed data and metadata

– Browse the data for science

Web   users...

• Generate   content   on   the   Web

– Blogs,   wikis,   podcasts,   videocasts,   etc.

• Form   communities

– Social   networks,   virtual   worlds

• Interact,   collaborate,   share

– Instant   messaging,   web   forums,   content   sites

• Consume   information   and   services

– Search,   annotate,   syndicate

Researchers...

• Annotate,   share,   discover   data

– Custom,   standalone   tools

• Conferences,   Journals

– Publication   process   is   long,   subscriptions,   discoverability   issues

• Collaborate   on   projects,   exchange   ideas

– Email,   F2F   meetings,   video ‐ conferences

• Use   workflow   tools   to   compose   services

– Domain ‐ specific   services/tools

With thanks to

Catharine van Ingen

Scientific

 

Data

 

Servers

 

for

 

Hydrology

• Working   with   Berkeley   Water   Center   using   modern   Database   technologies

– 149   Ameriflux sites   across   the  

Americas   reporting   minimum   of  

22   common   measurements

– Carbon ‐ Climate   Data   published   to   and   archived   at   Oak   Ridge

– Total   data   reported   to   date   on   the   order   of   192M   half ‐ hourly    measurements   since   1994

• http://public.ornl.gov/ameriflux/

19

Deb Agarwal and Savas Parastatidis

Easy mashups using PopFly and Silverlight

SilverLight Technologies

Sloan Digital Sky Server/SkyServer http://cas.sdss.org/dr5/en/

BLAST

BLAST   service delivered   through   a  

Web   browser

BLAST   service   (WSDL) that   can   be   integrated   into   an   application

Taverna

Workflow

Google   Account

Open   ID

Windows   Live   ID

Amazon   EC2   and   its   custom   SOAP ‐ based   authentication

SCRIPPS

(see   www.microsoft.com/science )

Services   not   middleware

• No   need   to   install   many   thousands   of   lines   of   middleware

• Take   the   Web2.0

  example   and   empower   researchers   through   their   browser

Services   in   the   cloud

• Blogs,   Wikis

• RSS,   Tagging

• Data   processing,   data   mining

• Content   upload,   sharing,   discovery

• Storage,   computation,   messaging http://ecrystals.chem.soton.ac.uk

Thanks to Jeremy Frey

Simple   Storage   Service   (S3)

• Storage   for   the   Internet

• Simple   Web   Services   interface   to   store   and   retrieve   any   amount   of   data   from   anywhere   on   the   Web

• Standards ‐ based   REST   and   SOAP  

Web   Service   interfaces

Elastic   Compute   Cloud   (EC2)

• Compute   on   demand

• Virtualization

• Integration   with   S3

Gene Analysis Virtual

Lab Experiment by Jong Youl Choi at Indiana

(For Beth Plale and Sun Kim)

Searching &

Visualization

Reputation

& Influence

Live

Documents

Nature’s

 

‘5

 

D’s

 

Framework’:

Deep

 

Data

Discussion

 

and

 

Dialogue

Digital

 

Discovery

Dynamic

 

Delivery

Data

 

Display

With   thanks   to   Timo Hannay

Publications

 

as

 

Live

 

Documents

Link to simulation software and data in archive

Link to data, follow links back to the raw data archive

New

 

Forms

 

of

 

Peer

 

Review

Tagging

 

for

 

Researchers

Lab

 

Notebooks

 

as

 

Blogs?

A record of an experiment that failed…

Publishable?

Useful?

• Identify   information   sources,   tools   and   services   to   support   research   in   STM

• Explore   the   application   of   new   services

– Collaborative   filtering   of   literature,   continual   queries   and   more…

• Build   on   commercial   software,   to   the   extent   possible

• Intuitive   to   use   and   navigate,   user   configurable

– “If   I   can’t   figure   it   out   in   15   minutes,   I   will   quit   using   it…”

Home   page   with   RSS   &   Live   Search   integration

4. Faculty menu displays

Contacts and Wiki. LCS adds presence to contacts for IM communication

3. Researcher can move between his Projects by clicking on the project name

2. The research lifecycle phases are represented as tabs

1. John Logs in Using his infocard.

The portal is personalized for the logged in user

8. RSS feeds alerts the user about latest funding topics/opportunities

5. Researcher can search across multiple online sources.

6. User can save searches and rerun them as required

7. References displays URL’s saved by User

Integration

 

with

 

MS

 

OneNote

 

for

 

bibliography

User can flag a particular article and saved to his list of references

User can save a particular article to his personal bibliographic store like OneNote

User can view others comments

Lets the user collaborate with other researchers by sending documents directly

Lets the user add comments

Integration

 

with

 

Word

 

&

 

Excel

 

2007

Use   context   sensitive   Word   ribbons   for   automating   tedious   tasks

Use   Excel   templates   for   common   tasks   to   increase   accuracy

• Journal   subscriptions   rising   faster   than   library   budgets

– Cancelling   subscriptions,   no   freedom   for   new   journals   in   new   and   emerging   fields

• Web   technology   and   digital   media   now   make   dissemination   of   knowledge   ‘easy’ and   ‘free’ without   the   traditional   paper   journals

– Similar   dilemma   to   that   of   the   music   industry   with  

MP3   and   ‘free’ digital   copies

¾ Curious   ‘crisis’ in   that   the   average   academic   is   often   unaware   of   these   issues

• To   save   money,   the   University   of   Michigan’s   libraries   are   canceling   some   of   their   journal   subscriptions   because   of   budget   cuts   and   the   increasing   costs   of   the   subscriptions.

• Many   of   the   cuts   are   to   print   subscriptions   only,   while   the   University   continues   to   subscribe   to   the   journals   online.

• University   Librarian   Paul   Courant   said   that   about   2,500   were   canceled   this   fiscal   year.

  In   many   cases,   Courant   said,   the   University   starts   by   canceling   duplicate   subscriptions,   leaving   one   copy   of   the   journal   in   at   least   one   library,   as   opposed   to   in   multiple   libraries.

  The   University's   other   prominent   case   is   when   subscriptions   were   cancelled   to   journals   with   lower   demand....

• The   University   Library   budget   has   gone   up   by   an   average   of   3.1

  percent   per   year   since   2004.

• According   to   Library   Journal   magazine,   the   average   subscription   price   of   national   arts   and   humanities   journals   has   increased   6.8

  percent   per   year   since   2003.

  National   social   science   journals   increased   9.2

  percent   and   national   science   journals   increased   by   8.3

  percent....

Paul Ginsparg is the creator of arXiv, an open access repository for prepublication of much of the physics and astronomy literature

Published in the Journal of Neuroscience,

September 20, 2006

Ginsparg’s Conclusions?

On   the   one ‐ decade   time   scale,   it   is   likely   that   more   research   communities   will   join   some   form   of   global   unified   archive   system   without   the   current   partitioning   and   access   restrictions   familiar   from   the   paper   medium,   for   the   simple   reason   that   it   is   the   best   way   to   communicate   knowledge   and   hence   to   create   new   knowledge.”

“Ironically,   it   is   also   possible   that   the   technology   of   the   21st   century   will   allow   the   traditional   players   from   a   century   ago,   namely   the   professional   societies   and   institutional   libraries,   to   return   to   their   dominant   role   in   support   of   the   research  

Enterprise.”

• Repositories   will   contain   not   only   full   text   versions   of   research   papers   but   also   ‘grey’ literature   such   as   workshop   papers,   presentations,   technical   reports   and   theses

• In   the   future   it   is   likely   that   repositories   will   also   contain   data,   images   and   software

• As   Dean   of   Engineering   at   Southampton   I   was   responsible   for   monitoring   the   research   output   of   over  

200   Faculty   and   500   Post   Docs   and   Grad   Students

• University   library   could   not   afford   to   subscribe   to   all   the   journals   that   my   staff   published   in,   not   to   mention   conference   proceedings   and   workshop   contributions   …

m

Individual Data Models and Services

SensorMap

Functionality:   Map   navigation

Data:   sensor ‐ generated   temperature,   video   camera   feed,   traffic   feeds,   etc.

• Semantic   relationships   between   different   data

• Semantic   descriptions   of   services

• Annotations

• Provenance

• Repositories

• Ontologies

• Folksonomies my

Grid

• Led by British Library, with National Libraries of

Netherlands, Austria and

Denmark , and National

Archives of UK,

Netherlands and

Switzerland

•Using new XML-based file formats in Office 2007 as a preservation metadata format

•Open, royalty-free file format specification will allow interoperability

•Office OpenXML is now an ECMA Standard

•Creating translators from old versions of Word to

OOXML

General   infrastructure   and   eResearch ‐ oriented   services

Reference   management

Project   management domain ‐ visualization   and   analysis   services specific   services blogs scholarly   communications

  & social   networking search books citations knowledge   management knowledge   discovery instant   messaging identity mail notification document   store storage/data   services compute services virtualization

Access   to   all   services   through   the   cloud

Reference   management

Project   managemnet domain ‐ visualization   and   analysis   services specific   services blogs scholarly   communications

  & social   networking search books citations knowledge   management knowledge   discovery instant   messaging identity mail notification document   store storage/data   services compute services virtualization

Services   accessible   through   the   browser…

Reference   management

Project   managemnet domain ‐ visualization   and   analysis   services specific   services blogs scholarly   communications

  & social   networking search books citations knowledge   management knowledge   discovery instant   messaging identity mail notification document   store storage/data   services compute services virtualization

...

  or   from   desktop   applications

Reference   management

Project   managemnet domain ‐ visualization   and   analysis   services specific   services blogs scholarly   communications

  & social   networking search books citations knowledge   management knowledge   discovery instant   messaging identity mail notification document   store storage/data   services compute services virtualization

¾ Research ‐ as ‐ a ‐ service

• More   affordable/sustainable   infrastructure   for   research   organizations

– Major   IT   companies   bringing   down   the   cost   of   the   infrastructure  

– Will   be   cheaper   to   pay ‐ per ‐ use   than   own/manage   private   infrastructure

• Combination   of   Data   Intensive   HPC   facilities   and  

Software   +   Services

– Will   enable   research   community   to   address   global   scientific   grand   challenges

• An   eScience Cyberinfrastructure   emerging

ƒ Grids   of   computational   centers

ƒ Comprehensive   libraries   of   digital   objects

ƒ Well ‐ curated collections   of   scientific   data

ƒ Online   instruments   and   vast   sensor   arrays

ƒ Convenient   software   toolkits

¾ Some   successes,   encouraging   progress

• Scientific   services   in   the   Cloud

¾ Future   of   eScience Cyberinfrastructure   is   likely   to   be   mix   of   software   +   services  

• The   ideas   presented   here   were   developed   with   input   from   many   colleagues   in   the   community   and   at   Microsoft:

– Special   thanks   are   due   to   David   De   Roure,   Geoffrey  

Fox,   Jeremy   Frey,   Dennis   Gannon,   Carole   Goble   and  

Paul   Watson

– And   to   Roger   Barga,   Savas   Parastatidis   and   colleagues   at   Microsoft  

• See   www.microsoft.com/science for   some   more   details   of   Microsoft’s   activities   in   Technical  

Computing

Download