The Future of eResearch

Speeding Science
Solutions for Data Curation
from Microsoft (Research)
Lee Dirks
Director, Education & Scholarly Communication
External Research Division
Microsoft Corporation
Microsoft External Research
Division within Microsoft Research
focused on partnerships between
academia, industry and government
to advance computer science,
education, and research in fields that
rely heavily upon advanced
Supporting groundbreaking research
to help advance human potential and
the wellbeing of our planet
Developing advanced technologies
and services to support every stage
of the research process
Microsoft External Research is
committed to interoperability and to
providing open access, open tools,
and open technology
Optimize and extend
Microsoft software to meet
the specific needs of the
academic community
Our approach:
Conduct applied projects to
enhance academic
productivity by evolving
Microsoft’s scholarly
communication offerings
Microsoft External Research
is uniquely positioned to
drive this initiative across
The Scholarly
Excel 2010
Windows Server HPC
“Astoria” / “Pop Fly”
Research &
Office OpenXML
XPS Format
SQL Server &
Entity Framework
Rights Management
Data Protection Manager
MSR Academic Search
SharePoint 2010
Archiving &
Publication &
Word 2010 + PowerPoint 2010
WPF & Silverlight
“Sea Dragon” / “PhotoSynth” / “Deep Zoom”
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Office Live
Office 2010:
Tablet PC/UMPC
Goal: Transform Scholarly Communication
• Interoperability is essential
– Actively lobby and drive for consensus around technical standards and standardized protocols
proactively adopted by the community; enable broad community engagement
Customers have told Microsoft that interoperability is OUR responsibility
• Leverage Existing Community Protocols, Practices, Guidelines, etc.
– Example – metadata conventions / taxonomies / ontologies: a traditional strength for libraries –
and a critical component in enabling Web 2.0
• Optimize for data-driven research
– To both data (scientific) and to information (scholarly publications)
– Reproducible research + computational science
– Properly document / annotate scholarly output
• Data preservation (and provenance) should be baseline
– Documentation of the data’s provenance
– Preservation needs to be like “accessibility” features – i.e., assumed as required
• Semantic knowledge discovery & social networking
– Harnessing collective intelligence must be a consideration – since accessing research is a core
step in the life-cycle. Enable knowledge discovery
– Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Open Science
Open Access
Open Source
Open Data
“In order to help catalyze and facilitate
the growth of advanced CI, a critical
component is the adoption of open
access policy for data, publications and
NSF Advisory Committee on
Cyberinfrastructure (ACCI)
Microsoft Interoperability Principles
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Open Connections to Microsoft Products
Support for Standards
Data Portability
Open Engagement
Membership / Participation
DataCite is an international consortium to establish easier access to scientific research data on
the Internet increase acceptance of research data as legitimate, citable contributions to the
scientific record, and to support data archiving that will permit results to be verified and repurposed for future study.
The Open Planets Foundation has been established to provide practical solutions and expertise in
digital preservation, building on the €15 million investment made by the European Union and Planets
consortium. OPF members benefit from the Planets results, new developments and the growing OPF
community that includes experts at some of the most prestigious research, technology and memory
institutions in Europe.
The Confederation of Open Access Repositories (COAR) is a not-for-profit association of
repository initiatives launched in October 2009. It aims to enhance greater visibility and application
of research outputs through global networks of Open Access digital repositories.
The Coalition for Networked Information (CNI) is an organization dedicated to supporting the
transformative promise of networked information technology for the advancement of scholarly
communication and the enrichment of intellectual productivity. Membership includes some 200
institutions representing higher education, publishing, network and telecommunications,
information technology, and libraries and library organizations.
ICSTI, the International Council for Scientific and Technical Information, offers a unique
forum for interaction between organizations that create, disseminate and use scientific and
technical information. ICSTI's mission cuts across scientific and technical disciplines, as well as
international borders, to give member organizations the benefit of a truly global community.
CrossRef is a not-for-profit membership association whose mission is to enable easy
identification and use of trustworthy electronic content by promoting the cooperative
development and application of a sustainable infrastructure. CrossRef's general purpose is to
promote the development and cooperative use of new and innovative technologies to speed and
facilitate scholarly research.
Who we work with
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
GenePattern Reproducible Research Add-in
Services: Connects to
GenePattern database
Relationships: Inline graphics
are synchronized to dataset
Data: Resulting data (and
provenance) stored within
Word document
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Data: Control and
execute query pipelines
into GenePattern
Source code and binary:
Creative Commons Add-in for Office 2007
Intent: Insert Creative Commons
licenses from within Office 2007
Services: Integrates with
Creative Commons Web API
to create new licenses
Relationships: license information stored
as RDF XML within the document OOXML
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Source code and binary:
Ontology Add-in for Word 2007
Services: Ontology
download web service
• John Wilbanks
Intent: Term recognition
& disambiguation
• Phil Bourne
• Lynn Fink
Ontology browser
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Source code and binary:
Article Authoring Add-in for Word 2007
Services: repository
deposit via SWORD
Structure: Read, convert, and
author NLM XML documents
Relationships: ORE
Resource Map creation
Structure: Client-side XML validation
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Relationships: Citation lookup
and reference management
Binary (version 2.0):
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Chem4Word - Chemistry Drawing in Word
Author/edit 1D and 2D chemistry.
Change chemical layout styles.
Intent: Recognizes
chemical dictionary
and ontology terms
Relationships: Navigate and
link referenced chemistry
Data: Semantics
stored in Chemistry
Markup Language
<?xml version="1.0" ?>
<cml version="3" convention="org-synth-report" xmlns="">
<molecule id="m1">
<atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" />
<atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" />
<atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" />
<atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" />
<atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" />
<atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" />
<atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" />
<atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />
<bond atomRefs2="a1 a2" order="1" />
<bond atomRefs2="a2 a3" order="1" />
<bond atomRefs2="a2 a4" order="2" />
<bond atomRefs2="a1 a5" order="1" />
<bond atomRefs2="a1 a6" order="1" />
<bond atomRefs2="a1 a7" order="1" />
<bond atomRefs2="a3 a8" order="1" />
• Peter MurrayRust
• Joe Townsend
• Jim Downing
Intelligence: Verifies validity
of authored chemistry
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Available soon:
Project Trident: Scientific Workflow Workbench
Author, Execute and Monitor Workflows
Organize collection of
individual workflow activities
View data products, performance
metrics, and provenance data
Compose and modify workflows
via drag & drop canvas
Available now:
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
Other relevant projects
This work is licensed under a Creative Commons
Attribution 3.0 United States License.
• The Windows Azure platform offers a flexible, familiar
environment for developers to create cloud applications
and services. With Windows Azure, you can shorten your
time to market and adapt as demand for your service
grows. Windows Azure offers a platform that is easily
implemented alongside your current environment.
• Offerings:
– Windows Azure: operating system as an online service
– Microsoft SQL Azure: fully relational cloud database solution
– Windows Azure platform AppFabric: connects cloud services
and on-premises applications
– Microsoft Codename “Dallas”: information marketplace for
data and web services
Azure – Project “Dallas”
• Microsoft "Dallas" is a service allowing developers
and information workers to easily discover, purchase,
and manage premium data subscriptions in the
Windows Azure platform.
– Dallas is an information marketplace that brings data, imagery, and
real-time web services from leading commercial data providers and
authoritative public data sources together into a single location, under
a unified provisioning and billing framework.
– Dallas APIs allow developers and information workers to consume this
premium content with virtually any platform, application or business
– More:
Excel Services & Excel Web Access
• Excel Calculation Services (ECS) is the "engine" of
Excel Services that loads the workbook, calculates in
full fidelity with Microsoft Office Excel 2007, refreshes
external data, and maintains sessions.
• Excel Web Access (EWA) is a Web Part that displays
and enables interaction with the Microsoft Office
Excel workbook in a browser by using Dynamic
Hierarchical Tag Markup Language (DHTML) and
JavaScript without the need for downloading ActiveX
controls on your client computer, and can be
connected to other Web Parts on dashboards and
other Web Part Pages.
• Excel Web Services (EWS) is a Web service hosted in
Microsoft Office SharePoint Services that provides
several methods that a developer can use as an
application programming interface (API) to build
custom applications based on the Excel workbook.
• More:
Microsoft’s “OData” Initiative
• What is it?
– The Open Data Protocol (OData) is a Web protocol for querying and updating data that
provides a way to unlock your data and free it from silos that exist in applications today.
OData does this by applying and building upon Web technologies such as HTTP, Atom
Publishing Protocol (AtomPub) and JSON to provide access to information from a variety
of applications, services, and stores. The protocol emerged from experiences
implementing AtomPub clients and servers in a variety of products over the past several
– OData is being used to expose and access information from a variety of sources including,
but not limited to, relational databases, file systems, content management systems and
traditional Web sites.
– OData is consistent with the way the Web works - it makes a deep commitment to URIs
for resource identification and commits to an HTTP-based, uniform interface for
interacting with those resources (just like the Web). This commitment to core Web
principles allows OData to enable a new level of data integration and interoperability
across a broad range of clients, servers, services, and tools.
– OData is released under the Open Specification Promise to allow anyone to freely
interoperate with OData implementations.
• Find out more
– &
– Contact Pablo Castro ( / Blog:
Microsoft’s Open Government Data Initiative
• The Open Government Data Initiative (OGDI) is a cloud-based
collection of software assets that enables publicly available
government data to be easily accessible. Using open standards and
application programming interfaces (API), developers and
government agencies can retrieve the data programmatically for use
in new and innovative online applications, or mash-ups that can help:
– Improve citizen services
– Enhance collaboration between government agencies and private
– Increase government transparency
• OGDI promotes the use of this data by capturing and publishing reusable software assets, patterns, and practices. The data repository
already holds over 60 different government datasets that are readily
available for use in new applications, and is continuously updated
with additional government datasets.
• More:
Data Curation Add-in for Microsoft Excel
• In partnership with the California Digital
Library’s Curation Center
– In collaboration with Tricia Cruse & John Kunze
– Part of the DataONE (an NSF DataNet Project)
Data Curation Add-in for Microsoft Excel
Proposed functionality under consideration:
Support for versioning, so that revision history and the original raw data can be easily protected and recovered,
Standardized date/time stamps so that researchers can easily determine when the data were created and last updated.
A “workbook builder” allowing researchers to select from globally shared standardized layouts for capturing data,
Ability to export metadata in a standard format (e.g., a DataCite citation or an EML document that describes the
dataset(s) in a workbook) so that researchers can readily share their data,
Ability to select from a globally shared vocabulary of terms for data descriptions (e.g., column names), and as needed to
add new terms to the globally shared vocabulary, to enable wide collaboration between researchers
Ability to import term descriptions from the shared vocabulary and annotate them locally to refine their definitions as
used in the dataset,
“Speed bumps” to discourage use of macros and customizations that would impede interoperation of data imported
from Excel into other applications, and
Ability to deposit data and metadata directly into a data archive to enable compliance with funding agency requirements
to preserve and publish research data.
Lee Dirks
Director—Education & Scholarly Communication
Microsoft External Research
Facebook: Scholarly Communication at Microsoft
This work is licensed under a Creative Commons
Attribution 3.0 United States License.