Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division Microsoft Corporation Microsoft External Research Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computing Supporting groundbreaking research to help advance human potential and the wellbeing of our planet Developing advanced technologies and services to support every stage of the research process Microsoft External Research is committed to interoperability and to providing open access, open tools, and open technology Mission Optimize and extend Microsoft software to meet the specific needs of the academic community Our approach: Conduct applied projects to enhance academic productivity by evolving Microsoft’s scholarly communication offerings Microsoft External Research is uniquely positioned to drive this initiative across Microsoft The Scholarly Communication Lifecycle Excel 2010 Windows Server HPC “Astoria” / “Pop Fly” Data Collection, Research & Analysis Office OpenXML XPS Format SQL Server & Entity Framework Rights Management Data Protection Manager Discoverability FAST MSR Academic Search “Bookweb” SharePoint 2010 Storage, Archiving & Preservation Authoring Publication & Dissemination Word 2010 + PowerPoint 2010 WPF & Silverlight “Sea Dragon” / “PhotoSynth” / “Deep Zoom” This work is licensed under a Creative Commons Attribution 3.0 United States License. Collaboration SharePoint LiveMeeting Office Live Office 2010: •Word •PowerPoint •Excel •OneNote Tablet PC/UMPC Goal: Transform Scholarly Communication • Interoperability is essential – Actively lobby and drive for consensus around technical standards and standardized protocols proactively adopted by the community; enable broad community engagement • Customers have told Microsoft that interoperability is OUR responsibility • Leverage Existing Community Protocols, Practices, Guidelines, etc. – Example – metadata conventions / taxonomies / ontologies: a traditional strength for libraries – and a critical component in enabling Web 2.0 • Optimize for data-driven research – To both data (scientific) and to information (scholarly publications) – Reproducible research + computational science – Properly document / annotate scholarly output • Data preservation (and provenance) should be baseline – Documentation of the data’s provenance – Preservation needs to be like “accessibility” features – i.e., assumed as required • Semantic knowledge discovery & social networking – Harnessing collective intelligence must be a consideration – since accessing research is a core step in the life-cycle. Enable knowledge discovery – Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier This work is licensed under a Creative Commons Attribution 3.0 United States License. Open Science Open Access Open Source Open Data “In order to help catalyze and facilitate the growth of advanced CI, a critical component is the adoption of open access policy for data, publications and software.” NSF Advisory Committee on Cyberinfrastructure (ACCI) Microsoft Interoperability Principles http://www.microsoft.com/interop/ This work is licensed under a Creative Commons Attribution 3.0 United States License. Open Connections to Microsoft Products Support for Standards Data Portability Open Engagement Membership / Participation DataCite is an international consortium to establish easier access to scientific research data on the Internet increase acceptance of research data as legitimate, citable contributions to the scientific record, and to support data archiving that will permit results to be verified and repurposed for future study. The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium. OPF members benefit from the Planets results, new developments and the growing OPF community that includes experts at some of the most prestigious research, technology and memory institutions in Europe. The Confederation of Open Access Repositories (COAR) is a not-for-profit association of repository initiatives launched in October 2009. It aims to enhance greater visibility and application of research outputs through global networks of Open Access digital repositories. The Coalition for Networked Information (CNI) is an organization dedicated to supporting the transformative promise of networked information technology for the advancement of scholarly communication and the enrichment of intellectual productivity. Membership includes some 200 institutions representing higher education, publishing, network and telecommunications, information technology, and libraries and library organizations. ICSTI, the International Council for Scientific and Technical Information, offers a unique forum for interaction between organizations that create, disseminate and use scientific and technical information. ICSTI's mission cuts across scientific and technical disciplines, as well as international borders, to give member organizations the benefit of a truly global community. CrossRef is a not-for-profit membership association whose mission is to enable easy identification and use of trustworthy electronic content by promoting the cooperative development and application of a sustainable infrastructure. CrossRef's general purpose is to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research. Who we work with This work is licensed under a Creative Commons Attribution 3.0 United States License. GenePattern Reproducible Research Add-in Services: Connects to GenePattern database Relationships: Inline graphics are synchronized to dataset Data: Resulting data (and provenance) stored within Word document This work is licensed under a Creative Commons Attribution 3.0 United States License. Data: Control and execute query pipelines into GenePattern Source code and binary: http://GenepatternWordAddin.codeplex.com Creative Commons Add-in for Office 2007 Intent: Insert Creative Commons licenses from within Office 2007 Services: Integrates with Creative Commons Web API to create new licenses Relationships: license information stored as RDF XML within the document OOXML This work is licensed under a Creative Commons Attribution 3.0 United States License. Source code and binary: http://ccaddin2007.codeplex.com Ontology Add-in for Word 2007 Services: Ontology download web service • John Wilbanks Intent: Term recognition & disambiguation • Phil Bourne • Lynn Fink Relationships: Ontology browser This work is licensed under a Creative Commons Attribution 3.0 United States License. Source code and binary: http://research.microsoft.com/ontology/ Article Authoring Add-in for Word 2007 Services: repository deposit via SWORD Structure: Read, convert, and author NLM XML documents Relationships: ORE Resource Map creation Structure: Client-side XML validation This work is licensed under a Creative Commons Attribution 3.0 United States License. Relationships: Citation lookup and reference management Binary (version 2.0): http://research.microsoft.com/authoring/ This work is licensed under a Creative Commons Attribution 3.0 United States License. Chem4Word - Chemistry Drawing in Word Author/edit 1D and 2D chemistry. Change chemical layout styles. Intent: Recognizes chemical dictionary and ontology terms Relationships: Navigate and link referenced chemistry Data: Semantics stored in Chemistry Markup Language <?xml version="1.0" ?> <cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule> </cml> • Peter MurrayRust • Joe Townsend • Jim Downing Intelligence: Verifies validity of authored chemistry This work is licensed under a Creative Commons Attribution 3.0 United States License. Available soon: http://research.microsoft.com/chem4word/ Project Trident: Scientific Workflow Workbench Author, Execute and Monitor Workflows Organize collection of individual workflow activities View data products, performance metrics, and provenance data Compose and modify workflows via drag & drop canvas Available now: This work is licensed under a Creative Commons Attribution 3.0 United States License. http://research.microsoft.com/collaboration/tools/trident.aspx Other relevant projects This work is licensed under a Creative Commons Attribution 3.0 United States License. • The Windows Azure platform offers a flexible, familiar environment for developers to create cloud applications and services. With Windows Azure, you can shorten your time to market and adapt as demand for your service grows. Windows Azure offers a platform that is easily implemented alongside your current environment. • Offerings: – Windows Azure: operating system as an online service – Microsoft SQL Azure: fully relational cloud database solution – Windows Azure platform AppFabric: connects cloud services and on-premises applications – Microsoft Codename “Dallas”: information marketplace for data and web services Azure – Project “Dallas” • Microsoft "Dallas" is a service allowing developers and information workers to easily discover, purchase, and manage premium data subscriptions in the Windows Azure platform. – Dallas is an information marketplace that brings data, imagery, and real-time web services from leading commercial data providers and authoritative public data sources together into a single location, under a unified provisioning and billing framework. – Dallas APIs allow developers and information workers to consume this premium content with virtually any platform, application or business workflow. – More: http://www.microsoft.com/windowsazure/dallas/ Excel Services & Excel Web Access • Excel Calculation Services (ECS) is the "engine" of Excel Services that loads the workbook, calculates in full fidelity with Microsoft Office Excel 2007, refreshes external data, and maintains sessions. • Excel Web Access (EWA) is a Web Part that displays and enables interaction with the Microsoft Office Excel workbook in a browser by using Dynamic Hierarchical Tag Markup Language (DHTML) and JavaScript without the need for downloading ActiveX controls on your client computer, and can be connected to other Web Parts on dashboards and other Web Part Pages. • Excel Web Services (EWS) is a Web service hosted in Microsoft Office SharePoint Services that provides several methods that a developer can use as an application programming interface (API) to build custom applications based on the Excel workbook. • More: http://msdn.microsoft.com/enus/library/ms546696.aspx Microsoft’s “OData” Initiative • What is it? – The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today. OData does this by applying and building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores. The protocol emerged from experiences implementing AtomPub clients and servers in a variety of products over the past several years. – OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional Web sites. – OData is consistent with the way the Web works - it makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web). This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools. – OData is released under the Open Specification Promise to allow anyone to freely interoperate with OData implementations. • Find out more – http://odata.org & http://msdn.com/data – Contact Pablo Castro (pablo.castro@microsoft.com) / Blog: http://blogs.msdn.com/pablo Microsoft’s Open Government Data Initiative • The Open Government Data Initiative (OGDI) is a cloud-based collection of software assets that enables publicly available government data to be easily accessible. Using open standards and application programming interfaces (API), developers and government agencies can retrieve the data programmatically for use in new and innovative online applications, or mash-ups that can help: – Improve citizen services – Enhance collaboration between government agencies and private organizations – Increase government transparency • OGDI promotes the use of this data by capturing and publishing reusable software assets, patterns, and practices. The data repository already holds over 60 different government datasets that are readily available for use in new applications, and is continuously updated with additional government datasets. • More: http://www.microsoft.com/industry/government/opengovdata/ Data Curation Add-in for Microsoft Excel • In partnership with the California Digital Library’s Curation Center – In collaboration with Tricia Cruse & John Kunze – Part of the DataONE (an NSF DataNet Project) Data Curation Add-in for Microsoft Excel Proposed functionality under consideration: • • • • • • • • Support for versioning, so that revision history and the original raw data can be easily protected and recovered, Standardized date/time stamps so that researchers can easily determine when the data were created and last updated. A “workbook builder” allowing researchers to select from globally shared standardized layouts for capturing data, Ability to export metadata in a standard format (e.g., a DataCite citation or an EML document that describes the dataset(s) in a workbook) so that researchers can readily share their data, Ability to select from a globally shared vocabulary of terms for data descriptions (e.g., column names), and as needed to add new terms to the globally shared vocabulary, to enable wide collaboration between researchers Ability to import term descriptions from the shared vocabulary and annotate them locally to refine their definitions as used in the dataset, “Speed bumps” to discourage use of macros and customizations that would impede interoperation of data imported from Excel into other applications, and Ability to deposit data and metadata directly into a data archive to enable compliance with funding agency requirements to preserve and publish research data. Questions? Lee Dirks Director—Education & Scholarly Communication Microsoft External Research ldirks@microsoft.com http://research.microsoft.com/people/ldirks URL – http://www.microsoft.com/scholarlycomm/ Facebook: Scholarly Communication at Microsoft This work is licensed under a Creative Commons Attribution 3.0 United States License.