Presentation

advertisement
SEAD
Sustainable Environment –
Actionable Data
Margaret Hedstrom
SEAD PI/Project Director
Professor & Associate Dean
UM School of Information
Robert H. McDonald
SEAD Sr. Personnel
Assoc. Dean/Associate Director
Indiana University
CNI Fall Members Meeting
Arlington, VA
12/12/2011
NSF DataNet Program
• new types of organizations that integrate library & archival sciences,
cyberinfrastructure, computer & information sciences, & domain
science expertise
• provide reliable digital preservation, access, integration, and
analysis capabilities for science and/or engineering data over a
decades-long timeline;
• continuously anticipate and adapt to changes in technologies and in
user needs and expectations;
• engage in research to drive the leading edge forward
• serve as component elements of an interoperable data preservation
and access network
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141
Partners
• SEAD’s Unique
Contributions
– Address domain-driven
needs & requirements
– Serve scientists and
researchers in the “long tail”
– Integrate existing
technologies, tools &
services (rather than build
new from scratch)
Sustainability
Science
Science
Cooperation
Technology
Policy
Economics
Poverty &
Justice
4
Data challenges
•
•
•
•
Heterogeneity of
all kinds
Multiple scales
Multidisciplinary
Many small
datasets
The long tail of scientific research
Small and derived data sets
Heterogeneous data
Multiple sources of data
Short-lived data with long-term
value
• Value of data grows when combined
& integrated
•
•
•
•
SEAD’s Goals
• Provide data services that address the needs of
researchers working toward sustainability
• Integrate these services into an generalizable “Active and
Social Curation” infrastructure suited to the social
structure and economics of long-tail research
communities
• Develop capabilities to package and migrate the most
valuable datasets to a federated repository
infrastructure for long-term preservation
• Education, outreach, & training to disseminate SEAD’s
contributions to other projects & communities
SEAD’s Strategy
• Leverage social media for discovery of data,
interest, and expertise
• Move data curation upstream in the data life
cycle
• Involve domain scientists in setting priorities
for evolution of data and services
• Take advantage of existing infrastructures
(Institutional Repositories, ICPSR) for longterm preservation
Active and Social Curation
• Engage researchers during projects, not at the
end
• Automatically capture metadata as defined by
the data producers
• Provide facilities for commentary,
recommendations, and mark-up of data
• Further reduce costs by re-engineering
curation processes to leverage this rich
metadata and volunteered effort
Active Curation Model
Active Curation
Social Media
Workflows
Data
Metadata
Review
Rating
Commenting
SEAD Status
Phase 1
Months 1-18
Develop
Prototype
Phase 2
Years 3-5
Grow SEAD
users, data,
and
functionality
SEAD start date: 10/1/2011
In other words, SEAD is not ready to accept your data!
SEAD Personnel
•
•
•
•
•
•
•
•
•
•
•
Margaret Hedstrom, PI (Michigan)
Praveen Kumar, co-PI (Illinois)
Jim Myers, co-PI (RPI)
Beth Plale, co-PI (Indiana)
Ann Zimmerman, co-PI/Project Manager
(Michigan)
George Alter (ICPSR)
Bryan Beecher (ICPSR)
Katy Börner (Indiana)
Robert McDonald (Indiana)
Jude Yew, Post-doc (Michigan)
+ many more to come
http://sead-data.net
SEAD TEAM
University of Michigan: Margaret Hedstrom (UM PI), Ann
Zimmerman (Co-PI and Project Manager), George Alter, Bryan
Beecher, Charles Severance, Karen Woollams, Jude Yew.
Indiana University: Beth Plale (IU PI), Katy Borner, Robert H.
McDonald, Kavitha Chandrasekar, Robert Ping, Stacy Kowalczyk,
Robert Light.
University of Illinois: Praveen Kumar (UIUC PI), Rob Kooper, Luigi
Marini, Terry McLaren.
Rensselaer Polytechnic Institute: Jim Myers (RPI PI), Ram Prasanna
Govind Krishnan, Lindsay Todd, Adam Wilson.
SEAD Cyberinfrastructure
• An international resource
for sustainability science
• Novel technical and
business approaches to
supporting the long-tail
of research data
• Lifecycle support:
actionable data services
integrated with curation
and preservation
infrastructure
Key Challenges for SEAD
Cyberinfrastructure
•
•
•
•
•
Managed Data storage and services are expensive!
Begging for metadata doesn’t work!
Curation and preservation are time consuming!
The long-tail is not standardized!
Data collections are always missing something
valuable!
• Data models evolve!
• Cyberinfrastructure is obsolete by the time you build
it!
• Building Community as you leverge
cyberinfrastructure
SEAD: Social Networking
•
•
•
•
•
•
•
Co-authorship
Co-funding
Micro-citation
Shared project repositories
Shared tags
Threaded discussions
Quoting, forwarding, …
Linked Data and Repositories
•
•
•
•
Tag and annotate data
Overlay it with reference data
Organize it in domain terminology
Link it to people, papers, projects,
conversations…
Using Science of Science to Link
Repositories
KEY SEAD Questions
• What could SEAD capture when?
• How can SEAD provide direct value
to data producers, users, and
curators?
• How can robust web-services and
social computing lower barriers and
reduce/realign costs?
SEAD: Active Content Repository
• With the ‘Big Picture’ graph in-hand, curators
can:
▫ Focus on what to curate and when,
▫ Automate parts of the process
▫ Use existing/emerging technologies for packaging
and preserving datasets
▫ Better manage federated repositories
SEAD: Leveraging Existing Resources
• Cyberinfrastructure
▫ IU Data Capacitor/HPC Capabilities
▫ UIUC/NCSA HPC Capabilities
▫ Rensselaer CCNI Capabilities
• Repositories
▫
▫
▫
▫
UM Deep Blue
IU ScholarWorks
ICPSR Repository
UIUC IDEALS
SEAD LayerCake View
• Services over an
active content layer
that is backed
by/harvested into a
federated archive
infrastructure based
on institutional
resources
Data
Conservancy
Network of Data
Producers
Web User Interface
Active Content Repository
Services Provided
Content
Mining
Curation
Decisions
Archival
data
generation
Other
services
Virtual Archives
Institutional Repositories
IU
RPI
UIUC
User Network
UM
ICPSR
CI Technical Approach
Active and Social Curation
Data
Acquisition,
Analysis and
Simulation
VIVO/
Linked
Data
OAIS Repository Federation
Curation Boundary
Metadata
Management
Automated
Curation
Workflow/Rule
Engine
DDI3. METS, PREMIS,
MODS, DC, SensorML,
OGC, …
Operates on Metadata,
Content Objects and
Trigger Events
Active
Content
Repositor
y
Ingest scripts:
Appraisal fixity, integrity,
and CI Technical
authentication,
Approach
Selection transformation
Scholarly
Communication
Ingest, AIPs
Compound Objects - OAI-ORE
Digital Repository Federation
(OAIS compliant)
Preservation
Actions
Dissemination Packages
Wide-Area File System
Search,
Browse,
Annotation,
Visualization
Tools
Use, Reuse,
Repurposing
Tools
Contributor User
Migration
and
Emulation
Tools
Access Mechanisms
and E-Scholarship
Services
Toward PetaScale Data
• Internet2 upgrade:
▫ Total bandwidth from 100 Gbps to 8.8 Tbps
▫ Moving a petabyte of data will go from from 10 days to 25 hrs
SEAD 18 Month Prototype Targets for
Cyberinfrastructure
• Active and Social Content Curation
▫ Pilot Active Content Repository, VIVO deployments
▫ Exemplar services for Data Ingest, Discovery, Reuse, Curation
• CI for Long-term Access
▫ Data model, protocol design/development
▫ Pilot Federated Repository infrastructure
SEAD CI QuickView
• SEAD will quickly build a repository and data services infrastructure
for sustainability research that can be responsively adapted based on
community feedback – Community Agile Development
• SEAD will leverage existing tools and emerging practices to
dramatically enhance the interactions of researchers and data
librarians – Active Curation
• SEAD’s focus on the long-tail will force an emphasis on ease-of-use
and low costs that is critical for long-term sustainability – Leverage
Existing Institution Resources for Long-term Access
• SEAD will leverage experiences in the sustainability research
community to provide guidance for other long-tail communities
making the transition to an interdisciplinary, systems-oriented
approach to research – Sustainability and Resource Growth
Partnership and Collaboration
Acknowledgments
SEAD is funded by the National Science
Foundation under cooperative agreement
#OCI0940824
• For more on SEAD go to:
• http://sead-data.net
• Follow us on Twitter
@SEADdatanet
http://sead-data.net
Download