Research Data Alliance (RDA) 2nd Working Group Collaboration Meeting November 13 – 14, 2014 All meetings will be held at: National Institute of Standards and Technology (NIST) Administration Building, Lecture Room A 100 Bureau Drive, Building 101, Gaithersburg, MD 20899-1060 Objective This meeting is a continuation of a meeting series that was started in Garching last February. The objective is to have a detailed update from and focused discussion among Working Groups, something that has proved elusive at the busy Plenary meetings, and to generate discussion on how the WGs can work together to improve overall RDA outputs. The focus of the Garching meeting was on the Data Foundations and Terminology WG and the focus of this meeting will be on the Metadata, on the relationships among the Working Groups, and the evolving notion of the Data Fabric as a framework into which the WG outputs would fit. Schedule (flexible) Thursday 1:00 – 1:30 Welcome, NIST Introduction 1:30 – 2:30 Metadata Standards Directory James Warren Materials Genome Initiative, Data, Open Science and NIST Goal: develop materials innovation infrastructure Achieve national goals in energy, etc. Design process highlighted NIST Role in MGI Data inputs goes from quantum scale to manufacturing capabilities Vision improve upon the process, resolve issues around digital data Goals: 1. Establish data and model exchange protocols 2. Create means to ensure quality 3. New methods, data driven science, big data possibilities Use Case example to demonstrate constraints, what can the data show us? Objective: Search over multiple repositories for data on all materials that fulfill use case constraints, simultaneously Success scenario: can search through models to design new materials with improved properties DISCUSSION Use case Incorporating metadata into the use case has benefitted the presentation’s ability to connect with material scientist understanding. Problem: Very few material scientists seem to appreciate the potential Problems with terminology, labels that exist across domains Raphael: How does NIST meet the international need beyond the US focus? Warren: This information is a form of publishing… In short, he believes that it should be open with an emphasis on urgency. Write use cases using RDA to solve problem Rainier – repositories: Are they distributed or centralized? Warren: desire open architecture; talks are in motion to gain that advantage 2:30 – 2:50 Data Foundations and Terminology Jane Greenberg Metadata Standards Directory Goals and work plan: on target Progress made thanks intern participation! Goals: develop an open collaborative approach to metadata standards Establish a working group. Analytics show people are visiting the established directory often to evaluate standards. Masters students to do papers and earn credit on this research. Accomplishments Attempt to move copy of DCC into GitHub, though not yet ready for wider use Policy Development Outreach success Action items from RDA 4: Assess GitHub approach, seek new technology and feedback participation. Do we need firm objectives? What is it specifically that we want to move forward on? Hope to get updates from all eight working groups today, get working groups to respond to objectives. Where does metadata fit into data fabric notion? Kathy Fontaine: Any issues with RDA website? Difficulties with email success… What proving, convincing impact can we share to drive our initiative? Jane – RDA has helped to spread the word, need numbers to express compelling use of the website. RDA needs to prove what can be done. How does prototype help to underline our cause? Rebecca Metadata Standards Directory Metadata Directory on GitHub Changes are moderated Former directory: more difficult to navigate Current prototype is apparently more user friendly to those who are interested. Barrier to making changes on GitHub: high complexity for entry Is the page usable? How to automate policy of permitting greater openness to directory? Move beyond human readability to computer readability. Kathy: Working groups if there’s more to be done, get to a point where there’s a self-contained conclusion and report on next steps. Peter: Argument to make: where do we invest? At a critical point where we should be blunt in this respect. Need good presentation package that communicates the value RDA has to offer Keith Jeffery Metadata Interest Group Problem with many different standards. Need standards to share the same elements to permit wider use. Group sessions trialed template use cases and received feedback. Now, plan to provide revision before ‘P5’ Need to overcome threshold of participation in GitHub Vision: difference between data and metadata is mode of use. Metadata not just for data, also for users, software services, computing resources Metadata is not just for description an discovery, desire to make virtual researching environment Metadata must be machine-understandable as well as human understandable Management of (meta)data is also relevant. Concentration is on datasets What metadata is required? Assertions and Questions Plan: involve not only metadata groups but all RDA Please test packages for feedback and use separate feedback and apply to all packages. What is needed from other groups? Interact to encourage human knowledge base DISCUSSION Peter: happy with these plans Primary issue in previous meeting: length of time. What is needed to accomplish goals? What is possible in RDA framework? Keith: Need active project from both groups from Europe effort to U.S. Keith: Advocate for a technical project not necessarily collaborative one Larry: What are specific plans? Jane: Exploring at a high level. Kathy: Belmont forum for funding. Demonstrate prototype to pull awareness for organizational assembly. Peter: Need to anticipate situation of having too much data? Directory doesn’t yet help because it’s not yet automated. Keith: That’s the direction we need. Make directory into package so it is machine understandable. Keith: will document list of recommendations Peter: still does not see how it’s feasible Rainier: groups struggling to define metadata, esp. details. Concentration on defining vocab for interviews. How to validate packages? Peter: how to make use of the knowledge? Jane: two-pronged answer: at least go to the directory. Peter: Is use of directory traceable? Is it documented for use? Jane: Directory has use cases, but they aren’t standardized. Keith: problem of people developing their own standards. Raphael: need not just wide acceptance of standards, but also tools that can really improve use Mary: Harvesting existing data on web. Automated harvester to search catalogue. RDA Data Foundation and Terminology Looking to propose data terminology interest group. Main activity is to support other work group interests and foster communication. Need to “gear up” properly to give support. Peter: In Germany meeting on RDA next week. He has flyer to show an explanation of RDA results. Unfortunately, did not accomplish as much as was hoped. Many meetings, across disciplines. Intense conversations on data. People have begun considering the small steps needed for metadata outcomes using PID’s. Start making training courses on working groups. This will take a lot of time, but we have the people necessary for this in our new round of funding. Plan is to intensify further community discussion. Peter: Need a process model that will benefit these new initiatives. 2:50 – 3:10 Break 3:10 – 3:30 Data Type Registries Larry Addressed problem of implicit assumptions in data. In order to share data, it needs to be understood. If it’s not understood and agreed upon, it’s not worth sharing. Goal: Explicate and share assumptions using types and type registries What is a data type? A unique and resolvable identifier. Need to further automate data collection so we can collect from different sources and conduct data processing. Raphael: What is scope of automated type registries? What different types of data will be collected? Larry: In short, there is interest in many. Gary: How does relate to other metadata efforts? Larry: By meeting standards. Being a good “metadata citizen” 3:30 – 3:50 Persistent Identifier Types WG PID Information Types Tobias Report of current status: RDA outcome process on the move Ongoing TR / PIT discussionsFuture PIT Processes o Checkpoint next spring? Idea is good, but cannot reach goals in scope of current working group. Need to go back to the users (communities) Develop practical, central types Data fabric wish list PIT relevance: Every object in the DF should bear PITs that enable automated management 3:50 – 4:10 Practical Policies Rainier Practical Policies Policies that can be automated Identification of 11 policy areas Policy information should be carried by metadata Integration in the data fabric Templates create a crude structure of a vocabulary Creation of a human + machine accessible vocabulary How to build a sound vocabulary for practical policies? Need time and money and people who will do the ‘bug’ (grunt) work 4:10 – 4:30 Wheat Data Wheat Data Context for creation of this interest group Need inter-operability framework for collecting this data Achieving semantic interoperability: Two paths towards semantic interoperability: Make everyone speak the same language Provide “translations” among the existing metadata Possible interactions with other WGs Biosharing registries WG Data type registries WG Biodiversity Data Integration IG Metadata Interest Group Noted similarities of requirements 4:30 – 4:50 Data Seal of Approval Repository Audit and Certification DSA (Data Seal of Approval)-WDS (World Data Seal) Partnership WG Mary Goals: Develop common catalogue, and more! General Findings: Two catalogues have similarities and differences Mission / Scope: Next steps: Map to Nestor and ISO Finalize the harmonized requiremenents Begin to work on aligning procedures Determine the relationship between DSA and WDS to each other… Create testbed for certification Investigate shared pool of reviewers 4:50 – 5:10 Brokering Governance Global and Multidisciplinary Interoperability: building on existing infrastructures Standardization is at base of interoperability Brokering Benefits Lowers barriers Accelerates interconnection of disparate systems Facilitates sustainability … Brokering Concerns New paradigms pose a cultural challenge Complexity is shifted to brokering framework That’s a new tier to be organized and governed Scalability of brokering framework Goal: Address the governance of the brokering framework middleware and interconnect existing international e-infrastructures. Expected outcomes: Position paper Test of a selected governance model Recommendation document for the RDA Hope that metadata will help to reduce the existing models Try to push for a more common solution Push complexity to the broker. This is practical way of addressing that there will never be a proper standardization of terminology 5:10 – 5:30 Discussion Wrap up, dinner plans Friday 09:00 – 09:30 Metadata WG reflections on Wednesday All groups represented a need for working closer with Metadata Groups in general. Advice on what standards to use Assistance in applying metadata standards Implications Syntax Semantics Temporal information Integrity Represented in some form of first order logic Keith - Metadata Principles are up for discussion Noted the need for more formalized version of Dublin core Keith’s plan: drive these harder projects first so as to draw out proper builds for the simpler ones Keith - New groups utilized at a domain level can be difficult Objective to get some traction with current WG’s to make the problem more prevalent. The community appears to be eager to do the work that needs to be done. Talpady – Groups should stay vigilant with recording and sharing best practices and guidelines that can support the push forward Gary – Also, cross-interest use cases can create branches between work groups. This can help highlight “sweet spots” of collaboration and innovation. Peter: sort out how to move forward with a viable process in order to validate our statements of what is possible. Urges for an answer / decision that must fill this need Larry – Suggests a metadata help desk for inter-group support; thought of as part of a service model Jane / Kathy - Importance of making the distinction between focus of RDA US and Euro funding. Traditionally, US has been used for coordination and Europe for new, continued projects. Kathy – There is still work to be done by working groups Keith – Question the need for continuing with the 18 month project intervals to help working groups commit to a time constraint. Rainier – Asks where the interest group-type conversations are taking place? Keith & Jane in agreement – There currently are problems with the forum and mailing list that have confounded participation efforts. Jane – Leaders of these boards have definitely tried to keep the interest up Keith – WG’s should be able to inter-operate and we need the software tools that can support this. Peter – Also semantics, terminology must be clear for true interoperation. Keith – Elaborates on need for tool that allows non-local metadata to be accessible to students for interoperation and creating opportunity for corresponding interests Rainier – metadata core problem of RDA. Is there a need for implementation support? Interest from Beth urges importance of this question… Jane, Mary, Keith respond Keith – problem is that we have many participants, groups are fractured Raphael – make the distinctions and maintain domain differences in metadata. Harder to work with contextual metadata as it is a tremendous task. It, therefore, can be tedious to get scientists to participate or to back this kind of project. Then there is the unfortunate problem of false meta-data production. Need to seriously consider the integrity of falsified data. Keith – Idea is to find the origin of elements in research data and store it, or even cache it so that that it’s not so tedious and researchers are supported throughout form fills. Need to reference in right way for this to be successful. Discuss possibility of organizing more meetings. Greater frequency of these conversations, and moving through important conflicts will promote understandings to come to the surface. Gary – Also supports different categories of metadata that are mutable and related for the virtue of discovery. Beth – in agreement Kathy – Is it feasible to do a metadata track? Keith – would be obliged Larry – seems clear metadata bunch has to drive continued efforts. Discussion of the development of marketing material documents to appease stakeholder interest. They are in play. 09:30 – 10:00 Five-minute responses from each of the WGs 10:00 – 10:30 Open discussion, agenda bashing Beth – urgent matter of organizing activities into areas will be handled in the afternoon. Machine froze, lost content Mark – process of iterative review Kathy – bundle must stand alone Tobias – comfort of new documentation license Beth – is the documentation open sourced (i.e., on Git)? Tobias – yes and seems stable. 10:30 – 10:45 Break Tentative schedule for the rest of Friday 10:45 – 11:45 Data Fabric as WG integrator Peter White paper promised as step 1 in case statement, should be simple and declarative so that those outside of the discussion will understand what this is about. Diagrams should be simple and well-explained. Basic terminology ought to be agreed upon Working through legal aspects of inter-operability. Use cases – demonstrate good solutions that come close to what is meant by “data fabric”. Motivation… One issue not discussed at P4 meeting – large scale infrastructure projects that need directions to prevent “island” solutions again. Peter seeks agreement from group with this issue. Mark in agreement – definitely should not be run by publishers. Peter - Find the right moment to interact, no need to integrate. Working groups to continue: There will be a terminology group, PID group, DTR, PP (policy) WG Terminology… Acknowledge John Henry’s suggested use of system engineering terminology John’s diagram – an abstraction of terminology use, up for discussion on whether this agreed upon terminology should make it into the Wiki. RDA does not want to get into business of over-explaining terminology. Group is getting hammered in discussion to explain what words we’re using. Need to find better terms than data fabric because it’s a loaded term. The fact that we need to dwell on explanations so much that it’s become a major hindrance. Beth – It is a global scope Peter – need to come up with a joint-view so that it can be of use. Mark – architecture is a loaded term that conflicts with people’s assumptions. Data fabric is our unique term and we don’t have an RDA architecture. Peter – what term gets used in the white paper? John – Not a new concept, framework captures a lot of this need, but lacks a clear goal. Gary – Supports viewpoint of John. Believes framework to be the most useful, we’re not trying to build an architecture, though there is a process of structuring that resembles assembled components and is suggestive of architecture. Peter – White Paper is not the place to hold a dictionary of our terms as this can create more questions than answers Gary – If we don’t go through the details somewhere, people will be confused or mislead by the simpler diagrams. Peter – not easy to get active working groups into Simple Diagram 2 (indicating machine character). Beth – commentary on successful usefulness of diagram. Happy that it looks more comprehensive. Stefano – Like to say that brokering is part of processing Between Keith and Peter: “problematic” comments on the registry discussion. Gary - challenge of diagrams adding complexity to original visual. Lacking a clear description of changes taking place. Next steps… Mark – scope of data fabric has gotten bigger than originally conceived. How does TAB group collaborate? Peter – TAB must make a statement. Should be monitoring what other groups are doing. Beth – If indeed we agree, there will be an overlap of organization. Strongly encourages action on behalf of this group. Gary – Logical connections with RDA going forward that are clarified by discussion Talpady – What is focus of working group? (Mark clarifies) Larry - How would we relate this to the proposal of test beds? Peter – we have code out there (defining test beds). Need to assess whether or not it is working together. Larry – for it to function can’t be a paper exercise. Mark – in agreement. Peter – clarifies no top-down rule, need to come to grasp the need to not get so up in arms over terminology, architecture being an example. John – value is that it helps you make decisions with certainty. We need to help groups see what they can trust and invest in. Do this by backing up with data and scientific evidence. Kathy, Peter – offer to take this discussion outside of RDA meeting 11:45 – 12:45 RDA Working Group Processes Proposal to RDA Technical Advisory Board Beth Timothy – happy that TAB community should be able to identify gaps. Kathy – Is the grouping visual intended to show that working groups are working together? Rebecca - Why is “long tail of research data” within Trust grouping? Asks that representation is consistent in terminology. Larry – Visual suggests how grouping by domain focus is difficult. Mark – problem with first diagram is that it isolates activities Peter – The visuals can help bring groups together to understand a the common ground. Mark – As we’re communicating with our stakeholders, encourage appropriate collaboration btw. Working groups and interest groups. Who are we clustering for? Mary – for purposes of communication, it’s important bins come together as something people can relate to Dan – raises question of visibility of this on our website. Peter – Need for applications grouping Gary – Attempt to rationalize between two diagrams. Beth – Need to elucidate the dual-role. Tobias – Usefulness of diagram is that it communicates the different purposes of groups to the outside. Distinct point where we want others to note our cause and join in. Need for orthogonal track. Rainier – Who came up with area director? Not the impression we want to give. Beth – Have to do something to rid the presiding reputation we’ve made for ourselves, that we’re very incoherent and made up of 50 disparate groups. Peter – We’re not big data analytics, that’s not part of our effort. … Beth – idea to reject proposal C Larry – need for communication is abundantly clear. We should have as much structure as needed and no more. Beth closes and expresses gratitude for feedback Larry – Shown how difficult it is to keep track of what RDA is doing. Maybe need for an internal awareness service. For example, put out short summaries once a month. Timothy – General news feed directing users to recent work group activities. Peter – Agrees and someone who can keep note of the pace. Mark – better staffed than we were, so a regular newsletter is possible. Pitch idea example of “how to use your group guidelines” Kathy – in search of Drupal module to support this type of functionality Jane – confusion of another RDA registry (Resource Description and Access) international effort that has been around for a long time. Kathy- Any new work groups that need a process, refer to her as she is trying to come up with these types of templates. If you form something new, be prepared to hear from her on this matter. Kathy… Focus on Adoption Day: Look for finished adopted prototype projects. Vast email list to come out Mark - His effort is getting away from start-up approach to doling out responsibilities in a much more structured manner. 12:45 – 1:45 Lunch 1:45 – 3:0 0 3rd WG Collaboration Meeting, Open Discussion, Final thoughts Rainier Next WG Chairs Working Meeting Next meeting is in Karlsruhe, June 11-12 2015 @ Karlsruhe Institute of Technology Hotel Eden recommended and is less than two miles away from meeting location Discussion of travel funds