Research Computing and Cyberinfrastructure Governance Working Groups (May 13, 2015) 1. Data Center a. Provide input into the specs for the PSU Data Center; develop policies related to colocation of servers and other equipment to be used for research at the Data Center. b. Consider development of a set of access rights/policies for users, and responsibilities for data center personnel. c. Help draft SLA’s (service level agreements). Detail a set of services that will be provided and charges for each. Purchase of equipment? Rental of storage space? Rental or purchase of CPU time? Access to web servers for making data publicly available? Colocation of equipment to be relocated to the data center by departments, individuals, or research groups? d. Who will be allowed access? Can individual faculty buy in? Research Groups? Departments? Institutes? Will PIs be allowed to supply their own equipment? Will PIs (or others) have to buy data center equipment? If someone buys in, how can local tech support reps, or PIs, have access (either physical or remotely) to their equipment? What maintenance will be supplied by the data center? What guarantees of backups? e. Given the importance of data and archiving, will this be a prime location for the storage and backup of big data? Will there be archiving facilities? Will there be separate segments of the data center where restricted data may be stored? Data that is export controlled? Data with PII? Will access be via fast pipe (e.g. the research network)? 2. Software a. Work with the research guru or appropriate representative to catalogue research software on campus, identify where we could combine licenses efficiently, identify what new software needs might exist, and determine how to publicize (or distribute) research software efficiently. b. Once data is collected from units on current usage of different packages, work on the cataloging of licenses, and make recommendations for what should be licensed centrally, vs. locally (e.g. at a department level), vs. being purchased by just a few researchers as individual licenses. Discuss at what point the cost/efficiency/coordination tradeoff makes it worthwhile to centralize licensing. c. Prepare a document showing current (distributed) costs, what the central cost would be, and the savings. Identify what elements of license agreements need to be tracked. Obtain relevant details of license agreements affecting research software, e.g. how broadly can a site license be scaled up, is monitoring of licenses necessary, what restrictions are there on use (e.g. on one workstation, on a work machine and a home machine or laptop, only within the U.S. [export controls], and so on. d. Perhaps consider policies on software distribution and local installation of software and extensions/updates. Some faculty have administrative rights and can install software, some cannot. This is a frustration for faculty concerned with productivity, especially in time-crunch situations. e. Key participants will include the group developing a new ITS software cataloguing initiative (Mairéad Martin, ITS). Outreach will need to be to/through all administrative and/or IT units in Colleges and Institutes, and perhaps to central purchasing if this seems like a valuable way to identify past software purchases. Research Computing and Cyberinfrastructure Governance Working Groups / 2 3. High-performance computing a. Examine HPC in comparison to other universities; look at costs of HPC. Is HPC serving the faculty well, and what gaps or opportunities are there? Assess high-performance clusters around the University; are there advantages to seeking consolidation? Assess whether PSU should engage in a major effort in this area (e.g. should we try to get a supercomputer? If so, what would it take? Would such an effort have to be Legislature funded the way it is in some states? Could it be grant-funded? Would a PA University consortium be feasible?). How can HPC at UP interact with researchers at Hershey (e.g. folks working on genomics)? Should we seek to position PSU at the forefront of the Big 10 in computing power? Do we need to? b. What are the differentiated HPC requirements of scientific workflows that are computeintensive, data-intensive or are both compute and data intensive? How can they be reflected in the features of the hardware and software environments that are needed to serve these classes of workloads? c. Key participants will be faculty working with HPC both through the ICS-ACI and in individual clusters, and perhaps using facilities outside PSU; and ICS-ACI personnel. 4. IT/HR Job Classification and Compensation a. Consider issues of IT job classification, compensation, and other HR issues. How do we get and keep the best IT people at Penn State? How do units avoid training, and then losing, IT personnel we want to retain? Examine losses of IT staff to competitors like Dell, Apple, Google. Examine compensation across units, and competition across units. If PSU prohibits internal competitive bidding, why does it seem to occur, and what are the implications (is there a systemic drain from some units to others, and if so is this a problem, and what do we do?)? Are there systematic problematic compensation patterns across units or job descriptions, for example where a sys admin I is routinely paid more in unit X than the same job in unit Y? Given competition in this area with the outside world, how can appropriate flexibility be built into hiring CI/IT staff, or is this unnecessary? b. Key participants in this discussion will be representatives from HR, along with IT group managers responsible for recruitment and retention. c. The Provost and VPR have indicated willingness to entertain proposals for improved “career tracks” for IT colleagues, and for getting training, trading parts of jobs to gain experience, and facilitating graduate study towards MA or Ph.D. How do we institutionalize this? 5. Research Network and Data Classification Policies. a. What are the parameters and plans for access to the new Research Network? What segments of campus will be connected? What are the costs for “last mile” connections (to an individual office, to a building, within a building), and who should pay for that? How will researchers access the new fast network, how can this be made as transparent as possible to facilitate research? How does the research network relate to the data center and access to it, and to ICS-ACI? How easy will it be for faculty to gain access? For what uses is use of the new network appropriate, and how can such use be ensured while facilitating research? b. Data classification has direct impacts on what can be transported in the research network; it has other implications too, such as access rights, cloud storage, treatment of Research Computing and Cyberinfrastructure Governance Working Groups / 3 data by identity finder, machine security, and in other areas. Do current proposed classification policies have enough nuance to cover the varying kinds of research data we have? Not all research data is sensitive, not all is restricted, and some restrictions are more important than others. In some cases research data needs to be kept secure, in other cases security is not critical (e.g. if it is all generated from public sources). In all cases, barriers to appropriate data use should be reduced. What is a researcher-driven (along with liability-driven) set of data classifications, and what types of restrictions and policies (if any) should be placed on use of/access to those types of data? c. The ITS networking group will be a key connection in this discussion. 6. Data, Data Governance, Data Preservation, Data Dissemination, Data Security, Managing the Scientific Data Life-Cycle a. We've gotten a lot of suggestions and issues related to data. Does this need to be split into multiple working groups? One suggested split was to separate 1) Data Classification, 2) Data Security, 3) Data Archiving/Data Life-cycle. b. The working group will deal with developing and disseminating policies and technologies for data preservation and dissemination. Consider services during active research and later for longer-term dissemination, archival, curation, etc. Storage is one key component; speedy access is another; some conversation needs to be around software and tools for access to and manipulation of scientific data including the roles of the academic units, research, Libraries and ITS. c. There are many different specific aspects of data storage that tie in. Space, cost, archiving, long-term/short-term, publicizing/archiving for public use, internal/external (cloud), local/centralized… Should these all be in one committee or working group? Or (for instance) are discussions of “big data” so distinct from discussions of public replication data sets that these should be handled by different working groups? d. Data security/data compliance (For discussion: Should we pull out a separate working group on “Security, Data Protection and Compliance”). i. What data storage solutions are appropriate for different types of data, e.g. data that are or are not de-identified, restricted, classified, public. As case studies, the Network on Child Protection and the Clearinghouse on Military Families (both in SSRI) are having major difficulties with risk management around being able to get basic work accomplished; part of the problem seems to be about protected data storage. ii. How do we ensure the security of electronic medical records while not cutting off appropriate access? Penn State's Clinical and Translational Sciences Institute has invested heavily in making such data accessible, but there are some unique and complex security issues. iii. How do we ensure compliance by individual faculty with important and appropriate efforts to protect data? iv. Is Identityfinder a good security solution? Is it effective? Does it crash a lot of machines? Does it waste a lot of time? It certainly generates a lot of complaints. What is the gain vs. cost of this type of intrusive solution? Are there others? What are appropriate exceptions policies to automatic running of identityfinder (as case studies, we hear of identityfinder running in the background, draining laptop batteries, leading to system shutdown in the middle of a presentation)? Is it in fact suggested or actually mandated/required? Research Computing and Cyberinfrastructure Governance Working Groups / 4 e. Location/centralization of services (or not): Should we let a thousand storage solutions bloom? What can PSU provide and what needs to go elsewhere? What size data can the library handle? What size data can the data center handle? Should Penn State have a long-term archiving solution, or public dissemination solution (e.g. PSU websites)? What happens if a faculty member leaves, or retires? Should data be archived in perpetuity, or what restrictions should be placed on this? f. How should the Library’s Scholarsphere and the Data Commons efforts be integrated into efforts to manage data? Library resources are in use by only segments of campus. What should these be used for? How can they be integrated into long-term or big data storage needs? g. Cloud vs. (very) local vs. centralized storage solutions. Coordinate (or develop) policies about cloud computing / cloud storage. There does not seem to be a coordinated answer on whether faculty (via grants or local funds) can purchase access to (e.g.) the Amazon cloud services. What issues (security, cost, network transfer, privacy, export controls) exist with use of such services? What services provide what level of protection (e.g. Amazon has a military-grade service; if this is accessible, then security should not be an issue)? Are cloud services a good long-term storage solution? A solution for big data? What are the limits and best uses of the cloud? How can we get purchasing and risk management to recognize the appropriate use of, and approve spending on, cloud services, when this is well-justified? h. Big Data and access to big data. The issue of big data is different than an issue of just faster computers. i. Some key participants include the Library's Research Data Working Group, the Library’s Digital Preservation Strategies Team, risk management, and the ITS security office.