Services for Sensitive Research Data Gard Thomassen, PhD Head of Research Support Services Group Leader of the ”Services for Sensitive Data” project University Center for Information Technology (USIT) University of Oslo Outline • • • • • • • • • • What is sensitive data? Who has sensitive data? Project background Collaborators and reference group System requirements System outline Technical and security details Maintenance Advantages and current status International collaborations Gard Thomassen,TSD 2.0 Who has sensitive data? • • • • • Faculty of Medicine / Oslo University Hospital Faculty of Theology Faculty of Educational Sciences Faculty of Social sciences And so the list continues…also outside UiO.. Gard Thomassen,TSD 2.0 Project background • UiO has an open network structure, but still with a high level of security • Most of the UiO data is open • Various UiO/OUS researchers approached USIT asking for an eInfrastructure for sensitive data (majority was MR-images and NGS data) • The pilot project TSD 1.0 was run Gard Thomassen,TSD 2.0 Lessons learned • The need for our services far exceeded the scalability of our system • Too much hands-on maintaining and manual setup of new projects and new users • There is a need for a High Performance Computing (HPC) resource within a secure environment • Not very user friendly (both ends) Gard Thomassen,TSD 2.0 Main collaborators on TSD 2.0 Collaborators • Norwegian Storage Infrastructure (NorStore) • Norwegian Genetics Analysis Platform (GenAp) • Norwegian Dietary Registry (Faculty of Medicine) • Institute of Psychology (Faculty of Social Sciences) • Norwegian Cancer Sequencing Consortium (NCGC) Reference group Oslo University Hospital, NorStore, Regional Etichal Committee, National Institute of Public Health, Norwegian Cancer Registry, Research Network at OUS, Elixir Norway, NCGC, GenAP and Institute of Psychology,UiO. 7 Gard Thomassen,TSD 2.0 System requirements • • • • • • • • • • • • • • Security, isolation and access control as given by law Large storage capacity Multiple users High performance computing resource High bandwidth Easy to maintain Easy to use (including audio and video) Some freedom within user space Accessible from anywhere through authentication A variety of software and public DBs must be available Windows and Linux support (OS X if possible) Data collection service Data sharing service National scope (so far..) 8 Gard Thomassen,TSD 2.0 Solution outline 9 Gard Thomassen,TSD 2.0 System outline VM-server Gateway n HPC - Colossus 1 Internet 1 (project) Secure encrypted network to special high volume data production sites 1 (storage area) Storage 10 Gard Thomassen,TSD 2.0 Using TSD 2.0 for analysis TSD 2.0 P1 DB VM B1 P1 VM B2 P1 User B1 P1 User B2 P1 GW Front end Colossus TSD disk Colossus P1 Colossus disk 11 Gard Thomassen,TSD 2.0 Data import and export using TSD 2.0 “Sluice-server” TSD 2.0 Virtual “sluiceserver” 2 3 NFS mount 1 Data copied here by ssh+scp or web-drive (2-factor authentication) encrypted data if sensitive Virtual projectserver “Sluice HD” Project HD 4 12 Gard Thomassen,TSD 2.0 Data collection using TSD 2.0 minID “Nettskjema” Encrypted XML (PGP) Import mechanism Project VM Project disk TSD 2.0 13 Gard Thomassen,TSD 2.0 Data-import for NGS-centers and other large scale data producers HiSEQ TSD 2.0 Project VM /tmp/ storage TSD controlled box on-site GW Encrypted connection Project disk 14 Gard Thomassen,TSD 2.0 Technical outline Closed network at USIT Admin services Storage / DBs - - PostgreSQL - Archiving - Compartmentalized disk Provisioning system AD Surveillance Software repo Cfengine Vcenter Backup Antivirus Log service Clinical health data projects Other sensitive data projects HPC-resource Management - Mgmt of storage Mgmt of network Mgmt of hardware Mgmt of VMs Access network - - National Health network Terminal servers Thin client servers VPN Clients (2-factor login) - Remote desktop clients - Thin-clients on dedicated network - Special network for large-scale data production centers Publicly available network segment through “minID” Webquestionary Web portal Electronic consent 15 Gard Thomassen,TSD 2.0 Technical details • KVM for virtualization (RedHat Linux) • Cerebrum as provisioning (a USIT application) • AD system administration guided by the provisioning system (duplicated) • FreeBSD firewall and gateway (duplicated) • Integration with IDporten (Norwegian governmental eID system) for www-enquiries and applications • Storage with separation between projects (Hitachi disc system and encrypted backup to tape) • IPv6 on the inside (… and private IPv4) 16 Gard Thomassen,TSD 2.0 HPC resource – Colossus • At present about 500 cores • No project users are to log in on any nodes • One global job daemon to control data integrity (to ensure project data separation) • /tmp/ and /work/ will be per projects and cleaned after job finishes • As similar to Abel as possible • Separate disk and more nodes will come soon 17 Gard Thomassen,TSD 2.0 Security details • OATH TOTP 2-factor authentication – Smart phones or programmable hardware tokens • • • • • • • • • • Special roles for those allowed to export data Import/export is under strict control No open connection to the internet Strong separation between projects (VLAN) Special security measures with remote desktops Extremely hardened FreeBSD gateway and firewall Encrypted backup, one key per project Sys admins are single users (traceability) Sys admins have to use same authentication process Most hardware is physically separated from other UiO hardware 18 Gard Thomassen,TSD 2.0 Maintenance • Reuse as much as possible from the USIT eInfrastructure • Virtualize as much as possible • Management/ surveillance data can be pushed, but not pulled (Nagios, Collectd) • Surveillance based on existing systems • Sys admins have different access levels 19 Opportunities enabled by TSD 2.0 • NGS research on humans is possible • Large scale imaging studies possible • “HUNT-like” studies online for the respondents and the scientists • Off-site analysis of sensitive data • Secure storage for verification of published research • Electronic consent • Possible work-area for making exams? • TSD to host all human NGS research data from UIO/OUS?? Gard Thomassen,TSD 2.0 Nordic collaboration opportunities • Laws are fairly similar (Norway very strict) • Difficult to exchange data for research • One should learn from each others as these systems demands very special IT-knowledge • System development and system-administration is non-sensitive and may be shared • Building TSD addresses many novel security questions in a University setting, to be learnt from • Large DBs of health data may enable very interesting research in the future (NeGI) • NeIC has shown interest into TSD 2.0 • TSD collaborate with CSC in Finland and with BILS / Elixir Sweden. BBMRI are interested 21 Gard Thomassen,TSD 2.0 Current status • Pilot project data is transferred now now • System is being prepared and finished for setting up new projects and go into production • Storage is up • Secure Nettskjema is up • Working on risk evaluation • Project registration when risk evaluation is finished • HPC-resource 4th quarter 2013 • Video and sound will be the main target during further work • System Whitepaper (v1.0) written People involved Project group / developers • • • • • • • • • • Dag-Erling Smørgrav Petter Reinholtsen Elisabeth Ytterdal Tor Fuglerud DBA (PostgreSQL team) Cerebrum team Morten Werner Forsbring Espen Grøndahl HPC – Colossus team Gard Thomassen Administration / associated • IT-dir Lars Oftedal • Hans A. Eide • Märtha Felton 23 Gard Thomassen,TSD 2.0 Cost per project • • • • First year establishment price (per project) Regular yearly project fee License cost (licensed software usage) Storage cost for storage exceeding basic allocation • Cost of DB administration (if DB needed) • Cost of CPU hours Colossus 24 Project administration in TSD 2.0 - technical • Application through the National ID-portal + Nettskjema • The project is created in Cerebrum with role-categories • The project is connected to resources (VM + disc + VLAN + DB + HPC) • Users are created and given their roles • Username, pwd and one-time-passwords are distributed • Accounts kept on storage, HPC CPU time and additional VMs to enable control and book-keeping • NorStore may offer “free” storage within TSD (there might be a small security mgmt overhead cost) • In the the future there will be some level of self service through a web portal within TSD 25 Gard Thomassen,TSD 2.0 Conclusion • It is very hard to make something secure and userfriendly at the same time – Researchers wants the freedom of using the internet while doing research on sensitive data… • A thorough risk assessment must be made during and after the planning and implementation phase to make the best choices • What you can not avoid should at least be detected by some surveillance mechanism. • More (inter)national / local cooperation wanted 26 Gard Thomassen,TSD 2.0 Pilot project (TSD 1.0) • Secure storage for large amounts of NGS data and MR-images (>100TB) • Secure windows “research server” enabling usage of MS Office, STATA, SPSS etc on sensitive data • Research server is based on an isolated system using VMware ESX • Two-factor login-system • Encrypted backup Gard Thomassen,TSD 2.0 “The Ultimate Goal is…. ….to be able to provide the same services that are available for researchers working with nonsensitive data, with the necessary security, with minimum impact on the user experience, and minimum extra overhead and cost.” Hans Eide, 2012 (my boss) 28 Gard Thomassen,TSD 2.0