Who has sensitive data?

advertisement
Services for Sensitive Research Data
Gard Thomassen, PhD
Head of Research Support Services Group
Leader of the ”Services for Sensitive Data” project
University Center for Information Technology (USIT)
University of Oslo
Outline
•
•
•
•
•
•
•
•
•
•
What is sensitive data?
Who has sensitive data?
Project background
Collaborators and reference group
System requirements
System outline
Technical and security details
Maintenance
Advantages and current status
International collaborations
Gard Thomassen,TSD 2.0
Who has sensitive data?
•
•
•
•
•
Faculty of Medicine / Oslo University Hospital
Faculty of Theology
Faculty of Educational Sciences
Faculty of Social sciences
And so the list continues…also outside UiO..
Gard Thomassen,TSD 2.0
Project background
• UiO has an open network structure, but still
with a high level of security
• Most of the UiO data is open
• Various UiO/OUS researchers approached
USIT asking for an eInfrastructure for
sensitive data (majority was MR-images and
NGS data)
• The pilot project TSD 1.0 was run
Gard Thomassen,TSD 2.0
Lessons learned
• The need for our services far exceeded the
scalability of our system
• Too much hands-on maintaining and manual
setup of new projects and new users
• There is a need for a High Performance
Computing (HPC) resource within a secure
environment
• Not very user friendly (both ends)
Gard Thomassen,TSD 2.0
Main collaborators on TSD 2.0
Collaborators
• Norwegian Storage Infrastructure (NorStore)
• Norwegian Genetics Analysis Platform (GenAp)
• Norwegian Dietary Registry (Faculty of Medicine)
• Institute of Psychology (Faculty of Social Sciences)
• Norwegian Cancer Sequencing Consortium (NCGC)
Reference group
Oslo University Hospital, NorStore, Regional Etichal Committee, National
Institute of Public Health, Norwegian Cancer Registry, Research Network
at OUS, Elixir Norway, NCGC, GenAP and Institute of Psychology,UiO.
7
Gard Thomassen,TSD 2.0
System requirements
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Security, isolation and access control as given by law
Large storage capacity
Multiple users
High performance computing resource
High bandwidth
Easy to maintain
Easy to use (including audio and video)
Some freedom within user space
Accessible from anywhere through authentication
A variety of software and public DBs must be available
Windows and Linux support (OS X if possible)
Data collection service
Data sharing service
National scope (so far..)
8
Gard Thomassen,TSD 2.0
Solution outline
9
Gard Thomassen,TSD 2.0
System outline
VM-server
Gateway
n
HPC - Colossus
1
Internet
1 (project)
Secure encrypted network
to special high volume
data production sites
1 (storage area)
Storage
10
Gard Thomassen,TSD 2.0
Using TSD 2.0 for analysis
TSD 2.0
P1 DB
VM B1 P1
VM B2 P1
User B1 P1
User B2 P1
GW
Front end
Colossus
TSD disk
Colossus
P1
Colossus disk
11
Gard Thomassen,TSD 2.0
Data import and export using TSD 2.0
“Sluice-server”
TSD 2.0
Virtual
“sluiceserver”
2
3
NFS mount
1
Data copied here by ssh+scp
or web-drive
(2-factor authentication)
encrypted data if sensitive
Virtual
projectserver
“Sluice HD”
Project HD
4
12
Gard Thomassen,TSD 2.0
Data collection using TSD 2.0
minID
“Nettskjema”
Encrypted XML (PGP)
Import mechanism
Project VM
Project disk
TSD 2.0
13
Gard Thomassen,TSD 2.0
Data-import for NGS-centers and other
large scale data producers
HiSEQ
TSD 2.0
Project VM
/tmp/
storage
TSD controlled
box on-site
GW
Encrypted
connection
Project disk
14
Gard Thomassen,TSD 2.0
Technical outline
Closed network at USIT
Admin services
Storage / DBs
-
- PostgreSQL
- Archiving
- Compartmentalized
disk
Provisioning system
AD
Surveillance
Software repo
Cfengine
Vcenter
Backup
Antivirus
Log service
Clinical
health
data
projects
Other
sensitive
data
projects
HPC-resource
Management
-
Mgmt of storage
Mgmt of network
Mgmt of hardware
Mgmt of VMs
Access network
-
-
National Health
network
Terminal servers
Thin client
servers
VPN
Clients (2-factor login)
- Remote desktop clients
- Thin-clients on dedicated
network
- Special network for large-scale
data production centers
Publicly available network segment through “minID”
Webquestionary
Web portal
Electronic
consent
15
Gard Thomassen,TSD 2.0
Technical details
• KVM for virtualization (RedHat Linux)
• Cerebrum as provisioning (a USIT application)
• AD system administration guided by the provisioning
system (duplicated)
• FreeBSD firewall and gateway (duplicated)
• Integration with IDporten (Norwegian governmental
eID system) for www-enquiries and applications
• Storage with separation between projects (Hitachi
disc system and encrypted backup to tape)
• IPv6 on the inside (… and private IPv4)
16
Gard Thomassen,TSD 2.0
HPC resource – Colossus
• At present about 500 cores
• No project users are to log in on any nodes
• One global job daemon to control data
integrity (to ensure project data separation)
• /tmp/ and /work/ will be per projects and
cleaned after job finishes
• As similar to Abel as possible
• Separate disk and more nodes will come
soon
17
Gard Thomassen,TSD 2.0
Security details
• OATH TOTP 2-factor authentication
– Smart phones or programmable hardware tokens
•
•
•
•
•
•
•
•
•
•
Special roles for those allowed to export data
Import/export is under strict control
No open connection to the internet
Strong separation between projects (VLAN)
Special security measures with remote desktops
Extremely hardened FreeBSD gateway and firewall
Encrypted backup, one key per project
Sys admins are single users (traceability)
Sys admins have to use same authentication process
Most hardware is physically separated from other UiO hardware
18
Gard Thomassen,TSD 2.0
Maintenance
• Reuse as much as possible from the USIT
eInfrastructure
• Virtualize as much as possible
• Management/ surveillance data can be
pushed, but not pulled (Nagios, Collectd)
• Surveillance based on existing systems
• Sys admins have different access levels
19
Opportunities enabled by TSD 2.0
• NGS research on humans is possible
• Large scale imaging studies possible
• “HUNT-like” studies online for the respondents and the
scientists
• Off-site analysis of sensitive data
• Secure storage for verification of published research
• Electronic consent
• Possible work-area for making exams?
• TSD to host all human NGS research data from
UIO/OUS??
Gard Thomassen,TSD 2.0
Nordic collaboration opportunities
• Laws are fairly similar (Norway very strict)
• Difficult to exchange data for research
• One should learn from each others as these systems
demands very special IT-knowledge
• System development and system-administration is
non-sensitive and may be shared
• Building TSD addresses many novel security
questions in a University setting, to be learnt from
• Large DBs of health data may enable very
interesting research in the future (NeGI)
• NeIC has shown interest into TSD 2.0
• TSD collaborate with CSC in Finland and with BILS /
Elixir Sweden. BBMRI are interested
21
Gard Thomassen,TSD 2.0
Current status
• Pilot project data is transferred now now
• System is being prepared and finished for setting up
new projects and go into production
• Storage is up
• Secure Nettskjema is up
• Working on risk evaluation
• Project registration when risk evaluation is finished
• HPC-resource 4th quarter 2013
• Video and sound will be the main target during
further work
• System Whitepaper (v1.0) written
People involved
Project group / developers
•
•
•
•
•
•
•
•
•
•
Dag-Erling Smørgrav
Petter Reinholtsen
Elisabeth Ytterdal
Tor Fuglerud
DBA (PostgreSQL team)
Cerebrum team
Morten Werner Forsbring
Espen Grøndahl
HPC – Colossus team
Gard Thomassen
Administration / associated
• IT-dir Lars Oftedal
• Hans A. Eide
• Märtha Felton
23
Gard Thomassen,TSD 2.0
Cost per project
•
•
•
•
First year establishment price (per project)
Regular yearly project fee
License cost (licensed software usage)
Storage cost for storage exceeding basic
allocation
• Cost of DB administration (if DB needed)
• Cost of CPU hours Colossus
24
Project administration in TSD 2.0 - technical
• Application through the National ID-portal + Nettskjema
• The project is created in Cerebrum with role-categories
• The project is connected to resources (VM + disc + VLAN + DB
+ HPC)
• Users are created and given their roles
• Username, pwd and one-time-passwords are distributed
• Accounts kept on storage, HPC CPU time and additional VMs
to enable control and book-keeping
• NorStore may offer “free” storage within TSD (there might be a
small security mgmt overhead cost)
• In the the future there will be some level of self service through
a web portal within TSD
25
Gard Thomassen,TSD 2.0
Conclusion
• It is very hard to make something secure and userfriendly at the same time
– Researchers wants the freedom of using the internet while
doing research on sensitive data…
• A thorough risk assessment must be made during
and after the planning and implementation phase to
make the best choices
• What you can not avoid should at least be detected
by some surveillance mechanism.
• More (inter)national / local cooperation wanted
26
Gard Thomassen,TSD 2.0
Pilot project (TSD 1.0)
• Secure storage for large amounts of NGS
data and MR-images (>100TB)
• Secure windows “research server” enabling
usage of MS Office, STATA, SPSS etc on
sensitive data
• Research server is based on an isolated
system using VMware ESX
• Two-factor login-system
• Encrypted backup
Gard Thomassen,TSD 2.0
“The Ultimate Goal is….
….to be able to provide the same services that
are available for researchers working with nonsensitive data, with the necessary security, with
minimum impact on the user experience, and
minimum extra overhead and cost.”
Hans Eide, 2012 (my boss)
28
Gard Thomassen,TSD 2.0
Download