> - gpc-informatics

advertisement
Technical Approaches for
Protecting Privacy in the
PCORnet Distributed Research
Network V1.0
Guidance Document
Prepared by: PCORnet Data Privacy Task Force
Submitted to the PMO
Approved by the PMO
Submitted to PCORI
Approved by PCORI
March 31, 2015
April 2, 2015
<<Date Submitted to PCORI>>
<<Date Approved by PCORI>>
i
Data Privacy Task Force
Technical Approaches for Protecting Privacy in the PCORnet Distributed
Research Network V1.0
TABLE OF CONTENTS
EXECUTIVE SUMMARY ..................................................................................................................... III
1.0 MINIMUM THRESHOLD ............................................................................................................- 1 2.0 PERTURBATION OF QUERY RESULTS..........................................................................................- 1 3.0 OBFUSCATION OF IDENTIFIERS FOR RECORD LINKAGE ...............................................................- 2 4.0 DE-IDENTIFICATION OF RECORD-LEVEL DATA ............................................................................- 3 A.
B.
C.
CAPRICORN APPROACHES ................................................................................................................. - 3 NEPHCURE PPRN’S APPROACHES TO DE-IDENTIFICATION ......................................................................... - 4 PEDSNET APPROACHES TO DE-IDENTIFICATION....................................................................................... - 4 -
TABLES AND FIGURES .....................................................................................................................- 5 REFERENCES ..................................................................................................................................- 6 -
The Data Privacy Task Force
-i-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
EXECUTIVE SUMMARY
PCORnet is a federated network, with PCORnet network partners retaining discretion and responsibility
with respect to the collection, access, use, and disclosure of patient information; network partners also
make determinations about when they will participate in any particular PCORnet query.
The Data Privacy Task Force is working collectively with the CDRNs and PPRNs to develop a set of privacy
policies to govern data sharing by PCORnet.
This guidance is intended to augment the PCORnet policies to provide examples of methods to reduce
the risk of re-identification with respect to the generation, collection, maintenance, or return of
Network Data. Terms used in this guidance are defined in the PCORnet policies.
This guidance is intended to be modified over time as the PCORnet Distributed Research Network gains
experience. The guidance covers the following privacy protective techniques:
(Threshold) Minimum count thresholds for Aggregate Data;
(Perturb) Perturbation of PCORnet Data;
(Obfuscate) Obfuscation of identifiers for record linkage; and
(De-identify) De-identification of record-level research participant information.
The Data Privacy Task Force
-i-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
MINIMUM THRESHOLD
One of the manners by which personal information can be exploited for re-identification is by the
triangulation on small groups of individuals. In order to mitigate such attacks, PCORnet Policy currently
states that Network Data Affiliates cannot release Network Data with cell counts of five or less, unless
authorized by the research protocol and IRB(s) approving the query. (See PCORnet Policy 6.2.2).
PCORnet policies permit network partners to apply their local rules for masking cell counts, or for
rejecting queries where the return of results would not match their thresholds for releasing Aggregate
Data. Such local policies must be consistent with commitments made to patients/data subjects with
respect to use of their information.
Other examples of thresholds are shown in Table 1.
PERTURBATION OF QUERY RESULTS
Another manner by which personal information can be exploited for re-identification is by overlapping
queries to remove the intersection and disclose the remaining individuals. Consider an example of how
this might be achieved. First, an Authorized User issues a query for how many juvenile diabetics were on
drug A and drug B with an adverse outcome and the answer is X, which, for this case, let us assume
corresponds to 31. The User then issues a subsequent query in which they ask how many juvenile
diabetics were on drug A with an adverse outcome, such that the answer is now 30. At this point, the
User learns that there is only 1 juvenile diabetic on both drug A and drug B with the adverse outcome.
There are a number of ways in which this type of attack could be prevented. In practice, systems tend to
apply either 1) rounding (or coarsening) or 2) injection of a certain degree of noise to the query result.
As noted in PCORnet policies, the PCORnet query should specify the approach to be used to de-identify
data or reduce re-identification risks (see PCORnet Policy 5.2.1.1).
If a rounding (or coarsening approach is used), the result X could be rounded to the nearest value of 10.
For instance, in the above scenarios, the answers to the queries would both be 30. However, it should
be noted that the degree to which the utility of the query answers would be tied directly to the rounding
values. An initial rounding value of 10 is recommended.
An alternative to rounding is the injection of a certain amount of noise into the results. This is the
strategy that query-response tools such as i2b2 [Murphy 2009] (specifically in SHRINE [Lowe 2009])
apply in their system. In this scheme, the result would be reported as 30 + , where  is a random value
selected from a known distribution. This distribution could be uniform, Gaussian, Laplacian, or
something else. It should be noted that i2B2 applies a Gaussian distribution. If random noise is to be
added, the approach needs to specify the standard deviation of the distribution from which the value is
selected.
The Data Privacy Task Force
-1-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
OBFUSCATION OF IDENTIFIERS FOR RECORD LINKAGE
To mitigate bias in investigations, it is important to resolve when a patient’s data resides in multiple
resources. This process, called record linkage, is non-trivial because a patient’s record often contains
typographical and semantic errors. Sophisticated record linkage strategies have been proposed to
resolve these problems, but they rely on patient identifiers, such as personal name and Social Security
Number. To overcome this barrier, a growing list of techniques has been proposed to support private
record linkage (PRL).
From a high level, the PRL process has a lifecycle that entails (but is not necessarily limited to) the
following steps [Toth 2014]:
1. Generation and storage of keys for cryptosystems, or salt values for hash functions, invoked in a
PRL protocol;
2. Communication of keys and salt to the entities encoding the records upon request;
3. Transformation of identifiers into their protected form as specified by the protocol;
4. Separation of salt hosting and de-duplication trusted entities for enhanced security
5. Execution of the record linkage framework (e.g., feature weighting, blocking, and comparison of
record pairs to predict which correspond to the same individual); and
6. Transfer of records and parameters related to the linkage protocol (i.e., all communication
between parties).
Under no circumstances can the keys or salt values be disclosed to any entity beyond PCORnet network
partners.
A number of network partners are exploring different approaches to private record linkage. Some
network partners report using NIH’s Global Unique Identifier (GUID) Tool
(https://fitbir.nih.gov/jsp/contribute/guid-overview.jsp). The CAPriCORN Clinical Data Research Network
has developed private record de-duplication software [insert link to JAMIA paper when it is available].
The Secure Open Master Patient Indexing System (SOEMPI), developed researchers at Vanderbilt
University and the University of Texas at Dallas, is another approach. Private companies also offer deduplication software options. Although it is too early to require that all PCORnet participants adopt a
specific approach, evolving to the same approach would be beneficial, as it would allow for centralized
de-duplication to occur, versus having network participants individually engage in these efforts.
To apply such an approach, PCORnet would need to agree on:
1. Who is the third party (trusted party A) who generates the keys/salt values of the functions?
2. Who is the third party (trusted party B) who gets to perform the linkage?
3. Who gets to see the linkage results? In other words, do the member sites get to know when
their constituents went to other sites?
4. What is the similarity threshold by which we could claim that two records correspond to the
same individual?
There are no standards and no standard software available at this time. SOEMPI is one option, but it will
require either PCORnet or some organization to adopt the source code and support is operations. An
alternative solution would be to piggyback on the software developed by the Chicago CDRN – the paper
describing this system is under review at JAMIA and is provided separately. There are benefits and
drawbacks to both systems in their design and linkage algorithms.
The Data Privacy Task Force
-2-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
DE-IDENTIFICATION OF RECORD-LEVEL DATA
A predominant model for research using the PCORnet Distributed Research Network is one where the
individual, record-level or patient-level data remains under the control of the network partner (or
Network Data Affiliate); the research query is run on the Network Data, and only Aggregate Data is
returned in response. This privacy-preserving architecture reduces the need to adopt de-identification
strategies for data shared in response to a query. [Mini Sentinel 2012]
However, PCORnet policies recognize that at times, responses to queries may require the sharing of
record- or patient-level de-identified data. In addition, network partners (particularly those consisting of
disparate organizations) may choose as a matter of local policy to create de-identified datasets for
research purposes. There a number of ways by which de-identification can be achieved. Follow this link
for the latest guidance from the HHS office for Civil Rights on HIPAA de-identification:
http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
In circumstances where the query requires the return of de-identified data, PCORnet policies require the
query to specify the “definition and approach or procedures required to de-identify data.” In addition,
some network partners may be required to abide by NIH’s recently released Genomic Data Sharing
Policy, which includes specifications on the de-identification approach to be used.
http://gds.nih.gov/03policy2.html.
For initial queries requiring the return of de-identified data, the PCORnet Coordinating Center (CC), with
input from network partners participating in the queries, may need to set the approach to be used;
however, over time, PCORnet should develop a robust set of policies and best practices that over time
may reduce or eliminate the need for CC control.
These approaches focus on reducing risk of re-identification using demographic identifiers; future
iterations of the guidance may need to deal with risk of re-identification from exposure of clinical data.
PCORnet network partners are invited to share their approaches to de-identification of record level
data, in order to share resources and begin to develop a library of best practices. The following recordlevel de-identification approaches have been shared and are also available on the PCORnet Central
Desktop:
A. CAPRICORN APPROACHES
CAPriCORN proposes initially to validate and use limited data sets with randomly seeded, time-shifted
temporal references and geographical references restricted to the first three digits of zip codes. Expert
statistical determination will be sought for the method of time-stamping events to confirm that it also
meets the Safe Harbor de-identification criteria of the HIPAA Privacy Rule. Until such determination has
been achieved, the data sets will be considered limited, rather than de-identified, datasets. In the event
that this proves infeasible, CAPriCORN will adhere to Safe Harbor until the situation has evolved and use
of date shifting is accepted.
A separate important piece of information useful for epidemiologic investigations is geographic location.
We may need to incorporate these data through IRB approval of limited data sets rather than addresses
The Data Privacy Task Force
-3-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
that can be geocoded. ZIP code level data will need to be considered when applying our minimum
threshold and perturbation of query rules.
B. NEPHCURE PPRN’S APPROACHES TO DE-IDENTIFICATION
1. Encrypted hash (SHA1) on a sequential ID number assigned as the surveys come in.
2. Randomizing birth dates within six months, with a new random birth date generated for each
query.
3. The Common Data Model has been constructed as views in a separate schema, so no queries
can get to the underlying data.
C. PEDSNET APPROACHES TO DE-IDENTIFICATION
1. Institution replaces PHI with a site encrypted identifier, and maintains link between the two.
2. DCC replaces “site encrypted identifier” with a PEDSnet encrypted identifier (PEI) to insure
uniqueness across sites.
3. All datasets stored or sent out of the DCC use the PEI.
What this means in the study context is that the investigator gets a set of PEIs in response to a casefinding query. If they want to re-identify patients, they tell the DCC, who translates that back to a site
and site encrypted identifier, and sends that back to the site of origin. That site is then able to link to PHI
and re-contact the patient or provide additional data (e.g., chart review).
We’re planning to cycle a test of this process in December, if the DUAs get sorted by then.
The Data Privacy Task Force
-4-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
TABLES AND FIGURES
Refer to tables and figures throughout the document and place them here. Use capital T’s and F’s when
referring to tables and figures (e.g., ‘As mentioned in Table 1’, etc.).
Table 1. Examples of thresholds applied in the minimum threshold rule
AGENCY
Washington State Department of Health [WA 2012]
Centers for Disease Control Healthy People 2010 [Klein 2002]
Arkansas HIV/AIDS Data Release Policy [AR 2012]
Colorado State Department of Public Health and Environment [CO 2012]
National Center for Health Statistics [NCHS 2004]
UK Department of Enterprise, Trade, and Investment [DETI 2012]
Utah State Department of Health [UT 2005]
Iowa Department of Public Health [IA 2005]
NASA [SEDAC 2005]
The Data Privacy Task Force
-5-
MINIMUM THRESHOLD
10
5 - 10
5
5
5
5
5
4
3
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
REFERENCES
[AR 2010] Arkansas HIV/AIDS Surveillance Section. Arkansas HIV/AIDS Data Release Policy. Available
Online:
http://www.healthy.arkansas.gov/programsServices/healthStatistics/Documents/STDSurveillance/D
atadeissemination.pdf. First published: May 2010. Last Accessed: April 29, 2014.
[CO 2010] Colorado State Department of Public Health and Environment. Guidelines for working with
small numbers. Available online: http://www.cdphe.state.co.us/cohid/smnumguidelines.html. Last
Accessed: April 29, 2014.
[DETI 2010] U.K. Department of Enterprise, Trade, and Investment. DETI Data Confidentiality Statement.
Available online: http://www.detini.gov.uk/deti-stats-index/stats-national-statistics/datasecurity.htm. Last Accessed: April 29, 2014.
[Klein 2002] R. KLEIN, S. Proctor, M. Boudreault, K. Turczyn. Healthy people 2010 criteria for data
suppression. Centers for Disease Control Statistical Notes Number 24. 2002.
[Mini Sentinel 2012] J RASSEN, et al., Mini Sentinel Methods: Evaluating Strategies for Data Sharing and
Analyses in Distributed Data Settings, November 2012, http://www.minisentinel.org/work_products/Statistical_Methods/Mini-Sentinel_Methods_Evaluating-Strategies-forData-Sharing-and-Analyses.pdf.
[Murphy 2009] S. MURPHY, et. al. Strategies for maintaining patient privacy in i2b2. Journal of the
American Medical Informatics Association. 2011; 18: 103-108.
[SEDAC] Socioeconomic Data and Applications Center. Confidentiality issues and policies related to the
utilization and dissemination of geospatial data for public health application; a report to the public
health applications of earth science program, national aeronautics and space administration, science
mission directorate, applied sciences program. 2005. Available online:
http://www.ciesin.org/pdf/SEDAC_ConfidentialityReport.pdf. Last Accessed: April 29, 2014.
[TOTH 2014] C. TOTH, et al. SOEMPI: A Secure Open Master Patient Index Software Toolkit for private
record linkage. Proceedings of the 2014 American Medical Informatics Association Annual
Symposium. 2014: in press.
[UT 2005] Utah State Department of Health. Data release policy for Utah’s IBIS-PH web-based query
system, Utah Department of Health. Available online:
http://health.utah.gov/opha/IBIShelp/DataReleasePolicy.pdf. First published: 2005. Last Accessed:
April 29, 2014.
[WA 2012] Washington State Department of Health. Guidelines for working with small numbers.
Available online: http://www.doh.wa.gov/Portals/1/Documents/5500/SmallNumbers.pdf. First
published 2001, last updated October 15 2012. Last Accessed: April 29, 2014.
The Data Privacy Task Force
-6-
Technical Approaches for Protecting
Privacy in the PCORnet DRN v1.0
Download