Privacy, Confidentiality and Data Security (PCDS) in HSR: Best Practices Alan M. Zaslavsky Department of Health Care Policy Harvard Medical School 1 Privacy, Confidentiality and Data Security (PCDS) • Importance and sensitivity of PCDS • Basic concepts of disclosure risk – Deidentification and reidentification – Disclosure control • Institutional and regulatory frameworks – Common Rule, HIPAA, Data use agreements • File organization, data flow and computer security 2 • This presentation offered in our department at least annually – Required attendance by all programmers, students, fellow, project managers with data responsibilities – Presented to faculty at meetings – Shortened version for lower-level staff – Tracking of attendance by personnel manager – Sanction is loss of computer account • Seek to fully involve project management in PCDS issues 3 Definitions • Privacy: the right of an individual to keep information about herself or himself from others. • Confidentiality: safeguarding, by a recipient, of information about another individual • Disclosure: release (direct or indirect) of information about an identifiable individual 4 Definitions (continued) • Data security: protections on data to prevent unauthorized access or destruction • Informed consent: a person's agreement to allow person data to be provided for research and statistical purposes • Research: study producing generalizable knowledge – excludes internal operations, quality assurance 5 Importance of PCDS Nexus for balance between • benefits of information to society • possible harms of information use to individuals in conducting the research enterprise. One person’s “invasion of privacy” is another’s “essential use of information.” 6 Inherent conflicts • Law enforcement / legal process • General access to research data – Freedom of Information Act (FOIA) • Commercial use / beneficial products & services? • Prevention of harm • Need to save data for verification, revision 7 Costs of violations of PCDS • Damage to subjects – Material – Psychological/social • Damage to the research enterprise • Exposure to legal/administrative sanctions for researchers and data providers and their institutions 8 Direct and indirect identifiers Key: variable or combination of variables, the value for which results in a record being unique in the target and population data Direct identifier: Information that is uniquely associated with a person. Indirect identifier: Data which, in combination are uniquely associated with a person. Information which facilitates such associations. 9 Direct Identifiers (keys) •Name . •Telephone number •Street /e-mail address •Unique features (SSN, Medicare ID, Health plan, Medical record #, Certificate/License, voice-finger prints, photos) 10 Re-identification by Matching De-identification Original target file Anonymized target file Name abcdefghijkl abcdefghijkl Re-identification key Anonymized target file abcdefghijkl Population file abcdefmnop Name 11 Data in Combination Variables might be identifying in combination that are not identifying by themselves • Month, day and year of birth • Gender • Zip code 12 Example of reidentification using three variables Variables list Birthdate alone Birthdate + gender Birthdate + Zip (5) Birthdate + Zip (9) % Unique in Maine state voter registration 12 29 69 97 Sweeney, 1997 13 Population (External) Data Bases • Voter Registration Lists • Research files • State & Federal Files – Survey files with added administrative data • Information Vendor Files • The unknown: what might an “intruder” know about some or all members of your population? 14 Identifiable population groups (entire data set highly identifiable) • Rare diseases •Sample drawn from a particular area 15 Unique/unusual cases: rare values • 110 year-old woman • Man who weighs 350 pounds • Income > $100 million • Verbatim text containing identifying details 16 Unique/unusual cases: rare combinations of values • 16 year-old widow • 20 year-old Ph.D. • Asian race in rural mid-west • Female/Asian Executive • 60-year old male married to 30 year-old female • Cause of death = prostate cancer for 30 year-old male 17 Micro Data Protection 1 • Remove direct identifiers • Restrict geographical detail • Code to remove detail – larger categories, top/bottom coding • Remove, code or edit verbatim comments • Case suppression • Variable suppression 18 Micro Data Protection 2 • Special handling (e.g. coding) of data from external sources (esp. area data) • Statistical modification (“noise”) • Sample/subsample • Eliminate link between persons and establishments 19 Tabular data • Information on individuals deduced from unique cases in tables • Reidentification usually related to small groups, small cell counts • Rounding, cell suppression, complementary suppression might be required 20 Disclosure of individual information from a table Income ($’000) <10 10-25 25-50 >50 Cancer type Colon Lung Kidney 60 80 0 25 36 0 19 12 2 22 14 0 Breast 24 36 17 35 21 Technical issues • Highly technical issues in both microdata and tabular nondisclosure – Intersection of stats, math, computer science • Software for detecting disclosure risk – RTI, m-argus, etc. • Nontechnical variables – Resources and intentions of “intruder” 22 Disclosure control in released data • Affect us as producers and consumers of data • Masking – Affects analyses if performed on data we receive – Complex to implement on our releases • Limited access data centers 23 Restricted access data centers • Alternative to fully-deidentified public-use microdata files • Data are held at restricted center – Limited set of researchers submit analyses through intermediaries – Output reviewed for nondisclosure • Only feasible for organizations with substantial, persistent resources – e.g. NCHS, Census 24 Institutional and regulatory frameworks for PCDS • • • • Common Rule / IRB HIPAA Data Use Agreements State regulations 25 Common Rule • Governs protection of research subjects in all Federally-funded research – IRB evaluates adherence by researcher – Institutional sanctions for violations – Many institutions extend to all research • Objective: protection of subject from harm – In HSR, often there is no intervention – Typically, commitment to minimal risk of disclosure 26 Common Rule (continued) • Informed consent – generally required in primary data-collection – appropriate information about use of data – might be waived where impractical to obtain (e.g. intrusive), if risks minimal & rights not injured • Exemption from (full) review – No intervention that could harm subject – Secondary data with no identifiable data – Requires determination by IRB (but less tedious) 27 Implications for researchers • Commitments are made – To subjects: consent language – To IRB: safeguards promised in IRB application – To funding agencies: in grant application • May involve – Protection of data while used – Limits on duration of use 28 HIPAA Health Insurance Portability and Accountability Act • Specific rules for electronic transmission of health data – Primarily for efficiency but includes Privacy Rule • Obligations imposed on health care providers – Includes direct providers, health plans and insurers – Research data distinguished from health plan / provider operational functions • Researchers must respect these obligations 29 Who is Covered by HIPAA? • A health care provider who transmits health information in electronic transactions Example: a physician or hospital who electronically bills for services • A health plan • A health care clearinghouse 30 HIPAA implications for research • Practical implications of HIPAA – What data providers will be looking for – Need to work around restrictions on content – More elaborate paths for data control • HIPAA provisions for releasing data for research – fully deidentified – limited use dataset – waiver 31 Option 1: De-identified Health Information • Completely de-identified information (18 elements removed) and no knowledge that remaining information can identify the individual. OR • Statistically “de-identified” information where a qualified statistician determines that there is a “very small risk ” that the information could be used to identify the individual and documents the methods and analysis. 32 Removal of These Identifiers Makes Information De-identified – Names – Geographic info (including city and ZIP) – Elements of dates (except year) – Telephone #s – Fax #s – E-mail address – Social Security # – Medical record, prescription #s – Health plan beneficiary #s – Account #s – – – – – – – – Certificate/license #s VIN and Serial #s, license plate #s Device identifiers, serial #s Web URLs IP address #s Biometric identifiers (finger prints) Full face, comparable photo images Unique identifying #s If the covered entity has actual knowledge that remaining information can be used to identify the individual, the information is considered individually identifiable, and therefore, generally is PHI. 33 Option 2: Limited Data Set with Data Use Agreement • The Privacy Rule permits limited types of identifiers to be released for research with health information (referred to as a Limited Data Set). • Limited Data Sets can only be used and released in accordance with a Data Use Agreement between the covered entity and the recipient. 34 Limited Data Set w/ Data Use Agreement • The Limited Data Set CAN contain – Elements of Dates – City and ZIP – Other unique identifiers, characteristics and codes not previously listed as direct identifiers (previous slide) • CANNOT contain other direct identifiers (among the 18) 35 Option 3: Waiver of Authorization May use or disclose personal inforamtion for research if IRB or Privacy Board determines that : – research involves no more than minimal risk – research does not adversely affect the “ rights and welfare” of subjects – the research could not be done without a waiver 36 Data Use Agreements (DUA) • Between data provider and data user • Restrictions: – access by specific personnel – use for a specific reason – defined duration of retention • Implements commitments made by data provider 37 State regulations • Variable from state to state • Some are relatively restrictive – requires negotiation with data provider 38 Iron-clad protection? • Certificate of Confidentiality – Issued by DHHS – Protects data against legal process – Typically for sensitive topics, e.g. illicit drugs • O, Canada! 39 Data security in complex projects • Multisite projects: special needs • Careful mapping of data flow and access • Minimal identifying information at each stage • Particular care in technical aspects of security 40 Example of a data flow plan (with security provisions) 41 File management for PCDS • General practices of good management – Practices necessary to maintain project continuity • Well-structured directory organization and naming • Include documentation with files • Separate project data from personal directories • Separate datasets from programs • Separate raw data from analytic datasets 42 • We typically follow this presentation with a 15-minute tutorial on good practices for data and file management 43 Backups • Conflict of privacy/confidentiality (restrict) and data security (maintain) • Basic backup schedule (undeletable) – All Unix files: 4 month retention – PC files: 2 month retention • Project-specific backup: by request – Only possible if material is properly organized – Permanent media, physical security 44 • The backup policy described here was adopted after several months of faculty discussion – Computer system managers wanted longer retention – Faculty concerned about unexpected discovery of material intended to be deleted – Conflicts of DUA requirements with rules regarding retention of data for verification, revision of manuscripts, etc. 45 General computer security • Proper use of computer accounts, only by authorized individuals • Secure connections for outside access – Remote users – Home or “on road” access via Internet – Applications can be “tunneled” securely • Good practices with passwords • Maintain file permissions to restrict access to authorized users 46 • We follow this up with a training on mechanics of computer security – Permissions, file organization, etc. • More or less fine-grained tools for protection of various files • IT staff included in training – Responsible for implementing security and data retention policies for various project datasets • Teach methods for both Unix and Windows sides of our system 47 Conclusions • Know your data • Be prepared to accommodate restrictions required by data providers • Maintain general security • Seek guidance for tough situations! 48