Data extracts depersonalization for PoC - Terms definition a) Depersonalized data – or depersonalization (d15n) of data is de-associating personal identifiers from data to assure the anonymity of individuals studied. It should enable creation of realistic data in non-production environments without the risk of exposing sensitive information to unauthorized users. For such data it is important to remain comparable after d15n, e.g. same input should produce same output. Data relationships must be maintained after d15n. However at the same time it is required that transformation back from depersonalized data to original data should be prevented. b) Hash function – A hash function is defined as procedure or mathematical function that converts arbitrary large amount of data into a small usually same length string representation. Same input generates same output representation. The values returned by a hash function are called hash values, hash codes, hash sums or hashes. Important feature of hash function is that it should be not possible to mathematically deduce original value from hashed value. Deloitte preferred hash algorithm is SHA-2 (function SHA-256). c) Brute force attack – Preparation of known/potential/frequent original value and their corresponding hash value pairs and comparison of those against available hash values. Protection against brute force attack is usage of “salted hash value”. d) Salted hash value – Prevention and barring of brute force attacks is possible by adding text string of arbitrary length and structure (known only to d15n operator) to data before submitting data into hash function. This should prevent reversal of hash value to its original value by comparison against known original value – hash pairs. Frequent common terms will still produce frequent salted hash codes (comparability of data has to be protected), however there will be not possible to claim with certainty what was the original value. Frequency analysis would and must be possible by nature of further data processing. Original data contents and privacy of underlying subjects stays protected. e) Garbled/random data – Data that contains randomly replaced/generated data and in terms of its information value it equals random data. Such data can and is used e.g. for software testing where some data is needed to test software functionality however only presence of data not its content is relevant. f) Unification vocabularies – Words that have the same meaning as for example Robert - Bob, Henry - Harry, John – Johann etc can be unified to one standard form in order to increase data quality and make data more comparable and connectable. Proposed Client depersonalization (d15n) of different data types for PoC: Content below describes Deloitte best practice in d15n if not required by Client otherwise. For technical details refer to section “Technical details of d15n implementation”. Data fields to be hashed completely: a) Account number* b) Date of Birth (if the <client> permits, extracting Month and Year) c) National personal ID (Czech RČ) (if the Client permits, extracting Month and Year) d) Business/Tax or similar register ID e) Phone (landline or mobile) (if the Client permits, extracting the prefix) *Note: Bank Sort Code if available should not be part of depersonalization. Excluded from d15n (remains visible): f) Gender To be hashed partially: g) Address Recommendation: Street Name hashed House No hashed Columns should be hashed separately City and ZIP remain in plain form Reasoning: By hashing just Street name and House No. only, other fields as City and ZIP could be used and are important e.g. for subjects proximity or concentration analysis. If address is completely hashed all location based analysis techniques will not be possible. h) Email Recommendation: Note: hashed alias and domain separately email structure explained <alias>@<domain> Reasoning: By hashing email alias and domain separately, domain part can be used for concentration analysis, e.g. user creating fake email accounts using the same email service provider. If email address is hashed as whole, all concentration based analysis techniques will not be possible. i) Name, Surname Recommendation: Name remains visible in original form** Surname hashed Name to be hashed separately from Surname **Less frequent “unique” Names (below 10 percentile of occurrence from total distribution) will be hashed. Reasoning: By hashing Surname only it should be enough to prohibit exact person identification, however it would allow to e.g. identify gender of person if not available otherwise or identify potential family members with different Names but same Surname. Hashing of low frequency (first) Names shall ensure that such unique customers cannot be identified in the data extract. Data request technical: j) Data format preferred for d15n processing is columns delimited text in UTF-8 encoding. Recommended column delimiters are “^”, “|” or tab character. Comma or semicolon must not be used as column delimiter. k) Data for d15n procedures should be extracted and provided separately from all other data needed for further processing/analysis. Internal id/key should be left in the data for d15n in order to be possible to connect this depersonalized data with remaining data. Recommended and requested structure examples: File name.txt File surname.txt: File address_streetname.txt: File address_streetno.txt: File address_city.txt: File address_zip.txt: File address_state.txt: File email_home.txt: File email_business.txt: File bank_account.txt: File phone_mobile.txt: File phone_fixed.txt: … <internal id/key><column delimiter><name> <internal id/key><column delimiter><surname> <internal id/key><column delimiter><str.name> <internal id/key><column delimiter><str.no> <internal id/key><column delimiter><city> <internal id/key><column delimiter><zip> <internal id/key><column delimiter><state> <internal id/key><column delimiter><email > <internal id/key><column delimiter><email > <internal id/key><column delimiter><bank acc> <internal id/key><column delimiter><phone> <internal id/key><column delimiter><phone> If there is need to differentiate for different sub/systems data files can be split to different directories per system or prefixed/appended by system name. ¨ Technical details of d15n implementation: l) Empty / NULL data will not be hashed; it will remain empty on d15n output. m) Various date value formats will be unified to ISO format YYYY-MM-DD before submitting to hash function. n) Phone data will be unified to international format with '+' sign before submitting to hash function. o) Unification vocabularies of external parties (participating vendors) can be used in d15n procedures; however they will be used only if provided in by Deloitte defined text format (refer below for definition) and if it will be allowed to disclose their content to project participants. p) Deloitte defined text format of unification vocabularies: Delimited text files with UTF-8 encoding. Recommended column delimiters are “^”, “|” or tab character. Comma or semicolon must not be used as column delimiter. q) If unification vocabulary (Deloitte or external) is used/applied, its content and in which order (in relation to other unification vocabularies) it was applied to data will be provided to all parties for information. r) Deloitte in its d15n procedures reserves the right to decide whether it will use/apply (or not) internally or any externally provided unification vocabulary in each particular case/dataset. Deloitte will not explain and/or justify its decisions why and/or in which order any unification vocabulary was or wasn’t applied in d15n procedures. s) Data normalization LEVEL 0 performed on all data destined for d15n in order of its application on data is as follows: a. b. c. d. e. f. t) replacing of ascended characters for CE Latin character sets, replacing of not appropriate characters e.g. phone have only “+” or 0-9, removal of titles if present, e.g. in surnames or names data capitalization of data, sorting of multiple words behind each other, replacement of multiple space characters (tab, space) to one space (optional) Data normalization LEVEL 1 performed or not by decision of d15n operator on data destined for d15n in order of its application on data is as follows: a. decisions and preparation (LEVEL 0) of unification vocabularies b. application of unification vocabularies u) Data normalization LEVEL 2 on data destined for d15n in order of its application on data is as follows: a. b. c. d. e. review of resulting data quality preparation/adjustment of correction routines running review of resulting data quality if quality not satisfactory go back to b v) Data normalization levels performed and their order of application on all data sets will be disclosed for all parties for information. w) Deloitte used and preferred hash algorithm is SHA-2 (function SHA-256 in binary mode for all files).