template_110727_Depersonalization d15n - pepe

advertisement
Data extracts depersonalization for PoC - Terms definition
a)
Depersonalized data – or depersonalization (d15n) of data is de-associating
personal identifiers from data to assure the anonymity of individuals studied. It
should enable creation of realistic data in non-production environments without
the risk of exposing sensitive information to unauthorized users. For such data
it is important to remain comparable after d15n, e.g. same input should
produce same output. Data relationships must be maintained after d15n.
However at the same time it is required that transformation back from
depersonalized data to original data should be prevented.
b)
Hash function – A hash function is defined as procedure or mathematical
function that converts arbitrary large amount of data into a small usually same
length string representation. Same input generates same output
representation. The values returned by a hash function are called hash values,
hash codes, hash sums or hashes. Important feature of hash function is that it
should be not possible to mathematically deduce original value from hashed
value. Deloitte preferred hash algorithm is SHA-2 (function SHA-256).
c)
Brute force attack – Preparation of known/potential/frequent original value and
their corresponding hash value pairs and comparison of those against
available hash values. Protection against brute force attack is usage of “salted
hash value”.
d)
Salted hash value – Prevention and barring of brute force attacks is possible
by adding text string of arbitrary length and structure (known only to d15n
operator) to data before submitting data into hash function. This should
prevent reversal of hash value to its original value by comparison against
known original value – hash pairs. Frequent common terms will still produce
frequent salted hash codes (comparability of data has to be protected),
however there will be not possible to claim with certainty what was the original
value. Frequency analysis would and must be possible by nature of further
data processing. Original data contents and privacy of underlying subjects
stays protected.
e)
Garbled/random data – Data that contains randomly replaced/generated data
and in terms of its information value it equals random data. Such data can and
is used e.g. for software testing where some data is needed to test software
functionality however only presence of data not its content is relevant.
f)
Unification vocabularies – Words that have the same meaning as for example
Robert - Bob, Henry - Harry, John – Johann etc can be unified to one standard
form in order to increase data quality and make data more comparable and
connectable.
Proposed Client depersonalization (d15n) of different data types for PoC:
Content below describes Deloitte best practice in d15n if not required by Client
otherwise. For technical details refer to section “Technical details of d15n
implementation”.
Data fields to be hashed completely:
a) Account number*
b) Date of Birth (if the <client> permits, extracting Month and Year)
c) National personal ID (Czech RČ) (if the Client permits, extracting Month and Year)
d) Business/Tax or similar register ID
e) Phone (landline or mobile) (if the Client permits, extracting the prefix)
*Note: Bank Sort Code if available should not be part of depersonalization.
Excluded from d15n (remains visible):
f) Gender
To be hashed partially:
g) Address
Recommendation:
Street Name hashed
House No hashed
Columns should be hashed separately
City and ZIP remain in plain form
Reasoning: By hashing just Street name and House No. only, other fields as
City and ZIP could be used and are important e.g. for subjects proximity or
concentration analysis. If address is completely hashed all location based
analysis techniques will not be possible.
h) Email
Recommendation:
Note:
hashed alias and domain separately
email structure explained
<alias>@<domain>
Reasoning: By hashing email alias and domain separately, domain part can
be used for concentration analysis, e.g. user creating fake email accounts
using the same email service provider. If email address is hashed as whole, all
concentration based analysis techniques will not be possible.
i) Name, Surname
Recommendation:
Name remains visible in original form**
Surname hashed
Name to be hashed separately from Surname
**Less frequent “unique” Names (below 10 percentile of occurrence from total distribution) will
be hashed.
Reasoning: By hashing Surname only it should be enough to prohibit exact
person identification, however it would allow to e.g. identify gender of person if
not available otherwise or identify potential family members with different
Names but same Surname. Hashing of low frequency (first) Names shall
ensure that such unique customers cannot be identified in the data extract.
Data request technical:
j)
Data format preferred for d15n processing is columns delimited text in UTF-8
encoding. Recommended column delimiters are “^”, “|” or tab character.
Comma or semicolon must not be used as column delimiter.
k)
Data for d15n procedures should be extracted and provided separately from
all other data needed for further processing/analysis. Internal id/key should be
left in the data for d15n in order to be possible to connect this depersonalized
data with remaining data. Recommended and requested structure examples:
File name.txt
File surname.txt:
File address_streetname.txt:
File address_streetno.txt:
File address_city.txt:
File address_zip.txt:
File address_state.txt:
File email_home.txt:
File email_business.txt:
File bank_account.txt:
File phone_mobile.txt:
File phone_fixed.txt:
…
<internal id/key><column delimiter><name>
<internal id/key><column delimiter><surname>
<internal id/key><column delimiter><str.name>
<internal id/key><column delimiter><str.no>
<internal id/key><column delimiter><city>
<internal id/key><column delimiter><zip>
<internal id/key><column delimiter><state>
<internal id/key><column delimiter><email >
<internal id/key><column delimiter><email >
<internal id/key><column delimiter><bank acc>
<internal id/key><column delimiter><phone>
<internal id/key><column delimiter><phone>
If there is need to differentiate for different sub/systems data files can be split
to different directories per system or prefixed/appended by system name. ¨
Technical details of d15n implementation:
l)
Empty / NULL data will not be hashed; it will remain empty on d15n output.
m)
Various date value formats will be unified to ISO format YYYY-MM-DD before
submitting to hash function.
n)
Phone data will be unified to international format with '+' sign before submitting
to hash function.
o)
Unification vocabularies of external parties (participating vendors) can be used
in d15n procedures; however they will be used only if provided in by Deloitte
defined text format (refer below for definition) and if it will be allowed to
disclose their content to project participants.
p)
Deloitte defined text format of unification vocabularies: Delimited text files with
UTF-8 encoding. Recommended column delimiters are “^”, “|” or tab character.
Comma or semicolon must not be used as column delimiter.
q)
If unification vocabulary (Deloitte or external) is used/applied, its content and
in which order (in relation to other unification vocabularies) it was applied to
data will be provided to all parties for information.
r)
Deloitte in its d15n procedures reserves the right to decide whether it will
use/apply (or not) internally or any externally provided unification vocabulary in
each particular case/dataset. Deloitte will not explain and/or justify its
decisions why and/or in which order any unification vocabulary was or wasn’t
applied in d15n procedures.
s)
Data normalization LEVEL 0 performed on all data destined for d15n in order
of its application on data is as follows:
a.
b.
c.
d.
e.
f.
t)
replacing of ascended characters for CE Latin character sets,
replacing of not appropriate characters e.g. phone have only “+” or 0-9,
removal of titles if present, e.g. in surnames or names data
capitalization of data,
sorting of multiple words behind each other,
replacement of multiple space characters (tab, space) to one space
(optional) Data normalization LEVEL 1 performed or not by decision of d15n
operator on data destined for d15n in order of its application on data is as
follows:
a. decisions and preparation (LEVEL 0) of unification vocabularies
b. application of unification vocabularies
u)
Data normalization LEVEL 2 on data destined for d15n in order of its
application on data is as follows:
a.
b.
c.
d.
e.
review of resulting data quality
preparation/adjustment of correction routines
running
review of resulting data quality
if quality not satisfactory go back to b
v)
Data normalization levels performed and their order of application on all data
sets will be disclosed for all parties for information.
w)
Deloitte used and preferred hash algorithm is SHA-2 (function SHA-256 in
binary mode for all files).
Download