Cross Portfolio Data Integration Secretariat de

advertisement
Dear Sir,
Thank you for the opportunity to comment on the De-identification of data and information consultation draft April 2013. The following comments are submitted by the Cross Portfolio Data
Integration Secretariat and relate to both Information Policy Agency Resource 1 and the Privacy
Business Resource 2.
The Cross Portfolio Data Integration Secretariat are pleased to note that the guidelines largely align
with the information in the National Statistical Service's Confidentiality Information series and that this
source is referenced for further information. We would like to suggest a couple of changes that would
improve this alignment and/or clarify some technical aspects of the guidelines.
Our general comments are written in black text and suggested changes to wording are shown in blue.

What is de-identification? The Secretariat suggests alternative wording that would align more
closely with the definition of confidentialisation used in the Confidentiality Information series:
"De-identification is a process that involves removing or altering information within a
collection of data or information (e.g. a dataset) to ensure that no individual is likely to
be identified in the data. De-identification is a two step process: removing personal
identifiers (such as name, address and date of birth), and removing or altering other
personal information that would allow the identification of an individual who is the
source or subject of the data or information (such as a rare characteristic or a
combination of unique or remarkable characteristics)."
We note also that the Australian National Data Service guidelines for De-identifying your data
which are linked as a resource makes the point that National Statement on Ethical Conduct in
Human Research cautions against the use of the term de-identify because the meaning is
unclear. If this term is used, it needs to be clear what is meant (ie. more than just taking off
name and address). In our research we found that the meaning of other terms such as
anonymised can be equally as unclear as the term de-identified (for similar reasons) which is
why our preference is for the term confidentialise.

How to de-identify There are some issues in your description of the techniques used to deidentify data:
(i) Removing or modifying identifying features such as a person's name etc. - The OAIC definition of
de-identification refers to 'identifying features' as 'personal identifiers'. We suggest you use the term
personal identifiers in both instances. We would also suggest that removing name, address and
possibly date of birth are an essential (not an optional) component in de-identifying the data. Using
age instead of date of birth is less identifying (and can be combined with other age categories) and
limiting address information to broader areas such as the city or town of residence would also be less
identifying than address or postcode.
(ii) Removing or modifying quasi-identifiers (for example, gender, significant dates, profession,
income) which are unique to an individual, or in combination with other information may identify an
individual. Removal of particular quasi-identifiers should be considered only if the identity of an
individual cannot be protected through other de-identification methods as this information may need to
be retained for the information or data to continue to be meaningful and usable.
Quasi-identifiers is a vague concept, but the examples you give would not be likely to identify an
individual unless they are highly visible in the data such as if they have an unusual job (eg. pop star or
judge), or a very large income. However, they may be identifying if they are combined with other
information about a person (for example, a very elderly person with a very high income for someone
of that age). We are surprised that gender is included as a quasi-identifier as this is not considered
an identifying variable by the ABS.
(iii) Suppressing data. There seems to be some confusion about the concepts of data suppression
and combining categories in the OAIC information sheets. The definition of suppressing data in your
information sheets actually describes the concept of combining categories. Further explanation and
alternative definitions of data suppression and combining information or data are provided below:
'Data suppression' as a method, can be described as follows:

Data suppression involves not releasing the particular information that may enable
identification to anyone, or deleting that information from the dataset. To maximise the
usefulness of the data, data suppression should be used if there is no other method that will
adequately de-identify the file.
The original wording discussed combining categories. 'Category' in this instance does not refer to a
range of data or data that has been classified into broad categories; it refers to response categories.
For example, a question may ask for age to be reported in single years (each age would be a
response category) and combining categories in this instance would involve combining age into (for
example) five year groups; alternatively, a question may ask which age range a person belongs to
(the ranges specified would be the response categories) and combining categories in this instance
would involve making the age range categories bigger.
Given the confusion that the word 'categories' creates, the Secretariat suggests a new dot point to
define 'combining information or data' and deleting the dot point relating to 'combining data
categories':

Combining information or data that is likely to enable identification of an individual into
categories is a popular method of altering the information to de-identify it. For example, age
may be combined and expressed in ranges, rather than single years, so the age of people
who are 27 might be changed to an age of 25-35 years. Extreme values above an upper limit
or below a lower limit may be placed in an open ended range such as an age value of less
than 15 years or 80+.
(iv) Combining data categories... delete this as it is covered above.
(v) Another method that you have not mentioned in your guide is introducing small amounts of
random error by altering the identifying information in a small way such that the aggregate information
or data are not significantly affected. Some words to describe these methods have been included
below should you wish to include them in your guide:


The identity of an individual may be protected by altering identifiable information in a small
way such that the aggregate information or data are not significantly affected but the original
values cannot be known with certainty. Rounding values is one way of introducing small
amounts of error. There are a variety of methods for rounding data, some of which are
described in the National Statistical Service's Confidentiality Information Sheet 4 - How to
confidentialise data: the basic principles.
Data swapping is another method that may be used to hide the uniqueness of some
information. This involves swapping identifying information for one person with the information
for another person with similar characteristics. For example, a person from a particular town
in Australia may speak a language that is unique in that town. The language spoken
information could be swapped with the language spoken for another person with otherwise
similar characteristics (based on age, gender, income etc.) in an area where the language is
more commonly spoken.
(vi) We do not have any comment to make on the synthetic data method.
If you have any questions regarding our response please give me a call.
_________________________________________________________________
Debbie Hansard
Cross Portfolio Data Integration Secretariat | Australian Bureau of Statistics
(P)
(E)
(W) www.nss.gov.au/dataintegration
Download