Dear Sir, Thank you for the opportunity to comment on the De-identification of data and information consultation draft April 2013. The following comments are submitted by the Cross Portfolio Data Integration Secretariat and relate to both Information Policy Agency Resource 1 and the Privacy Business Resource 2. The Cross Portfolio Data Integration Secretariat are pleased to note that the guidelines largely align with the information in the National Statistical Service's Confidentiality Information series and that this source is referenced for further information. We would like to suggest a couple of changes that would improve this alignment and/or clarify some technical aspects of the guidelines. Our general comments are written in black text and suggested changes to wording are shown in blue. What is de-identification? The Secretariat suggests alternative wording that would align more closely with the definition of confidentialisation used in the Confidentiality Information series: "De-identification is a process that involves removing or altering information within a collection of data or information (e.g. a dataset) to ensure that no individual is likely to be identified in the data. De-identification is a two step process: removing personal identifiers (such as name, address and date of birth), and removing or altering other personal information that would allow the identification of an individual who is the source or subject of the data or information (such as a rare characteristic or a combination of unique or remarkable characteristics)." We note also that the Australian National Data Service guidelines for De-identifying your data which are linked as a resource makes the point that National Statement on Ethical Conduct in Human Research cautions against the use of the term de-identify because the meaning is unclear. If this term is used, it needs to be clear what is meant (ie. more than just taking off name and address). In our research we found that the meaning of other terms such as anonymised can be equally as unclear as the term de-identified (for similar reasons) which is why our preference is for the term confidentialise. How to de-identify There are some issues in your description of the techniques used to deidentify data: (i) Removing or modifying identifying features such as a person's name etc. - The OAIC definition of de-identification refers to 'identifying features' as 'personal identifiers'. We suggest you use the term personal identifiers in both instances. We would also suggest that removing name, address and possibly date of birth are an essential (not an optional) component in de-identifying the data. Using age instead of date of birth is less identifying (and can be combined with other age categories) and limiting address information to broader areas such as the city or town of residence would also be less identifying than address or postcode. (ii) Removing or modifying quasi-identifiers (for example, gender, significant dates, profession, income) which are unique to an individual, or in combination with other information may identify an individual. Removal of particular quasi-identifiers should be considered only if the identity of an individual cannot be protected through other de-identification methods as this information may need to be retained for the information or data to continue to be meaningful and usable. Quasi-identifiers is a vague concept, but the examples you give would not be likely to identify an individual unless they are highly visible in the data such as if they have an unusual job (eg. pop star or judge), or a very large income. However, they may be identifying if they are combined with other information about a person (for example, a very elderly person with a very high income for someone of that age). We are surprised that gender is included as a quasi-identifier as this is not considered an identifying variable by the ABS. (iii) Suppressing data. There seems to be some confusion about the concepts of data suppression and combining categories in the OAIC information sheets. The definition of suppressing data in your information sheets actually describes the concept of combining categories. Further explanation and alternative definitions of data suppression and combining information or data are provided below: 'Data suppression' as a method, can be described as follows: Data suppression involves not releasing the particular information that may enable identification to anyone, or deleting that information from the dataset. To maximise the usefulness of the data, data suppression should be used if there is no other method that will adequately de-identify the file. The original wording discussed combining categories. 'Category' in this instance does not refer to a range of data or data that has been classified into broad categories; it refers to response categories. For example, a question may ask for age to be reported in single years (each age would be a response category) and combining categories in this instance would involve combining age into (for example) five year groups; alternatively, a question may ask which age range a person belongs to (the ranges specified would be the response categories) and combining categories in this instance would involve making the age range categories bigger. Given the confusion that the word 'categories' creates, the Secretariat suggests a new dot point to define 'combining information or data' and deleting the dot point relating to 'combining data categories': Combining information or data that is likely to enable identification of an individual into categories is a popular method of altering the information to de-identify it. For example, age may be combined and expressed in ranges, rather than single years, so the age of people who are 27 might be changed to an age of 25-35 years. Extreme values above an upper limit or below a lower limit may be placed in an open ended range such as an age value of less than 15 years or 80+. (iv) Combining data categories... delete this as it is covered above. (v) Another method that you have not mentioned in your guide is introducing small amounts of random error by altering the identifying information in a small way such that the aggregate information or data are not significantly affected. Some words to describe these methods have been included below should you wish to include them in your guide: The identity of an individual may be protected by altering identifiable information in a small way such that the aggregate information or data are not significantly affected but the original values cannot be known with certainty. Rounding values is one way of introducing small amounts of error. There are a variety of methods for rounding data, some of which are described in the National Statistical Service's Confidentiality Information Sheet 4 - How to confidentialise data: the basic principles. Data swapping is another method that may be used to hide the uniqueness of some information. This involves swapping identifying information for one person with the information for another person with similar characteristics. For example, a person from a particular town in Australia may speak a language that is unique in that town. The language spoken information could be swapped with the language spoken for another person with otherwise similar characteristics (based on age, gender, income etc.) in an area where the language is more commonly spoken. (vi) We do not have any comment to make on the synthetic data method. If you have any questions regarding our response please give me a call. _________________________________________________________________ Debbie Hansard Cross Portfolio Data Integration Secretariat | Australian Bureau of Statistics (P) (E) (W) www.nss.gov.au/dataintegration