The identification of individuals within a dataset can cause serious harm to those individuals as well as impact your agency and the government. The de-identification of personal information removes personal identifiers so that the individuals that are the subject of the data cannot be identified.

You should consider obtaining the proper approvals to undertake de-identification within your agency.
South Australian state government information privacy is guided by the Information Privacy Principles Instruction (IPPI), otherwise known as PC012.  Local government councils and universities are not covered by the IPPI.

The IPPI only applies to personal information and not data that has been de-identified. To ensure your agency is not in breach of the IPPI, care must be taken to determine that the information is properly de-identified.

Identification risks

There are two key types of identification risks associated with the release of government data:

  • Spontaneous recognition
    The risk that identification is made without any deliberate attempt to identify a person. This can result from the release of a dataset that includes the data of individuals with rare characteristics. The risk of identification is proportionate to the rarity of the characteristic.
  • Deliberate recognition
    The risk associated with a malicious or deliberate attempt to identify a person from the released dataset. This can result from list matching, or matching common, characteristics in the released dataset to other publicly available datasets or information. It can also result from targeting a particular individual using a characteristic in the dataset already known by the person attempting to identify them.

Several techniques can be applied to properly de-identify data and mitigate any risks of identification of an individual.

The simplest method of de-identification is to remove obvious identifying variables from the data such as an individual’s name or address.  For example, consider the following data:

Name

Address / Postcode

Age

Gender

Profession

Annual Salary

B. Johns

10 Record Street Woodville SA

5011

52

Male

Driving Instructor

$75,000

By removing basic identifiers this can become:

Postcode

Age

Gender

Profession

Annual Salary

5011

52

Male

Driving Instructor

$75,000

This data has been stripped of its identifiers, however the potential for re-identification is high. The data still exists on an individual level and other potentially identifying information has been retained. For example, some South Australian postcodes have small populations and identifying a 52-year-old driving instructor would be easy. This may mean your agency has disclosed the individual's salary to the community without his permission.

However, consider how much information to remove before it becomes meaningless. It is important for the dataset to have a purpose and only include data that suits the objective of that purpose.

Pseudonymisation replaces recognisable identifiers with artificially generated identifiers, such as a coded reference or pseudonym.

Continuing with the example above, B. Johns would be assigned a randomly selected numerical value:

Individual reference

Postcode

Age

Gender

Profession

Annual Salary

SR23597

5011

52

Male

Driving Instructor

$75 000

Pseudonymisation allows for different information about an individual, often in different datasets to be correlated without the consequence of direct identification of the individual.

For example, the information above could be correlated with:

Individual reference

Marital status

Number of children

Highest level of education attained

Number of cars owned by household

SR23597

Divorced

2

Diploma

3

Pseudonymised data exists on an individual level with other potentially identifying information being retained and has a relatively high potential for re-identification.  Also, because pseudonymisation is generally used when an individual is tracked over more than one dataset, if re-identification does occur, more personal information will be revealed concerning the individual.

Rendering personal information less precise can make re-identification less likely.
For example, dates of birth or ages can be replaced by age groups and specific salaries can be replaced by salary ranges.

B. John’s data now becomes:

Name

Postcode

Age range

Gender

Profession

Annual Salary range

SR23597

5011

50-60

Male

Driving Instructor

$60,000 -$80,000

Techniques for reducing data precision include suppression of cells with low values or conducting statistical analysis to determine whether values can be traced back to individuals. In such cases, you can apply a frequency rule by setting a minimum number of times a specific measurement is displayed.

For example, if we apply a frequency rule to the following table where the minimum value is 3, the row showing driving instructors at ages 35 to 40 may be suppressed or aggregated into a bigger range.

Age

Postcode

Number of Driving Instructors

Average Annual Salary

25 to 30

5011

20

$50,000

35 to 40

5011

2

$60,000

45 to 50

5011

10

$65,000

More advanced techniques include combining data so that the original values cannot be known with certainty, but the aggregate results are unaffected.

Individual data can be combined to provide information about groups or populations. The larger the group, the less specific the data is, and therefore there is less potential for identifying an individual within the group.

An example of aggregated data would be:

Profession

State

Annual Salary

Number of drivers

Driving Instructor

South Australia

$49,500

$40,000

$45,000

200

10,000

2,000

$56,000

$58,000

$66,000

3,748

11,414

31,203

Aggregated data:

Profession

State

Annual Salary

Number of drivers

Driving Instructor

South Australia

<$50,000

12,200

>$55,000

46,365

Tools to assist in de-identification

Tools and software packages are available to assist in de-identifying datasets.  These tools apply automated de-identification methods and can even assist you to determine the success of the de-identification methods and privacy risks of publicly releasing the data.

Your agency should conduct its own research to identify a tool best suited to your objectives.

Testing de-identification

It is good privacy practice to test the methods you have employed to mitigate the privacy risks of publishing the dataset. Primarily this will involve attempting to re-identify individuals from the de-identified dataset.

Page last updated: 14 June 2024