Data privacy: could traditional approaches work for machine learning models?
In the second of their two-part series on privacy in the age of big data, Taylor Fry’s Stephanie Russell and Jonathan Cohen explore the potential to apply data privacy approaches used in traditional settings to machine learning models.
Read Part 1: The privacy act is changing – how will this affect your machine learning models and what can you do? |
Our lives are increasingly tied to artificial intelligence and machine learning, yet the consequences seem unclear – especially for our privacy – as countries scramble to navigate and keep pace with a world dominated by algorithms. What does this mean for our privacy in Australia? In Part 1 of our series, we looked at changes to the Privacy Act 1988 under consideration by the Australian Government, and how these might impact machine learning models, the industries who use them and the consumers they target. This leads us to ask, could current approaches to protect people’s data in traditional data contexts also be applied to machine learning models?In considering the question, we explore:
- The challenges in applying data privacy approaches in traditional data contexts
- A traditional data privacy approach – ‘differential privacy’ – and its applicability to machine learning models
- Emerging ‘unlearning’ techniques, and whether these can be used to reduce the costs and time associated with retraining models in response to requests to delete personal information by customers under the ‘right to erasure’, one of the proposed changes to the privacy act.
1. Challenges of applying data privacy in everyday settings
The application of data privacy approaches to machine learning models is still a maturing research area. Even in traditional data contexts, masking private information can be difficult, and unexpected privacy issues can arise even after using standard, well-developed de-identification techniques. For example, the Australian Federal Department of Health unintentionally breached privacy laws when it published de-identified health data records of 2.5 million people online in 2016. Despite the published dataset complying with protocols around anonymisation and de-identification, it was found that the data entries for certain individuals with rare conditions could easily be re-identified by cross-referencing the dataset with a few simple facts from other sources such as Wikipedia and Facebook. It’s not all bad news, however, and several techniques are emerging to address these data privacy risks, as the quest to find solutions gains momentum. We explore some of these below and whether they might be applicable in machine learning environments.
2. Differential privacy – a modern approach for traditional data contexts
Developed by researchers at Microsoft in 2006, differential privacy is a system that permits the public sharing of aggregate statistics and information about cohorts within datasets, without disclosing private information about specific individuals. It ensures that knowledge of the individual cannot be ascertained with confidence by attackers, regardless of what other background information is available on individuals within the population. In practice, differential privacy is implemented by adding noise or randomness to a dataset in a controlled manner and in a way that preserves aggregate statistics. In broad terms, the introduction of noise under differential privacy conflicts with the philosophy of machine learning models, which typically rely on individual variation. Nevertheless, researchers have developed tools to apply differential privacy in machine learning, with the goal of limiting the impact of individual records used in the training dataset– particularly sensitive features – on model outputs. Common places where noise can be introduced into machine learning models include the training dataset, the predicted model outputs and gradients. However, studies to date have found that differential privacy settings that add more noise in order to provide strong privacy protections typically result in useless models, whereas settings that reduce the amount of noise added in order to improve model utility increase the risks of privacy leakage. Despite differential privacy’s reputation for ensuring privacy in traditional data contexts, researchers concluded that alternative means of protecting individual privacy must be used instead in settings where utility and privacy cannot be balanced.
3. ‘Unlearning’ techniques
Also in Part 1, we explored the implications of the ‘right to erasure’ for machine learning models, and found that in addition to the deletion of the individual’s data itself, any influence of the individual’s data on models may also be required to be removed upon customer request. The most straightforward approach to achieve this is to retrain machine learning models from scratch using an amended dataset excluding the individual’s data, but this is often computationally costly and inefficient. To address these inefficiencies, alternative ‘unlearning’ techniques have been developed to ensure a model no longer uses the data record that has been selected for erasure. This guarantees that training a model on a person’s data record and unlearning it afterwards will produce the same model distribution as if it were never trained on the data record at all. Emerging unlearning techniques include:
- Sharded, Isolated, Sliced and Aggregated (SISA) training, a framework that speeds up the unlearning process by strategically limiting the influence of an individual data record during training. The training data is divided into multiple shards or pieces so that a data record is included in one shard only. By training models in isolation across each of these shards, only the affected models will need to be retrained when requests to erase an individual’s data are made, limiting retraining costs.
- Data removal-enabled (DaRE) forests, a variant of random forests that supports adding and removing training data with minimal retraining. DaRE randomises both the variables used and the thresholds adopted for splitting in the upper layers of trees. This randomised structure helps to reduce the retraining required by only retraining portions of the model where the structure must change to match the updated dataset.
Studies using real-world datasets have shown that both of these techniques are materially faster than retraining models from scratch, while sacrificing very little in predictive model performance. For instance, researchers found that, on average, DaRE models were up to twice as fast as retraining from scratch with no loss in model accuracy, and two to three times faster if slightly worse predictive performance was tolerated. However, among the drawbacks of these emerging techniques are the large storage costs associated with them and their limited applicability across all machine learning algorithms. There are also likely to be challenges in implementing them in practice, as these approaches would require retraining of existing models and a fundamental redesign of existing machine learning pipelines with unclear effects.
The future looks …?
The proposed changes to the privacy act may place significant governance and compliance burdens on organisations. Although the application of existing data privacy approaches and emerging unlearning techniques to machine learning have shown some promise, this area of research is still in the very early stages of development, and there may be some fundamental limitations on achieving compliance via shortcuts. In future, it will be important to promote and invest in research for commercially robust solutions that maintain privacy, while also providing other desirable model properties such as transparency and explainability. In the interim, if the proposed amendments to the privacy act are enacted and interpreted as applying to machine learning models, organisations may be forced to significantly simplify their modelling approaches and infrastructures to have a better chance of meeting compliance obligations.
This is an edited version of an original article published by Taylor Fry |
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.