Statistical anonymity: Quantifying reidentification risks without reidentifying users
/ Abstract
Data anonymization is an approach to privacypreserving data release aimed at preventing participants reidentification, and it is an important alternative to differential privacy in applications that cannot tolerate noisy data. Existing algorithms for enforcing k-anonymity in the released data assume that the curator performing the anonymization has complete access to the original data. Reasons for limiting this access range from undesirability to complete infeasibility. This paper explores ideas — objectives, metrics, protocols, and extensions — for reducing the trust that must be placed in the curator, while still maintaining a statistical notion of k-anonymity. We suggest trust (amount of information provided to the curator) and privacy (anonymity of the participants) as the primary objectives of such a framework. We describe a class of protocols aimed at achieving these goals, proposing new metrics of privacy in the process, and proving related bounds. We conclude by discussing a natural extension of this work that completely removes the need for a central curator. 1. Releasing private data (Background) As the use of big data continues to permeate modern society, so does the sharing of our personal data with centralized third-parties. For example, the U.S. Census Bureau shares aggregated population statistics with lawmakers (Abowd, 2018), and hospitals share medical information with insurance companies (Crellin & BCE, 2011). If unregulated, this type of information poses a threat to individual privacy. A trivial way to completely protect the privacy of individuals would be to simply not share any of their information, but such an absolutist approach is neither feasible nor useful. Google Research, New York, US Gatsby Unit, University College London, UK. Correspondence to: Gecia Bravo-Hermsdorff <gecia@google.com>. A sensible compromise is to develop methods that balance the usefulness of the data against the privacy lost by the individuals. Two common frameworks for privacy-preserving data release are: differential privacy, i.e., DP (and its various extensions, e.g., Rényi differential privacy) and k-anonymity (and its various extensions, e.g., t-closeness). 1.1. A quick (incomplete) summary of DP In the central model of differential privacy (Dwork et al., 2006), a trusted curator stores the database, and an analyst1 issues queries about the database to a curator, who returns noisy responses. Such an approach requires the users to trust the curator with the entirety of their private data. Several models have been proposed to relax this requirement. In the local model, each user adds noise to their own data and responds to the analyst directly (Evfimievski et al., 2003). In the shuffle model, each user encrypts their noisy data (such that only the analyst may read them), and sends them to a trusted shuffler. The shuffler then randomly permutes these encrypted messages before forwarding them to the analyst (Cheu et al., 2019). 1.2. A quick (incomplete) summary of k-anonymity A dataset satisfies k-anonymity if for every individual whose data is contained in the dataset, their data are indistinguishable from that of at least k − 1 other individuals (also presented in this dataset). Since k-anonymity was first introduced (Sweeney, 2002), efficient algorithms for anonymizing a database (while preserving the maximum amount of information possible) have received increasing interest. Local suppression algorithms aim to achieve this by redacting specific (feature, user) entries of the database (Meyerson & Williams, 2004), while global suppression algorithms redact the same set of features for every user (El Emam et al., 2009). Note that here the “analyst” and the “public” are the same entity since the data observed by the analyst could be seen by anyone else. ar X iv :2 20 1. 12 30 6v 1 [ cs .D S] 2 8 Ja n 20 22 Statistical anonymity (Meyerson & Williams, 2004) showed that the problem of optimally anonymizing a database by either local and global suppression is NP-hard. In light of these results, several approximation algorithms have been proposed, particularly for local suppression (Aggarwal et al., 2005; Gkoulalas-Divanis et al., 2014). Similar to the central model of differential privacy, these algorithms/curators require access to the entire private data. Unlike differential privacy, variants of k-anonymity that reduce the trust that participants must place in the curator remain relatively unexplored. 2. Why we focus on k-anonymity (Motivation) Differential Privacy (DP) (Dwork et al., 2014) is a measure of privacy loss (typically denoted by ε) that holds true no matter what (e.g., even if additional information is released in the future). As a result of this strong propriety, any DP algorithm must be stochastic (e.g., by adding noise to the data). This, however, can be undesirable in a variety of applications (see Section 2.3 for examples). In contrast, while k-anonymity can be satisfied without adding noise to the data, its the privacy guarantee are contingent on the availability of auxiliary information (see (Narayanan & Shmatikov, 2008) for a famous example involving Netflix). 2.1. There is no panacea for private data As differential privacy offers an upper bound on each instance of privacy loss that holds regardless of anything else, it has a simple composition rule that can be invoked without further assumptions. Perhaps for this reason, DP is currently the de facto academic definition of privacy. As the issues surrounding privacy become increasingly pressing societal issues, it seems natural that the entities managing our private data would like to offer meaningful privacy guarantees. Unfortunately, despite being the “gold standard”, DP is often touted with essentially meaningless parameters (Domingo-Ferrer et al., 2021). For example, the US census of 2020 claims a “mathematical algorithm to ensure that the privacy of individuals is sufficiently protected” with a “budget” of ε = 19.61 (US Census Press Release CB21-CN.42). Setting aside a conspicuous similarity with the natural logarithm of the US population,2 the guarantee being made is essentially meaningless: “your participation in the census will not change the likelihood of any outcome by more than a factor of 331 million.” Given this clear rift in communication between theory and practice, it is fruitful to also consider privacy notions that The US population in 2020 is estimated at 331 million, and ln ( (331± 1) · 10 ) ≈ 19.617± 0.002. might have fewer “translation” issues, despite their technically “weaker” guarantees. 2.2. Natural extensions of k-anonymity For simplicity, consider the following setting: A database is to be released containing i.i.d. samples from the population, and the values can be split into two disjoint sets “Quasi-Identifiers” (QI) and “Sensitive Attributes” (SA). QI are not known to an adversary a priori, but could be learned (for some cost) via exogenous means. SA are features that are not known to the adversary, cannot be learned exogenously, and would be detrimental(valuable) to the participant(adversary) if learned by the adversary. Many “scalar-word” anonymity measures can be classified by the assumptions they make on the sensitive attributes (SA). The use of k-anonymity assumes that all SA are completely incomparable, while l-diversity (Machanavajjhala et al., 2007) allows for the possibility of identical SA (but is still blind to the magnitude of differences). Other metrics, such as t-closeness, δ-disclosure, and β-likeness, allow for a more general similarity metric between different SA (Khan et al., 2021). The main goal of this paper is to understand the trade-off between anonymization guarantees to the participants and the trust they must place in the entity performing the anonymization. We believe that k-anonymity is a suitable notion to use as a proof-of-concept to introduce such statistical relaxation. Extending this framework to more nuanced measures of anonymity would be of considerable practical interest. 2.3. Application examples Essentially, we consider a setting in which the private variables (the Sensitive Attributes) are incomparable (i.e., there is no metric of similarity) and unique (no two private variables are identical). In such a setting, k-anonymity is equivalent to l-diversity, and extensions such as t-closeness and β-likeness do not make sense (as the SA have no notion of similarity). For example, consider a database containing X-rays images (SA), along with some (Quasi-Identifying) demographics of the patients. The latter could likely be obtained by an adversary with minimal effort, whereas the former is essentially impossible to directly measure (without explicit cooperation from the individual). Given the exposing nature of these SA, it is not a stretch to think about an adversary using them for their personal gain at the expense of the owner of the images. Moreover, the details of everyone’s insides are rather unique. Another application is that of preventing browser fingerprinting (Laperdrix et al., 2020). Malicious websites engaged in browser fingerprinting query detailed information about a Statistical anonymity user’s device (e.g., which fonts they have installed). If these details are sufficiently unique, they can be used to covertly track a user across different the web. While certain system details can be made less amenable to fingerprinting by adding noise to them (e.g., window size/resolution), the option of returning noisy responses is often not practical (e.g., uncertain browser type). Several browsers have proposed to prevent fingerprinting by ensuring that the information queried by a website is always k-anonymous, and blocking the query otherwise. However, the only way to completely guarantee k-anonymity is to grant a central curator access to the full data of every user. 3. The big picture (What we did) In seeking a version of k-anonymity that does not require a fully-trusted cu
Journal: ArXiv