Travis Dick, Cynthia Dwork, Michael Kearns, Terrance Liu, Aaron Roth, Giuseppe Vietri, Zhiwei Steven Wu
A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
Sumegha Garg, Christopher Jung, Omer Reingold, Aaron Roth
A recent line of work has shown a surprising connection between multicalibration, a multi-group fairness notion, and omniprediction, a learning paradigm that provides simultaneous loss minimization guarantees for a large family of loss functions. Prior work studies omniprediction in the batch setting. We initiate the study of omniprediction in the online adversarial setting. Although there exist algorithms for obtaining notions of multicalibration in the online adversarial setting, unlike batch algorithms, they work only for small finite classes of benchmark functions $F$, because they require enumerating every function $f \in F$ at every round. In contrast, omniprediction is most interesting for learning theoretic hypothesis classes $F$, which are generally continuously large. We develop a new online multicalibration algorithm that is well defined for infinite benchmark classes $F$, and is oracle efficient (i.e. for any class $F$, the algorithm has the form of an efficient reduction to a no-regret learning algorithm for $F$). The result is the first efficient online omnipredictor -- an oracle efficient prediction algorithm that can be used to simultaneously obtain no regret guarantees to all Lipschitz convex loss functions. For the class $F$ of linear functions, we show how to make our algorithm efficient in the worst case. Also, we show upper and lower bounds on the extent to which our rates can be improved: our oracle efficient algorithm actually promises a stronger guarantee called swap-omniprediction, and we prove a lower bound showing that obtaining $O(\sqrt{T})$ bounds for swap-omniprediction is impossible in the online setting. On the other hand, we give a (non-oracle efficient) algorithm which can obtain the optimal $O(\sqrt{T})$ omniprediction bounds without going through multicalibration, giving an information theoretic separation between these two solution concepts.
Krishna Acharya, Eshwar Ram Arunachaleswaran, Sampath Kannan, Aaron Roth, Juba Ziani
We study the problem of online prediction, in which at each time step $t$, an individual $x_t$ arrives, whose label we must predict. Each individual is associated with various groups, defined based on their features such as age, sex, race etc., which may intersect. Our goal is to make predictions that have regret guarantees not just overall but also simultaneously on each sub-sequence comprised of the members of any single group. Previous work such as [Blum & Lykouris] and [Lee et al] provide attractive regret guarantees for these problems; however, these are computationally intractable on large model classes. We show that a simple modification of the sleeping experts technique of [Blum & Lykouris] yields an efficient reduction to the well-understood problem of obtaining diminishing external regret absent group considerations. Our approach gives similar regret guarantees compared to [Blum & Lykouris]; however, we run in time linear in the number of groups, and are oracle-efficient in the hypothesis class. This in particular implies that our algorithm is efficient whenever the number of groups is polynomially bounded and the external-regret problem can be solved efficiently, an improvement on [Blum & Lykouris]'s stronger condition that the model class must be small. Our approach can handle online linear regression and online combinatorial optimization problems like online shortest paths. Beyond providing theoretical regret bounds, we evaluate this algorithm with an extensive set of experiments on synthetic data and on two real data sets -- Medical costs and the Adult income dataset, both instantiated with intersecting groups defined in terms of race, sex, and other demographic characteristics. We find that uniformly across groups, our algorithm gives substantial error improvements compared to running a standard online linear regression algorithm with no groupwise regret guarantees.
Katrina Ligett, Aaron Roth
In this paper, we consider the problem of estimating a potentially sensitive (individually stigmatizing) statistic on a population. In our model, individuals are concerned about their privacy, and experience some cost as a function of their privacy loss. Nevertheless, they would be willing to participate in the survey if they were compensated for their privacy cost. These cost functions are not publicly known, however, nor do we make Bayesian assumptions about their form or distribution. Individuals are rational and will misreport their costs for privacy if doing so is in their best interest. Ghosh and Roth recently showed in this setting, when costs for privacy loss may be correlated with private types, if individuals value differential privacy, no individually rational direct revelation mechanism can compute any non-trivial estimate of the population statistic. In this paper, we circumvent this impossibility result by proposing a modified notion of how individuals experience cost as a function of their privacy loss, and by giving a mechanism which does not operate by direct revelation. Instead, our mechanism has the ability to randomly approach individuals from a population and offer them a take-it-or-leave-it offer. This is intended to model the abilities of a surveyor who may stand on a street corner and approach passers-by.
Anupam Gupta, Moritz Hardt, Aaron Roth, Jonathan Ullman
Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? + We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ) model. This gives a complete answer to the question when running time is not a concern. + We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to the first algorithm that efficiently releases differentially private answers to of all Boolean conjunctions with 1% average error. This presents significant progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.
Avrim Blum, Katrina Ligett, Aaron Roth
In this paper we demonstrate that, ignoring computational constraints, it is possible to privately release synthetic databases that are useful for large classes of queries -- much larger in size than the database itself. Specifically, we give a mechanism that privately releases synthetic data for a class of queries over a discrete domain with error that grows as a function of the size of the smallest net approximately representing the answers to that class of queries. We show that this in particular implies a mechanism for counting queries that gives error guarantees that grow only with the VC-dimension of the class of queries, which itself grows only logarithmically with the size of the query class. We also show that it is not possible to privately release even simple classes of queries (such as intervals and their generalizations) over continuous domains. Despite this, we give a privacy-preserving polynomial time algorithm that releases information useful for all halfspace queries, given a slight relaxation of the utility guarantee. This algorithm does not release synthetic data, but instead another data structure capable of representing an answer for each query. We also give an efficient algorithm for releasing synthetic data for the class of interval queries and axis-aligned rectangles of constant dimension. Finally, inspired by learning theory, we introduce a new notion of data privacy, which we call distributional privacy, and show that it is strictly stronger than the prevailing privacy notion, differential privacy.
Justin Hsu, Aaron Roth, Jonathan Ullman
We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting differential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data analysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer analyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the private query-release problem as a two-player zero-sum game, which may be of independent interest.
Hadi Elzayn, Shahin Jabbari, Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, Zachary Schutzman
Settings such as lending and policing can be modeled by a centralized agent allocating a resource (loans or police officers) amongst several groups, in order to maximize some objective (loans given that are repaid or criminals that are apprehended). Often in such problems fairness is also a concern. A natural notion of fairness, based on general principles of equality of opportunity, asks that conditional on an individual being a candidate for the resource, the probability of actually receiving it is approximately independent of the individual's group. In lending this means that equally creditworthy individuals in different racial groups have roughly equal chances of receiving a loan. In policing it means that two individuals committing the same crime in different districts would have roughly equal chances of being arrested. We formalize this fairness notion for allocation problems and investigate its algorithmic consequences. Our main technical results include an efficient learning algorithm that converges to an optimal fair allocation even when the frequency of candidates (creditworthy individuals or criminals) in each group is unknown. The algorithm operates in a censored feedback model in which only the number of candidates who received the resource in a given allocation can be observed, rather than the true number of candidates. This models the fact that we do not learn the creditworthiness of individuals we do not give loans to nor learn about crimes committed if the police presence in a district is low. As an application of our framework, we consider the predictive policing problem. The learning algorithm is trained on arrest data gathered from its own deployments on previous days, resulting in a potential feedback loop that our algorithm provably overcomes. We empirically investigate the performance of our algorithm on the Philadelphia Crime Incidents dataset.
Emily Diana, Wesley Gill, Ira Globus-Harris, Michael Kearns, Aaron Roth, Saeed Sharifi-Malvajerdi
We extend the notion of minimax fairness in supervised learning problems to its natural conclusion: lexicographic minimax fairness (or lexifairness for short). Informally, given a collection of demographic groups of interest, minimax fairness asks that the error of the group with the highest error be minimized. Lexifairness goes further and asks that amongst all minimax fair solutions, the error of the group with the second highest error should be minimized, and amongst all of those solutions, the error of the group with the third highest error should be minimized, and so on. Despite its naturalness, correctly defining lexifairness is considerably more subtle than minimax fairness, because of inherent sensitivity to approximation error. We give a notion of approximate lexifairness that avoids this issue, and then derive oracle-efficient algorithms for finding approximately lexifair solutions in a very general setting. When the underlying empirical risk minimization problem absent fairness constraints is convex (as it is, for example, with linear and logistic regression), our algorithms are provably efficient even in the worst case. Finally, we show generalization bounds -- approximate lexifairness on the training sample implies approximate lexifairness on the true distribution with high probability. Our ability to prove generalization bounds depends on our choosing definitions that avoid the instability of naive definitions.
Christopher Jung, Sampath Kannan, Changhwa Lee, Mallesh M. Pai, Aaron Roth, Rakesh Vohra
There is increasing regulatory interest in whether machine learning algorithms deployed in consequential domains (e.g. in criminal justice) treat different demographic groups "fairly." However, there are several proposed notions of fairness, typically mutually incompatible. Using criminal justice as an example, we study a model in which society chooses an incarceration rule. Agents of different demographic groups differ in their outside options (e.g. opportunity for legal employment) and decide whether to commit crimes. We show that equalizing type I and type II errors across groups is consistent with the goal of minimizing the overall crime rate; other popular notions of fairness are not.
Aaron Roth, Aleksandrs Slivkins, Jonathan Ullman, Zhiwei Steven Wu
We study the problem of a seller dynamically pricing $d$ distinct types of indivisible goods, when faced with the online arrival of unit-demand buyers drawn independently from an unknown distribution. The goods are not in limited supply, but can only be produced at a limited rate and are costly to produce. The seller observes only the bundle of goods purchased at each day, but nothing else about the buyer's valuation function. Our main result is a dynamic pricing algorithm for optimizing welfare (including the seller's cost of production) that runs in time and a number of rounds that are polynomial in $d$ and the approximation parameter. We are able to do this despite the fact that (i) the price-response function is not continuous, and even its fractional relaxation is a non-concave function of the prices, and (ii) the welfare is not observable to the seller. We derive this result as an application of a general technique for optimizing welfare over \emph{divisible} goods, which is of independent interest. When buyers have strongly concave, Hölder continuous valuation functions over $d$ divisible goods, we give a general polynomial time dynamic pricing technique. We are able to apply this technique to the setting of unit demand buyers despite the fact that in that setting the goods are not divisible, and the natural fractional relaxation of a unit demand valuation is not strongly concave. In order to apply our general technique, we introduce a novel price randomization procedure which has the effect of implicitly inducing buyers to "regularize" their valuations with a strongly concave function. Finally, we also extend our results to a limited-supply setting in which the number of copies of each good cannot be replenished.
Michael Kearns, Aaron Roth, Zhiwei Steven Wu, Grigory Yaroslavtsev
Motivated by tensions between data privacy for individual citizens, and societal priorities such as counterterrorism and the containment of infectious disease, we introduce a computational model that distinguishes between parties for whom privacy is explicitly protected, and those for whom it is not (the targeted subpopulation). The goal is the development of algorithms that can effectively identify and take action upon members of the targeted subpopulation in a way that minimally compromises the privacy of the protected, while simultaneously limiting the expense of distinguishing members of the two groups via costly mechanisms such as surveillance, background checks, or medical testing. Within this framework, we provide provably privacy-preserving algorithms for targeted search in social networks. These algorithms are natural variants of common graph search methods, and ensure privacy for the protected by the careful injection of noise in the prioritization of potential targets. We validate the utility of our algorithms with extensive computational experiments on two large-scale social network datasets.
Sampath Kannan, Jamie Morgenstern, Ryan Rogers, Aaron Roth
We consider the problem of implementing an individually rational, asymptotically Pareto optimal allocation in a barter-exchange economy where agents are endowed with goods and have preferences over the goods of others, but may not use money as a medium of exchange. Because one of the most important instantiations of such economies is kidney exchange -- where the "input"to the problem consists of sensitive patient medical records -- we ask to what extent such exchanges can be carried out while providing formal privacy guarantees to the participants. We show that individually rational allocations cannot achieve any non-trivial approximation to Pareto optimality if carried out under the constraint of differential privacy -- or even the relaxation of \emph{joint} differential privacy, under which it is known that asymptotically optimal allocations can be computed in two-sided markets, where there is a distinction between buyers and sellers and we are concerned only with privacy of the buyers~\citep{Matching}. We therefore consider a further relaxation that we call \emph{marginal} differential privacy -- which promises, informally, that the privacy of every agent $i$ is protected from every other agent $j \neq i$ so long as $j$ does not collude or share allocation information with other agents. We show that, under marginal differential privacy, it is possible to compute an individually rational and asymptotically Pareto optimal allocation in such exchange economies.
Arpita Ghosh, Katrina Ligett, Aaron Roth, Grant Schoenebeck
We consider the problem of designing a survey to aggregate non-verifiable information from a privacy-sensitive population: an analyst wants to compute some aggregate statistic from the private bits held by each member of a population, but cannot verify the correctness of the bits reported by participants in his survey. Individuals in the population are strategic agents with a cost for privacy, \ie, they not only account for the payments they expect to receive from the mechanism, but also their privacy costs from any information revealed about them by the mechanism's outcome---the computed statistic as well as the payments---to determine their utilities. How can the analyst design payments to obtain an accurate estimate of the population statistic when individuals strategically decide both whether to participate and whether to truthfully report their sensitive information? We design a differentially private peer-prediction mechanism that supports accurate estimation of the population statistic as a Bayes-Nash equilibrium in settings where agents have explicit preferences for privacy. The mechanism requires knowledge of the marginal prior distribution on bits $b_i$, but does not need full knowledge of the marginal distribution on the costs $c_i$, instead requiring only an approximate upper bound. Our mechanism guarantees $ε$-differential privacy to each agent $i$ against any adversary who can observe the statistical estimate output by the mechanism, as well as the payments made to the $n-1$ other agents $j\neq i$. Finally, we show that with slightly more structured assumptions on the privacy cost functions of each agent, the cost of running the survey goes to $0$ as the number of agents diverges.
Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, Zhiwei Steven Wu
Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. This raises a fairness concern. In such settings, one might like to run a "greedy" algorithm, which always makes the (myopically) optimal decision for the individuals at hand - but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve "no regret", perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that "generically" (i.e. in slightly perturbed environments), exploration and exploitation need not be in conflict in the linear setting.
Stephen Gillen, Christopher Jung, Michael Kearns, Aaron Roth
We consider the problem of online learning in the linear contextual bandits setting, but in which there are also strong individual fairness constraints governed by an unknown similarity metric. These constraints demand that we select similar actions or individuals with approximately equal probability (arXiv:1104.3913), which may be at odds with optimizing reward, thus modeling settings where profit and social policy are in tension. We assume we learn about an unknown Mahalanobis similarity metric from only weak feedback that identifies fairness violations, but does not quantify their extent. This is intended to represent the interventions of a regulator who "knows unfairness when he sees it" but nevertheless cannot enunciate a quantitative fairness metric over individuals. Our main result is an algorithm in the adversarial context setting that has a number of fairness violations that depends only logarithmically on $T$, while obtaining an optimal $O(\sqrt{T})$ regret bound to the best fair policy.
Justin Hsu, Sanjeev Khanna, Aaron Roth
In this paper, we give efficient algorithms and lower bounds for solving the heavy hitters problem while preserving differential privacy in the fully distributed local model. In this model, there are n parties, each of which possesses a single element from a universe of size N. The heavy hitters problem is to find the identity of the most common element shared amongst the n parties. In the local model, there is no trusted database administrator, and so the algorithm must interact with each of the $n$ parties separately, using a differentially private protocol. We give tight information-theoretic upper and lower bounds on the accuracy to which this problem can be solved in the local model (giving a separation between the local model and the more common centralized model of privacy), as well as computationally efficient algorithms even in the case where the data universe N may be exponentially large.
Aaron Roth, Grant Schoenebeck
We consider the problem of conducting a survey with the goal of obtaining an unbiased estimator of some population statistic when individuals have unknown costs (drawn from a known prior) for participating in the survey. Individuals must be compensated for their participation and are strategic agents, and so the payment scheme must incentivize truthful behavior. We derive optimal truthful mechanisms for this problem for the two goals of minimizing the variance of the estimator given a fixed budget, and minimizing the expected cost of the survey given a fixed variance goal.
Ira Globus-Harris, Varun Gupta, Michael Kearns, Aaron Roth
There is a long history in machine learning of model ensembling, beginning with boosting and bagging and continuing to the present day. Much of this history has focused on combining models for classification and regression, but recently there is interest in more complex settings such as ensembling policies in reinforcement learning. Strong connections have also emerged between ensembling and multicalibration techniques. In this work, we further investigate these themes by considering a setting in which we wish to ensemble models for multidimensional output predictions that are in turn used for downstream optimization. More precisely, we imagine we are given a number of models mapping a state space to multidimensional real-valued predictions. These predictions form the coefficients of a linear objective that we would like to optimize under specified constraints. The fundamental question we address is how to improve and combine such models in a way that outperforms the best of them in the downstream optimization problem. We apply multicalibration techniques that lead to two provably efficient and convergent algorithms. The first of these (the white box approach) requires being given models that map states to output predictions, while the second (the \emph{black box} approach) requires only policies (mappings from states to solutions to the optimization problem). For both, we provide convergence and utility guarantees. We conclude by investigating the performance and behavior of the two algorithms in a controlled experimental setting.
Osbert Bastani, Varun Gupta, Christopher Jung, Georgy Noarov, Ramya Ramalingam, Aaron Roth
We give a simple, generic conformal prediction method for sequential prediction that achieves target empirical coverage guarantees against adversarially chosen data. It is computationally lightweight -- comparable to split conformal prediction -- but does not require having a held-out validation set, and so all data can be used for training models from which to derive a conformal score. It gives stronger than marginal coverage guarantees in two ways. First, it gives threshold calibrated prediction sets that have correct empirical coverage even conditional on the threshold used to form the prediction set from the conformal score. Second, the user can specify an arbitrary collection of subsets of the feature space -- possibly intersecting -- and the coverage guarantees also hold conditional on membership in each of these subsets. We call our algorithm MVP, short for MultiValid Prediction. We give both theory and an extensive set of empirical evaluations.