CSCI-2390: Privacy-Conscious Computer Systems

Assignment 3: Differential Privacy in Practice

In this assignment, you are given access to aggregate data collected from the students in the class. The Instructor and TAs are also in that data. You will experiment with various attacks to identify Kinan's data, and then implement a differentially private histogram mechansim to protect that data.

Overall Instructions

Overview of what you will do:

Analyze the output of certain aggregates to identify Kinan's data.
Implement and analyze a differentially private histogram mechansim.
Study the impact of composing many DP releases on the overall privacy.

Please read the detailed instructions below, as they are designed to help make the assignment easier for you. Make notes about what you did when working on the different questions below, you will need to write a report describing these steps and your results!

Submission instructions: Your submission will consist of a written report and all of the python scripts that you filled into the stencil. In your report, please write a few sentences or a short paragraph for every question, highlighting your approach and your findings. Do not just provide your answer in one word! Please show us your rational/steps.

Setup

Task: Please fork and clone this template Github repo with the assignment stencil to your computer.

You have API access to the data we collected from the class prior to releasing this assignment. You will use API access to run various aggregate queries over the data to solve the assignment questions.

Please use the client.py file we provide to interact with the dataset. The dataset is hosted on our own server, and client.py sends your queries to that server via a REST API. The server executes the query over the data and sends back the aggregate results.

You need python3 to use the client script. You may also need to install the required Python packages listed in requirements.txt. We recommend you use a Python virtual environment for this. A common way to do this is by using the following:

$ python3 -m venv venv
$ . venv/bin/activate
$ pip install -r requirements.txt

The stencil also includes three Python files, one for each of the programming questions below. The files contain some boilerplate setup and testing code. To answer the questions, you will need to implement the missing functionality.

After downloading and extracting the stencil, you can find information on how to run and use client.py and the API it exposes by running:

$ python3 client.py help

Part 1: Plain Aggregates and Privacy

To motivate the need for differential privacy, we will look at some simple attacks on non-noised aggregates that allow you to reconstruct or learn information about users in the dataset.

Our data set contains innocuous information collected from all the students in our class. The data also includes Kinan's answers to the same questions. Your task is to expose Kinan's information, and learn their answers.

Kinan kindly agreed to provide truthful answers to all the questions. You can also assume that Kinan never chooses "Prefer not to answer", and that they submitted exactly one response for each question.

We will start by looking at a specific aggregate: the count of people in the dataset that like a certain music genre, grouped by age.

Run this aggregate query using the provided client.py script as follows:

$ python3 client.py count age music

Question 1: Look at the result of this count query. Note that it does not include any name, email, or other personally identifiable information. What can you nevertheless learn about Kinan's musical tastes? What possible genres might they have chosen? Alternatively, what genres is it impossible for them to have chosen?

Regardless of your approach, it is unlikely that you can identify a single genre as Kinan's music preference with absolute certainty. However, you may be able to use other information available about Kinan online, and link it with the deductions you made earlier to learn more information about Kinan. This type of attack is referred to as a linking attack. While you're using a simple form of linking, far more complex linking attacks that can operate over probability distributions, rather than exact known answers, exist. This exercise is only meant to give you a brief intuition.

Task: Look Kinan up online and try to find information about their musical taste. You can start by Googling Kinan Bab and checking the first few pages of results. Prioritize results from online services where musical taste may be relevant.

Note that the name Kinan Bab is fairly common: we know of at least two other profiles with the same name on different services. Try to verify that the information you found truly belongs to our Kinan. Kinan generally uses the same profile picture on most services.

Hint: Try to use the music taste you learned to find out Kinan's exact age, which will help you identify the rest of his data. Try to see if you can find his age online.

Question 2: What did you find out about Kinan? Are your findings consistent with Question 1? Combine the two together to learn Kinan's exact age.

Now let's move on to another query. We will look at a similar grouped-count query (commonly known as a histogram), this time involving age and favorite color. You can get this query by running:

$ python3 client.py count age color

Question 3: Identify Kinan's favorite color. What is it? How easy or obvious is this to do, and why?

All the above attacks work in part because of the fine granularity of the histogram. Specifically, the counts are grouped by the exact age, and given the size and spread of our class, this means that many groups consist of a single (or handful of) person(s). This makes it easy to identify their data. (While we discourage you from trying to identify any classmates other than Kinan, think about how you would go about doing so and what auxiliary information you could use.)

One intuitive idea is to make the released groups more coarse-grained, so that any group contains data from many data subjects. This is the main intuition behind k-anonymity. This approach can sometimes work well in practice, but is vulnerable to some attacks. We will not consider k-anonymity in this assignment.

Let us take a look at a coarser-grained query. The count of people that prefer each sport grouped by an age group, rather than an exact age. Execute this query by running:

$ python3 client.py count agegroup sport

Question 4: What information can you learn about Kinan's favorite sport from the above query?

Let's assume that similar data was collected from last year's CSCI 2390. Kinan's data must be part of this data collected last time. We provide the same coarse grained histogram over this data below. You can assume Kinan's answer did not change from last year.

Age group	Favorite sport	Count
20 or less	Baseball	1
20 or less	E-Sports	1
20 or less	Basketball	3
20-25	E-Sports	4
20-25	Soccer	2
20-25	Hockey	1
20-25	American Football	3
20-25	Basketball	2
25-30	American Football	1
25-30	Soccer	1
25-30	E-Sports	3

Task: Combine what you learn from both datasets to learn more information about Kinan's favorite sport.
[Hint: If looking at the age group alone is not decisive enough, look at the exact age!]

This type of attack is called a differencing attack, because it uses the difference between two data sets to infer information.

Question 5: What is Kinan's favorite sport?

Part 2: Implementing Differential Privacy

We have seen examples of how various attacks can help an attacker learn sensitive information about individuals from aggregates. In this part, you will protect against such attacks using differential privacy.

Your task is to help Kinan release some of these helpful statistics about the course, without the risk that data of any given individual gets exposed. Specifically, we will implement a differential privacy mechansim to protect the first query (music tastes grouped by age) by adding Laplace noise.

In order to successfuly protect the data, we must first determine the parameters of the Laplace distribution from which we'll sample our noise.

Sensitivity: The first step is to find out the sensitivity of the function or query we want to protect. The sensitivity is the largest possible difference (in absolute value) between the function/query output when run over any two neighboring datasets. For us, this function/query is a histogram. In other words, it's a count of individuals that fall in any one of several non-overlapping groups. "Neighboring datasets" are two datasets that match in all but one user's data, which is present in only one of them.

Identify the sensitivity of the histogram query grouped by age and music taste. You may want to run the query again to inspect what the output schema looks like. Think of two concrete neighbor datasets. For example, the dataset in this assignment, and the same dataset with your own answers removed. How many groups in the query's result have a different count when your data is removed and by how much? Does this generalize to any two neighboring datasets?

Laplace distribution: You need to sample random noise from a Laplace distribution. The Laplace distribution has two parameters: the mean μ and scale b. For the Laplace differential privacy mechanism, these parameters are set to μ = 0 and b = sensitivity / epsilon, where epsilon is the privacy parameter. For now, set epsilon to 0.5. Note that this epsilon is not the total privacy budget, but a privacy parameter to the Laplace distribution for one query.

To get samples from a Laplace distribution, you can either implement your own function for sampling using inverse transform sampling, or you can use numpy's numpy.random.laplace(μ, b) out of the box.

Putting it all together: Think about what steps, conceptually, the code needs to perform:

It needs to get the non-noised query results. This happens by calling the count(["age", "music"], False) function to retrieve the histogram.
For each count in the histogram, the code needs to sample Laplace noise with the aforementioned parameters. Because counts are whole integers, you'll want to round the noise after sampling it.
The code needs to add the corresponding sampled noise to each count to produce the new noised histogram and print it to the screen.

Task: Implement the differentially private histogram release for the age and music histogram query. Add your implementation to the dp.py file in the provided stencil. Include your modified dp.py in your submission.
Note: Remember to round the noised values so that the counts remain integers.

Question 6: Run dp.py several times varying the epsilon privacy parameter for different values between 10 and 0.01, like so:

$ python3 dp.py 0.01 [...] $ python3 dp.py 0.1 [...] [etc.]
What happens when the privacy parameter grows larger or smaller? How does that affect privacy?

Now, let's investigate how the the noise affects one particular record. Consider the first row in the histogram. If you run the query multiple times (with the same privacy parameter epsilon), you will observe different reported counts, because the noise added is different every time. Ideally, we would add enough noise to create uncertainty about the count even when many queries run over the data. Let's investigate how well this works for different privacy parameters.

Task: Use the plotting code in dp.py to plot the frequency of observing different counts after applying the noise. The plotting code runs your dp_histogram for 150 iterations, and counts the frequency of each value in the first row. The x-axis represents the different counts observed, and the y-axis shows the frequency of each count value across the 150 queries.

Question 7: Look at the plot generated with privacy parameter epsilon = 0.5. What is the most likely value? What is the expected (i.e., average) value? How do they relate to the actual value (i.e., the query excuted without any noise via client.py)? How does the plot change for different values of the privacy parameter?
Please include the generated plot for epsilon = 0.5 in your submission.

Part 3: Differential Privacy and Composition

So far, we have only looked at histogram queries. Part of why differential privacy is powerful is because it allows other types of queries as well. In this part of the assignment, we will look at averages. Specifically, we will consder the average age grouped by programming experience. Our server implements a similar Laplace DP mechanism as you just implemented in part 2. For the programming experience category, the server only allows access to differentially private outputs! Because this function is an average, the sensitivity used by our server is different to that of a count. Our configured sensitivity relies on known bounds on the age of students in the class, but we don't tell you the sensitivity we use.

Run the DP average query once and observe the output:

$ python3 client.py dp avg age programming

Can you deduce with high certainity Kinan's programming experience level? What if you run the query two, three, or more times? (The average query excludes submissions that did not disclose their age or programming level.)

Differential privacy guarantees compose when multiple (or the same) functions are computed over the same dataset (or correlated datasets). The more aggregates are released over the dataset, the less privacy that dataset has overall. In our case, every time you run the query, new noise is sampled and added. Hence, we can treat each run as evaluating a new function.

You can programmatically access the noised averages by calling the avg(["programming"], "age", True) function in client.py.

Task: Complete the composition.py stencil file. The stencil makes many (200) calls to the underlying query function; think of a way to combine all these outputs to factor out the noise and reconstruct the exact values! You'll start by attacking your DP histogram from part 2.

Implementation note: The stencil is designed to help you make your attack re-usable for different queries. By default, it calls expose(lambda: dp_histogram(0.5)), which works with the DP histogram for part 2, but you'll find pre-defined code for exposing averages and counts.

Testing: The stencil helps you test your attack by print the output from the non-noised histogram query from part 1, followed by the results of your attack applied to the DP histogram. When they are similar or identical, you have succeeded. (You may wish to round floating point numbers to integers for this part, as the original histogram is based on integers.)

Question 8: Run the composition attack against the average age grouped by programming experience. What can you deduce from the exposed averages about Kinan's programming experience level? How confident are you in what you have deduced? Are there scenarios where they might be wrong?

To increase your confidence in your results and make it easier to identify Kinan's data, we can combine the exact averages from above with the counts of people with every level of programming expertise. Our server exposes a noised version of these counts (excluding submissions that did not disclose their age or programming level). You can retrieve these counts using:

$ python3 client.py dp count0 programming

You can also access these counts programmatically via the count0(["programming"], True) function in client.py.

Question 9: Reuse your composition attack from question 8 to compute the exact non-noised counts per programming experience level. Deduce Kinan's programming experience level, with high confidence, by looking at both the exposed counts and the previously exposed averages. Now summarize everything you've learned about Kinan!

Differential privacy guarantees are tightly governed by the privacy parameter (i.e., epsilon) of each query made. When making multiple queries q₁, ..., q_n, the privacy parameters compose linearly to give an overal privacy parameter epsilon = epsilon₁ + ... + epsilon_n. This is why you are able to reconstruct the actual data and remove the noise when you make many queries. For example, if you made 100 queries each with epsilon 0.5, the composition only provides DP guarantees with epsilon = 100 * 0.5 = 50, which are not meaningful. (Recall how poorly epsilon = 10 did in part 2!)

One common way composition is reasoned about in practice is via the use of a privacy budget. This is a recurring idea in many DP systems, including the ones we look at in class (PINQ and others). The system assigns each dataset a privacy budget B. Every time a query with some privacy parameter epsilon runs against that data, the system checks that the remaining privacy budget is larger than epsilon before computing the query. The system is responsible for updating the budget by subtracting epsilon from it after the query completes. Queries whose epsilons are larger than the privacy budget must be rejected.

Task Complete the BudgetTracker class in budget.py from the stencil. This class manages calls to the avg, count, and count0 functions in client.py. Your class should be initialized with a privacy budget, and should check and update that budget whenever a query happens. When the budget is consumed, your class must stop any further queries from executing. You can assume that any query costs epsilon = 0.5.

Question 10: Does the class you implemented suffice to truly enforce that a dataset is never used beyond a certain privacy budget? Can developers intentionally or unintentionally over-use the dataset beyond the privacy budget? At a very high level, how would you design a different privacy budget enforcment mechanism that does not suffer these drawbacks?

This is the end of the assignment! 🥳 Now, package up your answers and code and submit your work.

Task: Submit your report, plot, and code (in a ZIP file) by email to cs2390tas@lists.brown.edu by 11pm (Eastern time) on Friday, November 1, 2024.