⚠️ This is not the current iteration of the course! Head here for the current offering.

MTG 7 student questions

 On Sep 30, 2020, at 10:59 PM, Anonymous wrote:
> What is the relationship between data auditing and reporting a breech, such that they are
> categorized together?

Auditing is a technique, and a data breach is an event. You need auditing after a breach in order to
find out what data got compromised; if you don't have the technology set up to audit what happened,
you will not be able to comply with the requirements of the GDPR for what data controllers and
processors need to do after a data breach (e.g., notify affected data subjects). Auditing in
technology terms often involves fine-grained logging of events on a system, so that you can find out
what happened after the fact.

Malte 
 On Sep 30, 2020, at 10:56 PM, Anonymous wrote:

> Despite the fact that the paper does manage to cover most GDPR articles for data storage, what
> would be some better ways to address these regulations without significantly sacrificing
> performance?

It does appear that this particular interpretation is not necessarily done from an optimization
or efficiency perspective, but rather a mere compliance perspective (which the paper notes several
times is antithetical to performance, cost, and reliability).

Finding ways to achieve compliance (strict or less strict) without burdening developers and while
preserving good performance is an important active research topic. For some ideas, see e.g., the
following papers:
 * https://arxiv.org/abs/2008.04936
 * https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf
 * https://www.researchgate.net/profile/Michael_Brodie/publication/336822029_SchengenDB_A_Data_Protection_Database_Proposal/links/5dfa8b0d299bf10bc3643efe/SchengenDB-A-Data-Protection-Database-Proposal.pdf

Malte 
 On Wed, Sep 30, 2020 at 9:22 PM Anonymous wrote:
> Some of the GDPR compliance retrofitting the authors did was to the client as opposed to the
> database itself.  Is the client considered to be part of the database? Is it safe to put logic
> there and consider the system compliant?

I believe they did this because they wanted to minimize the changes to the database system. In the
case of Redis, adding features like metadata indexing would have required major changes to the
system, so they instead (for expediency) implemented these features in the client.

If the database and the client are run by the same company and all accesses go through the client
library, it seems reasonable to implement some features in the client library. (In fact, such "fat
clients" are quite common in large web companies like Facebook.) It becomes trickier when there
are ways to circumvent the client library, or when the database operator doesn't trust the clients
or cannot force them to use a particular client library.

Malte 
 On Wed, Sep 30, 2020 at 8:23 PM Anonymous wrote:
> In many cases, the controller and processor are the same entity, with potentially other processors
> downstream. I am unsure if having an entity with both opertions belonging to controller and
> processor (as the categorieses in 3.3) would have meaningful impact on the makeup of these
> benchmarks and the design of the GDPR DBMSs. Two specific questions I have are:
> 1. Does a controller+processor entity end up possessing additional capability because of the
>composition not already listed in the operations in 3.3 and in the benchmark?

My take is that the controller role in the legal text of the GDPR isn't so much a technical role as
one consisting of responsibility and direction, and of defining rules by writing program code. The
controller can delegate most or all of the actual actions (including getting consent, serving data,
etc.) to data processors, but remains responsible for defining the rules.

In this sense, I find the paper's attempt to define particular workflows for users, controllers, and
processors somewhat artificial; in the real world there may well be processors who complete
operations beyond those listed, and controllers that run operations not listed in the paper.

In practice, many controllers are also processors, so the composition is common. But within the
taxonomy of the paper, I have no reason to expect that composing the roles would add extra
operations.

> 2. Does a controller+processor entity exhibit different operation profile in the real world
> compared to a separate controller and processor? If so, does that mean that the GDPR benchmarks
> workloads in the paper need to be extended or modified (different weights or distributions or maybe
> an entire new workload for the composed controller+processor)?

Perhaps. Consider a startup that builds its entire infrastructure on AWS services: they may not
perform or operate any actual data processing, but merely write code and glue these service APIs
together. Even when receiving a GDPR request, they might use the processor's services to satisfy
it. On the other hand, many companies do their own in-house processing (including, perhaps, on
employee laptops). So the set of operations listed in the paper might happen at the controller or
at the processor, or there may be different operations than those listed in the paper. My take is
that they largely defined these in order to have a workload for their benchmark; whether that
workload is generalizable or representative is open for debate.

Malte 
 On Oct 1, 2020, at 2:04 AM, Anonymous wrote:
> Why did they choose SystemC? I understand Postgres and Redis as common choices, but I have never
> heard of this resource.

While Postgres is somewhat of a reference architecture for databases, it's also known to be a
relatively slow DB compared to commercial offerings like Oracle, IBM DB2, or Microsoft SQL Server.
We don't know which DBMS SystemC is, but I would guess it's one of these three. It is likely that
the authors added it to compare to an "industry-strength" DBMS in addition to PostgreSQL to show
that their results generalize.

> What would a comparable performance on a graph database look like, where metadata storage is a
> first-class operation and an expectation of the platform?

I don't know what the performance on a graph database would be; my guess is that it depends on how
well you can map your metadata and data schemas to a graph. Facebook clearly does this (cf.
TAO/DELF), but it may not be natural for other applications. Running GDPRBench on a graph DB and
tuning its performance would be a fun CSCI 2390 class project!

Malte 
 On Wed, Sep 30, 2020 at 9:51 PM Anonymous wrote:
> The way that they are implementing the GDPR regulation by deletion seems reasonable to me,
> although tedious. Are there other methods being used by other companies? How are they dealing with
> it?

Instantaneous deletion as described in the paper is unlikely to work for practical settings, since
companies in reality typically have offline backups (e.g., in Google's case, they actually have
magnetic tapes stored in a secure site, and erasing the data from those takes time). It's also not
required by the GDPR, since it allows up to a month for the data controller to respond to a request.
In that sense, the paper targets a stricter compliance than is required by the regulation, and
(likely) pays a performance penalty for it.

Another practical approach we've seen is Facebook's DELF, which schedules an asynchronous deletion
(i.e., it's not instantaneous) within the "maximum allowable retention period". Many other companies
likely action these requests manually, so the deletion may happen when a human gets around to
running a script.

Malte 
 On Wed, Sep 30, 2020 at 9:15 PM Anonymous wrote:
> I am wondering if the benchmark designed in this paper can be generalized to evaluate all kinds of
> database systems.

For the purpose of evaluating the impact of GDPR compliance, I believe so; at least it is the
authors' goal that the benchmark be generalizable in this way.  For other purposes, the benchmark is
likely not very useful (although YCSB, the benchmark it was derived from, is widely-used for a
variety of purposes).

Malte 
 On Wed, Sep 30, 2020 at 9:01 PM Anonymous wrote:
> 1. Is it sufficient to assume that the database is in a steady state? What would change if we
> didn't assume so?

"Steady state" here means that over the course of the experiment, the total size of the database
neither increases nor decreases even as individual users apply writes and deletes. This makes it
easier to run a controlled experiment, but whether it is realistic is a good question. Most real
websites databases tend to grow over time; the paper's argument that GDPR-mandated expiry dates
change this is somewhat debatable, since the expiry date only plays a role when the data no longer
needs to be processed for the purpose it was originally collected for. On Facebook or Twitter, old
posts remain accessible indefinitely (at least for now); hence, their data volume would seem to grow
continuously.

> 2. Is there similar information about other DBMSs?

Not to my knowledge! This paper was the first paper on GDPR compliance in the systems literature.

> 3. Why does PostgreSQL not support TTL?

I suspect because nobody needed the feature before; it's also tricky to implement efficiently, since
the database needs to either set timers for individual records' expiry times or needs to
continuously scan the whole DB contents for expired records.

Malte 
 On Wed, Sep 30, 2020 at 2:51 PM Anonymous wrote:
> The paper mentions that they expect rights violations/compliance complaints to follow a zipf
> distribution, which I interpreted as meaning that complaints will not be evenly distributed across
> companies but rather their will be a subset of companies responsible for the majority of
> complaints. However, I was wonder what they used to come to this conclusion? It seems as though it
> could be correct based on our GDPR case studies (for example, Google was disproportionately
> represented), which raises another question: how effective is GDPR if the same companies keep
> violating its regulations instead of learning from their mistakes?

The Zipf distribution is a highly skewed distribution, often associated with skewed popularity of
online content. My read of the part you reference is that they assume that for each company, a small
fraction of its users will be responsible for most of the complaints it receives. In particular, the
paper references the "Google RTBF experience" in §4.2.2, which it earlier describes by saying that
"the requests showed a heavy skew towards a small number of users (top 0.25% users generated 20.8%
delisting)". In other words, there are some privacy-conscious or obnoxious users who generate many
requests; most users probably generate none.

Malte 
 On Mon, Oct 21, 2019 at 8:16 PM Anonymous wrote:
> As "Analyzing the Impact of GDPR on Storage Systems" points out, the overhead of logging everything that
> happened (even a read) in both the control plane and data plane seems too large and will penalize the
> performance badly. However, since this is required by GDPR, does that simply implies that it will be a huge
> pain to be Fully GDPR compliant since every operation must be persisted onto the disk for future audits? It
> seems not very feasible for me to become fully GDPR compliant for large internet companies.

This depends on how strict the logging has to be in the face of failures -- something that hasn't been tested
(in court) with regards to the GDPR yet. Some high-risk, regulated processes (e.g., financial transactions)
are already subject to stronger legal requirements for persistent logging, even at the cost of performance.
However, it seems possible to assume that batched logging (i.e., flushing to disk only occasionally) might be
deemed compliant for most situations.

Malte 
 On Mon, Oct 21, 2019 at 8:44 PM Anonymous wrote:
> Q. If GDPR compliance had to be divided into broad segments, what would be the other components other than
> storage?

At a guess: consent by the data subject, security of processing and data transfer, and implementation/fines.

> Q. Does the GDPR have any mandate on the ownership of shared data? Have there been 'GDPR property disputes',
> for lack of a better term?

The GDPR lacks a strict specification of what constitutes "ownership", presumably leaving this to be worked
out through examples adjudicated in the courts. I found this example related to different interpretations of
what "information associated with a data subject" should be taken to mean:
https://www.technologylawdispatch.com/2019/04/in-the-courts/how-not-to-restrict-gdpr-access-requests-in-employment-proceedings-german-court-establishes-high-threshold/.
Anecdotally, Spotify also appears to have a narrow interpretation that some users have challenged
successfully: https://twitter.com/steipete/status/1025024813889478656?lang=en.

Malte