Lecture notes: Algorithmic bias, testing tables, plotting

Algorithmic bias discussion

For Homework 2, you read two articles about algorithmic bias: the possibility that computer algorithms used to make decisions involving real people will end up being biased against or in favor of particular groups. One of the articles discussed the use of algorithms in criminal sentencing, and presented evidence that commonly-used algorithms end up being biased against African-American defendants. For the in-class discussion, we talked in small groups about the following questions:

Do you think it could ever be a good idea to use algorithms in criminal sentencing? If not, how would you convince a judge of this? If so, how would you reduce the risk that algorithms lead to unfair outcomes? Think about both social and technical mechanisms.

Testing table programs

Here’s the program we ended up with last time:

include tables
include gdrive-sheets

include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1jHvn5CPE6RkTTQRIXQbY5n5p4aiOH7fZsnwK2s6s6tc"
spreadsheet = load-spreadsheet(ssid)

all-municipalities = load-table: name :: String, city :: Boolean,
  population-2000 :: Number, population-2010 :: Number
  # true because the sheet has a "header" row
  source: spreadsheet.sheet-by-name("municipalities", true)
end

fun is-town(r :: Row) -> Boolean:
  not(r["city"])
end

fun percent-change(r :: Row) -> Number:
  (r["population-2010"] - r["population-2000"]) /
  r["population-2000"]
end

fun fastest-growing-towns(municipalities :: Table) -> Table:
  towns = filter-by(municipalities, is-town)
  towns-with-percent-change = build-column(towns, "percent-change", percent-change)
  sort-by(towns-with-percent-change, "percent-change", false)
end

We’ve done a bit of a bad thing here: we have written three functions, but we don’t have tests for any of them! Let’s see how we can rectify this.

We can test table programs by using test tables–tables with the same structure as the larger tables we are interested in, but which are smaller and contain data ` that are useful for testing. What would good test data look like for this problem?

test-municipalities = table: name, city, population-2000, population-2010
  row: "City", true, 100, 101
  row: "Town 1", false, 100, 102
  row: "Town 2", false, 100, 99
  row: "Town 3", false, 50, 54
end

Let’s see how we use these test data to test our functions.

First, we can test is-town. This is a boolean function; we want to make sure it works on town and city rows.

fun is-town(r :: Row) -> Boolean:
  not(r["city"])
where:
    is-town(test-municipalities.row-n(0)) is false
    is-town(test-municipalities.row-n(1)) is true
    is-town(test-municipalities.row-n(2)) is true
end

How about percent-change? In the end we’ll only want to use it on towns, but we can test it on city rows, too.

fun percent-change(r :: Row) -> Number:
  (r["population-2010"] - r["population-2000"]) /
  r["population-2000"]
where:
  percent-change(test-municipalities.row-n(0)) is 0.01
  percent-change(test-municipalities.row-n(1)) is 0.02
  percent-change(test-municipalities.row-n(2)) is -0.01
end

Finally, we can test our top-level table function. It produces a table, so we’ll have to construct a new table that it should return.

fun fastest-growing-towns(municipalities :: Table) -> Table:
  towns = filter-by(municipalities, is-town)
  towns-with-percent-change = build-column(towns, "percent-change", percent-change)
  sort-by(towns-with-percent-change, "percent-change", false)
where:
  test-municipalities-after = table: name, city, population-2000, population-2010, percent-change
    row: "Town 3", false, 50, 54, 0.08
    row: "Town 1", false, 100, 102, 0.02
    row: "Town 2", false, 100, 99, -0.01
  end
  fastest-growing-towns(test-municipalities) is test-municipalities-after
end

It’s important to write out the result table before you see the table produced by the function–we’re trying to make sure that the function does what it’s supposed to, which means that we have to predict its behavior! If we just copy in the table it produces, we’re testing the function against its own behavior rather than against our intuitions.

Testing a function against its own previous behavior can be useful if you are planning on changing the function and want to ensure that it still has the same behavior. This is called a regression test.

Plotting data

Plots help us visually understand the shape of data, and are often much more readable than a large table of numbers. Data scientists use plots for both exploratory and explanatory purposes–they are useful for understanding data in preparation for further analysis and in presenting data to a general audience.

Our tables library includes several functions to generate different kinds of plot. Here are a few examples using our municipalities data.

# how is population distributed in the state?
pie-chart(all-municipalities, "name", "population-2010")

# how many municipalities of various sizes are there?
histogram(all-municipalities, "population-2010", 1000)

# hw much, and how, does population vary?
box-plot(all-municipalities, "population-2010")

ft = fastest-growing-towns(all-municipalities)

# visually present the growth data
bar-chart(ft, "name", "population-2010")

# is a town's size (in 2000) correlated with its growth?
scatter-plot(ft, "population-2000", "percent-change")
# linear regression
lr-plot(ft, "population-2000", "percent-change")