More combining tables, handling errors in data

Multiple tables continued

Here’s the code we ended with last time:

include tables
include gdrive-sheets

include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX")

ssid = "1jHvn5CPE6RkTTQRIXQbY5n5p4aiOH7fZsnwK2s6s6tc"
spreadsheet = load-spreadsheet(ssid)

ri-municipalities = load-table: name :: String, city :: Boolean,
  population-2000 :: Number, population-2010 :: Number
  # true because the sheet has a "header" row
  source: spreadsheet.sheet-by-name("municipalities", true)
end

ri-areas = load-table: name :: String, area :: Number
  source: spreadsheet.sheet-by-name("municipalities-area", true)
end

fun area-for-municipality(areas :: Table, name :: String) -> Number:
  filter-with(areas, lam(r): r["name"] == name end).row-n(0)["area"]
end

And here’s the rest of the density computation. See the lecture capture for details!

include tables
include gdrive-sheets

include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX")

ssid = "1jHvn5CPE6RkTTQRIXQbY5n5p4aiOH7fZsnwK2s6s6tc"
spreadsheet = load-spreadsheet(ssid)

ri-municipalities = load-table: name :: String, city :: Boolean,
  population-2000 :: Number, population-2010 :: Number
  # true because the sheet has a "header" row
  source: spreadsheet.sheet-by-name("municipalities", true)
end

ri-areas = load-table: name :: String, area :: Number
  source: spreadsheet.sheet-by-name("municipalities-area", true)
end

fun area-for-municipality(areas :: Table, name :: String) -> Number:
  filter-with(areas, lam(r): r["name"] == name end).row-n(0)["area"]
end

fun add-area(municipalities :: Table, areas :: Table) -> Table:
  build-column(municipalities, 
    "area",
    lam(r): area-for-municipality(areas, r["name"]) end)
end

fun density(r :: Row) -> Number:
  r["population-2010"] / r["area"]
end

fun add-density(municipalities :: Table) -> Table:
  build-column(municipalities, "density", density)
end

fun densest-municipalities(municipalities :: Table, areas :: Table) -> Table:
  with-area = add-area(municipalities, areas)
  with-density = add-density(with-area)
  sort-by(with-density, "density", false)
end

Errors in data

So far in this class, we’ve been dealing with “clean” data–tables with correct and complete information. In practice, you’ll often have to deal with errors: missing data, formatting issues, and incorrect values.

Take a look at the data in this spreadsheet. What do you notice?

In class, we noticed a number of different errors:

  • Malformed email addresses (“kurt”, etc.)
  • Differently-formatted number (“three”)
  • Various different discount code formats (“student” vs. “STUDENT”, “none” vs. empty cells)
  • Odd-looking totals
  • A weird entry in “Flavor” (2 oatmeal, 2 chocolate).

What happens when we load the data into Pyret and try to do the same analysis?

include gdrive-sheets
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw"
cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
  end

# --- redo the analysis from google-sheets ---
# do we get the same answers?

total-cookies = sum(cookie-data, "count")

student-cookies = 
  sum(filter-with(cookie-data, lam(r): r["discount"] == "student" end),
    "count")

chocolate-cookies = 
  sum(filter-with(cookie-data, lam(r): r["flavor"] == "Chocolate" end),
    "count")

full-price-cookies = 
  sum(filter-with(cookie-data, lam(r): r["discount"] == "" end),
    "count")

When we run this program, we get an error because of the “three” field. Once that’s fixed, we still get some unexpected results (see the lecture capture for details). We’ll fix these next time.