More combining tables, handling errors in data
Multiple tables continued
Here’s the code we ended with last time:
include tables include gdrive-sheets include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX") ssid = "1jHvn5CPE6RkTTQRIXQbY5n5p4aiOH7fZsnwK2s6s6tc" spreadsheet = load-spreadsheet(ssid) ri-municipalities = load-table: name :: String, city :: Boolean, population-2000 :: Number, population-2010 :: Number # true because the sheet has a "header" row source: spreadsheet.sheet-by-name("municipalities", true) end ri-areas = load-table: name :: String, area :: Number source: spreadsheet.sheet-by-name("municipalities-area", true) end fun area-for-municipality(areas :: Table, name :: String) -> Number: filter-with(areas, lam(r): r["name"] == name end).row-n(0)["area"] end
And here’s the rest of the density computation. See the lecture capture for details!
include tables include gdrive-sheets include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX") ssid = "1jHvn5CPE6RkTTQRIXQbY5n5p4aiOH7fZsnwK2s6s6tc" spreadsheet = load-spreadsheet(ssid) ri-municipalities = load-table: name :: String, city :: Boolean, population-2000 :: Number, population-2010 :: Number # true because the sheet has a "header" row source: spreadsheet.sheet-by-name("municipalities", true) end ri-areas = load-table: name :: String, area :: Number source: spreadsheet.sheet-by-name("municipalities-area", true) end fun area-for-municipality(areas :: Table, name :: String) -> Number: filter-with(areas, lam(r): r["name"] == name end).row-n(0)["area"] end fun add-area(municipalities :: Table, areas :: Table) -> Table: build-column(municipalities, "area", lam(r): area-for-municipality(areas, r["name"]) end) end fun density(r :: Row) -> Number: r["population-2010"] / r["area"] end fun add-density(municipalities :: Table) -> Table: build-column(municipalities, "density", density) end fun densest-municipalities(municipalities :: Table, areas :: Table) -> Table: with-area = add-area(municipalities, areas) with-density = add-density(with-area) sort-by(with-density, "density", false) end
Errors in data
So far in this class, we’ve been dealing with “clean” data–tables with correct and complete information. In practice, you’ll often have to deal with errors: missing data, formatting issues, and incorrect values.
Take a look at the data in this spreadsheet. What do you notice?
In class, we noticed a number of different errors:
- Malformed email addresses (“kurt”, etc.)
- Differently-formatted number (“three”)
- Various different discount code formats (“student” vs. “STUDENT”, “none” vs. empty cells)
- Odd-looking totals
- A weird entry in “Flavor” (2 oatmeal, 2 chocolate).
What happens when we load the data into Pyret and try to do the same analysis?
include gdrive-sheets include tables include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX") ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw" cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) end # --- redo the analysis from google-sheets --- # do we get the same answers? total-cookies = sum(cookie-data, "count") student-cookies = sum(filter-with(cookie-data, lam(r): r["discount"] == "student" end), "count") chocolate-cookies = sum(filter-with(cookie-data, lam(r): r["flavor"] == "Chocolate" end), "count") full-price-cookies = sum(filter-with(cookie-data, lam(r): r["discount"] == "" end), "count")
When we run this program, we get an error because of the “three” field. Once that’s fixed, we still get some unexpected results (see the lecture capture for details). We’ll fix these next time.