Dealing with errors in data

So far in this class, we’ve been dealing with “clean” data–tables with correct and complete information. In practice, you’ll often have to deal with errors: missing data, formatting issues, and incorrect values.

Take a look at the data in this spreadsheet. What do you notice?

In class, we noticed a number of different errors:

What happens when we load the data into Pyret and try to do the same analysis?

include gdrive-sheets
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw"
cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
  end

# --- redo the analysis from google-sheets ---
# do we get the same answers?

total-cookies = sum(cookie-data, "count")

student-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "student" end),
    "count")

chocolate-cookies = 
  sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end),
    "count")

full-price-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "" end),
    "count")

When we run this program, we get an error because of the “three” field. Once that’s fixed, we still get some unexpected results (see the lecture capture for details). In order to make sure the data end up in the format we want, we’ll use sanitizers, which convert data from an external source into a specific Pyret data type. We’ll sanitize the “discount” field (because it contains blanks) as follows:

include data-source # to get the sanitizers

cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize discount using string-sanitizer
  end

We could, and should, do this for the other fields as well:

include data-source # to get the sanitizers

cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize name using string-sanitizer
    sanitize email using string-sanitizer
    sanitize count using num-sanitizer
    sanitize flavor using string-sanitizer
    sanitize discount using string-sanitizer
  end

Our computed results still don’t match Google sheets. This is down to the “discount” field–in order to find “student” cookies, Google sheets is doing a case-insensitive comparison. Pyret, on the other hand, only thinks two strings are equal if the cases match. So, we’ll clean up the data a bit in Pyret by upper-casing all of the discount codes:

  fun fun cleanup-discount(discount :: String):
  string-to-upper(discount)
where:
  cleanup-discount("A") is "A"
  cleanup-discount("student") is "STUDENT"
  cleanup-discount("") is ""
end

cookie-data-clean = transform-column(cookie-data, "discount", 
  cleanup-discount)

The transform-column function is used to clean up or otherwise alter the data in a single column of a table. It returns a new table by applying its function argument to each value of the given column.

Here’s the program we ended up with. We’ll continue to clean it up on Friday.

include gdrive-sheets
include data-source
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw"
cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize discount using string-sanitizer
  end

# --- redo the analysis from google-sheets ---
# do we get the same answers?

total-cookies = sum(cookie-data, "count")

student-cookies = 
  sum(filter-by(cookie-data, lam(r): string-to-upper(r["discount"]) == "STUDENT" end),
    "count")

chocolate-cookies = 
  sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end),
    "count")

full-price-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "" end),
    "count")

fun cleanup-discount(discount :: String):
  string-to-upper(discount)
where:
  cleanup-discount("A") is "A"
  cleanup-discount("student") is "STUDENT"
  cleanup-discount("") is ""
end

cookie-data-clean = transform-column(cookie-data, "discount", 
  cleanup-discount)

fun invalid-email(email :: String) -> Boolean:
  not(string-contains(email, "@"))
end

Cookies!

include gdrive-sheets
include data-source # to get the sanitizers
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw"
cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize name using string-sanitizer
    sanitize email using string-sanitizer
    sanitize count using num-sanitizer
    sanitize flavor using string-sanitizer
    sanitize discount using string-sanitizer
  end

# --- redo the analysis from google-sheets ---
# do we get the same answers?

total-cookies = sum(cookie-data, "count")

student-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "student" end),
    "count")

chocolate-cookies = 
  sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end),
    "count")

full-price-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "" end),
    "count")

# --- clean the data table -------

fun remove-spaces(str :: String) -> String:
  doc: "remove spaces from given string"
  string-replace(str, " ", "")
end

fun none-to-empty(str :: String) -> String:
  doc: "convert string 'none' to empty string"
  if str == "none":
    ""
  else:
    str
  end
end

# create a cleaned-up dataset that we can use for analysis hereafter
cookie-data-clean = 
  transform-column(
    transform-column(
      transform-column(cookie-data, "discount", remove-spaces),
      "discount", none-to-empty),
    "discount", string-toupper)


# --- check for errors -----------

# make sure email addresses are valid
fun is-email(str :: String) -> Boolean:
  doc: "Check whether string contains an valid email address"
  strlen = string-length(str)
  tail = string-substring(str, strlen - 4, strlen)
  string-contains(str, "@") and 
  ((tail == ".com") or (tail == ".org") or (tail == ".edu"))
where:
  is-email("myname") is false
  is-email("myname@home") is false
  is-email("myname@homeorg") is false
  is-email("myname@home.org") is true
end

fun invalid-email(rw :: Row) -> Boolean:
  doc: "return true if data in email column is not a valid address"
  not(is-email(rw["email"]))
end

fun malformed-emails(t :: Table) -> Table:
  doc: "extract rows that have invalid email addresses"
  filter-by(t, invalid-email)
end

# check for known flavors 
fun invalid-flavor(rw :: Row) -> Boolean:
  not(
    (rw["flavor"] == "Chocolate") or (rw["flavor"] == "Oatmeal"))
end

# identify the malformed rows
filter-by(cookie-data, invalid-flavor) 
filter-by(cookie-data, invalid-email)

#|
   Since email and delivery errors can't be easily fixed in code, 
   you are likely to just correct these in the source spreadsheet
|#


# --- now we have a cleaned-up dataset. Generate plots
#     to make sure the data make sense before analyzing

# did we get a reasonable distribution of flavor options
freq-bar-chart(cookie-data-clean, "flavor")
# wait -- what's that extra column? One of the cells says
#  ChocoIate rather than Chocolate. We should write a script to fix
#  that too (or fix in source)

# are the tickcounts normal?
bar-chart(cookie-data-clean, "name", "count")
# how did Ernie end up with no tickets?  Is 10 over a purchase limit?