Dealing with errors in data
So far in this class, we’ve been dealing with “clean” data–tables with correct and complete information. In practice, you’ll often have to deal with errors: missing data, formatting issues, and incorrect values.
Take a look at the data in this spreadsheet. What do you notice?
In class, we noticed a number of different errors:
- Malformed email addresses (“kurt”, etc.)
- Differently-formatted number (“three”)
- Various different discount code formats (“student” vs. “STUDENT”, “none” vs. empty cells)
- Odd-looking totals
- A weird entry in “Flavor” (2 oatmeal, 2 chocolate).
What happens when we load the data into Pyret and try to do the same analysis?
include gdrive-sheets include tables include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX") ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw" cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) end # --- redo the analysis from google-sheets --- # do we get the same answers? total-cookies = sum(cookie-data, "count") student-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "student" end), "count") chocolate-cookies = sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end), "count") full-price-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "" end), "count")
When we run this program, we get an error because of the “three” field. Once that’s fixed, we still get some unexpected results (see the lecture capture for details). In order to make sure the data end up in the format we want, we’ll use sanitizers, which convert data from an external source into a specific Pyret data type. We’ll sanitize the “discount” field (because it contains blanks) as follows:
include data-source # to get the sanitizers cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize discount using string-sanitizer end
We could, and should, do this for the other fields as well:
include data-source # to get the sanitizers cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize name using string-sanitizer sanitize email using string-sanitizer sanitize count using num-sanitizer sanitize flavor using string-sanitizer sanitize discount using string-sanitizer end
Our computed results still don’t match Google sheets. This is down to the “discount” field–in order to find “student” cookies, Google sheets is doing a case-insensitive comparison. Pyret, on the other hand, only thinks two strings are equal if the cases match. So, we’ll clean up the data a bit in Pyret by upper-casing all of the discount codes:
fun fun cleanup-discount(discount :: String): string-to-upper(discount) where: cleanup-discount("A") is "A" cleanup-discount("student") is "STUDENT" cleanup-discount("") is "" end cookie-data-clean = transform-column(cookie-data, "discount", cleanup-discount)
The transform-column
function is used to clean up or otherwise alter the data
in a single column of a table. It returns a new table by applying its function
argument to each value of the given column.
Here’s the program we ended up with. We’ll continue to clean it up on Friday.
include gdrive-sheets include data-source include tables include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX") ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw" cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize discount using string-sanitizer end # --- redo the analysis from google-sheets --- # do we get the same answers? total-cookies = sum(cookie-data, "count") student-cookies = sum(filter-by(cookie-data, lam(r): string-to-upper(r["discount"]) == "STUDENT" end), "count") chocolate-cookies = sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end), "count") full-price-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "" end), "count") fun cleanup-discount(discount :: String): string-to-upper(discount) where: cleanup-discount("A") is "A" cleanup-discount("student") is "STUDENT" cleanup-discount("") is "" end cookie-data-clean = transform-column(cookie-data, "discount", cleanup-discount) fun invalid-email(email :: String) -> Boolean: not(string-contains(email, "@")) end
Cookies!
- fix loading error
- what are the results?
- look at the data
- add sanitizers
- the results are still different–capitalization, blanks, “none”
- how about those emails? how about the flavors?
- finding invalid data
- are we done? bar chart–oops, ChocoIate!
include gdrive-sheets include data-source # to get the sanitizers include tables include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX") ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw" cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize name using string-sanitizer sanitize email using string-sanitizer sanitize count using num-sanitizer sanitize flavor using string-sanitizer sanitize discount using string-sanitizer end # --- redo the analysis from google-sheets --- # do we get the same answers? total-cookies = sum(cookie-data, "count") student-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "student" end), "count") chocolate-cookies = sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end), "count") full-price-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "" end), "count") # --- clean the data table ------- fun remove-spaces(str :: String) -> String: doc: "remove spaces from given string" string-replace(str, " ", "") end fun none-to-empty(str :: String) -> String: doc: "convert string 'none' to empty string" if str == "none": "" else: str end end # create a cleaned-up dataset that we can use for analysis hereafter cookie-data-clean = transform-column( transform-column( transform-column(cookie-data, "discount", remove-spaces), "discount", none-to-empty), "discount", string-toupper) # --- check for errors ----------- # make sure email addresses are valid fun is-email(str :: String) -> Boolean: doc: "Check whether string contains an valid email address" strlen = string-length(str) tail = string-substring(str, strlen - 4, strlen) string-contains(str, "@") and ((tail == ".com") or (tail == ".org") or (tail == ".edu")) where: is-email("myname") is false is-email("myname@home") is false is-email("myname@homeorg") is false is-email("myname@home.org") is true end fun invalid-email(rw :: Row) -> Boolean: doc: "return true if data in email column is not a valid address" not(is-email(rw["email"])) end fun malformed-emails(t :: Table) -> Table: doc: "extract rows that have invalid email addresses" filter-by(t, invalid-email) end # check for known flavors fun invalid-flavor(rw :: Row) -> Boolean: not( (rw["flavor"] == "Chocolate") or (rw["flavor"] == "Oatmeal")) end # identify the malformed rows filter-by(cookie-data, invalid-flavor) filter-by(cookie-data, invalid-email) #| Since email and delivery errors can't be easily fixed in code, you are likely to just correct these in the source spreadsheet |# # --- now we have a cleaned-up dataset. Generate plots # to make sure the data make sense before analyzing # did we get a reasonable distribution of flavor options freq-bar-chart(cookie-data-clean, "flavor") # wait -- what's that extra column? One of the cells says # ChocoIate rather than Chocolate. We should write a script to fix # that too (or fix in source) # are the tickcounts normal? bar-chart(cookie-data-clean, "name", "count") # how did Ernie end up with no tickets? Is 10 over a purchase limit?