More data cleaning, introduction to lists
Table errors, continued
In order to make sure the data end up in the format we want, we’ll use sanitizers, which convert data from an external source into a specific Pyret data type. We’ll sanitize the “discount” field (because it contains blanks) as follows:
include data-source # to get the sanitizers cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize discount using string-sanitizer end
We could, and should, do this for the other fields as well:
include data-source # to get the sanitizers cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize name using string-sanitizer sanitize email using string-sanitizer sanitize count using num-sanitizer sanitize flavor using string-sanitizer sanitize discount using string-sanitizer end
Our computed results still don’t match Google sheets. This is down to the “discount” field–in order to find “student” cookies, Google sheets is doing a case-insensitive comparison. Pyret, on the other hand, only thinks two strings are equal if the cases match. So, we’ll clean up the data a bit in Pyret by upper-casing all of the discount codes:
fun cleanup-discount(discount :: String): string-to-upper(discount) where: cleanup-discount("A") is "A" cleanup-discount("student") is "STUDENT" cleanup-discount("") is "" end cookie-data-clean = transform-column(cookie-data, "discount", cleanup-discount)
The transform-column
function is used to clean up or otherwise alter the data
in a single column of a table. It returns a new table by applying its function
argument to each value of the given column.
There are some weird email addresses in the data. We could use this function to find them:
fun is-email(address :: String) -> Boolean: string-contains(address, "@") and string-contains(address, ".") where: is-email("ty@gmail.com") is true is-email("ty@gmail") is false is-email("kurt") is false end
We can use filter-with
to find email addresses for further cleanup–either
manually or in Pyret:
> filter-with(cookie-data, lam(r): not(is-email(r["email"])) end)
What should we do about the differently-formatted “flavor” entry? We could combine it into two orders. We might be able to write a program to do it, but if there’s only one we can just do it manually.
Now that we’ve done this cleanup, we can do some analysis. How about this:
> freq-bar-chart(cookie-data-clean, "flavor")
This chart has a an extra column! Looks like we’ve got a bad flavor, “ChocoIate”, in the mix. It’s always a good idea to look at your data, generate little graphs like this, etc.–you’ll find problems that way!
Here’s our final program.
include gdrive-sheets include tables include data-source include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX") import lists as L ssid = "1gKlJTN0qQ6OYZ4DznWBKs_fjRVK1AxSfjMgXnh8Z9Qg" cookie-data = load-table: name :: String, email :: String, count :: Number, flavor :: String, discount :: String source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize discount using string-sanitizer sanitize name using string-sanitizer sanitize count using num-sanitizer end fun clean-discount(discount :: String) -> String: string-replace(string-to-lower(discount), " ", "") where: clean-discount("STUDENT") is "student" clean-discount("student") is "student" clean-discount("") is "" clean-discount(" ") is "" end fun is-email(address :: String) -> Boolean: string-contains(address, "@") and string-contains(address, ".") where: is-email("ty@gmail.com") is true is-email("ty@gmail") is false #is-email("ty.gmail@com") is false is-email("kurt") is false end cookie-data-clean = transform-column(cookie-data, "discount", clean-discount) total-cookies = sum(cookie-data-clean, "count") chocolate-cookies = sum(filter-with(cookie-data-clean, lam(r): r["flavor"] == "Chocolate" end), "count") student-cookies = sum(filter-with(cookie-data-clean, lam(r): r["discount"] == "student" end), "count") full-price-cookies = sum(filter-with(cookie-data-clean, lam(r): r["discount"] == "" end), "count") bad-email-rows = filter-with(cookie-data-clean, lam(r): not(is-email(r["email"])) end)
Introduction to lists
Our cookie shop only sells particular cookie flavors. Let’s write a function to find flavors we don’t have available.
fun invalid-flavor(r :: Row) -> Boolean: (r["flavor"] == "Chocolate") or (r["flavor"] == "Oatmeal") or (r["flavor"] == "Snickerdoodle") end
This code should work, but there’s something a bit unsatisfying about it. We’re
mixing logic (the equality tests and the or
’s) and data (the actual flavor
names). What if we decide to start selling a new flavor–say, pumpkin spice?
Here’s a better way:
import lists as L cookie-flavors = [list: "Chocolate", "Oatmeal", "Snickerdoodle", "Pumpkin Spice"] fun invalid-flavor(r :: Row) -> Boolean: L.member(cookie-flavors, r["flavor"]) end
Here we’re defining a list of flavors. A list is an ordered collection of single elements, of any length. The next several lectures will focus on lists and list operations.
The first of these operations is shown: L.member
.1 L.member
takes a list and a potential element; it returns true
if the element is in the
list and false
otherwise.
Tables vs. lists
Instead of a list of flavors, we could have used a table:
flavors = table: flavor row: "Chocolate" row: "Oatmeal" row: "Snickerdoodle" end
Which representation is better?
Why are we working with lists now instead of tables? Lists are a simpler data
structure (while still being flexible enough to be the basis of lots of
important applications!), and as we work with them further we’ll see how to
create new list operations based on very simple building blocks. When working
with tables we viewed sort-by
, filter-with
, build-column
, etc. as
fundamental operations; in the context of lists, we’ll see how to write
similar operations. It’s going to be fun!
Footnotes:
We have imported the
lists
library with the name L for boring technical reasons–some list
operations have the same names as operations built in to Pyret. You should do
the same in your own code, with the same import lists as L
line.