More data cleaning, introduction to lists

Table errors, continued

In order to make sure the data end up in the format we want, we’ll use sanitizers, which convert data from an external source into a specific Pyret data type. We’ll sanitize the “discount” field (because it contains blanks) as follows:

include data-source # to get the sanitizers

cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize discount using string-sanitizer
  end

We could, and should, do this for the other fields as well:

include data-source # to get the sanitizers

cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize name using string-sanitizer
    sanitize email using string-sanitizer
    sanitize count using num-sanitizer
    sanitize flavor using string-sanitizer
    sanitize discount using string-sanitizer
  end

Our computed results still don’t match Google sheets. This is down to the “discount” field–in order to find “student” cookies, Google sheets is doing a case-insensitive comparison. Pyret, on the other hand, only thinks two strings are equal if the cases match. So, we’ll clean up the data a bit in Pyret by upper-casing all of the discount codes:

fun cleanup-discount(discount :: String):
  string-to-upper(discount)
where:
  cleanup-discount("A") is "A"
  cleanup-discount("student") is "STUDENT"
  cleanup-discount("") is ""
end

cookie-data-clean = transform-column(cookie-data, "discount", 
  cleanup-discount)

The transform-column function is used to clean up or otherwise alter the data in a single column of a table. It returns a new table by applying its function argument to each value of the given column.

There are some weird email addresses in the data. We could use this function to find them:

fun is-email(address :: String) -> Boolean:
  string-contains(address, "@") and string-contains(address, ".")
where:
  is-email("ty@gmail.com") is true
  is-email("ty@gmail") is false
  is-email("kurt") is false
end

We can use filter-with to find email addresses for further cleanup–either manually or in Pyret:

> filter-with(cookie-data, lam(r): not(is-email(r["email"])) end)

What should we do about the differently-formatted “flavor” entry? We could combine it into two orders. We might be able to write a program to do it, but if there’s only one we can just do it manually.

Now that we’ve done this cleanup, we can do some analysis. How about this:

> freq-bar-chart(cookie-data-clean, "flavor")

This chart has a an extra column! Looks like we’ve got a bad flavor, “ChocoIate”, in the mix. It’s always a good idea to look at your data, generate little graphs like this, etc.–you’ll find problems that way!

Here’s our final program.

include gdrive-sheets
include tables
include data-source
include shared-gdrive("cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX")
import lists as L

ssid = "1gKlJTN0qQ6OYZ4DznWBKs_fjRVK1AxSfjMgXnh8Z9Qg"
cookie-data =
  load-table: name :: String, email :: String, count :: Number, 
    flavor :: String, discount :: String
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize discount using string-sanitizer
    sanitize name using string-sanitizer
    sanitize count using num-sanitizer
  end

fun clean-discount(discount :: String) -> String:
  string-replace(string-to-lower(discount), " ", "")
where:
  clean-discount("STUDENT") is "student"
  clean-discount("student") is "student"
  clean-discount("") is ""
  clean-discount("     ") is ""
end

fun is-email(address :: String) -> Boolean:
  string-contains(address, "@") and string-contains(address, ".")
where:
  is-email("ty@gmail.com") is true
  is-email("ty@gmail") is false
  #is-email("ty.gmail@com") is false
  is-email("kurt") is false
end

cookie-data-clean = transform-column(cookie-data, "discount", clean-discount)

total-cookies = sum(cookie-data-clean, "count")

chocolate-cookies =
  sum(filter-with(cookie-data-clean, lam(r): r["flavor"] == "Chocolate" end), "count")

student-cookies =
  sum(filter-with(cookie-data-clean, lam(r): r["discount"] == "student" end), "count")

full-price-cookies =
  sum(filter-with(cookie-data-clean, lam(r): r["discount"] == "" end), "count")

bad-email-rows = filter-with(cookie-data-clean, lam(r): not(is-email(r["email"])) end)

Introduction to lists

Our cookie shop only sells particular cookie flavors. Let’s write a function to find flavors we don’t have available.

fun invalid-flavor(r :: Row) -> Boolean:
  (r["flavor"] == "Chocolate") or
  (r["flavor"] == "Oatmeal") or
  (r["flavor"] == "Snickerdoodle")
end

This code should work, but there’s something a bit unsatisfying about it. We’re mixing logic (the equality tests and the or’s) and data (the actual flavor names). What if we decide to start selling a new flavor–say, pumpkin spice?

Here’s a better way:

import lists as L
cookie-flavors = [list: "Chocolate", "Oatmeal", "Snickerdoodle", "Pumpkin Spice"]

fun invalid-flavor(r :: Row) -> Boolean:
  L.member(cookie-flavors, r["flavor"])
end

Here we’re defining a list of flavors. A list is an ordered collection of single elements, of any length. The next several lectures will focus on lists and list operations.

The first of these operations is shown: L.member.1 L.member takes a list and a potential element; it returns true if the element is in the list and false otherwise.

Tables vs. lists

Instead of a list of flavors, we could have used a table:

flavors = table: flavor
row: "Chocolate"
row: "Oatmeal"
row: "Snickerdoodle"
end

Which representation is better?

Why are we working with lists now instead of tables? Lists are a simpler data structure (while still being flexible enough to be the basis of lots of important applications!), and as we work with them further we’ll see how to create new list operations based on very simple building blocks. When working with tables we viewed sort-by, filter-with, build-column, etc. as fundamental operations; in the context of lists, we’ll see how to write similar operations. It’s going to be fun!

Footnotes:

1

We have imported the lists library with the name L for boring technical reasons–some list operations have the same names as operations built in to Pyret. You should do the same in your own code, with the same import lists as L line.