More table errors, list intro

HW5, Project 1 Discussion

Going forward, we will be grading on readability and testing! Please include type annotations, docstrings, and tests for all functions, and do your best to follow the Pyret Style Guide.

Table errors, continued

Here’s the program we ended up with last class in order to analyze the data in this spreadsheet.

include gdrive-sheets
include data-source
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw"
cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize discount using string-sanitizer
  end

# --- redo the analysis from google-sheets ---
# do we get the same answers?

total-cookies = sum(cookie-data, "count")

student-cookies = 
  sum(filter-by(cookie-data, lam(r): string-to-upper(r["discount"]) == "STUDENT" end),
    "count")

chocolate-cookies = 
  sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end),
    "count")

full-price-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "" end),
    "count")

fun cleanup-discount(discount :: String):
  string-to-upper(discount)
where:
  cleanup-discount("A") is "A"
  cleanup-discount("student") is "STUDENT"
  cleanup-discount("") is ""
end

cookie-data-clean = transform-column(cookie-data, "discount", 
  cleanup-discount)

fun invalid-email(email :: String) -> Boolean:
  not(string-contains(email, "@"))
end

We’ve written a function to identify bad email addresses. We can use it to find email addresses for further cleanup–either manually or in Pyret:

> filter-by(cookie-data, invalid-email)

What should we do about the differently-formatted “flavor” entry? We could combine it into two orders. We might be able to write a program to do it, but if there’s only one we can just do it manually.

Now that we’ve done this cleanup, we can do some analysis. How about this:

> freq-bar-chart(cookie-data-clean, "flavor")

This chart has a an extra column! Looks like we’ve got a bad flavor, “ChocoIate”, in the mix. It’s always a good idea to look at your data, generate little graphs like this, etc.–you’ll find problems that way!

Here’s our final program.

include gdrive-sheets
include data-source # to get the sanitizers
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")

ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw"
cookie-data =
  load-table: name, email, count, flavor, discount
    source: load-spreadsheet(ssid).sheet-by-name("Cookies", true)
    sanitize name using string-sanitizer
    sanitize email using string-sanitizer
    sanitize count using num-sanitizer
    sanitize flavor using string-sanitizer
    sanitize discount using string-sanitizer
  end

# --- redo the analysis from google-sheets ---
# do we get the same answers?

total-cookies = sum(cookie-data, "count")

student-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "student" end),
    "count")

chocolate-cookies = 
  sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end),
    "count")

full-price-cookies = 
  sum(filter-by(cookie-data, lam(r): r["discount"] == "" end),
    "count")

# --- clean the data table -------

fun remove-spaces(str :: String) -> String:
  doc: "remove spaces from given string"
  string-replace(str, " ", "")
end

fun none-to-empty(str :: String) -> String:
  doc: "convert string 'none' to empty string"
  if str == "none":
    ""
  else:
    str
  end
end

# create a cleaned-up dataset that we can use for analysis hereafter
cookie-data-clean = 
  transform-column(
    transform-column(
      transform-column(cookie-data, "discount", remove-spaces),
      "discount", none-to-empty),
    "discount", string-toupper)


# --- check for errors -----------

# make sure email addresses are valid
fun is-email(str :: String) -> Boolean:
  doc: "Check whether string contains an valid email address"
  strlen = string-length(str)
  tail = string-substring(str, strlen - 4, strlen)
  string-contains(str, "@") and 
  ((tail == ".com") or (tail == ".org") or (tail == ".edu"))
where:
  is-email("myname") is false
  is-email("myname@home") is false
  is-email("myname@homeorg") is false
  is-email("myname@home.org") is true
end

fun invalid-email(rw :: Row) -> Boolean:
  doc: "return true if data in email column is not a valid address"
  not(is-email(rw["email"]))
end

fun malformed-emails(t :: Table) -> Table:
  doc: "extract rows that have invalid email addresses"
  filter-by(t, invalid-email)
end

# check for known flavors 
fun invalid-flavor(rw :: Row) -> Boolean:
  not(
    (rw["flavor"] == "Chocolate") or (rw["flavor"] == "Oatmeal"))
end

# identify the malformed rows
filter-by(cookie-data, invalid-flavor) 
filter-by(cookie-data, invalid-email)

#|
   Since email and delivery errors can't be easily fixed in code, 
   you are likely to just correct these in the source spreadsheet
|#


# --- now we have a cleaned-up dataset. Generate plots
#     to make sure the data make sense before analyzing

# did we get a reasonable distribution of flavor options
freq-bar-chart(cookie-data-clean, "flavor")
# wait -- what's that extra column? One of the cells says
#  ChocoIate rather than Chocolate. We should write a script to fix
#  that too (or fix in source)

Introduction to lists

Our cookie shop only sells particular cookie flavors. Let’s write a function to find flavors we don’t have available.

fun invalid-flavor(r :: Row) -> Boolean:
  (r["flavor"] == "Chocolate") or
  (r["flavor"] == "Oatmeal") or
  (r["flavor"] == "Snickerdoodle")
end

This code should work, but there’s something a bit unsatisfying about it. We’re mixing logic (the equality tests and the or’s) and data (the actual flavor names). What if we decide to start selling a new flavor–say, pumpkin spice?

Here’s a better way:

import lists as L
cookie-flavors = [list: "Chocolate", "Oatmeal", "Snickerdoodle", "Pumpkin Spice"]

fun invalid-flavor(r :: Row) -> Boolean:
  L.member(cookie-flavors, r["flavor"])
end

Here we’re defining a list of flavors. A list is an ordered collection of single elements, of any length. The next several lectures will focus on lists and list operations.

The first of these operations is shown: L.member.1 L.member takes a list and a potential element; it returns true if the element is in the list and false otherwise.

Here’s another use for lists: extracting columns from a table. We have already seen Row, the data type representing rows: it needs to have a value for every named column in the table. What about columns? Do we need a Column data type? A column consists of an ordered collection of values, of unbounded length. So a column is really just a list!

We can get the list of values for a table column with get-column:

fun order-flavors(t :: Table) -> List:
  all-flavors = t.get-column("flavors")
  L.distinct(all-flavors)
end

The L.distinct operation takes a list and returns a list of its distinct values.

Tables vs. lists

Instead of a list of flavors, we could have used a table:

flavors = table: flavor
row: "Chocolate"
row: "Oatmeal"
row: "Snickerdoodle"
end

Which representation is better?

Why are we working with lists now instead of tables? Lists are a simpler data structure (while still being flexible enough to be the basis of lots of important applications!), and as we work with them further we’ll see how to create new list operations based on very simple building blocks. When working with tables we viewed sort-by, filter-by, build-column, etc. as fundamental operations; in the context of lists, we’ll see how to write similar operations. It’s going to be fun!

Footnotes:

1
We have imported the lists library with the name L for boring technical reasons–some list operations have the same names as operations built in to Pyret. You should do the same in your own code, with the same import lists as L line.