More table errors, list intro
HW5, Project 1 Discussion
Going forward, we will be grading on readability and testing! Please include type annotations, docstrings, and tests for all functions, and do your best to follow the Pyret Style Guide.
Table errors, continued
Here’s the program we ended up with last class in order to analyze the data in this spreadsheet.
include gdrive-sheets include data-source include tables include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX") ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw" cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize discount using string-sanitizer end # --- redo the analysis from google-sheets --- # do we get the same answers? total-cookies = sum(cookie-data, "count") student-cookies = sum(filter-by(cookie-data, lam(r): string-to-upper(r["discount"]) == "STUDENT" end), "count") chocolate-cookies = sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end), "count") full-price-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "" end), "count") fun cleanup-discount(discount :: String): string-to-upper(discount) where: cleanup-discount("A") is "A" cleanup-discount("student") is "STUDENT" cleanup-discount("") is "" end cookie-data-clean = transform-column(cookie-data, "discount", cleanup-discount) fun invalid-email(email :: String) -> Boolean: not(string-contains(email, "@")) end
We’ve written a function to identify bad email addresses. We can use it to find email addresses for further cleanup–either manually or in Pyret:
> filter-by(cookie-data, invalid-email)
What should we do about the differently-formatted “flavor” entry? We could combine it into two orders. We might be able to write a program to do it, but if there’s only one we can just do it manually.
Now that we’ve done this cleanup, we can do some analysis. How about this:
> freq-bar-chart(cookie-data-clean, "flavor")
This chart has a an extra column! Looks like we’ve got a bad flavor, “ChocoIate”, in the mix. It’s always a good idea to look at your data, generate little graphs like this, etc.–you’ll find problems that way!
Here’s our final program.
include gdrive-sheets include data-source # to get the sanitizers include tables include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX") ssid = "1oJhQqMQdxbbNkw5d__FG3Uz6o8tzovN_ZNyYspqZRhw" cookie-data = load-table: name, email, count, flavor, discount source: load-spreadsheet(ssid).sheet-by-name("Cookies", true) sanitize name using string-sanitizer sanitize email using string-sanitizer sanitize count using num-sanitizer sanitize flavor using string-sanitizer sanitize discount using string-sanitizer end # --- redo the analysis from google-sheets --- # do we get the same answers? total-cookies = sum(cookie-data, "count") student-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "student" end), "count") chocolate-cookies = sum(filter-by(cookie-data, lam(r): r["flavor"] == "Chocolate" end), "count") full-price-cookies = sum(filter-by(cookie-data, lam(r): r["discount"] == "" end), "count") # --- clean the data table ------- fun remove-spaces(str :: String) -> String: doc: "remove spaces from given string" string-replace(str, " ", "") end fun none-to-empty(str :: String) -> String: doc: "convert string 'none' to empty string" if str == "none": "" else: str end end # create a cleaned-up dataset that we can use for analysis hereafter cookie-data-clean = transform-column( transform-column( transform-column(cookie-data, "discount", remove-spaces), "discount", none-to-empty), "discount", string-toupper) # --- check for errors ----------- # make sure email addresses are valid fun is-email(str :: String) -> Boolean: doc: "Check whether string contains an valid email address" strlen = string-length(str) tail = string-substring(str, strlen - 4, strlen) string-contains(str, "@") and ((tail == ".com") or (tail == ".org") or (tail == ".edu")) where: is-email("myname") is false is-email("myname@home") is false is-email("myname@homeorg") is false is-email("myname@home.org") is true end fun invalid-email(rw :: Row) -> Boolean: doc: "return true if data in email column is not a valid address" not(is-email(rw["email"])) end fun malformed-emails(t :: Table) -> Table: doc: "extract rows that have invalid email addresses" filter-by(t, invalid-email) end # check for known flavors fun invalid-flavor(rw :: Row) -> Boolean: not( (rw["flavor"] == "Chocolate") or (rw["flavor"] == "Oatmeal")) end # identify the malformed rows filter-by(cookie-data, invalid-flavor) filter-by(cookie-data, invalid-email) #| Since email and delivery errors can't be easily fixed in code, you are likely to just correct these in the source spreadsheet |# # --- now we have a cleaned-up dataset. Generate plots # to make sure the data make sense before analyzing # did we get a reasonable distribution of flavor options freq-bar-chart(cookie-data-clean, "flavor") # wait -- what's that extra column? One of the cells says # ChocoIate rather than Chocolate. We should write a script to fix # that too (or fix in source)
Introduction to lists
Our cookie shop only sells particular cookie flavors. Let’s write a function to find flavors we don’t have available.
fun invalid-flavor(r :: Row) -> Boolean: (r["flavor"] == "Chocolate") or (r["flavor"] == "Oatmeal") or (r["flavor"] == "Snickerdoodle") end
This code should work, but there’s something a bit unsatisfying about it. We’re
mixing logic (the equality tests and the or
’s) and data (the actual flavor
names). What if we decide to start selling a new flavor–say, pumpkin spice?
Here’s a better way:
import lists as L cookie-flavors = [list: "Chocolate", "Oatmeal", "Snickerdoodle", "Pumpkin Spice"] fun invalid-flavor(r :: Row) -> Boolean: L.member(cookie-flavors, r["flavor"]) end
Here we’re defining a list of flavors. A list is an ordered collection of single elements, of any length. The next several lectures will focus on lists and list operations.
The first of these operations is shown: L.member
.1 L.member
takes a list and a potential element; it returns true
if the element is in the
list and false
otherwise.
Here’s another use for lists: extracting columns from a table. We have already
seen Row
, the data type representing rows: it needs to have a value for every
named column in the table. What about columns? Do we need a Column
data type?
A column consists of an ordered collection of values, of unbounded length. So a
column is really just a list!
We can get the list of values for a table column with get-column
:
fun order-flavors(t :: Table) -> List: all-flavors = t.get-column("flavors") L.distinct(all-flavors) end
The L.distinct
operation takes a list and returns a list of its distinct
values.
Tables vs. lists
Instead of a list of flavors, we could have used a table:
flavors = table: flavor row: "Chocolate" row: "Oatmeal" row: "Snickerdoodle" end
Which representation is better?
Why are we working with lists now instead of tables? Lists are a simpler data
structure (while still being flexible enough to be the basis of lots of
important applications!), and as we work with them further we’ll see how to
create new list operations based on very simple building blocks. When working
with tables we viewed sort-by
, filter-by
, build-column
, etc. as
fundamental operations; in the context of lists, we’ll see how to write
similar operations. It’s going to be fun!
Footnotes:
lists
library with the name L for boring technical reasons–some list
operations have the same names as operations built in to Pyret. You should do
the same in your own code, with the same import lists as L
line.