Class summary: Table Errors
Copyright (c) 2017 Kathi Fisler
1 Errors in Data
We began by looking at a Google Sheet with data gathered about people registering for an event (like a show). The middle sheet includes some simple analysis on the data, such as the numbers of tickets sold, student tickets sold, and full-fare (no discount) tickets sold.
We looked at the table and the analysis data and enumerated possible errors in the data. These included invalid email addresses, numbers written in words, differences in capitalization, and options that seem inconsistent with one another. All of these errors are the kinds of things that can happen with real data.
We loaded the spreadsheet into Pyret and wrote expressions to do the same analysis (in the starter file). Pyret produced different answers to some questions. What gives?
Key idea in CS: While programming tools (which include Excel and Google Sheets) typically produce the same answers on valid data, different tools often take very different approaches on data with errors. Some tools skip over missing data, some interpret it as zero or an empty string. Some tools warn you about invalid data, others do their best to produce answers. There are arguments for and against each of these positions, depending on context.
Your job is to be aware of this, to know how to check for invalid data problems, and to have a habit of checking and cleaning your data before use (especially if you got that data from someone else!).
2 Cleaning and Repairing Data
We talked about what kinds of errors must be fixed in the source spreadsheet, and what errors we can fix through code. We also introduced a Pyret concept called a sanitizer, which converts all data in a column to a given type (turning missing data into empty strings for string columns, missing numbers into 0 on numeric columns, and so on). The final code (also posted on the lectures page) shows how to use sanitizers.
The final posted code shows using filter-by to detect errors such as invalid email addresses. It showed a new table operation called transform-column, which replaces the contents of a cell with a new computed value (this is good for converting a column to uppercase, for example). The final code and the lecture capture show how these were developed.
3 Cleaning and Error Checking Tips
The general takeaways from this lecture are as follows:
Always run sanity checks on your data before doing analysis, especially if the data is from an external source.
Sanity checks can be done by filtering for unexpected inputs, or by generating plots that make sure the data looks as you might expect. For example, creating a bar chart of ticket orders highlights that Ernie and Zander have unusually high and low numbers of tickets – that might be an error to fix. Creating a frequency bar chart of delivery methods reveals a misspelling of "email" as "emall" in one row. Making initial plots to poke around is a good way to look for problems in your data.
In general, when you get a new dataset, use code to sanity check it, repair it, and save a copy of the clean data in a different variable name that you then use for the actual analysis. You will see this structure reflected in the final code file for this lecture.