Class summary:   Table Errors
1 Errors in Data
2 Cleaning and Repairing Data
3 Cleaning and Error Checking Tips

Class summary: Table Errors

Copyright (c) 2017 Kathi Fisler

1 Errors in Data

We began by looking at a Google Sheet with data gathered about people registering for an event (like a show). The middle sheet includes some simple analysis on the data, such as the numbers of tickets sold, student tickets sold, and full-fare (no discount) tickets sold.

We looked at the table and the analysis data and enumerated possible errors in the data. These included invalid email addresses, numbers written in words, differences in capitalization, and options that seem inconsistent with one another. All of these errors are the kinds of things that can happen with real data.

We loaded the spreadsheet into Pyret and wrote expressions to do the same analysis (in the starter file). Pyret produced different answers to some questions. What gives?

Key idea in CS: While programming tools (which include Excel and Google Sheets) typically produce the same answers on valid data, different tools often take very different approaches on data with errors. Some tools skip over missing data, some interpret it as zero or an empty string. Some tools warn you about invalid data, others do their best to produce answers. There are arguments for and against each of these positions, depending on context.

Your job is to be aware of this, to know how to check for invalid data problems, and to have a habit of checking and cleaning your data before use (especially if you got that data from someone else!).

2 Cleaning and Repairing Data

We talked about what kinds of errors must be fixed in the source spreadsheet, and what errors we can fix through code. We also introduced a Pyret concept called a sanitizer, which converts all data in a column to a given type (turning missing data into empty strings for string columns, missing numbers into 0 on numeric columns, and so on). The final code (also posted on the lectures page) shows how to use sanitizers.

The final posted code shows using filter-by to detect errors such as invalid email addresses. It showed a new table operation called transform-column, which replaces the contents of a cell with a new computed value (this is good for converting a column to uppercase, for example). The final code and the lecture capture show how these were developed.

3 Cleaning and Error Checking Tips

The general takeaways from this lecture are as follows: