Class summary: Handling Tables Errors with Programs
Copyright (c) 2017 Kathi Fisler
1 Identifying Table Errors
Last class, we talked about the importance of sanity checking data before you begin to work with it. Let’s briefly review what that means on a concrete example, then look at how to use programs to help us manage the sanicty-checking process.
Open up this table of partial salary information for a small business. Spend a couple of minutes looking at the table. What do you notice that warrants being checked or fixed about this data.
Potential issues include:
Names of departments aren’t spelled or capitalized consistently
One salary level seems much higher than the others
Someone at a lower salary level is making more than someone else at another level, even within the same department
2 Managing Table Errors
How many of these errors could we check (or correct) with small Pyret programs? Many, as it turns out, though some point to additional programming constructs that would be nice to have.
Here is the starter file that loads the salary table into Pyret.
We wrote a series of functions to work on different issues from the original table. You can see the resulting code in the final code handout posted to the lectures page.
3 Setting Up a Data Analysis file
Now that we have a slew of programs to help detect, and in some cases clean, a data file, how should we use all of this in setting up a data table for doing some sort of deeper statistical or other analysis?
Partly, this is a question of how to balance naming intermediate tables and keeping the collection of table names manageable for people. As a general rule, it makes sense to name the raw data table that you import into your program, to name the cleaned-up table, and to name any portions of the table that you want to use in multiple analyses. For example:
#| |
THE CLEANING CODE TEMPLATE |
|
raw-data = load-table(...) |
check for issues |
repair issues via programs |
sanity check values |
clean-data = result of all repairs |
... then work only with the clean data |
|# |