Class summary:   Handling Tables Errors with Programs
1 Identifying Table Errors
2 Managing Table Errors
3 Setting Up a Data Analysis file

Class summary: Handling Tables Errors with Programs

Copyright (c) 2017 Kathi Fisler

1 Identifying Table Errors

Last class, we talked about the importance of sanity checking data before you begin to work with it. Let’s briefly review what that means on a concrete example, then look at how to use programs to help us manage the sanicty-checking process.

Open up this table of partial salary information for a small business. Spend a couple of minutes looking at the table. What do you notice that warrants being checked or fixed about this data.

Potential issues include:

2 Managing Table Errors

How many of these errors could we check (or correct) with small Pyret programs? Many, as it turns out, though some point to additional programming constructs that would be nice to have.

Here is the starter file that loads the salary table into Pyret.

We wrote a series of functions to work on different issues from the original table. You can see the resulting code in the final code handout posted to the lectures page.

3 Setting Up a Data Analysis file

Now that we have a slew of programs to help detect, and in some cases clean, a data file, how should we use all of this in setting up a data table for doing some sort of deeper statistical or other analysis?

Partly, this is a question of how to balance naming intermediate tables and keeping the collection of table names manageable for people. As a general rule, it makes sense to name the raw data table that you import into your program, to name the cleaned-up table, and to name any portions of the table that you want to use in multiple analyses. For example:

  #|

     THE CLEANING CODE TEMPLATE

  

     raw-data = load-table(...)

     check for issues

     repair issues via programs

     sanity check values

     clean-data = result of all repairs

     ... then work only with the clean data

  |#