Class summary: Table Errors (Catching and Preventing)

This material is not explicitly in the textbook.

These notes take our first look at errors in the data in tables. For these notes, we work with manually-created tables, rather than tables loaded from spreadsheet files. We’ll come back to these same issues on tables from files in a few days.

1 Inconsistent Types in the Same Column

Consider a basic table of wage data:

wages =

table: name, hourly-wage, dept

row: "Pat", 18, "shipping"

row: "Jay", 15, "sales"

row: "Tina", 23, "finance"

row: "Zoe", 20, "sales"

end

We might want to write a program to give everyone in the company a 10% raise, such as the following:

transform wage-table using hourly-wage:

hourly-wage: hourly-wage * 1.10

end

In practice, data sources can contain errors. Let’s look at what programs can do to guard against errors.

Here’s a version of the table where Tina’s hourly wage has been entered not as a number, but as her job level (which determines pay in some organizations)

wages-pay-error =

table: name, hourly-wage, dept

row: "Pat", 18, "shipping"

row: "Jay", 15, "sales"

row: "Tina", "level 8", "finance"

row: "Zoe", 20, "sales"

end

What will happen when we run the pay-raise program? We may expect an error of some kind, but at what point in the computation? We get it when Pyret tries to evaluate

"level 8" * 1.1

We had bad data, so we should get an error. The question is whether we could have written our programs to clean up the data in the first place, so we wouldn’t try to run programs on poor data. This way, perhaps we can clean up the data once (before we run any programs), and our programs can then be written assuming that the data will at least have the shape and types that we wanted.

1.1 Using Column Types

In this case, the underlying problem is that someone has entered a string in a column where our programs expect to see a number. This expectation isn’t expressed in the code, however. We can do so by putting types on the columns:

wages-coltypes =

table: name :: String, hourly-wage :: Number, dept :: String

row: "Pat", 18, "shipping"

row: "Jay", 15, "sales"

row: "Tina", "level 8", "finance"

row: "Zoe", 20, "sales"

end

Now, we get an error from Pyret when we try to create the table, as we should: our rows violate the assumption on the table contents.

2 Bad Data of the Right Type

Here’s another example of what can go wrong with data in a table: the data can have the right type, but not make sense:

wages-min-error =

table: name :: String, hourly-wage :: Number, dept :: String

row: "Pat", 18, "shipping"

row: "Jay", 1.5, "sales"

row: "Tina", 23, "finance"

row: "Zoe", 20, "sales"

end

Here, someone accidentally entered Jay’s pay as 1.5 rather than 15. The 1.5 is valid according to the types, but it doesn’t make sense as an hourly wage (it violates minimum wage requirements, if nothing else!). What could we do to prevent this situation?

There are two issues here: reporting the problem and adjusting the data. We could report the problem by running a program to show all rows in which the person is not being paid minimum wage:

sieve wages-min-error using hourly-wage:

hourly-wage < 9.6

end

How might we adjust the data to deal with the problem? Ideally, we would ask the creator of the data to fix it, but that isn’t always feasible (perhaps the data is being loaded from an old website, perhaps you don’t know who created the data, etc). If you can’t fix the data at the source, you have to fix it within your code. Possible options are:

delete the rows that don’t pay minimum wage (though that could mean those employees don’t get paid!)
automatically give everyone at least minimum wage
ask someone to indicate fixes for each individual employee that has a bad wage

We haven’t learned what we need to do 3, and 1 isn’t appropriate in this situation, so let’s implement 2.

raise-to-min = transform wages-min-error using hourly-wage:

hourly-wage: num-max(9.6, hourly-wage)

end

The raise-to-min table now has everyone over minimum wage.

3 Creating Constants in Code

As a side note, our minimum wage functions could use an additional edit to make them easier to read and maintain: notice that we refer to the number 9.6 in two places. That number is the 2017 minimum wage for Rhode Island. Rather than use the number explicitly, we should just name this value and add it to our dictionary:

min-wage-ri = 9.6

sieve wages-min-error using hourly-wage:

hourly-wage < min-wage-ri

end

raise-to-min = transform wages-min-error using hourly-wage:

hourly-wage: num-max(min-wage-ri, hourly-wage)

end

Why is this a good idea? Using the name

reduces the chance of typos in the number
makes it clear what the number means
indicates that the multiple occurrences of 9.6 refer to the same concept, and aren’t just an accidental use of the same number
allows us to edit the program in just one place when the minimum wage changes

As a general rule, when your program uses a specific value that has a clear meaning in your context, you should give that value a name in the dictionary and use it consistently in your program.

1	Inconsistent Types in the Same Column
2	Bad Data of the Right Type
3	Creating Constants in Code