CS1950Y Lecture 1: Oracle Testing
Friday, January 26, 2018
Weaknesses of Tests
Here are some reasons for testing that students suggested:
- To expose logical and implementation errors
- To prevent regressions
- As a form of documentation for your code
- To find edge cases you might not have thought of while writing your code
Tim added that reasoning through what's worth testing can be a more valuable exercise than writing the code.
And here are some weaknesses of testing that students gave:
- They rely on your intuition
- They're not exhaustive and can't cover the entire space.
Sorting
Let's think about sorting records by keys. We'll have records with just a name and an ID number. The key of this record is going to be the name.
What's an example test case that we might write?
-
A list we know is sorted:
[{ name: "Alice", idnum: 1}, { name: "Bob", idnum: 2 }] -> [{ name: "Alice", idnum: 1}, { name: "Bob", idnum: 2 }]
-
The empty list:
[] -> []
Are these tests enough? No, we have this gut intuition that maybe two tests aren't enough. But there are also some things we aren't testing.
Suppose we have identical keys:
[{ name: "Alice", idnum: 1}, { name: "Alice", idnum: 100}, { name: "Bob", idnum: 2}]
Do we care if this is a stable sorting algorithm that preserves the order of duplicate keys?
We'll call this harmless nondeterminism — we have multiple correct outputs for a given input. What if we change the sorting algorithm a year later and it's not stable anymore? We don't want to have to update 1000 test cases, or lose real failures in the noise.
We're going to solve this problem with something that's going to also partially address weakness of intuition and inexhaustivity.
Let's think about types! Some of you might recognize this as part of the design recipe. What is the type of a sort on these records?
SORT: listof(Rec) -> listof(Rec)
But what's the type of a test suite for a sort?
SUITE: (listof(Rec) -> listof(Rec)) -> boolean
There are lots of things you could imagine that satisfy this type signature without being a sequence of input/output pairs. We're going to look at a technique called oracle testing.
Oracle Testing
For oracle testing, we wont' provide a concrete output. Instead, for a given input and output, we're going to ask whether or not the output is valid.
What is it that makes a sorting algorithm correct? What does it have to do?
- The output has to be sorted
- Nothing is missing — everything that was in the input list is in the output list
- Nothing is added — everything in the output list was in the input
These are all fairly straightforward to write individually. This is what writing a validator for an oracle is — we combine functions checking the individual properties we want to hold.
A student asked: what do we do if we care about the sort being stable?
We add that as a fourth property! Notice that the act of thinking about what you want from your sort here is a much more powerful idea than thinking about some concrete examples. This act of figuring out what correctness means, figuring out what you want from your algorithm, is the start of what this course is about.
As it stands, our properties aren't quite right. They don't have anything to say about duplicates, so 3 (Alice, 1)
records in the input could turn into 5 or 2 in the output!
Let's make them a bit more precise:
for all x : input | count(y : output | y = x) = count (y : input | y = x)
We'll use this weird SQL-like, logic-like notation for now.
How many of you would have tested this? In fact, 3 years ago, Tim had a fix for this in lecture that was wrong! He checked that the lengths of the two lists were equal, but that allows for bugs like [A, A, B] -> [A, B, B]
.
See how thinking abstractly has given us some concrete test cases we didn't have before, and given us properties we think are correct. This is a process we'll be using a lot in Logic for Systems. We'll be talking concretely and talking abstractly in ways that make the two go very well together.
This validator deals with multiple correct answers, but we still have the problem of thinking of inputs. We could think of inputs ourselves, but that still has the intuition problem. Instead, we can randomize and generate inputs.
This is a technique that multiple software companies are starting to adopt. It started as this Haskell library called QuickCheck. After a few years, people started liking the idea and wanting QuickCheck for their own languages. That means that today, after you leave class, you can do this right away in any language you'd want!
Is there still a place for concrete testing? Absolutely! If we can think of an edge case, and I want to make sure it doesn't happen, I should still write a test case for it — an oracle won't necessarily test that case again. Oracle testing is a complement to normal testing. You can also think of it as a way to generate new test cases. Randomization can give us inputs and edge cases we would never have though of, and those can be added as concrete tests.
Next, we'll look at some examples and think about how we'd write oracles for them.
Factoring
Let's look abstractly at another example: prime integer factorization.
What's the type signature of this?
pfactor: int -> listof(int)
What are some properties we want to validate?
- The output multiplies to the input
- All the output components are prime
Which of these is easy and which is hard? This is worth thinking about. Testing for primality is *hard*. To be useful, our oracle doesn't have to test everything. For now, we can aim for partial correctness, and just check that the output multiplies to the input.
Sudoku
Some people like playing Sudoku. Since learning how to write a solver for it, Tim is no longer one of them. But how do we know if the solver is correct?
One thing is to check that the output is correct: every square, row, and column has the numbers 1 through 9. We also have to make sure we actually solved the problem and followed the original constraints.
But how hard is it to randomly generate Sudoku boards? It turns out that it's very hard — we don't know how to easily tell that a randomly-generated board has one unique solution, which is a requirement of a Sudoku puzzle.
This is the opposite situation of the factorization problem. There, it was easy to generate inputs but hard to validate outputs. Now, we can validate outputs but can't generate inputs.