CS1950Y Lecture 14: Inside Alloy
February 28, 2018
Announcements
Tomorrow (3/6) from 12-1pm, Caroline Trippel of Princeton will be giving a talk on detecting hardware vulnerabilities using Alloy. If you're interested, the talk will be held in the Engineering Research Center, Room 125.
Many of you have reached out to get more details about case study presentations. We will not be providing a formal specification for exactly what case studies will look like, because they are meant to be informal. The goal here is to decompress for a week and reflect on what you've learned already!
Back to Logic
Last class, we talked about two kinds of logic: Boolean logic (also called propositional logic) and first-order logic.
Formulas in Boolean logic consist of variables with truth values (or propositions), strung together using logical connectives
like and
and or
. First-order logic is a bit more expressive; it also includes existential and
universal quantifiers. In addition, the variables of a first-order formula refer to concrete entities in the universe.
Just like programming languages, logics have different levels of expressive power. Sometimes, restricting expressive power
can be beneficial; although it reduces the scope of the things we can model in a particular language or logic, it makes it easier
to reason about and understand that language or logic.
Previously, we used Alloy to model the semantics of Boolean logic by recursively evaluating the truth values of the subformulas of a given formula, combining the results to get an overall truth value for the formula. Defining the semantics for first-order logic will prove a bit more complicated.
Consider the Alloy specification below:
sig Node { edges : set Node }
run { all n : Node | n->n in edges } for exactly 2 Node
How would we go about evaluating the truth of all n : Node | n->n in edges
in a particular situation, like this one:
We'll certainly need to know something about the truth value of n->n in edges
, but this subformula alone can't have
a truth value if we don't know what n
is. Somehow, we need to evaluate this subformula multiple times with different
possible values of n
, like N1
and N2
.
The way Alloy tackles this problem is to convert the user's specification (which may contain quantifiers, as well as operators like relational join and transitive closure) into a formula in Boolean logic. Once this conversion has taken place, Alloy passes the Boolean formula to a SAT solver. If the SAT solver reports that the specification is UNSAT, the user is told that their constraints were unsatisfiable. Otherwise, the output of the SAT solver is translated back into an instance.
How exactly does Alloy do the translation from the user-level specification into Boolean logic? For the spec above, we can rewrite
all n : Node | n->n in edges
as N1->N1 in edges and N2->N2 in edges
. That is, universal quantification
can be rewritten as conjunction. We can similarly rewrite existential quantification as disjunction.
However, these reformulations rely upon our knowledge that there are exactly 2 Node
.
If we were to remove the keyword exactly
, we could get an instance with one or zero nodes, for which
N1->N1 in edges and N2->N2 in edges
would be false, even if all n: Node | n->n in edges
was true.
We fix this problem using implication:
sig Node { edges : set Node }
run { all n : Node | n->n in edges } for 2 Node
N1 in Node implies N1->N1 in edges and N2 in Node implies N2->N2 in edges
After we get rid of pesky non-Boolean quantifications, what comes next? Our formula is in terms of the objects in the instance:
it refers to specific nodes and the edge relation, neither of which the SAT solver understands. We need to replace complex
statements like N1 in Node
and N2->N2 in edges
with Boolean variables. This particular situation
requires exactly 6 variables:
- V1: N1 in Node
- V1: N2 in Node
- V1: N1->N1 in edges
- V1: N1->N2 in edges
- V1: N2->N1 in edges
- V1: N2->N2 in edges
Alloy calls these replacements the primary variables. If you run Alloy in verbose mode, you can see exactly how many primary variables are generated in the translation process. Typically, relating the number of primary vars to your model is simple, but in some cases it's not immediately obvious. For example, consider altering our model to have an existential quantifier rather than a universal one:
sig Node { edges : set Node }
run { some n : Node | n->n in edges } for exactly 2 Node
If you run this version of the model, you'll notice it also has 6 primary vars, but we know that since both
N1
and N2
are guaranteed to be in all satisfying instances, we can toss out the first
two variables that were necessary for the universal quantification, leaving us with just 4. So where are the extra 2
variables coming from?
Recall that existential quantification is carried out by introducing a skolem relation which holds all those
atoms for which the statement inside the existential quantifier holds. That is, we have some new one-place relation
which holds only those members of Node which have self loops. Those extra 2 variables represent the statements
N1 in skolem_relation
and N2 in skolem_relation
.
What about all that other information in the verbose Alloy output? The clauses and non-primary variables? These are a consequence of some additional work we need to do to get our Boolean spec into the SAT solver-friendly format known as Conjunctive Normal Form (CNF).
A formula in CNF is the conjunction of a set of clauses, where each
clause is the disjunction of some number of variables, each of which may or may not be negated. In order to get our
formula in CNF, then, we have to to remove anything that isn't or
,
and
, or not
. We also have to move things around so that the whole formula is a conjunction
of disjunctions. There are a variety of logical equivalences we can use to make this transformation, but in practice
doing repeated transformations of the entire formula is quite inefficient, both space-wise and time-wise. For example,
consider trying to transform a disjunction of n conjunctions to CNF.
(F1 and G1) or ... or (Fn and Gn)
The distributive laws of Boolean logic guarantee that (P and Q) or R is equivalent to (P or R) and (Q or R), so we can start like so:
(F1 or ((F2 and G2) or ... or (Fn and Gn)) and (G1 or ((F2 and G2) or ... or (Fn and Gn))
But now we need to straighten out the second disjunct of both conjuncts, creating 4 new clauses, each of which needs more rearranging (which will give us 8 new clauses to fix). Ultimately, we generate a number of clauses that is exponential in n, which is pretty dismal.
Alloy solves this problem by introducing some extra variables. In addition to having variables that directly describe the relations in any instance, Alloy adds variables that represent entire subformulas of the Boolean formula. In our example above with the F's and G's, Alloy would add n extra variables V1,...Vn, where Vi stands for the whole subformula Fi and Gi. To encode the equivalence of Vi and Fi and Gi, Alloy introduces just three more clauses for each i:
2 clauses to encode V implies F and G:
(not Vi or Fi) and (not Vi or Gi)
1 clause to encode F and G implies V:
not Fi or not Gi or Vi
The final CNF formula is the conjunction of the disjunction of V1, V2, ..., Vn, with the 3n additional constraints which relate the original conjuncts to the V variables. That is, we've created a number of new constraints which is linearly related to the original number of constraints n. It is these extra V variables, and the clauses created using them, which accounts for the large numbers of vars and clauses Alloy generates even for small specifications.