CS1950Y Lecture 14: Inside Alloy
February 28, 2018


Announcements

Tomorrow (3/6) from 12-1pm, Caroline Trippel of Princeton will be giving a talk on detecting hardware vulnerabilities using Alloy. If you're interested, the talk will be held in the Engineering Research Center, Room 125.

Many of you have reached out to get more details about case study presentations. We will not be providing a formal specification for exactly what case studies will look like, because they are meant to be informal. The goal here is to decompress for a week and reflect on what you've learned already!

Back to Logic

Last class, we talked about two kinds of logic: Boolean logic (also called propositional logic) and first-order logic. Formulas in Boolean logic consist of variables with truth values (or propositions), strung together using logical connectives like and and or. First-order logic is a bit more expressive; it also includes existential and universal quantifiers. In addition, the variables of a first-order formula refer to concrete entities in the universe. Just like programming languages, logics have different levels of expressive power. Sometimes, restricting expressive power can be beneficial; although it reduces the scope of the things we can model in a particular language or logic, it makes it easier to reason about and understand that language or logic.

Previously, we used Alloy to model the semantics of Boolean logic by recursively evaluating the truth values of the subformulas of a given formula, combining the results to get an overall truth value for the formula. Defining the semantics for first-order logic will prove a bit more complicated.

Consider the Alloy specification below:


sig Node { edges : set Node }

run { all n : Node | n->n in edges } for exactly 2 Node
How would we go about evaluating the truth of all n : Node | n->n in edges in a particular situation, like this one: N1 N2

We'll certainly need to know something about the truth value of n->n in edges, but this subformula alone can't have a truth value if we don't know what n is. Somehow, we need to evaluate this subformula multiple times with different possible values of n, like N1 and N2.

The way Alloy tackles this problem is to convert the user's specification (which may contain quantifiers, as well as operators like relational join and transitive closure) into a formula in Boolean logic. Once this conversion has taken place, Alloy passes the Boolean formula to a SAT solver. If the SAT solver reports that the specification is UNSAT, the user is told that their constraints were unsatisfiable. Otherwise, the output of the SAT solver is translated back into an instance.

How exactly does Alloy do the translation from the user-level specification into Boolean logic? For the spec above, we can rewrite all n : Node | n->n in edges as N1->N1 in edges and N2->N2 in edges. That is, universal quantification can be rewritten as conjunction. We can similarly rewrite existential quantification as disjunction.

However, these reformulations rely upon our knowledge that there are exactly 2 Node. If we were to remove the keyword exactly, we could get an instance with one or zero nodes, for which N1->N1 in edges and N2->N2 in edges would be false, even if all n: Node | n->n in edges was true. We fix this problem using implication:


sig Node { edges : set Node }

run { all n : Node | n->n in edges } for 2 Node

N1 in Node implies N1->N1 in edges and N2 in Node implies N2->N2 in edges

After we get rid of pesky non-Boolean quantifications, what comes next? Our formula is in terms of the objects in the instance: it refers to specific nodes and the edge relation, neither of which the SAT solver understands. We need to replace complex statements like N1 in Node and N2->N2 in edges with Boolean variables. This particular situation requires exactly 6 variables:

Alloy calls these replacements the primary variables. If you run Alloy in verbose mode, you can see exactly how many primary variables are generated in the translation process. Typically, relating the number of primary vars to your model is simple, but in some cases it's not immediately obvious. For example, consider altering our model to have an existential quantifier rather than a universal one:


sig Node { edges : set Node }

run { some n : Node | n->n in edges } for exactly 2 Node

If you run this version of the model, you'll notice it also has 6 primary vars, but we know that since both N1 and N2 are guaranteed to be in all satisfying instances, we can toss out the first two variables that were necessary for the universal quantification, leaving us with just 4. So where are the extra 2 variables coming from?

Recall that existential quantification is carried out by introducing a skolem relation which holds all those atoms for which the statement inside the existential quantifier holds. That is, we have some new one-place relation which holds only those members of Node which have self loops. Those extra 2 variables represent the statements N1 in skolem_relation and N2 in skolem_relation.

What about all that other information in the verbose Alloy output? The clauses and non-primary variables? These are a consequence of some additional work we need to do to get our Boolean spec into the SAT solver-friendly format known as Conjunctive Normal Form (CNF).

A formula in CNF is the conjunction of a set of clauses, where each clause is the disjunction of some number of variables, each of which may or may not be negated. In order to get our formula in CNF, then, we have to to remove anything that isn't or, and, or not. We also have to move things around so that the whole formula is a conjunction of disjunctions. There are a variety of logical equivalences we can use to make this transformation, but in practice doing repeated transformations of the entire formula is quite inefficient, both space-wise and time-wise. For example, consider trying to transform a disjunction of n conjunctions to CNF.

(F1 and G1) or ... or (Fn and Gn)

The distributive laws of Boolean logic guarantee that (P and Q) or R is equivalent to (P or R) and (Q or R), so we can start like so:

(F1 or ((F2 and G2) or ... or (Fn and Gn)) and (G1 or ((F2 and G2) or ... or (Fn and Gn))

But now we need to straighten out the second disjunct of both conjuncts, creating 4 new clauses, each of which needs more rearranging (which will give us 8 new clauses to fix). Ultimately, we generate a number of clauses that is exponential in n, which is pretty dismal.

Alloy solves this problem by introducing some extra variables. In addition to having variables that directly describe the relations in any instance, Alloy adds variables that represent entire subformulas of the Boolean formula. In our example above with the F's and G's, Alloy would add n extra variables V1,...Vn, where Vi stands for the whole subformula Fi and Gi. To encode the equivalence of Vi and Fi and Gi, Alloy introduces just three more clauses for each i:

2 clauses to encode V implies F and G:

(not Vi or Fi) and (not Vi or Gi)

1 clause to encode F and G implies V:

not Fi or not Gi or Vi

The final CNF formula is the conjunction of the disjunction of V1, V2, ..., Vn, with the 3n additional constraints which relate the original conjuncts to the V variables. That is, we've created a number of new constraints which is linearly related to the original number of constraints n. It is these extra V variables, and the clauses created using them, which accounts for the large numbers of vars and clauses Alloy generates even for small specifications.