CS1950Y Lecture 25: Intro to Datalog
April 11, 2018

Scheduling Announcements

Thursday, 4/19: Design Check 2 is due by the end of the day; make sure to schedule an appointment with your mentor TA on or before this date.

Friday 4/20: Molhan Aref will be giving a guest lecture!

Friday 4/27: No lecture this day. Have a great Spring Weekend!

Friday 4/30: Ed Wilson will be giving a guest lecture! In addition, this is the first day of final project presentations.

And now back to logic!

What features does a good programming language have? When Tim asked this in class, he got some of the answers below:

syntax that is easy to learn, read and remember
useful documentation
informative error messages
library support
polymorphism
arithmetic operators
file I/O
fast build system
efficiency

But wait--answering this question well involves knowing what the language in question is being used for. The features that are useful for one application may not actually be all that helpful in an entirely different application.

In fact, for some applications we don't even need our language to be Turing complete! If you're not familiar with the notion of Turing completeness, you can think of it as a descriptor of expressive power of a language; using a Turing complete language, you can do everything that you'd be able to do in another TC language, like Java or C.

For the next few lectures, we'll be looking at a language called Datalog which is not Turing Complete. Datalog is used to encode databases as relations, and then build new relations (called database views) on top of these existing relations.

Let's start by describing a database for some social network, which holds information about who is friends with whom, as well as the senders and recievers of recently sent messages. We can declare that alice is friends with bob thus:

friend(alice, bob).

Note that the relation friend is not symmetric; the code above does not mean that bob is friends with alice.

Let's add a few more piece of information to our database. Firstly, alice is also friends with eve, and alice recently sent a message to bob:

friend(alice, bob).
friend(alice, eve).
recentMessage(alice, bob).

Our database now includes two binary relations friend and recentMessage. Let's say we want to distinguish especially good friendship on our social network so that we can choose whose content to display first in a user's hypothetical newfeed. We can define a new relation goodFriend based upon the existing relations. The definition of goodFriend below can be read as a backwards implication; if X is friends with Y and X recently sent a message to Y, X stands in the goodFriend relation to Y:

goodFriend(X,Y) :- friend(X,Y), recentMessage(X,Y).

If we ask Datalog to give us all the pairs of people that stand in the goodFriend relation, it gives us only (alice, bob). This might seem surprising, given your experience with Alloy; since goodFriend is defined using an implication, wouldn't it be possible for other pairs of people for whom the antecedent is not true to be in this relation along with (alice, bob)?

Although that would be possible in Alloy, Datalog always presents the smallest possible version of a newly defined relation like goodFriend. So although the semantics of implication allows for larger satisfying versions of the relation, we won't see them.

Next up, we'll define the classic notion of graph reachability using Datalog. Let edge be a binary relation on nodes such that edge(X,Y) indicates that there is an edge from node X to node Y. We'll define reachability inductively--a node X stands in the reach relation to node Y if one of the following things is true:

There is an edge directly from X to Y
There is an edge directly from X to Z, and Y is reachable from Z

We encode these two possibilities like so:

reach(X,Y) :- edge(X,Y).
reach(X,Y) :- edge(X,Z), reach(Z,Y).

There are actually a few different ways we could have formulated the inductive case of this definition. Two possible alternatives are these:

reach(X,Y) :- edge(X,Y).
reach(X,Y) :- reach(X,Z), edge(Z,Y).

reach(X,Y) :- edge(X,Y).
reach(X,Y) :- reach(X,Z), reach(Z,Y).

We can also define the notion of non-reachability as the relation nr using our notion of reachability like this:

nr(X,Y) :- node(X), node(Y), not(reach(X,Y)).

If we were writing reachability in Alloy, we would probably use the handy dandy transitive closure operator to do so. However, we could also translate our Datalog definition of reachability to Alloy. Such a translation would look something like this:

sig Node {
    edge: set Node
}

one sig Helper {
    reach: Node -> Node
}

fact reach {
    all n1, n2, n3: Node |
        n1->n3 in edge or (n1->n2 in edge and n2->n3 in reach) implies n1->n3 in reach
}

However, this definition won't quite do! Since Alloy will show larger instances that strictly necessary, using implication to define reach isn't quite enough; we'll need to use iffinstead. It turns out this isn't the only modification this solution will need; stay tuned for next lecture for more info!

CS1950Y Lecture 25: Intro to Datalog April 11, 2018

Scheduling Announcements

And now back to logic!

CS1950Y Lecture 25: Intro to Datalog
April 11, 2018