CS1950Y Lecture 25: Intro to Datalog
April 11, 2018
Scheduling Announcements
Thursday, 4/19: Design Check 2 is due by the end of the day; make sure to schedule an appointment with your mentor TA on or before this date.
Friday 4/20: Molhan Aref will be giving a guest lecture!
Friday 4/27: No lecture this day. Have a great Spring Weekend!
Friday 4/30: Ed Wilson will be giving a guest lecture! In addition, this is the first day of final project presentations.
And now back to logic!
What features does a good programming language have? When Tim asked this in class, he got some of the answers below:
- syntax that is easy to learn, read and remember
- useful documentation
- informative error messages
- library support
- polymorphism
- arithmetic operators
- file I/O
- fast build system
- efficiency
But wait--answering this question well involves knowing what the language in question is being used for. The features that are useful for one application may not actually be all that helpful in an entirely different application.
In fact, for some applications we don't even need our language to be Turing complete! If you're not familiar with the notion of Turing completeness, you can think of it as a descriptor of expressive power of a language; using a Turing complete language, you can do everything that you'd be able to do in another TC language, like Java or C.
For the next few lectures, we'll be looking at a language called Datalog which is not Turing Complete. Datalog is used to encode databases as relations, and then build new relations (called database views) on top of these existing relations.
Let's start by describing a database for some social network, which holds information about who is friends
with whom, as well as the senders and recievers of recently sent messages. We can declare that alice
is friends with bob
thus:
friend(alice, bob).
Note that the relation friend
is not symmetric; the code above does not mean that bob
is friends with alice
.
Let's add a few more piece of information to our database. Firstly, alice
is also friends with
eve
, and alice
recently sent a message to bob
:
friend(alice, bob).
friend(alice, eve).
recentMessage(alice, bob).
Our database now includes two binary relations friend
and recentMessage
. Let's say
we want to distinguish especially good friendship on our social network so that we can choose whose content
to display first in a user's hypothetical newfeed. We can define a new relation goodFriend
based upon
the existing relations. The definition of goodFriend
below can be read as a backwards implication;
if X
is friends with Y
and X
recently sent a message to Y
,
X
stands in the goodFriend
relation to Y
:
goodFriend(X,Y) :- friend(X,Y), recentMessage(X,Y).
If we ask Datalog to give us all the pairs of people that stand in the goodFriend
relation,
it gives us only (alice
, bob
). This might seem surprising, given your experience with
Alloy; since goodFriend
is defined using an implication, wouldn't it be possible for other pairs
of people for whom the antecedent is not true to be in this relation along with (alice
, bob
)?
Although that would be possible in Alloy, Datalog always presents the smallest possible version of a newly
defined relation like goodFriend
. So although the semantics of implication allows for larger satisfying
versions of the relation, we won't see them.
Next up, we'll define the classic notion of graph reachability using Datalog. Let edge
be a binary
relation on nodes such that edge(X,Y)
indicates that there is an edge from node X
to node Y
. We'll define reachability inductively--a node X
stands in the reach
relation to node Y
if one of the following things is true:
- There is an edge directly from
X
toY
- There is an edge directly from
X
toZ
, andY
is reachable fromZ
We encode these two possibilities like so:
reach(X,Y) :- edge(X,Y).
reach(X,Y) :- edge(X,Z), reach(Z,Y).
There are actually a few different ways we could have formulated the inductive case of this definition. Two possible alternatives are these:
reach(X,Y) :- edge(X,Y).
reach(X,Y) :- reach(X,Z), edge(Z,Y).
reach(X,Y) :- edge(X,Y).
reach(X,Y) :- reach(X,Z), reach(Z,Y).
We can also define the notion of non-reachability as the relation nr
using our notion of reachability
like this:
nr(X,Y) :- node(X), node(Y), not(reach(X,Y)).
If we were writing reachability in Alloy, we would probably use the handy dandy transitive closure operator to do so. However, we could also translate our Datalog definition of reachability to Alloy. Such a translation would look something like this:
sig Node {
edge: set Node
}
one sig Helper {
reach: Node -> Node
}
fact reach {
all n1, n2, n3: Node |
n1->n3 in edge or (n1->n2 in edge and n2->n3 in reach) implies n1->n3 in reach
}
However, this definition won't quite do! Since Alloy will show larger instances that strictly necessary,
using implication to define reach
isn't quite enough; we'll need to use iff
instead.
It turns out this isn't the only modification this solution will need; stay tuned for next lecture for more
info!