Posted on October 8, 2018 by Mistral Contrastin

Automatic Subgoal Scheduling for Datalog

In which we investigate dataflow in declarative programs and talk about how to make Datalog safer by statically reordering goals when predicates with dataflow constraints are involved.

I recently presented “Automatic Reordering for Dataflow Safety of Datalog” (can be accessed freely if you follow the link from my homepage) in the 20th Symposium on Principles and Practice of Declarative Languages. I collaborated with Dr Dominic Orchard and Dr Andrew Rice for this work. A few commented that the paper is technically dense and consequently hard going. I don’t want this work to be ignored because I suck at writing. So in an attempt to make things right and also to do some PR, I will explain the major ideas of the paper without the formalism and with numerous examples.

I want this post to be understandable with ease. If you find yourself confused or bored at any point or found some terms that are not very well explained, contact me via Twitter or email (both on my homepage).

After explaining some standard terminology about Datalog, we start by describing what extralogical predicates are, why one might want them in Datalog, and what problems they bring along. Then we clarify what is meant by invocation safety of such predicates. This will lead us to the naïve, incomplete, and inefficient solution and that will lead us to a better solution. I’ll conclude by giving summarising the main points, mentioning where this work work might go and why you might like to look at the paper.

Terminology

Let’s briefly look at basic Datalog terminology through an example:

ancestor(X,Z) :- parent(X,Y), parent(Y,Z).

                 -----------  -----------
                   Subgoal      Subgoal
-------------    ------------------------
    Head                   Body
------------------------------------------
                  Rule

This is a rule in Datalog (as it happens also in Prolog) with some parts highlighted.
A rule consists of a head (to the left of :-) and a body (everything to the right of :-). A body is a comma separated list of subgoals.
A subgoal is a predicate applied to a tuple.
A predicate qualifies a tuple with a name (sort). Throughout the text, we write subgoals in monospace and predicates in \(italic\). We sometimes say “the \(p\) subgoal” to refer to a subgoal formed by applying the predicate \(p\) to a tuple. We also say “the \(p\) rule(s)” to mean the rules with the predicate \(p\) featured in the head.
A tuple of subgoal consists of terms which are either constants like "Dragon Fruit" (with quotation marks) or variables like Fruit (no quotations, always capitalised).

We’re good to go! All other terms will be defined in the text as we go along.

Motivation

Example 1

Here’s a hypothetical Datalog rule that authenticates a user:

auth(User,Pass) :- password(User,Pass), hash(Pass,Hash), valid(User,Hash).

The reading of any rule in Datalog is “if every subgoal in the body holds, then we can conclude the head.” In this specific case, we can authenticate a user if we have a known password for a that user and the hash of that user is known to be valid. One of the nicest parts of this reading is that it doesn’t matter which order the subgoals are presented, we still reach the same conclusion. For example,

auth(User,Pass) :- hash(Pass,Hash), password(User,Pass), valid(User,Hash).

would authenticate exactly the same users as the previous example.

When the predicates in a program are logical, there are no dataflow restrictions for evaluation of a subgoal with that predicate. For example, password("Milner",Pass) would bind the password of Milner to Pass while password(User,Pass) would generate all known user and password pairs. There are no inherent input and output parameters to a logical predicate.

However, the \(hash\) predicate stands out because hash functions (if they are any good) work only in one direction. It is unreasonable to expect \(hash\) to bind a value for the password given a hash. Let’s refer to predicates like \(hash\) (those with dataflow requirements) as extralogical predicates.

In light of this knowledge, the versions of the rule defining \(auth\) above are not equivalent. Evaluation of the subgoal hash(Pass,Hash) produces a runtime error if the value of Pass is unknown. Suddenly, we have to be careful about the way we order subgoals. Not only those that are obviously extralogical like \(hash\), but also any user-defined predicates that make use of extralogical ones! To make the matters worse, although semantically arithmetic relations such as \(>\) are logical, they are usually implemented as extralogical predicates (or relations).

Example 2

Just when you think things are looking bleak, it gets worse. Extralogical predicates may also cause code duplication. Let’ look at another example:

check_client(Pass) :- weak(Pass,Hash).
check_server(Hash) :- weak(Pass,Hash).

weak(Pass,Hash) :- hash(Pass,Hash), rainbow(Pass,Hash).

Here, we define a predicate \(weak\) that checks if a password or a hash is weak. It does so looking for the hash of the password in a rainbow function (a function that can reverse some hashes). Like \(hash\), \(rainbow\) is also extralogical but in its second argument rather than the first. Now it is beneficial for both the client and the server to check if a hash is weak, but the client shouldn’t know the hash of a given password (otherwise it can construct a rainbow table of its own!) and the server shouldn’t know the password (in case it gets stolen). Hence, there are two predicates \(check\_client\) that takes a password and \(check\_server\) that takes a hash that can be used by the client and server respectively.

This example is worse than the previous one because regardless how careful the programmer is, she can’t rewrite the definition of \(weak\) in a single rule such that all dataflow requirements are satisfied. Now when \(check\_client\) is used, \(weak\) should have the subgoal hash(Pass,Hash) first, since the password is known (satisfying \(hash\)’s dataflow requirement) and evaluating this subgoal produces a hash, satisfying the dataflow requirement of \(rainbow\). If \(check\_server\) is used, we need the opposite ordering because this time the hash value is known initially. So neither rule alone can be used to satisfy the requirements of the whole program. The only solution is a code duplicating rewrite:

check_client(Pass) :- weak_client(Pass,Hash).
check_server(Hash) :- weak_server(Pass,Hash).

weak_client(Pass,Hash) :- hash(Pass,Hash), rainbow(Pass,Hash).
weak_server(Pass,Hash) :- rainbow(Pass,Hash), hash(Pass,Hash).

Example 3

If you’re still not sold on the idea, I have one final motivating example. Even in the purely logical implementations of Datalog, dataflow problems exist. Negation is indispensable in logic programming. However, there is a difficulty with it. Consider the following rule:

accessed("Mistral").
accessed("Hattie").
accessed("Rebecca").

password("Hattie" ,"171717").
password("Rebecca","242424").

guest(User) :- not password(User,Pass).

?- accessed(User), guest(User).

This program poses the query “out of those who are known to access the system, which ones are guests?”. Being a guest is defined as not having a password recorded for the given user identifier. Most implementations would reject this program because of something called allowedness. This means we can only conclude facts about constants that are known to the system. In this example, we can only conclude positive or negative facts about Mistral, Rebecca, and Hattie, but not about Alice.

In principle, this should be fine because the values User variable can take are already restricted by the accessed(User) subgoal in the query. However, most implementations of Datalog would reject this as this subgoal does not appear directly within the rule of \(guest\) and before the negated subgoal. This is a conservative approximation. Notice that we have not used any extralogical predicates, yet negated subgoals automatically require all variables appearing inside them to be bound by the subgoals preceding them. This is precisely the problem we have with extralogical predicates.

Overall sentiment

The chances are if you chose a declarative language such as Datalog over an imperative alternative, these sort of details are exactly what you are trying to run away from. It seems here we’re forced to choose between useful functionality and high-level programming. The work that follows allows you to have both and do so without incurring a runtime performance penalty.

Understanding invocation safety

Although the examples illustrate the problems with extralogical predicates, they do not provide us anything concrete to compute with. We need a way of representing bindings of parameters and dataflow requirements in the program. As soon as we define these, the precise definition of invocation safety is simple.

Adornments: concrete parameter binding

Adornments indicate whether a parameter is bound or not. Adornment transformation is the standard analysis that computes it. Let’s understand this using Example 1:

auth(User,Pass) :- password(User,Pass), hash(Pass,Hash), valid(User,Hash).

Is the argument Pass in the subgoal hash(Pass,Hash) bound? Yes, it is because the preceding subgoal password(User,Pass) has this variable in it and its evaluation binds Pass. How about the Pass argument in password(User,Pass)? It is free because we haven’t even seen this variable before¹, there is no way something else could have bound it. How about the User argument appearing in the same subgoal? It depends on the caller of the \(auth\) predicate.

Different queries yield different answers:

?- auth(User,Pass) asks which user & password pairs can be authenticated. Since we do not know which user to authenticate, the variables in password(User,Pass) are not bound to values.
?- auth("Alice",Pass) asks if Alice is authenticated. Since we know which user to authenticate, User in password(User,Pass) is bound to a value but not Pass.

What we learn from that example is the following:

Adornment is a top-down procedure starting from the query, the binding information flows down.
Constants such as "Alice" are ground, hence are bound.
Variables of a subgoal are bound if they appear in subgoals that precede them like password(User,Pass) preceding hash(Pass,Hash).
Variables of a subgoal are also bound when they are bound in the head of a rule. For example, ?- auth("Alice",Pass) makes the User variable in the head of the \(user\) rule bound. Consequently, the User variable in password(User,Pass) is bound.

These four rules define a top-down procedure for binding. Let’s invent something concrete to compute with. Let’s annotate parameters that are bound with b and those that are free with f. In the query, auth("Alice",Pass) receives the pattern bf due to the constant and this information flows down to the rules with head predicate \(auth\). Then, password(User,Pass) in the body of \(auth\) rule receives bf because User is bound in the head and Pass is not. The subgoal hash(Pass,Hash) would have bf because Pass appears in the preceding subgoal and Hash has never been observed before. As an exercise you can try to figure out the binding pattern for valid(User,Hash) from the previous example (the answer²).

Example 4

Before moving on from adornments, let’s look at an extension of Example 2. There is something that brings us closer to the solution of the code duplication problem described above.

?- check_client("123456").
?- check_server("deadbeaf").

check_client(Pass) :- weak(Pass,Hash).
check_server(Hash) :- weak(Pass,Hash).

weak(Pass,Hash) :- hash(Pass,Hash), rainbow(Pass,Hash).

This is the same program in Example 2 except that we have two queries one using the \(check\_client\) predicate and the other using \(check\_server\). Now let’s adorn this program but rather than noting the adornment patterns separately, let’s make them suffixes to the names of the predicates. This is how this analysis is usually conduct and is the reason why it is referred as a transformation despite being an analysis in spirit. As it is a top-down procedure we start from the queries.

?- check_client_b("123456").
?- check_server_b("deadbeaf").

Both queries receive the pattern b because all arguments are constants. Next we propagate this to the rules defining \(check\_client\) and \(check\_server\) predicates.

check_client_b(Pass) :- weak_bf(Pass,Hash).
check_server_b(Hash) :- weak_fb(Pass,Hash).

The heads of each predicate receive their adornments directly from their callers. Since both callers have the pattern b, so do the heads. Now in the bodies, we have \(weak\_bf\) and \(weak\_fb\) predicates because check_client_bf(User) binds the first parameter of weak(Pass,Hash) and check_client_fb(User) binds the second parameter of weak(Pass,Hash). Since we have two different bindings for \(weak\), we need two versions of \(weak\) body.

weak_bf(Pass,Hash) :- hash_bf(Pass,Hash), rainbow_bb(Pass,Hash).
weak_fb(Pass,Hash) :- hash_fb(Pass,Hash), rainbow_bb(Pass,Hash).

The adornment of these rules are done exactly as before, so I won’t describe them again. The important point is the adornment patterns of subgoals are effected by the adornments patern of the head. The binding patterns of hash above are bf and fb and this difference stems only from the head of the rules.

This looks remarkably similar to the code duplication solution we found to the problem described in Example 2. Once again, we ended up with two different versions of weak with suffixes are bf and fb instead of client and server. This is a superfluous difference. The important difference is that the \(weak\_fb\) rule is not reordered while the \(weak\_check\) rule is. So it can’t be a solution to the invocation safety on its own, but it is a step in the right direction and we develop this further below.

Modes: concrete dataflow requirements

A mode is an indication of input/output behaviour. Each predicate logical or otherwise have a mode pattern associated with it. We only need two modes for Datalog + and ?. If the invocation a predicate is going to be safe, parameters with mode + must be bound and those with mode ? are always safe bound or not.

For example, \(hash\) from Example 1 has a +? pattern because its first parameter needs to be bound at the time of invocation and the second parameter may be bound or not. If we evaluate the subgoal hash("Hattie","42"), "Hattie" must be there to safely execute \(hash\), but it doesn’t matter "42" is there because if the hash of "Hattie" is "0", then we can compare "0" and "42" safely. Let’s look at \(password\) now. That predicate is a lot like a database table, there are no restrictions on what you can use to look up values. Hence, it has the mode pattern ??. In standard Datalog, all predicates have ? mode associated with all of their parameters³.

Let’s adopt a scheme of amending the predicate name with mode patterns to make things syntactically obvious just as we did in the adornments example. Example 1 becomes

auth_xx(User,Pass) :-
  password_??(User,Pass), hash_+?(Pass,Hash), valid_??(User,Hash).

You notice there is an xx in place of fs and bs for the \(auth\) predicate. That’s because we don’t know a way of determining modes of user-defined predicates. Intuitively, dataflow requirements of a user-defined predicate is some function of the subgoals that define those predicates. Indeed, finding an effective procedure to precisely determine mode patterns of user-defined predicates is the only hard part of our quest to ensure invocation safety. We explore this soon in the solution.

Relating adornments to modes (aka. invocation safety)

Remember b means the parameter is bound and + means the parameter needs to be bound. Modes and adornments are clearly related. Their main differences are:

Adornment transformation needs an entry point (a query) and dataflow information flows top-down from heads to the bodies. We don’t have a procedure for computing modes (yet), but we mentioned the predicates defined in the heads should be a function of those in the body. This suggests a bottom-up information flow.
Binding (adornment) patterns qualify subgoals (predicates applied to tuples of variable and constants). Mode patterns, on the other hand, qualify predicates and hence don’t change from one subgoal to another.
Related to the previous point, adornment is entirely dependent on the query, whereas modes are not. This suggests we can perhaps compute modes beforehand and consult them as queries change.

The agreement between adornments and modes be made precise: in an adorned program, invocations of extralogical predicates are safe when all subgoals with extralogical predicates are adorned with b for parameters with mode +. This can be restated as the following almost tautological statement: if something needs to be bound, then it is bound.

In the literature, invocation safety is called well-modedness. A well-moded query & program pair doesn’t produce invocation errors just as “well-typed programs don’t go wrong” as Robin Milner put it.

Solution

Now we start with the stupidest possible thing we can do and refine our approach.

Naïve solution

The examples so far suggest that we need a simple reordering of bodies. So why not give that a go. Recall the variation of Example 1 with mode embellishments:

auth_xx(User,Pass) :-
  password_??(User,Pass), hash_+?(Pass,Hash), valid_??(User,Hash).

Let’s adorn versions of this rule with the subgoals permuted with respect to the query ?- auth_xx(User,Pass). The heads of the rules are omitted as they are always auth_xx_ff(User,Pass).

✖password_??_ff(User,Pass), hash_+?_bf(Pass,Hash), valid_??_bb(User,Hash).
✖password_??_ff(User,Pass), valid_??_bf(User,Hash), hash_+?_bb(Pass,Hash).
✖valid_??_ff(User,Hash), password_??_bf(User,Pass), hash_+?_bb(Pass,Hash).
✓valid_??_ff(User,Hash), hash_+?_fb(Pass,Hash), password_??_bb(User,Pass).
✓hash_+?_ff(Pass,Hash), valid_??_fb(User,Hash), password_??_bb(User,Pass).
✓hash_+?_ff(Pass,Hash), password_??_fb(User,Pass), valid_??_bb(User,Hash).

The last three are invocation safe while the first three are not. Since \(hash\) is the only extralogical predicate with dataflow requirements, namely on its first parameter Pass, orderings that leads to a bound mode (b) at the first parameter are well-moded, while the others are not.

The obvious problem is if there are \(n\) subgoals we may need to adorn \(n\) factorial orderings to find a suitable ordering. But perhaps that’s not a big problem, if in practice programmers stick to 3 to 5 subgoals per rule. The real problem is that we cannot isolate rules.

Example 5

Let’s say we do some software engineering and factor hash_+?(Pass,Hash) and valid_??(User,Hash) into a new rule with head predicate \(check\) as follows:

?- auth_xx(User,Pass)

auth_xx(User,Pass) :- check_xx(User,Pass), password_??(User,Pass).
check_xx(User,Pass) :- hash_+?(Pass,Hash), valid_??(User,Hash).

Now let’s try adorning this program while considering different orderings of \(check\) as that is where the extralogical predicate \(hash\) occurs.

?- auth_xx_ff(User,Pass)

auth_xx_ff(User,Pass) :-
  check_xx_ff(User,Pass), password_??_bb(User,Pass).

check_xx_ff(User,Pass) :- hash_+?_ff(Pass,Hash), valid_??_fb(User,Hash).
check_xx_ff(User,Pass) :- valid_??_ff(User,Hash), hash_+?_fb(Pass,Hash).

This looks bad because when we reorder \(check\) rule alone, neither of the orderings lead to an adornment that makes the invocation of \(hash\) safe. They both lead to a free first argument.

At this point we can just say that we have a sound but incomplete analysis, tough luck. Alternatively, we recognise that the first parameter of the subgoal with \(hash\) is Pass which appear in the head of the rule. As we learn in adornment transformation rules, if a parameter is bound in the head then it is also bound in the body. So all we need is to turn the ff binding pattern of the head to fb or bb. This is indeed possible if we reorder the use of \(check\) within the \(auth\) rule as follows:

?- auth_xx_ff(User,Pass)
auth_xx_ff(User,Pass) :-
  password_??_ff(User,Pass), check_xx_bb(User,Pass).

check_xx_bb(User,Pass) :- hash_+?_bf(Pass,Hash), valid_??_bb(User,Hash).
check_xx_bb(User,Pass) :- valid_??_bf(User,Hash), hash_+?_bb(Pass,Hash).

After permuting the subgoals with \(password\) and \(check\), we end up with bb adornment for the head of \(check\) and regardless the order of subgoals in the body \(check\). This makes invocations of \(hash\) safe in both cases. Computationally, however, this is problematic. Search space is exponential not only in the size of individual bodies but it is a multiplication of all possible orderings of all rules from query down to the relevant rule. This is intractable even for small programs. We address this not with elegance but with some greed.

Greed is good

We can’t do better than exponential asymptotically, but we can make it exponential the way Hindley-Milner type inference is—only in degenerate cases.

The most important insight to our solution is the following: although dataflow requirements are inherent to the predicates, they are contextual. That is to say if a predicate parameter has mode + (requiring it to be bound) and the subgoal using that predicate has the variable V at that parameter, if a preceding subgoal already binds V, we can safely pretend as if the predicate has mode ? in this context. So we can treat

... :- password_??(User,Pass), hash_+?(Pass,Hash), ...

... :- password_??(User,Pass), hash_??(Pass,Hash), ...

so \(hash\) in that context behaves as if it is a logical predicate.

Now we exploit the contextual nature of dataflow. We schedule subgoals with no dataflow requirements first such as \(password\) and \(valid\). These subgoals can be freely placed anywhere in the body as they are always safe to execute. So we adopt a greedy approach and schedule subgoals with such predicates as early as possible.

The benefits of this approach are twofold:

Combinatorially, there are fewer orderings to consider as placement of logical subgoals are restricted.
More importantly, variables of scheduled subgoals are propagated to the remaining alternatives to relax the dataflow constraints as described above. This would lead to further opportunities to apply the greedy scheduling principle.

Scheduling of subgoals fit nicely into a graph where the edges are labelled with subgoals to schedule and orderings are the paths from the root to vertices with no successors.

Example 6

Let’s apply this strategy to Example 1:

auth_xx(User,Pass) :-
  password_??(User,Pass), hash_+?(Pass,Hash), valid_??(User,Hash).

The corresponding graph is:

     {password, valid}             {hash}
V0 --------------------> V1 -------------------> V2

This graph stores orderings where subgoals with predicates \(password\) and \(valid\) are scheduled first and it doesn’t matter in which order, only then the \(hash\) subgoal is scheduled.

The subgoals with predicates \(password\) and \(valid\) could be scheduled first because they have no dataflow requirements (indicated by ?? mode pattern). The subgoal with \(hash\) could not be scheduled at that point due to +? mode pattern. However, as we established earlier in this section, mode patterns can be relaxed in the right context. So if we treat the vertex V1 in the graph as a context, hash_+?(Pass,Hash) behaves as hash_??(Pass,Hash) because on this path the subgoal password(User,Pass) precedes it and binds Pass, thus we can schedule it at that point using the greedy principle.

More concretely, this leads to the following orderings:

password_??(User,Pass), valid_??(User,Hash), hash_+?(Pass,Hash).
valid_??(User,Hash), password_??(User,Pass), hash_+?(Pass,Hash).

These are the two of the three well-moded orderings found through brute-force search in the naïve solution described above. The ordering we miss has valid_??(User,Hash) at the end of the body. Greedy scheduling prevents us from discovering this ordering.

This raises the question “are there cases in which we discard the only safe ordering?” The answer is no. If there is at least one safe ordering that can be found by reordering, greedy scheduling will also find at least one valid ordering. Surprisingly, greed is good⁴ and it doesn’t compromise completeness.

Before we move on to other examples of scheduling graphs, we can finally address the pending problem of determining mode patterns for user-defined predicates posed where modes are introduced.

Modes of user-defined predicates

Let’s continue with Example 6 and deduce what xx in auth_xx(User,Pass) should be.

Since a mode marks a requirement of boundness, the first x should be + if User needs to be bound for the invocation to be safe and it should be ? if invocation of all predicates in the body are safe regardless binding status of User. The same reasoning applies to the second parameter. If invocation is safe with a ? mode, then it is also safe with a + mode, so we are looking for the most relaxed mode⁵. The orderings from the graph above didn’t make any conditional reasoning about boundedness of User or Pass, hence ?? is suitable for \(auth\). We can revise the two versions of the rule as follows:

auth_??(User,Pass) :-
  password_??(User,Pass), valid_??(User,Hash), hash_+?(Pass,Hash).
auth_??(User,Pass) :-
  valid_??(User,Hash), password_??(User,Pass), hash_+?(Pass,Hash).

Here’s the beauty of this analysis result: despite the fact that we defined auth with the use an extralogical predicate, we manage to make it behave as a logical predicate wherever it is invoked. We just restored some of the lost declarative promise!

Example 7

We first look at Example 5 again:

auth_xx(User,Pass) :- check_xx(User,Pass), password_??(User,Pass).
check_xx(User,Pass) :- hash_+?(Pass,Hash), valid_??(User,Hash).

Let’s examine the interesting scheduling graph first that is the one for check rule.

     {valid}        {hash}
V1 ----------> V2 ---------> V3

This graph looks similar to the previous one. We apply the greedy principle and schedule valid_??(User,Hash) first. The difference is the variable bound by this subgoal (User and Hash) don’t include Pass which is the dataflow restricted argument of hash_+?(Pass,Hash). So the only way we can schedule \(hash\) is if Pass is already bound at the moment we start evaluating the rule. This is same as saying the second parameter of \(check\) (which is Pass) has to be bound at the caller. For the first parameter, there are no restrictions posed. Then, the overall mode pattern for check is ?+.

In the description above, we kept track of which variables on the path has to be resolved by the invoker of the predicate being defined. This information naturally fits in the vertices of the graph. This is how we do it in the paper, but we elide the details here to simplify the presentation.

Now that we know the mode pattern for \(check\), we can draw a scheduling graph for the rule that defines \(auth\) and deduce its predicate (again). This time we do not give a detailed explanation. It’d be a good exercise to see why we scheduled which predicate and what the mode pattern of \(auth\) is.

     {password}        {check}
V1 -------------> V2 ----------> V3

We cheat here a bit and impose an order on building the graphs. In general, we can’t find such an order because rules can refer to themselves recursively or be mutually recursive with other rules. Proper way of handling that situation is to assume ? in place of x for modes of user-defined predicates initially, and repeatedly apply this graph-based analysis to update the mode patterns until they stabilise (fixpoint) for all user-defined predicates. You can refer to the paper on why this works and why it terminates at all.

Example 8

So far it looks like all the scheduling graphs are just single paths. This is not the case in general and we can use Example 2 to demonstrate it:

weak_xx(Pass,Hash) :- hash_+?(Pass,Hash), rainbow_?+(Pass,Hash).

We don’t have any logical predicates in this body. This can’t be a good start. There is only one thing we do, that is to default to brute-force search for a moment and try scheduling both subgoals.

      {hash}          {rainbow}
  -------------> V2 -------------> V3
 /
V1
 \
  -------------> V5 -------------> V6
    {rainbow}           {hash}

Finally, some more interesting ASCII art.

Let’s inspect the branch we try scheduling hash_+?(Pass,Hash) first. In this path, scheduling the subgoal with \(hash\) predicate binds Pass and Hash, as a result at the context of \(V2\), rainbow behaves like a logical predicate (the dataflow constrained second parameter has Pass which is already bound). Since we couldn’t schedule the \(hash\) subgoal for free, it seems we need to constraint the first parameter (Pass) of \(weak\) leading +? mode pattern.

But wait a second! if we inspect the other branch instead, we commit to requiring Hash to be bound in the head of the rule (and consequently dealt with by the caller) instead. This suggests a mode pattern ?+.

So which is it? In a way, it is both. Depending on the ordering of the rules we have different dataflow requirements. This is good! Now we can just select a representative ordering for each mode pattern. When the rule is used differently, we use the appropriate ordering. So the overall mode pattern of \(weak\) is +?/?+ to signify either of them can be used.

One reason we ended with two alternatives is that the mode patterns we obtained are incomparable. If we had options +? and ++, instead we could have discarded ++ because +? is a more favourable alternative. This allows us to store fewer orderings. We say more about that in the paper.

Multiple rules

So far we have been assuming that a predicate is defined by a single rule. Now we look at those defined by multiple rules. Let’s extend Example 8 to have a custom extralogical predicate that checks weakness of a password & hash pair and requires both of its parameters to be bound to execute safely.

weak_xx(Pass,Hash) :- hash_+?(Pass,Hash), rainbow_?+(Pass,Hash).
weak_xx(Pass,Hash) :- custom_weak_++(Pass,Hash).

We already know the first rule leads to the mode pattern +?/?+. The second rule is so simple that we can tell the mode pattern it leads without drawing the relevant scheduling graph, namely ++ inherited from its single defining subgoal.

Now what does it mean to invoke a predicate in Datalog? It means we are going to try each of its defining rules to collect all possible conclusions. This implies any invocation of \(weak\) is going to evaluate both rules. Therefore, the only thing we can do to ensure safety is to satisfy the dataflow requirements of each rule simultaneously. What this means in our example is that we generate all possible combinations that is ++ & +? and ++ & ?+. In each case, ++ is stricter for every parameter, so it shadows the other. Hence, we end up with ++/++, but since both alternatives are identical, the overall mode pattern for \(weak\) is ++. Here this sounds ad hoc, but there is an algebraic method to this madness⁶. The details of which are in the paper.

Generalised adornment

Now we know how to determine dataflow requirements of user-defined predicates as well as orderings of subgoals that lead to these requirements. However, this does not help very much with our definition of invocation safety. The reason is that definition rely on adorning a program in the given order of subgoals. This is already in the result of Example 4 with two different versions of the \(weak\) rule with different adornments but static ordering of subgoals.

So the final step in our quest to invocation safety is to generalise adornment procedure such that it can take reorderings into the account.

We modify the adornment procedure in a single way. We let the procedure take a reordering function that given a rule and a binding pattern produces an ordering. Of course, we compute these orderings through scheduling graphs as above. Then, right before we start adorning the body of a rule, we apply the reordering first, then adorn from left to right as usual.

Example 9

Consider the following example adapted from Example 4.

?- weak_+?/?+(Pass,"deadbeaf").

weak_+?/?+(Pass,Hash) :- hash_+?(Pass,Hash), rainbow_?+(Pass,Hash).

So the head of \(weak\) rule will have the adornment fb due to the query. We almost computed the ordering function for \(weak\) in Example 8. For ?+ mode pattern alternative which is the only alternative compatible with fb binding pattern, we have an ordering in which the subgoal with \(rainbow\) comes first and the subgoal with \(hash\) comes second. So we reorder the body first and then perform the adornment and that leads to the following adorned program:

?- weak_+?/?+_fb(Pass,"deadbeaf")

weak_+?/?+_fb(Pass,Hash) :- rainbow_?+_fb(Pass,Hash), hash_+?_bb(Pass,Hash).

And voilà, we have a \(rainbow\) subgoal with an adornment for its second parameter b and \(hash\) subgoal with an adornment for its first parameter b as required. As all extralogical predicates are used in subgoals with adornments that match their modes, the program is safe to execute.

Checking invocation safety

Once we compute modes for all-user defined predicates including the query, we don’t have to adorn the program to see if the program is well-moded. We only need to look at the query.

Example 10

In Example 9 the query was ?- weak_+?/?+_fb(Pass,"deadbeaf"), here the fact that fb is consistent with one of the mode pattern alternatives, namely ?+ immediately indicates that we have a ordering function that satisfy all dataflow requirements.

If, on the other, hand we pose the same query to the version of weak used in explaining modes of predicates with multiple rules:

?- weak_++(Pass,"deadbeaf")

weak_++(Pass,Hash) :- hash_+?(Pass,Hash), rainbow_?+(Pass,Hash).
weak_++(Pass,Hash) :- custom_weak_++(Pass,Hash).

Just by looking at the query, we know the program is not well-moded with respect to this query because the predicate \(weak\) in this case requires both of its parameters to be bound but the first argument to the query is the free variable Pass.

Some properties

This algorithm enjoys a number of properties. I am neither going to get into their proofs nor the mathematical details, but it would be amiss to omit them entirely:

Soundness: if the algorithm finds a reordering function for a program, the the reordered rules will be safe to invoke.
Completeness: if there are any orderings of rules that ensure invocation safety, then the algorithm will find a (possibly different) set of orderings that ensure invocation safety.
Incremental computation: addition of new rules don’t invalidate the old results, hence the mode patterns of existing user-defined predicates can be used to seed the analysis leading to faster execution.
Termination: the analysis terminates on all inputs.

In particular, incremental computation is useful for making this analysis scalable. In the special case that the added rule does not extend an existing predicate, i.e., have a fresh head, we don’t have to reanalyse old rules. This means we can analyse libraries and ship the mode patterns of predicates defined in the library with them. The users of the library won’t have to reanalyse the code in the library.

Summary & concluding thoughts

The main take away is declarative programming is awesome! In the case of Datalog, we can supplement expressivity by allowing extralogical predicates. This brings the syntactic order of execution problem with it, but our analysis completely eliminates this and restores the promise of declarative programming!

The analysis relies on computing dataflow requirements (modes) of user-defined predicates and verifying their consistency with respect to the actual bindings (adornments) of the program & query pairs.

We avoid inefficiency through a greedy scheduling algorithm, but we don’t sacrifice completeness. Additionally, our analysis is incremental and thus scalable to multi module programs with relative ease.

Here are few reasons why you might like to read the paper:

understanding the mathematical foundation for modes & bindings,
gaining insights into about an implementation (paper is effectively executable maths),
need more advanced examples (illustrating omitted aspects of the algorithm),
looking for generalisation from Datalog to other contexts,
and seeing how this algorithm behaves with common Datalog examples such as negation, aggregation, and effectful extralogical predicates.

There is a number of directions this work may go. I believe the constraint-based approach of the work can be generalised to other declarative languages which need more sophisticated modes. I firmly believe there is a fundamental link between the mode operations we defined and dataflow in general. We didn’t discuss user-level annotations as you noticed. There are interesting design questions about an annotation language for modes that might work exploring. Finally (and a bit more abstractly), extralogical predicates are inevitable in logic programming but they are afterthoughts to the semantics of the language. I’d like them to be first-class citizens, so we can reason about them better.

Pass in this rule is an existentially quantified variable, so there is indeed nothing that could have bound it out of the rule context.↩︎
It is bb because both User and Pass appear in preceding subgoals.↩︎
This seems to contradict what I said in Example 3, so let me clarify. In a subgoal like not r(X,Y,Z), the predicate \(r\) still has ??? mode pattern. It is the negation that imposes makes r behave as if it has +++ pattern.↩︎
This is a quote from a famous movie speech by the character Gordon Gekko in Wall Street. Highly recommended.↩︎
In the paper, we show and heavily use the fact that constraints (which are derived from modes) form a bounded partial order. We can talk about principal moding or sub moding, but they are only relevant if we have explicit annotations for modes which are not strictly necessary for safety checking and automated reordering. Perhaps another paper? Feel free to contact me if you want to discuss this more.↩︎
This is the second time we are combining mode patterns and there is a form of subsumption going on. When we consider two alternative dataflow requirements and their combination leads to a more relaxed mode pattern. When we combine two mode patterns that need to be satisfied simultaneously, the combination is more strict. You must have noticed that these two operations sound very algebraic. Indeed, to our surprise, they form Martelli’s semiring (sorry, there aren’t any non-academic resources I could find) originally used to compute cut-sets of a graph. To the best of my knowledge this hasn’t been used in dataflow context before and seems general enough to be used beyond Datalog and extralogical predicates.↩︎