Declarative Logic Programming. Michael Kifer
first argument, which is used to distribute tuples in the network. Further, any “non-local” rule must contain a
#link
literal, and every predicate in the rule must have the source or destination of that link as its location specifier. An example from the paper is
which says there is a path from source @S
to destination @D
with first hop to @Z
. (NDlog supports list values through built-in functions.) We see that all predicates in the rule have either @S
or @Z
as the location specifier. NDlog is compiled into local dataflow graphs that run at each node in the network. A non-local rule such as the one above is split into two local rules communicating through messages across the link. Evaluation is a relaxed form of semi-naïve, in which iterations happen locally, with tuples that arrive from other nodes during an iteration either being used immediately or buffered for the next iteration. The system uses both traditional and newly proposed Datalog optimizations. The former include predicate reordering, Magic Sets and aggregate selection, in which presence of an aggregate function, such as min
in shortest path, allows suppression of non-optimal results [Furfaro et al. 2002]. The authors also present optimizations particular to the distributed setting, based on caching of results that pass through a node, and merging of messages with certain attribute values in common, as well as choosing between (or possibly combining) alternative evaluation strategies based on costs.
Abiteboul et al. [2005] apply Datalog to determining possible event sequences leading to alarm reports in a networked application. (A distributed telecom system is their example.) They target Datalog because of the need to reason over extensional (alarms) and intensional (potential execution flows) knowledge with recursive dependencies. The authors note that Datalog’s convergence properties are a good match to asynchronous, distributed settings, such as reasoning in and about networks. Their distributed version of Datalog—dDatalog—assigns whole relations to nodes. They adapt QSQ to evaluating dDatalog in a distributed setting. When a node encounters a remote predicate in evaluating a query, it ships that sub-goal, its binding pattern, and the remainder of the query to the node with the remote predicate.
The Cassandra access-control framework [Becker and Sewell 2004] is another example of the use of Datalog in a distributed setting. Cassandra augments Datalog with location and credential-issuer qualifiers, as well as constraints. The domain for the constraints is pluggable, and gives rise to “a sequence of increasingly expressive policy specification languages.” For example, a constraint domain for arithmetic allows specification of policies that limit the length of a delegation chain. Policy rules that can invoke remote policies over the network, to request resources, roles, and credentials, and are interpreted in a policy-evaluation engine. Cassandra chooses a tabled, top-down evaluation approach, because of potential problems with bottom-up evaluation. In particular, the distributed setting means that right-to-left evaluation of rules would require distributed query plans.
Datalog and Big Data. More recently, Datalog has been the basis for scalable dataanalytics systems, especially those targeting graph structures such as arise in social networks. SociaLite [Lam et al. 2013] extends Datalog for analyzing social networks. It supports some kinds of recursive aggregates, and allows the programmer to provide hints on data representation and evaluation order, which leads to performance comparable to imperative implementations, but with programs that are an order of magnitude smaller. Distributed SociaLite [Seo et al. 2013] further allows the programmer to specify how data should be shared across servers, and derives how to distribute evaluation and organize communication from that specification. The Myria system supports iterative analytic routines using recursion in Datalog [Wang et al. 2015]. Myria supports asynchronous and prioritized evaluation of Datalog on a shared-nothing cluster, as well as fault tolerance. BigDatalog [Shkapsky et al. 2016] also supports recursive analyses, in this case for Spark. While Spark supports iterative processing, the authors point out that directly coding recursive computations via iteration is challenging in Spark, and tends to be inefficient, as there are no opportunities for global analysis and optimization provided in Datalog. BigDatalog is especially well suited to graph analytics, where it outperforms systems specifically targeted at graphs, such as GraphX [Gonzalez et al. 2014]. One advantage systems such as SociaLite, Myria, and BigDatalog have over earlier Datlog systems for analytics application is support for aggregates—for example, max and sum—in recursive rules [Zaniolo et al. 2017].
Datalog is also the basis for parallel approaches to data analysis. Google’s Yedalog [Chin et al. 2015] is a recent Datalog-based system that is designed specifically to exploit data parallelism and aggregation of massive amounts of data. Also, the DeALS system [Shkapsky et al. 2013] has compilation techniques to support multicore processing [Yang et al. 2017].
Other Application Areas. We briefly mention other areas in which Datalog—and its derivatives—have been applied. The DLV system [Leone et al. 2006] uses Disjunctive Datalog to represent knowledge and to reason about alternatives. Disjunctive Datalog allows the “or” of predicates in the head of a rule, which gives rise to multiple minimal answer sets, similarly to what is seen for negation in a rule body. DLV also includes integrity constraints that rule out certain answer sets, and “soft” constraints that prioritize among answer sets. Ashley-Rollman et al. [2007] use a Datalog-like language called MELD to plan shape-shifting maneuvers for a swarm of modular robots. They find that the declarative approach allows programs that are 20 times smaller and allow more optimizations, compared to imperative alternatives.
A variety of Web-related applications also use Datalog. The Lixto system [Gottlob et al. 2004] supports Web data extraction and service definition. With Lixto, a designer can specify a data-extraction wrapper via a special browser and the system generates a Datalog program to perform the extraction. Its Elog language is based on Monadic Datalog (where all IDB predicates are unary), and supports visual wrapper specification. Lixto technology is now part of the McKinsey Periscope product. A follow-on project, DIADEM [Furche et al. 2014], targets fully automated data extraction from collections of web pages. DIADEM uses a form of Datalog± internally to represent domain knowledge. Orchestra [Ives et al. 2008] supports data sharing on the Web, and translates data-exchange programs to an extended version of Datalog with Skolem functions to support virtual-entity creation. WebDamlog [Abiteboul et al. 2011] is a Datalog-style language for implementing distributed applications among peer systems on the Web. Toward that goal, WebDamlog supports the exchange of both facts and rules.
We will also see uses of Datalog for data management in LogicBlox and Datomic in the next section. Also see chapters 7 and 9 to learn about applications in bioinformatics and natural language processing.
Summary. Datalog is gaining more acceptance of late, as evaluation methods and optimization techniques proposed on a theoretical level in the past have been implemented and tested on actual applications. These applications have demonstrated the utility of the language as a direct interface, as an intermediate language, and as a framework for additional features. These application have also inspired new extensions, such as for distribution, along with new evaluation and optimization approaches.
There is also a growing number of Datalog vendors and companies that rely on Datalog for their mission-critical applications, including LogicBlox, Datomic, XSB, Inc., Semmle, DLV Systems, Ontotext, and Coherent Knowledge Systems.
1.8 Current Systems and Comparison
This section describes and compares four current systems that implement Datalog: XSB, LogicBlox, Datomic, and Flora-2/Ergo. We show how a selection of Datalog rules and queries are written for each system, and report on experiments that illustrate their performance.
These systems all implement Datalog (or a superset). However, the observed performance of each system is unique. They show asymptotically different behavior due to fundamental factors, including their choices for evaluation strategies, selection of underlying data structures, and data indexing. Even when these factors are the same, there are differences in performance due to implementation details, such as the choice of implementation