Research: data interlinking with link keys

We are pursuing our work on link keys for data interlinking [Atencia 2021a] in two specific directions:

Link key processing work flow involving extraction and reasoning.

Tableau method for $\mathcal{ALC}$+LK reasoning

Link keys can also be thought of as axioms in a description logic. It is thus possible to reason with link keys in combination with ontologies and alignments. As such, they can contribute to infer ABox axioms, such as links, terminological axioms, or other link keys. This has important practical applications, such as link key inference, link key consistency and link key redundancy checking. Yet, no reasoning support existed for link keys.

We previously extended the tableau method designed for the $\mathcal{ALC}$ description logic to support reasoning with link keys in $\mathcal{ALC}$ [Gmati 2016a]. We showed how this extension enables combining link keys with classical terminological reasoning with and without ABox and TBox and generating non-trivial link keys. We further extended the method and have proven that this extended method terminates, is sound, complete, and that its complexity is 2ExpTime [Atencia 2019d].

We have also designed a worst-case optimal tableau algorithm, based on compressed tableau and directed by rule application, for reasoning in the description logic $\mathcal{ALC}$ with individuals and link keys. This algorithm is sound, complete, and has exponential complexity. A proof of concept for this algorithm has been implemented and evaluations were carried out. We have also defined a compressed tableau for reasoning in the description logic $\mathcal{ALC}$ extended with link keys and inverse roles. This provides a solid basis for designing a worst-case optimal algorithm for reasoning in the description logic $\mathcal{ALCI}$+LK [Jradeh 2022a].

In addition, we proposed and evaluated different strategies to integrate reasoning with link keys in the data interlinking pipeline. In particular, it is possible to use the reasoner, during link key extraction, to discard link keys introducing inconsistencies.

Link key extraction with relational concept analysis

A first method has been designed to extract and select link keys from two classes which deals with multiple values but not object values [Atencia 2014b]. Moreover, the extraction step has been rephrased in formal concept analysis (FCA) allowing to generate link keys across relational tables [Atencia 2014d].

Two concept lattices corresponding to circularly dependent link keys (generated by Linkky).

We have extended this latter work so that it can deal with multiple object values when the data set is cycle free. This encoding does not necessarily generate the optimal link keys. Hence, we use relational concept analysis (RCA), an extension of FCA taking relations between concepts into account [Atencia 2019z]. We showed that a new expression of this problem is able to extract the optimal link keys even in the presence of cyclic dependencies. Moreover, the proposed process does not require information about the alignments of the ontologies to find out from which pairs of classes to extract link keys.

We implemented these methods and evaluated them by reproducing the experiments made in previous studies [Vizzini 2017a]. This shows that the method extracts the expected results as well as (also expected) scalability issues.

Fixed-point semantics for a reduced version of relational concept analysis

We have used relational concept analysis (RCA) to extract link keys. This led us to notice that when there exist circular dependencies between objects, it extracts a unique stable concept lattice family grounded on the initial formal contexts. However, other stable families may exist whose structure depends on the same relational context. These may be useful in applications that need to extract a richer structure than the minimal grounded one.

We first considered this issue in a reduced version of RCA, which only retains the relational structure. We redefined the semantics of RCA in terms of concept lattice families closed by a fixed-point operation induced by this relational structure. We showed that these families admit a least and greatest fixed point and that the well-grounded RCA semantics is characterised by the least fixed point [Euzenat 2021a]. We then characterised the interesting lattices as the self-supported fixed points. We provided an algorithm to compute the greatest fixed point (dual to the RCA algorithm) and discussed strategies to extract all self-supported fixed points [Atencia 2021b].

Link key extraction with pattern structures

We also used pattern structures, an extension of formal concept analysis (FCA) with ordered structures, to reformulate the simple link key extraction problem [Abbas 2019a]. The strategies used to automatically discover link keys are not able to select the pair of classes on which the link key candidate applies. Indeed, a link key may be relevant for some pair of classes and not relevant for another. Then, discovering link keys for one pair of classes at a time may be computationally expensive if all pairs have to be considered. To overcome this limitation, we have introduced a specific and original pattern structure where link keys can be discovered in one pass while specifying the pair of classes associated with each link key, focusing on the discovery process and allowing more flexibility [Abbas 2020a]. This approach has been implemented in the linkex tool.

Strategies for identifying high-quality link keys

Link keys correspond to closed sets of a specific Galois connection and can be discovered thanks to a concept analysis algorithm (FCA, RCA or pattern structure). Given a pattern concept lattice whose intents are link key candidates, we aim at identifying the most relevant candidates with respect to adapted quality measures. To achieve this task, we introduced the Sandwich algorithm which is based on a combination of dual bottom-up and top-down strategies for traversing the pattern concept lattice [Abbas 2021a]. The output of the Sandwich algorithm is a partially ordered set of the most relevant link key candidates.

We also introduced the notion of non-redundant link key candidate [Abbas 2021b]. Some link keys are redundant from an extensional standpoint when the closure of the sameAs relation on the links they induce is the same as another link key. We have used the formalism of Partition Pattern Structures (pps) to detect such redundancy [Abbas 2022a]. In the pps concept lattice, every concept has an extent representing a link key candidate and an intent representing a partition of instances into sets of equivalent instances. Experiments showed two main results. First, redundancy of link keys is not so significant in real-world datasets. Nevertheless, the link key discovery approach based on pps returns a reduced number of non redundant link key candidates.

Link key extraction under ontological constraints

We investigated the use of link keys taking advantage of ontologies. This can be carried out in two different directions: exploiting the ontologies under which data sets are published, and extracting link keys using ontology constructors for combining attribute and class names.

Following the first approach, we extended our existing algorithms to extract link keys involving inverse (-1), union (⊔), intersection (⊓) and paths (˙) of properties. This helps providing link keys when it is not possible otherwise (without inverse, there is no possible correspondence if one data set is using parents and the other is using children). We showed how the paths could be normalised to reduce the search space. Extracting link keys under these conditions required to introduce better indexing techniques to avoid unnecessary link key generation and even looping.

Combining link keys

For certain data sets, it may be necessary to use several link keys, even on the same pair of classes, for retrieving a more complete link set. We introduced operators to combine link keys over the same pair of classes, investigated their relations and extended measures to evaluate their quality.

We specifically proposed strategies to extract disjunctions from RDF data and apply existing quality measures to evaluate them. We experimented with these strategies showing their benefits [
Atencia 2019c].

Publications on data interlinking