Data interlinking

The web of data uses semantic web technologies to publish data on the web in such a way that they can be interpreted and connected together. It is thus critical to be able to establish links between these data, both for the web of data and for the semantic web that it contributes to feed. We call data interlinking the process of generating links identifying same resource described in two data sets. Data interlinking parallels ontology matching: from two datasets (d and d') it generates a set of links (also called a linkset, L) which are pairs of resource identifiers ⟨ u, u'⟩ of each data sets related by the owl:sameAs property asserting the identity of the resources.

Goal: Our work on data interlinking is to propose new interlinking techniques as well as take advantage of alignments in data interlinking.

Data interlinking from expressive alignments

We have proposed a general framework for analysing the task of linking data and we have shown how the diverse techniques developed for establishing these links fit in the framework [Scharffe 2010a, 2011b]. We have also proposed an architecture allowing to associate various interlinking systems and to make them collaborate with systems developed for ontology matching that present many commonalities with link discovery techniques.

In the context of the Datalift project, we have developed a data interlinking module which generates data interlinking scripts from ontology alignments [Fan 2012a]. We generate partial data interlinking scripts from ontology alignments. For that purpose, we have integrated existing technologies within the Datalift platform [Fan 2012a, Scharffe 2012a]: the Alignment API, for taking advantage of the EDOAL language, and Silk, developed by Frei Universtität Berlin, for processing linking scripts.

We have further developed an algorithm able to determine potential attribute correspondences of two classes depending on their features. For that purpose, we use k-means or k-medoïds clustering to identify groups of properties which can be compared. This provides property correspondences used to construct a Silk script which generates an initial link set. Some of the links are presented to users who assess their validity. These are taken as positive and negative example by an extension of the disjunctive version space method to find an interlinking pattern, that can justify correct links and incorrect links. Such a technique can be iterated until fully satisfactory links are found. Experiments show that, with only 1% of sample links, this method reaches a F-measure over 96% [Fan 2014a, b].

This work is part of the PhD of Zhengjie Fan, co-supervised with François Scharffe (LIRMM), within the Datalift project.

Keys and pseudo-keys detection for web datasets cleansing

We have proposed a method for analysing web datasets based on key dependencies. Keys are sets of properties which uniquely identify individuals (instances of a class). We have refined the notion of database keys in a way which is more adapted to the context of description logics and the openness of the semantic web [Atencia 2014c].

In order to better deal with web data of variable quality, we have introduced the definition of pseudo-keys [Atencia 2012b]. We have also designed and implemented an algorithm for discovering pseudo-keys. Experimental results show that, even for a large dataset such as DBpedia, the runtime of the algorithm is still reasonable [David 2012b]. This work has allowed to detect automatically duplicates within wikipedia.

This work is developed partly in the Lindicle and Datalift projects. A proof of concept implementation is available at http://rdfpkeys.inrialpes.fr/.

Link keys for data interlinking

However, ontologies do not necessarily come with key descriptions and they may reveal useless when interlinking data. We have refined the notion of link keys introduced in [Euzenat 2013c]. Like alignments, link keys are assertions across ontologies and are not part of a single ontology. A link key is a combination of such keys with alignments. More precisely, a link key is an expression ⟨K^eq, Kⁱⁿ, C⟩ such that:

K^eq is a set of pairs of property expressions;
Kⁱⁿ is a set of pairs of property expressions;
C is a correspondence between classes.

Such a link key holds if and only if for any pair of resources belonging to the classes in correspondence by C such that the values of their property in K^eq are pairwise equal and the values of those in Kⁱⁿ pairwise intersect, the resources are the same. Link keys can thus be used for finding equal individuals across the two data sets and generating the corresponding links.

As can be seen, link key validity only relies on pairs of objects in two different data sets. We further qualify link keys as weak, plain and strong depending on them satisfying further constraints: a weak link key is only valid on pairs of individuals of different data sets, a plain link key has to apply in addition to pairs of individuals of the same data set as soon as one of them is identified with another individual of the other data set, a strong link key is a link key which is also a key for each data set, it can be though of as a link key which is made of two keys.

We have extended a classical key extraction technique for extracting weak link keys (extracting strong link keys is even easier). We have designed an algorithm to generate first a small set of candidate link keys [Atencia 2014b]. Depending on whether some of the, valid or invalid, links are known, we defined supervised and non supervised measures for selecting the appropriate link keys. The supervised measures approximate precision and recall on a sample, while the non supervised measures are the ratio of pairs of entities a link key covers (coverage), and the ratio of entities from the same data set it identifies (discrimination). We have experimented with these types of measures, showing the accuracy and robustness of both [Atencia 2014b].

This approach has been adapted to the simpler context of relational databases, and we have shown how candidate link keys can be encoded in the formal concept analysis framework [Atencia 2014d]. We are pursuing this work with full link keys.

Link keys can also be thought of as axioms in a description logic. As such, they can contribute to infer ABox axioms, such as links, or terminological axioms and other link keys. Yet, no reasoning support existed for link keys. We extended the tableau method designed for ALC to take link keys into account [Gmati 2016a]. We showed how this extension enables combining link keys with classical terminological reasoning with and without ABox and TBox and generate non trivial link keys.

Link keys have been implemented within the Alignment API, i.e., they can be associated to class correspondences. It is then possible to automatically generate SPARQL Construct queries which generate links between entities.

This work is developed partly in the Lindicle project.

Data interlinking by iterative import-by-query

We modelled the problem of data interlinking as a reasoning problem on possibly decentralised data. We have proposed a rule-based approach to infer all certain sameAs and differentFrom statements that are logically entailed from a given set of domain constraints and facts. Our main contribution is a novel algorithm, called Import-By-Query, that enables the scalable deployment of such an approach in the decentralised setting of linked data [Al-Bakri 2015a]. The main challenge is to identify the data, possibly distributed over several datasets, useful for inferring sameAs and differentFrom statements of interest. For doing so, Import-By-Query alternates steps of sub-query rewriting and of tailored querying the linked data cloud in order to import data as specific as possible for inferring or contradicting given target sameAs and differentFrom statements. It is an extension of the well-known query-subquery algorithm for answering Datalog queries over deductive databases. Experiments conducted on a real-world dataset have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation.

This work is part of the PhD thesis of Mustafa Al-Bakri (LIG-Hadas team), co-supervised by Manuel Atencia and Marie-Christine Rousset, developed in the Qualinca project.

Uncertainty-sensitive reasoning for inferring sameAs facts in linked data

Data interlinking requires to design tools that effectively deal with incomplete and noisy data, and exploit uncertain knowledge. We modelled data interlinking as a reasoning problem with uncertainty. For that purpose, we introduced a probabilistic framework for modelling and reasoning over uncertain RDF facts and rules that is based on the semantics of probabilistic Datalog. We have designed an algorithm, ProbFR, based on this framework. Experiments on real-world datasets have shown the usefulness and effectiveness of our approach for data linkage and disambiguation [Al-Bakri 2016a].

This work was carried out in collaboration with Mustafa Al-Bakri and Marie-Christine Rousset (LIG).

Crosslingual data interlinking

Another key challenge of linked data is to be able to discover links across datasets when entities are described in different natural languages. Indeed, even systems based on graph structure ultimately rely on anchors based on language fragments. In this context, data interlinking requires specific approaches in order to tackle cross-lingualism. We proposed a general framework for interlinking cross-lingual RDF data. It represents resources as (virtual) text documents and compare them using different strategies [Lesnikova 2013a, b, 2016b].

In order to assess the quality of possible measures, we have investigated two directions:

a translation-based approach where the virtual documents are automatically translated (through Bing or Google translate) [Lesnikova 2014a];
a language-independent approach where important terms found in documents are mapped to a terminological resource, such as BabelNet [Lesnikova 2015b], to compute document similarity.

We evaluated variations of theses two settings, in particular comparing their efficiency with both generic entities named with a common noun or term, and individual entities. We performed experiments DBPedia (in English) and XLore (Chinese) datasets as well as the multilingual TheSoz GESIS thesaurus (in English, French and German) and Agrovoc and Eurovoc thesauri (in English and Chinese respectively). Both approaches demonstrated promising results [Lesnikova 2014a]. We found machine translation to be more efficient on both generic and individual entities [Lesnikova 2016a]. However, the necessary depth to create virtual documents has an influence on the result quality. We conjecture that this depends on the generic-specific character of entities and have worked on testing this hypothesis.

This work is part of the PhD of Tatiana Lesnikova developed in the Lindicle project.

Benchmarking data interlinking

We also have proposed some guidelines for data interlinking evaluation [Euzenat 2012b].

< Ontology matching

Index

References on linked data and data interlinking

Transformation and properties >

http://exmo.inria.fr/research/interlinking.html

Feel free to comment to Jerome:Euzenat#inria:fr, $Id: interlinking.html,v 1.12 2016/12/26 09:56:52 euzenat Exp $