Zhengjie Fan, Concise pattern learning for RDF data sets interlinking, Thèse d'informatique, Université de Grenoble, Grenoble (FR), April 2014
There are many data sets being published on the web with Semantic Web technology. The data sets contain analogous data which represent the same resources in the world. If these data sets are linked together by correctly building links, users can conveniently query data through a uniform interface, as if they are querying one data set. However, finding correct links is very challenging because there are many instances to compare. Many existing solutions have been proposed for this problem. (1) One straight-forward idea is to compare the attribute values of instances for identifying links, yet it is impossible to compare all possible pairs of attribute values. (2) Another common strategy is to compare instances according to attribute correspondences found by instance-based ontology matching, which can generate attribute correspondences based on instances. However, it is hard to identify the same instances across data sets, because there are the same instances whose attribute values of some attribute correspondences are not equal. (3) Many existing solutions leverage Genetic Programming to construct interlinking patterns for comparing instances, while they suffer from long running time. In this thesis, an interlinking method is proposed to interlink the same instances across different data sets, based on both statistical learning and symbolic learning. The input is two data sets, class correspondences across the two data sets and a set of sample links that are assessed by users as either "positive" or "negative". The method builds a classifier that distinguishes correct links and incorrect links across two RDF data sets with the set of assessed sample links. The classifier is composed of attribute correspondences across corresponding classes of two data sets, which help compare instances and build links. The classifier is called an interlinking pattern in this thesis. On the one hand, our method discovers potential attribute correspondences of each class correspondence via a statistical learning method, the K-medoids clustering algorithm, with instance value statistics. On the other hand, our solution builds the interlinking pattern by a symbolic learning method, Version Space, with all discovered potential attribute correspondences and the set of assessed sample links. Our method can fulfill the interlinking task that does not have a conjunctive interlinking pattern that covers all assessed correct links with a concise format. Experiments confirm that our interlinking method with only 1% of sample links already reaches a high F-measure (around 0.94-0.99). The F-measure quickly converges, being improved by nearly 10% than other approaches.
Interlinking, Ontology Matching, Machine Learning