PhD position / Sujet de thèse

Interlinking crosslingual RDF data sets

The linked data initiative aims at publishing structured and interlinked data at web scale by using semantic web technologies [1]. These technologies provide different languages for expressing data as graphs (RDF), describing its organization through ontologies (OWL) and querying it (SPARQL) [2].

Linked data facilitates the implementation of applications that reuse data distributed on the web. Until now, the access to web data relies on limited Web APIs which provides interfaces and data formats specific to the data provider. Thus programmers have to build custom solution for each data source [3]. To facilitate interoperability between application, data issued by different providers has to be interlinked, i.e., the same entity in different data sets must be identified. However, in a heterogeneous system such as the web, there is no reason that two organizations (or providers) make use of the same ontologies to express their data or use the same key to identify entities.

One of the key challenge of linked data is to be able to discover links across datasets [4]. This problem is particularly difficult when entities are described in different natural languages, because many interlinking tools mostly compare entity labels in a syntactic way. Since, it is not possible to rely on simple string comparison, more global measures must be considered.

The main objective of this work is to develop efficient, reliable and scalable methods and tools for linking open data in a multilingual context.

Several approaches could be explored and combined:

The challenges are to select the best suitable approaches, invent new ones and combine them in an effective way.

Given the increasing size of datasets, another important aspect of the work would be to enable the scalability of the methods and tools by designing efficient pruning and/or segmentation strategies.

The successful candidate is expected to consider these problems under their theoretical and experimental aspects. Part of the research may be developed in collaboration with other groups (Getalp team at LIG, Pr. Juanzi Li at Tsinghua University, Jorge Gracia at UP Madrid).

References:
[1] Christian Bizer, Tom Heath and Tim Berners-Lee (2009). Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst., 5 (3), 1-22.
[2] Pascal Hitzler, Markus Krötzsch and Sebastian Rudolph (2009). Foundations of semantic web technologies, Chapman & Hall/CRC.
[3] Tom Heath and Christian Bizer (2011). Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology, 1(1), 1-136. Morgan & Claypool.
[4] Alfio Ferrara, Andriy Nikolov and François Scharffe (2011). Data Linking for the Semantic Web. Int. J. Semantic Web Inf. Syst., 7(3), 46-76.


http://exmo.inria.fr/training/Th-2012-multilink.html

$Id: Th-2012-multilink.html,v 1.5 2017/01/13 19:59:25 euzenat Exp $