Lost in Semantics? Ballooning the Web of Data

ERCIM-Balloon-logoWhile Linked Open Data showed enormous increase in volume, yet there is no single point of access for querying the over 200 SPARQL repositories. The Balloon project aims to create a Meta Web of Data focusing on structural information by crawling co-reference relationships in all registered and reachable Linked Data SPARQL endpoints. Besides introducing the main idea behind the crawling of the data, we also critically reflect the current status of the Linked Open Data cloud: although it is huge in size, access via SPARQL endpoints is complicated in most cases due to missing quality of service and maintenance.

Today’s vision of a common Web of Data is mostly achieved and coined by the Linked Open Data movement. The first wave of this movement transformed silo-based portions of data into a plethora of open accessible and interlinked data sets. The community itself provided guidelines (e.g., 5 ★ Open Data) as well as open source tools to foster interactions with the Web of data. Harmonization between those data sets has been established at the modelling level with unified description schemes characterizing a formal syntax and common data semantic. Without doubt, Linked Open Data is the de-facto standard to publish and interlink distributed data sets in the Web commonly exposed in SPARQL endpoints. However, a convenient request considering the globally described data set is only possible with strong limitations:

  1. The distributed nature of the Linked Open Data cloud in combination with the large amount of reachable endpoints hinders novice users to interact with the data.
  2. Following the Linked Data principle, specific URIs are in use to describe specific entities in the endpoints and are further resolvable to get further information on the given entity. The problem arises since each endpoint uses its own URI to describe the single semantic entities leading to semantic ambiguities.

One outcome of the EU FP7 CODE project is the Balloon framework. It tackles exactly this situation and aims to create a Meta Web of Data focusing on structural information. The basement for this is a crawled subset of the Linked Data cloud, resulting in a co-reference index as well as structural information. The main idea behind this index is to resolve the aforementioned semantic ambiguities by creating sets of semantically equivalent URIs to ease consumption of Linked Open Data. This is enabled by crawling information expressing the links between the endpoints. For this purpose, we consider a specific set of predicates, e.g., sameAs or exactMatch, to be relevant. The complete crawling process relies on SPARQL queries and considers each LOD endpoint registered at the CKAN platform. Here, RDF dumps are explicitly excluded. During the crawling, a clustering approach creates the co-reference clusters leading to a bi-directional view on the co-reference relationships and is the result of a continuous indexing process of SPARQL endpoints. Besides properties defining the equality of URIs, the indexing service also takes properties into account that enable structural analysis on the data corpus, e.g., rdfs:subclass. On the basis of this data corpus, interesting modules and application scenarios can be defined. As an example On-going research is focusing on the creation of the following two modules as starting point:
Intelligent and on the fly query rewriting by utilizing co-reference clusters and SPARQL 1.1 Federated Query.
Data analysis, e.g., retrieving common properties or super types for a given set of entities
Those modules are integrated in the overall Balloon platform and serve as a starting point for further applications. To foster a potential community uptake and to increase available modules in the platform, the Balloon project along with the data corpus will be made available as open source project soon.

The idea of leveraging co-reference information is not new to the research community: The Silk framework [1], SchemEX [2] and the well-known sameAs.org project proposed similar techniques. Nevertheless, the Balloon co-reference approach further considers consistent data provenance chains and the possibilities of cluster manipulations to enhance the overall quality and correctness. Further, the explicit limitation to LOD endpoints sets a clear focus on the data that is (in principle) retrievable in contrast to RDF dumps that are not searchable out of the box.

While creating the co-reference index, we encountered several issues in the current Linked Open Data cloud. Missing maintenance of endpoints over years as well as a lack of quality of service hinders the Linked Open Data cloud to fully unfold its entire potential. Our findings gathered during the crawling process is inline with the current statistics provided by the LOD2 project of the Linked Open Data cloud: From an overall amount of 700 official data sets, only approx. 210 are enclosed in a SPARQL endpoint and registered at the CKAN platform. Further, only more than half of the available endpoints had to be excluded due to insufficient support of SPARQL as well as unattainability. Finally, only 112 endpoints have been actively crawled for co-reference information leading to an overall amount of 22.4M distinct URIs (approx. 8.4M synonym groups). During the crawling phase we also encountered the need for a SPARQL feature lookup service. The main intention is to describe the actually supported retrieval abilities of an endpoint in a standardized way. Currently, discussions on this topic are observable at community mailing lists.


[1] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov, “Silk–a link discovery framework for the web of data,” in Proceedings of the 2nd Linked Data on the Web Workshop, 2009, pp. 559–572.
[2] M. Konrath, T. Gottron, S. Staab, and A. Scherp, “Schemex efficient construction of a data catalogue by stream-based indexing of linked data,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 16, no. 5, 2012.

Useful Links:

EU FP7 Code Project – http://code-research.eu/
Overview of Balloon – http://theseus.dimis.fim.uni-passau.de:8090/balloon/endpoints (under construction)
Crawled data – ftp://moldau.dimis.fim.uni-passau.de/data/ (on-going research, frequently/live updated)

Contact address:

Florian Stegmaier, Kai Schlegel, Michael Granitzer