Querying & Aggregation

CODE will develop a federated SPARQL search service that exploits provenance information following the design principle of a middleware system. Currently, the Linked Open Data community claims the need of provenance information, but it is still unconnected with the retrieval process. By taking this into account, the quality of the retrieved data will be massively increased. The following targets will be achieved:

  1. Federated SPARQL engine equipped with federated indices and virtual views on top of existing triple stores
  2. Full-fledged integration of provenance information into the query execution process.
  3. Cumulative calculation of statistical aggregation functions (e.g sum, mean, variance) and reintegration into the LOD cloud (e.g. as SPARQL Endpoint).

Only complex requests across different data sources ensure exploiting the full potential of LOD. To support complex queries, a federated SPARQL retrieval infrastructure is necessary.

This infrastructure is divided in two main components:

  1. query interface and
  2. federator.
  • The query interface is the entry point for any client request.
  • The federator maintains the complete query processing chain. Here, the central phases include query parsing, discovery of applicable data sources, query optimization/distribution and finally the consolidation of results.
  • To ensure an efficient distributed query process, the federator manages data statistics and indices.

These concepts are already well known in the database community, but not well reflected in the Semantic Web community. It has to be examined, whether different join implementations, e.g., semi-join or mediator-join have a noteworthy effect on the query evaluation time and how those can be applied to distributed RDF data. Further, the possibility of the federated processing of aggregation functions (e.g., sum and average function) to statistically summarize Linked Open Data Queries are not covered by yet available technologies. Most scalable triple stores like Virtuoso, Jena or Sesame support aggregate functions, but do not federate their calculation over note.

The main task in this topic is to ensure data consistency with the underlying concepts by integration of update functionalities or transactional models. In order to increase the quality of the retrieved data, central phases of the query evaluation process will be injected by provenance information. This would play a vital role especially for discovery of data sources as well as for ranking results. With a growing number of data sources it becomes more important to trace the origin of result items and establish trust in different data sources.