Virtual data warehouses: Towards the full potential of the Web of Data

Linked Data and Big Data are currently very hot topics on the Web. Data warehouses show promising approaches to enable efficient analytical processes on statistical data available in the Web of Data. The CODE project developed technologies to lift statistical data into the Linked Open Data cloud by assuring the creation of meaningful data provenance chains as well as integrating openly available background knowledge to improve analytical processes.

In current press releases, the two major (Web-related) buzzwords Big Data and Linked Data are frequently discussed, especially in terms of their interconnections. Without doubt, both research fields experience sophisticated benefits from each other. On the one hand, Big Data aims in the creation of efficient processing pipelines to enable analytics on very large data corpora. On the other hand, those corpora can be established and maintained in the Linked Open Data cloud, which is the de-facto standard for publishing tremendous portions of data in distributed data endpoints fostering open data activities, such as open government data. Major administrative authorities already publish their statistical data in a Linked Data aware format. Here, one has to mention the Eurostat database, a leading provider of high quality statistics on Europe driven by the European Commission. In this regard, one way to create an added value is to apply data warehousing techniques to analyse and aggregate statistical data. Inside data warehouses, the central data structure is the data cube model, which is a multi-dimensional dataset with an arbitrary number of dimensions. In addition to the dimensions, attributes and measures are used to describe certain observations.

The fruitful combination of Linked Data publishing principles and data warehousing analytics has been also spotted by the W3C leading to the specification of the RDF Data Cube Vocabulary [1] by the Government Linked Data Working Group. This standard defines a Linked Data aware statistical data model ready for online analytical processing (OLAP). Unfortunately a broad uptake is not yet visible due to its recent date of publication. Obviously, large data corpora can be found on the Web incorporating meaningful statistics, but are published in heterogeneous formats (e.g., CSV or Excel files) along with missing semantics. In view of this observation, the FP7 CODE project created a crowed-sourced triplification process [2] for statistical data concentrating on the following principles:

  1. Data publishing and interlinking: The web-based triplification prototype is able to extract statistical data from PDF documents, Excel or CSV files. The user is able to define dimensions as well as measures that span the creating the desired data space. During the processing, semantic entities are disambiguated with URIs from the Linked Open Data cloud to soften semantic ambiguities. Further, techniques for data curation and structural manipulation of tabular data are integrated. To enable the enriched data for statistical analysis, a measure type classifies the dimensions. This approach distinguishes between nominal, ordinal, interval, and ratio scale types.
  2. Creation of provenance chains: A Linked Data aware publishment of statistical data is not the end of the story due to arising questions, such as: Who generated the data? When was it released? Who interacted with it? Information answering these questions are collected and modelled by the W3C PROV Ontology [3]. This leads to the creation of provenance information chains enabling justification of data with respect to its impact and quality.
  3. Crowd-sourced workflows ensure the creation as well as validation of the semantically enriched knowledge.

Due to the disambiguation of semantic entities available in the dimensions of data cubes, esp. in a governmental use case it is very likely that single data cubes can be merged into a global data cube. Merging two or more data cubes primarily falls back to the problem of identifying common dimensions and also common observations. If such structures are found, the dataset structure definition and the observations can be combined into a single data cube. This is not a trivial problem due to inequality of dataset structure definitions or overlapping observations. In our approach, we distinguish between dimension-centric as well as data centric merging. Currently, a large-scale cube merging evaluation is envisioned on the data cubes published by Eurostat. First results showed, that the overall amount of more then 5200 data cubes could be merged into less then 1000 data cubes. Due to the merging of data cubes, it is highly probable that hidden insights or dependencies between the up to now unconnected data cubes can be revealed. Those efforts aim towards the creation of distributed, linked and open virtual data warehouses that improve analytical process through integrating openly available background knowledge.
Currently, the Government Linked Data Working Group is seeking for implementation reports to move the specification to proposed recommendation and finally to recommendation. It is planned to submit a full implementation report including the description of our services as well as lessons learned while using the specification to provide the community feedback on this topic.


[1] Richard Cyganiak and Dave Reynolds, “The RDF Data Cube Vocabulary”, W3C Candidate Recommendation, 25 June 2013.
[2] Kai Schlegel, Sebastian Bayerl, Stefan Zwicklbauer, Florian Stegmaier, Christin Seifert, Michael Granitzer and Harald Kosch, “Trusted Facts: Triplifying Primary Research Data Enriched with Provenance Information”, In The Semantic Web: ESWC 2013 Satellite Events (LNCS 7955), pp. 268-270. 2013.
[3] Timothy Lebo, Satya Sahoo, and Deborah McGuinness, “PROV-O: The PROV Ontology”, W3C Recommendation, 30 April 2013.

Useful Links:

EU FP7 Code Project
CODE Data Extractor (research prototype, constantly under development)

Contact address:

Florian Stegmaier, Kai Schlegel, Michael Granitzer