CODE will develop new, semi-supervised machine learning based techniques for crowd-sourcing fact extraction and linking. Cloud based workflows will involve users as validation and quality assurance sources while keeping their efforts at a minimum. The developed algorithms, bootstrapped by existing techniques and training data, will be made available as new services for linking academic research papers to Linked Open Data repositories and lightweight ontologies with high accuracy. In particular, the following challenges have to be solved:
- Intelligent web-based workflows for minimizing human effort in annotating textual data and integrating ontological concepts.
- Scalable, cloud based semantic enrichment algorithms utilizing MapReduce, stochastic learning techniques and intelligent data set selection.
- Scalable, machine learning based integration and disambiguation algorithms
Current open, cross-domain information extraction systems are capable of extracting simple, high-frequent factual relationships, for example birthdates, from open data sets like the Web. However, the extraction and disambiguation of non-frequent entities for non-general domains, like drug-effects in the biomedical domain, and their integration into distributed Linked Data repositories remains unsolved.
Solving this challenge requires involving human experts in information extraction tasks by supporting the collaborative assessment of extracted factual information. Only crowd-sourcing of fact extraction will allow CODE to scale beyond single research fields while achieving satisfactory quality.
Therefore, CODE will develop iterative workflows around semi-supervised, stochastic machine learning techniques to extract and disambiguate facts from academic research papers. The iterative, user centered enrichment workflow will minimize the user effort for creating training data and validating achieved results. CODE will go beyond state-of-the-art by developing the necessary algorithms and data structures to support this workflow.