Linking datasets and allowing for their joint exploration and exploitation.
The last years witnessed an impressive development of technologies for integration of heterogeneous data, open as well as non-open data. Within the RISIS project, the SMS platform is deploying these semantic web technologies for linking and enriching these data.
This supports the integration of the growing amount of data available for science, technology and innovation (policy) studies. Most of these datasets – fueled by scholarly work, but also by the needs of policymaking – are focused on a single type of information, which is in many cases available at different levels of aggregation, including micro-level data on publications, patents, careers, etc. Typical examples of these datasets within RISIS are the Leiden ranking (bibliometrics), MORE (careers) and EUPRO (European projects). A few more recent datasets focus on the meso-level of organizations, including Higher Education Institutions (ETER), public research organizations and research funding agencies and programs (JOREP). And within the SMS Data Store many other of such datasets are available, such as CORDIS (European projects), GRID (Organizations doing R&D), USPTO (patents), and OpenAIRE (open access publications).
Different origins of these datasets, their specific uses and characteristics of the respective data source have also driven to different structures and technical characteristics of the datasets, for what concerns the main analytical units considered, classifications adopted and technical specifications (software tool, access, updating). It is important to remark that ad hoc design presents some relevant advantages: the dataset can be designed in order to provide specific and use-tailored indicators; a close linkage to the original source is possible (for example concerning classifications); the hosting organization owns the relevant competences for managing and updating the dataset.
At the same time, from the perspective of the users (both scholars and policy-makers), this variety comes to the price that combining data proves to be difficult and time-consuming; typical problems include different organization’s names requiring hand matching, use of different classifications of scientific fields, different formats of data, differences in coverage of the relevant population of individuals or organizations.
Interoperability, as a key component of RISIS, includes different actions:
- First, developing a systematic and standardized documentation of datasets available in the RISIS data portal and in the SMS data store, in order to allow users to identify the data available: coverage, variables, measurement, and classifications.
- Second, developing the SMS services, supporting linking and enriching of datasets, and to harmonize the categories deployed. In parallel several more ad hoc integration of some of the RISIS datasets takes place.
- Third, developing a set of registers of organizations, that will identify uniquely organizations and track them across years, thanks to unique organizational identifiers, in order also to allow for the analysis of the organizational demography. Then, organizations IDs will be introduced in the RISIS databases and matched with their own identifiers; this will allow users straightforward matching of organizations and building panels across time. The register of public and private organizations (OrgReg), including Public Research Organizations, Higher Education Institutions, funding agencies is available for distant access at orgreg.joanneum.at, while a register of private companies (FirmReg) will be available by 2018.
- Fourth, these registers are incorporated in the SMS Data Store, so they become part of the larger data network provided by SMS.
- Fifth, introducing joint classification and taxonomies across different datasets. A key area of work in RISIS will concern geographic delineation, since no satisfactory geographical classification is currently existing at the European level which allows for fine-grained multi-level spatial analysis (beyond the rigid structure of the EUROSTAT NUTS classification). As a solution, the SMS geo-services enable flexible geographical delineation, where the user can define the geographical boundaries at the detail needed.