Connecting datasets and allowing for their joint exploitation.
The last years witnessed an impressive development of datasets concerning science, technology and innovation policies, fueled by scholarly work, but also by the needs of policymaking (especially at the European level). For reasons related largely to technical issues and data sources, most of these datasets are focused on a single type of information, which is in many cases available at different levels of aggregation, including micro-level data on publications, patents, careers, etc. Typical examples of these datasets within RISIS are the Leiden ranking (bibliometrics), MORE (careers) and EUPRO (European projects). A few more recent datasets focus on the meso-level of organizations, including Higher Education Institutions (ETER), public research organizations and research funding agencies and programs (JOREP).
Different origins of these datasets, their specific uses and characteristics of the respective data source have also driven to different structures and technical characteristics of the datasets, for what concerns the main analytical units considered, classifications adopted and technical specifications (software tool, access, updating). It is important to remark that ad hoc design presents some relevant advantages: the dataset can be designed in order to provide specific and use-tailored indicators; a close linkage to the original source is possible (for example concerning classifications); the hosting organization owns the relevant competences for managing and updating the dataset.
At the same time, from the perspective of the users (both scholars and policy-makers), this distributed architecture comes to the price that combining data proves to be difficult and time-consuming; typical problems include different organization’s names requiring hand matching, use of different classifications of scientific fields, different formats of data, differences in coverage of the relevant population of individuals or organizations.
Therefore, a major contribution of RISIS to research and innovation studies will be to simplify this process and to reduce the effort required to combine different types of datasets. Interoperability, as a key component of RISIS, includes different and complementary actions:
- First, developing a systematic and standardized documentation of all RISIS datasets, in order to allow users to identify data and classifications and devise matching procedures. In parallel, harmonizing the access to databases both for the technical aspects and legal conditions (through a common gateway and a unique accreditation system to different datasets.
- Second, developing a register of public and private organizations, including Public Research Organizations, Higher Education Institutions, funding agencies and, in a further step, private companies. The register will identify uniquely organizations and track them across years, thanks to unique organizational identifiers, in order also to allow for the analysis of the organizational demography. Then, organizations IDs will be introduced in the RISIS databases and matched with their own identifiers; this will allow users straightforward matching of organizations and building panels across time.
- Third, introducing joint classification and taxonomies across different datasets. A key area of work in RISIS will concern geographic delineation, since no satisfactory geographical classification is currently existing at the European level which allows for fine-grained multi-level spatial analysis (beyond the rigid structure of the EUROSTAT NUTS classification).