Introducing SMS Platform
Goal of the Platform
Up to now, STI studies are either rich but small scale (qualitative case studies) or large scale and under-complex – because they generally use only a single dataset like Patstat, Scopus, WoS, OECD STI indicators, etc., and therefore deploying only a few variables – determined by the data available. However, progress in the STI research field depends in our view on the ability to do large-scale studies with often many variables specified by relevant theories: There is a need for studies which are at the same time big and rich. To enable that, combining and integration of STI data and beyond is needed – in order to exploit the huge amount of data that are ‘out there’ in an innovative and meaningful way. The aim of the SMS (Semantically Mapping Science) data integration platform is to produce richer data to be used in social research – through the enriching and interlinking of heterogeneous datasets, ranging from tabular statistical data to unstructured data found on the Web. SMS adopts an entity-centric data integration approach with a predefined set of interlinked entities which can be extended based on the user needs on STI domain.
An Overview on SMS Linked Data Services and Applications
SMS platform exposes a set of Web services and applications in order to facilitate usage of the interlinked data by researchers. An important benefit of exposing data as service is the ability to build applications which combine one or more services with other existing services and applications to build novel and innovative STI applications. The services are generally categorized into the following categories:
- Metadata Services and Applications: Metadata helps potential users of a dataset to decide whether the dataset is appropriate for their purposes or not. RISIS project aims to provide a distributed infrastructure for research and innovation dynamics and policies. This infrastructure has a collection of various heterogeneous datasets that are not always publicly accessible due to privacy issues, and often require a researcher to be physically at the dataset location. To access these datasets, one needs to be granted an access request. This administrative detour that a researcher has to endure prior to detecting which dataset to use for a particular research question can reduce the number of RISIS datasets visitors. It has been shown that research publications that provide access to their base data yield consistently higher citation rates than those that do not. Therefore, to attract more users, to visit and cite RISIS datasets, SMS provides a dataset metadata service and application – modelled using the Resource Description Framework (RDF) – that allows researchers to search for data, and have an in-depth understanding of the data without the need to directly access it. Metadata service allows dataset holders to describe their datasets in a detailed, consistent and uniform way, store the description and if needed modify the stored metadata.
- Named Entity Recognition Services and Applications: Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Given a dataset which has one or more attributes with textual values, SMS NER service can extract named entities from the text and more importantly connect the extracted entities to a knowledge graph or taxonomy (which can then provide more data about those entities).
- Data Harmonization Services and Applications: The goal of SMS data harmonization service is to improve the quality and expressiveness of a dataset by enriching it with an existing standard classification. The harmonized datasets can be more easily interlinked with other datasets. For example, with regards to geo data, data can be enriched by adding HASC (Hierarchical Administrative Subdivision Codes) or ISO 3166 country codes. Or with regards to publication/patent data, using FoS (Field of Science), WoS (Web of Science) or IPC (International Patent Classification) classifications.
- Geo-enrichment Services and Applications: Geo-enrichment is an instrument to enrich data by linking through geo-location. Many (open) datasets provide variables that are measured at some level of geographical aggregation: e.g., environmental data, educational data, or socio-economic data. In order to exploit these linking and enriching possibilities, the SMS platform provides a variety of geo-services. The geo-services system is based on a series of open geo-resources, such as GADM, OpenStreetMap and Flickr geotagged data. By integrating these geo-resources, the service can give for an entity’s address the geo-location up to 11 different levels.
- Data Linking Services and Applications: Data linking is the process of creating a relationship between entities that meet preset conditions. If global unique identifiers for entities are available, the linking becomes straightforward. If not, a variety of techniques can be used, from (fuzzy) string matching to deploying attributes available in the different databases. In the SMS linking service that we provide, we emphasis on providing contextual information that help eliminating ambiguity after a relationship between entities is established, and enables re-use.
- Added 20+ open datasets related to STI
- Added support for automatic semantic annotation of datasets
- Added support for automatic geo-enrichment of datasets
- Added a flexible faceted browser to extract patterns from interlinked data
Upcoming updates or activities
- Opening of the platform on March 31 for on-site visits first and later on for online access.
- Training course ‘Linked Data for Science and Innovation Studies’ on 23-24 March 2017
- Adding support for user-defined lenses on interlinked data