FAIR Data Train specifications

1 Introduction

The FAIR Data Train (FDT) comprises a set of types of applications to support the findability, accessibility, interoperability and reusability of data following the FAIR principles [FAIR-principles]. The FDT follows the analogy of a train system where we have two main elements, stations and trains. Stations are responsible of making data (or other types of digital objects) available and provide metadata about itself (the Station) and its content (the data or other types of digital objects). Trains represent analysis/processing algorithms that visit stations to process and analyse data. Additionaly, a Station Directory can be consulted to verify which station provide which data. The Station Directory is a special kind of Station which indexes the metadata of other stations and provide search capabilities.

In this document, we consider data not only the content of artefacts such as databases, tables, graphs, etc., but also other types of digital objects such as controlled vocabularies, ontologies, models, etc. From now on, unless otherwise explained we use the terms data and other types of digital objects interchangeably. The FAIR Data Train architecture defines a number of key elements to support:

Provisioning of metadata to describe digital objects;
A common representation format [RDF] to express the metadata in a machine-actionable manner;
A common approach to inform clients on how to navigate through the metadata structure;
A common representation format [SHACL] to represent each metadata record's schema;
A common way of describing how data sources support interactions with their content in terms of metadata properties of the station;
A common way to represent how client applications can interact with data, i.e., interfaces and protocols for the station to accept incoming trains

The main goal of the FDT general architecture is to define a set of behaviours, interfaces and protocols to improve interoperability among data sources and data processing services. To fulfill this goal, this document contains a set of specifications to help developers to build new applications or to extend the functionality of their existing applications in a way that these applications can also be part of the FAIR Data Train ecosystem. The enviosined scenario is the one with a multitude of trains, stations and client applications independently created and able to interact with one another because they all follow the same base guidelines (interfaces, protocols, representation formats, etc.).

1.1 Purpose

The purpose of this document is to present the general architecture of the FAIR Data Train. This document includes requirements, architecture, design principles and design of the FDT. This architecture is primarily intended to be a reference for developers willing to add the FDT functionality into their existing applications or develop their own FAIR Data Train implementation. In order to better understand this specification, a knowledge of RDF, LDP, SHACL and REST APIs is required.

1.2 Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

2 Main components

As depicted in Figure 2.1, the main elements (applications, roles and interaction mechanisms) of the FAIR Data Train are:

FAIR Data Station: application type responsible for making data and their related metadata available to users under the accessibility conditions determined by applicable regulations and the related Data Controllers. The Data Station is also able to register itself in a Station Directory to have its metadata indexed by that directory;
Personal Gateway: application responsible to intermediate the communication between Data Stations and Data Controllers. The Data Controller exercises its/his/her control over the data available in different Data Stations through the Personal Gateway
Station Directory: application responsible to index metadata from the reachable Data Stations and allow users (client applications) to search for data available on those stations.
Train: represents the way data consumers interact with the data available on Data Stations. The Trains can be of different types, such as API calls, container images, queries, messages, etc. Trains can only visit Data Stations that support the same type of interaction, e.g., a SQL query train requires a station that supports this type of query;
Train Handler: application that represents client applications that can interact with the Station Directory to discover the existance and location of data and Data Stations to actually work with the available data;
Station Owner: role played by stakeholders that are responsible for managing and running Data Stations. Since the Station Directory is a specialisation of Data Station, the same role of Station Owner applies to the Station Directory;
Data Controller: role played by individuals or organizations that have controlling rights over data;
Train Owner: role played by individuals or organizations responsible for sending Trains to Data Stations;
Train Provider: role played by individuals or organizations that create Trains to be used by themselves or others. When a Train Provider uses its own created Train and sends it to access data in Data Stations, it is, at this moment, playing also the role of Train Owner.

Highlevel architecture view — Figure 2.1 High-level architecture of the FAIR Data Train

2.1 FAIR Data Station

Since the purpose of the FAIR Data Train architecture is to define a set of desired behaviours that applications should expose and support, the definition of the FAIR Data Station (FDS) should specify the station's interface. For simplicity, we first start by dividing the FDS API in following three major groups, namely, Metadata Interface, Station Interface and Content Interaction Interface, as depicted in Figure 2.2. Each one of these interface groups are intended to expose the interfaces of a number of services available at the FDS.

FDS interface groups — Figure 2.2 Interface groups of the FAIR Data Station

In this Figure 2.2, it is also made explicit that the interaction with the data happens through the FDS' Interaction Component, which connects with the actual Data Storage component.

FAIR Data Train general architecture

Working Draft, 15 February 2023

Abstract

Status of this document

1 Introduction

1.1 Purpose

1.2 Document conventions

2 Main components

2.1 FAIR Data Station

2.1.1 Metadata Interface

2.2 Station Directory

2.3 Personal Gateway

2.4 Train Handler