Requirements and Challenges for Query Execution across Decentralized Environments

Ruben Taelman

Introduction

In recent decades, numerous instances of personal data exploitation occurred on the Web as a consequence of the large amount of centralization of data on the Web. Due to this, there has been increasing interest in the decentralization of such personal data, which has spawned various decentralization innitiatives such as Solid [1], Bluesky [2], and Mastodon [3]. Common among these decentralized environments is that they distribute personal and permissioned data across a large number of authoritative Web sources. For example, Solid calls these sources personal data pods that can contain RDF-based Linked Data documents.

Within such decentralized environments, the problem of finding data is of critical importance when we want to build effective decentralized applications. While centralized query processing [4] over RDF data using the SPARQL query language [5] is a well-understood and extensively investigated problem, query processing across decentralized environments is not.

The area of SPARQL query federation [6, 7, 8] that allows queries to be executed across multiple SPARQL endpoints may offer us a solution to part of the problem. However, existing federation techniques are optimized for handling a small number (~10) of large sources [9], whereas decentralized environments such as Solid are characterized by a large number (>1000) of small sources. Furthermore, federation techniques assume sources to be known prior to query execution, which is not feasible in decentralized environments due to the lack of a central index. Alternatively, Link Traversal Query Processing (LTQP) is a technique that discovers sources on-the-fly and can handle a very large number of them. With LTQP, the follow-your-nose principle of Linked Data is used to enable the query engine to follow links between documents to discover additional information. Initial experiments have shown [10] that LTQP is promising for decentralized environments such as Solid.

Requirements

In order to build applications across a decentralized environments, an interaction layer is needed that abstracts away the complexities involving reading from and writing to decentralized environments. Such an abstraction layer can be achieved in the form of query engines, where queries are the declarative language of exchange. Below, we highlight the key requirements for query engines across decentralized environments such as Solid. We limit ourselves to the requirements of read-only queries, and leave write queries for future work.

Execution of arbitrary structured queries

A query engine should be able to execute queries without any restrictions. Concretely, any SPARQL queries should be executable, where the full expressiveness of the SPARQL language can be used.

Discovery of data within pods

Decentralized environments such as Solid group personal data in so-called pods. Such a pod therefore contains all data created or owned by a person or agent. A pod can contain multiple resources, where each resource contains RDF triples that have specific access control. As such, pods lead to data being spread over different resources. Query engines aiming to query over these pods should be able to discover all RDF triples across all resources in the pod.

Discovery of data across pods

Since data can also be interlinked between resources in different pods, it can also be useful to materialize these interlinkings using queries. As such, query engines should be able to discover links between resources and different pods and follow them.

Handling location heterogeneity

Data pods can be heterogeneous in terms of their data locations. For example, one pod may store pictures grouped by year, while another pod may store pictures by location. As such, in order for queries to be generic and usable for different pods, query engines should make no assumptions about data paths.

Handling schema heterogeneity

Data pods can be heterogeneous in terms of their data model. As such, query engines should be able to discover and infer schemas used in the pod, even if it differs from the one used in the query. Furthermore, schema alignment should also take place across multiple pods.

Handling API heterogeneity

While pods currently are exposed using a document-oriented interface, different kinds of APIs may be added in the future. Data pods can be heterogeneous in terms of their API, as some pods may fully expose their data using one API, while another pod may use some other API. A single pod may even expose subsets of the full dataset with different APIs. Therefore, query engines should be able to understand and exploit these different APIs as best as possible.

Authentication

While pods may contain fully public data, data is usually private and only accessible to certain people or groups. Since different data within pods may be exposed at different levels of access control, query engines need to be able to request this data on behalf on the users through authentication mechanisms.

User-perceived performance

Decentralized applications should be usable with a sufficient level of user-perceived performance. As such, when a user performs an action that translates to a query execution, the user should be served with at least a partial response based on query results in the order of seconds, so that user attention is kept within the application.

Challenges

In this section, we analyze how the requirements above are met by existing systems, and discuss which challenges remain.

To the best of our knowledge, three approaches exist at the time of writing that attempt to query or search through Solid pods in one way or another. A first approach (LTQP Solid) [10] applies LTQP to Solid, and makes use of the structural properties of Solid pods to optimize link traversal. It allows full SPARQL queries to be executed across one or multiple Solid pods. ESPRESSO [11] is a second approach that focuses on enabling keyword search over Solid pods. It makes use of a distributed index for Solid pods which can be accumulated in a single location. This accumulated index can then be queried using keyword search to find relevant pods to a query, POD-QUERY [12] is another approach that involves placing a SPARQL query engine agent in front of a Solid pod, which enables full SPARQL queries to be executed over single Solid pods. This approach does not enable query execution across multiple pods.

Requirement	LTQP Solid	ESPRESSO	POD-QUERY
Execution of arbitrary structured queries	✓		✓
Discovery of data within pods	✓	✓	✓
Discovery of data across pods	✓	~
Handling location heterogeneity	✓	✓	✓
Handling schema heterogeneity
Handling API heterogeneity
Authentication	✓	✓	✓
User-perceived performance	~	✓	✓

Fig. 1: An illustration of which requirements are met by which approaches. A checkmark (✓) indicates that the requirement is fully met. A tilde (~) indicates that the requirement is partially met. Otherwise, the requirement is not met.

Fig. 1 indicates how well each approach meets the requirements listed above. This table shows that none of the approaches meets all of the requirements for querying over decentralized environments. Most of the requirements are met by at least some of the approaches, but the requirements for “Handling schema heterogeneity” and “Handling API heterogeneity” are met by none of the approaches. If SPARQL querying across multiple pods is desired, a traversal-based approach such as LTQP Solid will be needed. However, this approach only achieves acceptable levels of performance for non-complex queries.

As such, future research is needed in at least the following three areas. First, the heterogeneity of schemas across pods must be handled. Schema alignment could either happen server-side by requesting data in specific vocabularies. Alternatively, it could happen client-side at query-time in case the server is not able to provide these schema alignment capabilities. Second, the heterogeneity of pods in terms of their APIs must be handled. For this, query engines need to be able to discover the capabilities of heterogeneous APIs, and use them as efficient as possible during query planning. Third, more work is needed towards improving the performance of traversal-based queries. This could be done by focusing on client-side optimization techniques, such as prioritizing techniques for link traversal or better query planning algorithms. Alternatively, this could be done by exposing additional metadata or by introducing auxiliary summaries in third-party aggregators which could reduce the amount of work that needs to be done in the client-side query engine if that aggregator has already part of the query answer available.

Acknowledgements

This research was supported by SolidLab Vlaanderen (Flemish Government, EWI and RRF project VV023/10). Ruben Taelman is a postdoctoral fellow of the Research Foundation – Flanders (FWO) (1202124N).