Query-driven Data Integration for a Decentralized Web

Query-driven
Data Integration
for a Decentralized Web

Ruben Taelman

Ghent University – imec – IDLab, Belgium

Data is highly decentralized

Produced by governments, companies, individuals, …
Stored at different locations across the world
Consumed through interactive applications, processes, intelligent agents, …

Decentralized data integration is challenging for user-facing app developers

Discovering data
How can I find data sources? How can I find data within a source?
Combining data
How to combine data across different data sources?
Preserving privacy
How to not leak sensitive data?

Query engines abstract access to decentralized data

Hide the complexities of reading and writing for app developers

↔

↔

Image credit

Querying in a decentralized Web
Open problems
Ongoing projects

Querying in a decentralized Web
Open problems
Ongoing projects

Query processing over centralized data

Dataset is collocated with query engine
All data is known beforehand
Single dataset
With precomputed indexes

Execute queries of any complexity

SELECT ?name ?deathDate WHERE {
  ?person a dbpedia-owl:Artist;
          rdfs:label ?name;
          dbpedia-owl:birthPlace [ rdfs:label "York"@en ].
  FILTER LANGMATCHES(LANG(?name),  "EN")
  OPTIONAL { ?person dbpprop:dateOfDeath ?deathDate. }
}

Centralization not always possible

Private data
Technical and legal reasons
Evolving data
Requires continuous re-indexing
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)

How to query over decentralized data?

Data and query engine are not collocated
Query engine runs on a separate machine
Not just one datasets
Data is spread over the Web into multiple documents

Approaches for querying over decentralized data

Federated Query Processing
Distributing query execution across known sources
Link Traversal Query Processing
Local query execution over sources that are discovered by following links

Federation distributes query over APIs

Clients do limited effort
Split up the query, distribute it (source selection), and combine results
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets

Link traversal follows linked documents

Documents are linked to each other
Following the Linked Data principles
Query engine can follow links
Start from one document, and discover new documents on the fly

Limitations of querying approaches

All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
Limited scalability of federation in terms of number of sources
Current federation techniques scale to the order of 10 sources
Link traversal can be too slow in practise
Too many links are followed for complex queries

Querying in a decentralized Web
Open problems
Ongoing projects

Focus on Knowledge Graphs

Graph-based data model
When data does not fit into a fixed relation model
Based on Semantic Web / Linked Data technologies
RDF, SPARQL, …
Interlinking data across multiple data sources

Publishing Knowledge Graphs
as SPARQL Endpoints

SPARQL endpoint: API that accepts SPARQL queries, and replies with results.

Most popular way to publish Knowledge Graphs
Alternatives are data dumps and Linked Data Documents
Very powerful
Very complex queries can be formulated with SPARQL

SPARQL endpoints have low availabily

Public endpoints have an availability of 95%.

→ ~1.5 days downtime per month!

Vandenbussche, Pierre-Yves, et al. "SPARQLES: Monitoring public SPARQL endpoints." Semantic web 8.6 (2017): 1049-1065.

SPARQL query expressivity is high
Some queries can be computationally intensive
Publicly accessible
An unbounded number of clients can send queries

SPARQL endpoints have restrictions

To counter availability issues

Timeouts
Limit execution time per query
Rate limits
Limit number of queries a client can send

→ Limits types and number of queries that can be executed!

Alternatives to SPARQL endpoints

To limit expressivity

Linked Data Fragments (LDF) axis: investigates trade-offs between server and client effort for query execution.

Verborgh, Ruben, et al. "Triple pattern fragments: a low-cost knowledge graph interface for the web." Journal of Web Semantics 37 (2016): 184-206.

LDF interfaces complicate federation

→ Federation engines must combine data across heterogeneous APIs

Personal data

Decentralization initiatives offer users full control of where data is stored and who can access it

How to query protected data?

Privacy preservation
Ensure only authorized agents can access data.
Trusted servers? Client-side? Combinations?
Trustworthiness
Allow inspection of how and where data was combined

→ Lack of understanding of how to do this for decentralized data!

Querying in a decentralized Web
Open problems
Ongoing projects

Solid as decentralized environment

Set of standards to decentralize data
Linked Data, LDP, Authentication, …
Solid pods decouple storage from app
Users are in full control over their pod
Foundation for Linked Web Storage
Is standardized in W3C working group

https://solidproject.org/

We develop the Comunica framework

Abstraction layer to query decentralized data
Federation, link traversal, Solid, …
SPARQL, GraphQL-LD, …
Modular query engine framework
Flexible experimentation ground for research (link traversal, federation, …)
Open-source (MIT)
Used for research and beyond
Used by 1.500+ open-source projects, 380.0000+ monthly downloads

https://comunica.dev/

The future of RDF and SPARQL

Contributing in RDF and SPARQL W3C working group
Towards RDF and SPARQL 1.2
And towards SPARQL 1.3
Better list support, lateral joins, rate limit announcement, …

https://www.w3.org/TR/sparql12-query/

LDES enables data silo synchronization

Linked Data Event Streams
Low-cost Linked Data API
API for replication and synchronization across datasets
For Linked Data publishers
Recommended by Flanders, SEMIC (EU), …

https://semiceu.github.io/LinkedDataEventStreams/

From access to usage control with ODRL

ODRL: Open Digital Rights Language
Policy expression language
Usage control
Traditional access control mechanisms are too limited
→ Permissions, prohibitions, duties
Retain data for max 10 days, use data only for medical research, …

https://semiceu.github.io/LinkedDataEventStreams/

Querying in a decentralized Web
Open problems
Ongoing projects

Conclusion: Query-driven data integration for decentralization

Knowledge Graph technologies
Universal semantics to interlink data across organizational boundaries
Need for intelligent query engines
Abstract away complexities of decentralized data
Querying over decentralized data → research challenges
Open to collaborations!

Query-drivenData Integrationfor a Decentralized Web

Data is highly decentralized

Decentralized data integration is challenging for user-facing app developers

Discovering data

Combining data

Preserving privacy

Query engines abstract access to decentralized data

Hide the complexities of reading and writing for app developers

Query processing over centralized data

Dataset is collocated with query engine

Single dataset

Execute queries of any complexity

Centralization not always possible

Private data

Evolving data

Web scale data

How to query over decentralized data?

Data and query engine are not collocated

Not just one datasets

Approaches for querying over decentralized data

Federated Query Processing

Link Traversal Query Processing

Federation distributes query over APIs

Clients do limited effort

Servers perform most of the effort

Link traversal follows linked documents

Documents are linked to each other

Query engine can follow links

Limitations of querying approaches

All federation members must be known before execution starts

Limited scalability of federation in terms of number of sources

Link traversal can be too slow in practise

Focus on Knowledge Graphs

Publishing Knowledge Graphsas SPARQL Endpoints

Most popular way to publish Knowledge Graphs

Very powerful

SPARQL endpoints have low availabily

Public endpoints have an availability of 95%.

SPARQL query expressivity is high

Publicly accessible

SPARQL endpoints have restrictions

To counter availability issues

Timeouts

Rate limits

→ Limits types and number of queries that can be executed!

Alternatives to SPARQL endpoints

To limit expressivity

LDF interfaces complicate federation

→ Federation engines must combine data across heterogeneous APIs

Personal data

Decentralization initiatives offer users full control of where data is stored and who can access it

How to query protected data?

Privacy preservation

Trustworthiness

→ Lack of understanding of how to do this for decentralized data!

Solid as decentralized environment

Set of standards to decentralize data

Solid pods decouple storage from app

Foundation for Linked Web Storage

https://solidproject.org/

We develop the Comunica framework

Abstraction layer to query decentralized data

Modular query engine framework

Open-source (MIT)

https://comunica.dev/

The future of RDF and SPARQL

Contributing in RDF and SPARQL W3C working group

And towards SPARQL 1.3

https://www.w3.org/TR/sparql12-query/

LDES enables data silo synchronization

Linked Data Event Streams

For Linked Data publishers

https://semiceu.github.io/LinkedDataEventStreams/

From access to usage control with ODRL

ODRL: Open Digital Rights Language

Usage control

https://semiceu.github.io/LinkedDataEventStreams/

Conclusion: Query-driven data integration for decentralization

Knowledge Graph technologies

Need for intelligent query engines

Query-driven
Data Integration
for a Decentralized Web

Publishing Knowledge Graphs
as SPARQL Endpoints