Querying over Decentralized Data

Querying over
Decentralized Data

Query execution when data can not be centralized

Ruben Taelman

Ghent University – imec – IDLab, Belgium

Data is highly decentralized

Produced by governments, companies, individuals, …
Stored at different locations across the world
Consumed through interactive applications, processes, intelligent agents, …

Decentralized data integration is challenging for application developers

Discovering data
How can I find data sources? How can I find data within a source?
Combining data
How to combine data across different data sources?
Preserving privacy
How to not leak sensitive data?

Query engines abstract access to decentralized data

Hide the complexities of reading and writing for app developers

↔

↔

Image credit

Querying over decentralized data
Federated Querying
Link Traversal

Querying over decentralized data
Federated Querying
Link Traversal

SPARQL processing over centralized data

Dataset is collocated with query engine
All data is known beforehand
Single dataset
With precomputed indexes

Execute queries of any complexity

SELECT ?name ?deathDate WHERE {
  ?person a dbpedia-owl:Artist;
          rdfs:label ?name;
          dbpedia-owl:birthPlace [ rdfs:label "York"@en ].
  FILTER LANGMATCHES(LANG(?name),  "EN")
  OPTIONAL { ?person dbpprop:dateOfDeath ?deathDate. }
}

Centralization not always possible

Private data
Technical and legal reasons
Evolving data
Requires continuous re-indexing
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)

How to query over decentralized data?

Data and query engine are not collocated
Query engine runs on a separate machine
Not just one datasets
Data is spread over the Web into multiple data sources

Approaches for querying over decentralized data

Federated Query Processing
Distributing query execution across known sources
Link Traversal Query Processing
Local query execution over sources that are discovered by following links

Querying over decentralized data
Federated Querying
Link Traversal

Client distributes query over query APIs

Clients do limited effort
Split up the query, distribute it (source selection), and combine results
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets

Federation over SPARQL endpoints

Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
Manual source selection
Use SERVICE clauses to define which sources apply to query parts

SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
  SERVICE <http://example.com/kegg> {
    ?keggDrug rdf:type kegg :Drug.
    ?keggDrug bio2rdf:xRef ?id.
    ?keggDrug purl:title ?title.
  }
}

Automatic source selection

User may not know how data is distributed across endpoints
Automatically rewrite query + sources in terms of SERVICE clauses

# Source 1: http://example.com/drb
# Source 2: http://example.com/kegg

SELECT ?drug ?title WHERE {
  ?drug db:drugCategory dbc:micronutrient.
  ?drug db:casRegistryNumber ?id.
  ?keggDrug rdf:type kegg :Drug.
  ?keggDrug bio2rdf:xRef ?id.
  ?keggDrug purl:title ?title.
}

SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
  SERVICE <http://example.com/kegg> {
    ?keggDrug rdf:type kegg :Drug.
    ?keggDrug bio2rdf:xRef ?id.
    ?keggDrug purl:title ?title.
  }
}

Query as if all triples are merged into a single endpoint

Main federation bottleneck: data transfer

Challenging the data locality principle
Data locality: keep data physically close to the place where it's processed
Federated querying leads to a separation of data and processing
Optimizations focus on maximizing data locality
Minimize size and number of HTTP requests
Through query planning and intelligent join algorithms

Exhaustive source selection (worst-case)

Wrap each triple pattern into UNIONs of SERVICEs
To ensure completeness

# Source 1: http://example.com/drb
# Source 2: http://example.com/kegg

SELECT ?drug ?title WHERE {
   ?drug db:drugCategory dbc:micronutrient.
   ?drug db:casRegistryNumber ?id.
}

SELECT ?drug ?title WHERE {
  {
    SERVICE <http://example.com/drb> {
      ?drug db:drugCategory dbc:micronutrient.
    }
  } UNION {
    SERVICE <http://example.com/kegg> {
      ?drug db:drugCategory dbc:micronutrient.
    }  
  }
  {
    SERVICE <http://example.com/drb> {
      ?drug db:casRegistryNumber ?id.
    }
  } UNION {
    SERVICE <http://example.com/kegg> {
      ?drug db:casRegistryNumber ?id.
    }  
  }
 }

Identifying exclusive groups

Exclusive group
Set of triple patterns that can be sent to a single endpoint
Identified during query planning
Using ASK or COUNT queries, VoID descriptions, …
Exclusive groups are more selective
Reduces number and size of HTTP requests

SELECT ?drug ?title WHERE {
   SERVICE <http://example.com/drb> {
     ?drug db:drugCategory dbc:micronutrient.
   }
   SERVICE <http://example.com/drb> {
     ?drug db:casRegistryNumber ?id.
   }
 }

 
SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
}

Federation-specific join algorithms

Bound join
Variant of nested-loop join grouped by blocks
Multiple iterations of a join grouped in a single query (e.g. using VALUES)

Example of one iteration (block size 16):

Joining (?category) ⋈ (?category, ?drug, ?id)

SELECT ?drug ?id WHERE {
   VALUES (?category) {
       db:drugCategory1 db:drugCategory2 db:drugCategory3 db:drugCategory4
       db:drugCategory5 db:drugCategory6 db:drugCategory7 db:drugCategory8
       db:drugCategory9 db:drugCategory10 db:drugCategory11 db:drugCategory12
       db:drugCategory13 db:drugCategory14 db:drugCategory14 db:drugCategory15
   }
   ?drug db:drugCategory ?category.
   ?drug db:casRegistryNumber ?id.
 }

Federation over heterogeneous sources

Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
Clients may have to take up more effort
Executing parts of queries client-side
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues

Limitations of federated querying

All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources
Public endpoints have restrictions
Timeouts, rate limits, …
Current federation techniques can not cope with them

Querying over decentralized data
Federated Querying
Link Traversal

Exploit interlinking of documents

Linked Data documents are linked to each other
Following the Linked Data principles
Query engine can follow links
Start from one document, and discover new documents on the fly

Example: decentralized address book

Example: Find Alice's contact names

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me>
        foaf:knows ?person.
    ?person foaf:name ?name.
}

Query process:

Start from Alice's address book
Follow links to profiles of Bob and Carol
Query over union of all profiles
Find query results: [ { "name": "Bob" }, { "name": "Carol" } ]

The Link Traversal Query Process

Link queue: stores links to documents that need to be fetched

1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document

Add all URLs in document to link queue
3. Execute query
Over union of all discovered RDF triples

Example: Evolution of the Link Queue (1)

Initialize link queue with seed URL(s)

https://alice.pods.org/profile

Example: Evolution of the Link Queue (2)

Follow link to head of queue

https://alice.pods.org/profile

Example: Evolution of the Link Queue (3)

Push URLs in `https://alice.pods.org/profile` to queue

https://alice.pods.org/profile
https://bob.pods.org/profile
https://carol.org/

</profile#me> :knows
  <https://bob.pods.org/profile#me>,
  <https://carol.org/#i>.

No need to re-add link to https://alice.pods.org/profile

Example: Evolution of the Link Queue (4)

Remove handled link from queue

https://bob.pods.org/profile

https://carol.org/

Example: Evolution of the Link Queue (5)

Follow link to head of queue

https://bob.pods.org/profile
https://carol.org/

Example: Evolution of the Link Queue (6)

No more URLs found in `https://bob.pods.org/profile`

https://bob.pods.org/profile
https://carol.org/

</profile#me> :name "Bob";
  :email <mailto:bob@mail.org>;
  :telephone "0499 12 34 56".

mailto:bob@mail.org is not an HTTP(S) link that can be followed

Example: Evolution of the Link Queue (7)

Remove handled link from queue

https://carol.org/

Example: Evolution of the Link Queue (8)

Follow link to head of queue

https://carol.org/

Example: Evolution of the Link Queue (9)

No more URLs found in `https://carol.org/`

https://carol.org/

</#i> :name "Carol";
  :email <mailto:i@carol.org>;
  :telephone "0499 11 22 33".

Example: Evolution of the Link Queue (10)

Remove handled link from queue

Link queue is empty

Execute query over union of all discovered RDF triples

Link Traversal has specific challenges

That do not exist in centralized querying

Link queue iteration blocks evaluation
Following links is expensive
Number of links can become very large
Each document typically contains many links → exponential growth
Traditional query planning does not work
No a priori knowledge about cardinalities and other statistics

Link queue iteration blocks evaluation

→ Process query during link queue handling

Extension of pipelining approach
Link queue is a component in pipeline

May produce an infinite data stream

Number of links can become very large

→ Pre-filter links before they enter queue

Different filtering strategies exist

Follow all
No filtering, follow all possible links
Follow none
Follow no links
Follow if triple matches query
Follow links in a triple if they match at least one triple pattern in the query

Follow if triple matches query: example

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me> foaf:knows ?person.
    ?person foaf:name ?name.
}

Contents of https://alice.pods.org/profile:

✅ </profile#me> :knows <https://bob.pods.org/profile#me>.
✅ </profile#me> :knows <https://carol.org/#i>.
❌ </profile#me> :likes <https://bob.pods.org/posts/hello-world>.

Filtering links and query semantics

SPARQL queries have universal meaning
Queries can be executed over all (RDF) datasets
Query results depend on datasets
Executing same query over different datasets may produce different results
Different link filtering strategies leads to different range of documents
Query results may differ for different strategies

Results also depend on the seed URLs

Traditional query planning does not work

→ Zero-knowledge query planning

Score-based heuristics for determining triple pattern order in BGPs

Selectivity
Triple patterns with fewer variables are prioritized
Seed-based
Triple patterns that contain a seed URL are prioritized
No ontologies
Triple patterns for well-known ontologies (e.g. rdf:type) are deprioritized
These often produce many results

Link Traversal: too slow for querying over Linked Open Data

Massive number of possible links on the open Web
New content is produced faster than you can follow links
Inefficient query plans
Traversal engine can not optimize sufficiently due to lack of statistics

Link Traversal becomes feasible with structural assumptions

Environments with structural properties
Subsets of the Web that follow specific structures
Traversal engines can make additional assumptions
Assumptions about how data is structured and interlinked
→ Guided link traversal

Solid pods follow structural properties

Pod contents listed through as REST API
Linked Data Platform
WebID profile
User name and link to storage
Type index
Type-based resource discovery

Querying over decentralized data
Federated Querying
Link Traversal

Conclusions

Query engines can integrating decentralized data
Abstract away complexities of decentralized data
Trade-offs between federated querying and link traversal
Federation faster, but lacks source discovery and limited in #sources
Link traversal currently only works over documents
Need for a hybrid between federation and link traversal?
You can become a pioneer in this domain!
IDLab is doing research in these fields
We offer student jobs, master theses, PhD jobs

Query results arrive incrementally
within human attention limits

SPARQL query with 6 triple patterns
Across 1.531 simulated Solid pods (158.233 files, 3.556.159 triples)
Taelman, Ruben, and Ruben Verborgh. "Link traversal query processing over decentralized environments with structural assumptions." International Semantic Web Conference 2023.

Single-pod queries are fast, but multi-pod queries can lead to link overload

Current techniques for following cross-pod links are unselective

Discover 4: single pod; Complex 2: multiple pods
Eschauzier, Ruben, Ruben Taelman, and Ruben Verborgh. "How Does the Link Queue Evolve during Traversal-Based Query Processing?." QuWeDa/MEPDaW@ISWC. 2023.

If pods expose more information, complex querying can become faster

When following links, engines can discover optimizations
If pods expose them
A pod can guide engines more efficiently towards relevant data
Type indexes, shape trees, shape indexes, …
Cross-pod indexes can prune out irrelevant pods
Aggregations or summarization of data across collections of pods
Dangers: staleness, trust, censorship, …

Querying overDecentralized Data

Query execution when data can not be centralized

Data is highly decentralized

Decentralized data integration is challenging for application developers

Discovering data

Combining data

Preserving privacy

Query engines abstract access to decentralized data

Hide the complexities of reading and writing for app developers

SPARQL processing over centralized data

Dataset is collocated with query engine

Single dataset

Execute queries of any complexity

Centralization not always possible

Private data

Evolving data

Web scale data

How to query over decentralized data?

Data and query engine are not collocated

Not just one datasets

Approaches for querying over decentralized data

Federated Query Processing

Link Traversal Query Processing

Client distributes query over query APIs

Clients do limited effort

Servers perform most of the effort

Federation over SPARQL endpoints

Servers are SPARQL endpoints (most common)

Manual source selection

Automatic source selection

User may not know how data is distributed across endpoints

Query as if all triples are merged into a single endpoint

Main federation bottleneck: data transfer

Challenging the data locality principle

Optimizations focus on maximizing data locality

Exhaustive source selection (worst-case)

Wrap each triple pattern into UNIONs of SERVICEs

Identifying exclusive groups

Exclusive group

Identified during query planning

Exclusive groups are more selective

Federation-specific join algorithms

Bound join

Example of one iteration (block size 16):

Federation over heterogeneous sources

Servers are not only SPARQL endpoints

Clients may have to take up more effort

Trade-off between server and client effort

Limitations of federated querying

All federation members must be known before execution starts

Limited scalability in terms of number of endpoints

Public endpoints have restrictions

Exploit interlinking of documents

Linked Data documents are linked to each other

Query engine can follow links

Example: decentralized address book

Example: Find Alice's contact names

Query process:

The Link Traversal Query Process

1. Initialize link queue with seed URLs

2. Iterate and append link queue

3. Execute query

Example: Evolution of the Link Queue (1)

Initialize link queue with seed URL(s)

Example: Evolution of the Link Queue (2)

Follow link to head of queue

Example: Evolution of the Link Queue (3)

Push URLs in https://alice.pods.org/profile to queue

Example: Evolution of the Link Queue (4)

Remove handled link from queue

Example: Evolution of the Link Queue (5)

Follow link to head of queue

Example: Evolution of the Link Queue (6)

No more URLs found in https://bob.pods.org/profile

Example: Evolution of the Link Queue (7)

Remove handled link from queue

Example: Evolution of the Link Queue (8)

Follow link to head of queue

Example: Evolution of the Link Queue (9)

No more URLs found in https://carol.org/

Querying over
Decentralized Data

Push URLs in `https://alice.pods.org/profile` to queue

No more URLs found in `https://bob.pods.org/profile`

No more URLs found in `https://carol.org/`

Query results arrive incrementally
within human attention limits