Data is highly decentralized
- Produced by governments, companies, individuals, …
- Stored at different locations across the world
- Consumed through interactive applications, processes, intelligent agents, …
Decentralized data integration is challenging for application developers
-
Discovering data
How can I find data sources? How can I find data within a source?
-
Combining data
How to combine data across different data sources?
-
Preserving privacy
How to not leak sensitive data?
SPARQL processing over centralized data
Centralization not always possible
-
Private data
Technical and legal reasons
-
Evolving data
Requires continuous re-indexing
-
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)
How to query over decentralized data?
-
Data and query engine are not collocated
Query engine runs on a separate machine
-
Not just one datasets
Data is spread over the Web into multiple data sources
Approaches for querying over decentralized data
-
Federated Query Processing
Distributing query execution across known sources
-
Link Traversal Query Processing
Local query execution over sources that are discovered by following links
Client distributes query over query APIs
-
Clients do limited effort
Split up the query, distribute it (source selection), and combine results
-
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets
Federation over SPARQL endpoints
-
Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
-
Manual source selection
Use SERVICE clauses to define which sources apply to query parts
SELECT ?drug ?title WHERE {
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
}
SERVICE <http://example.com/kegg> {
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
}
Automatic source selection
-
User may not know how data is distributed across endpoints
Automatically rewrite query + sources in terms of SERVICE clauses
# Source 1: http://example.com/drb
# Source 2: http://example.com/kegg
SELECT ?drug ?title WHERE {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
SELECT ?drug ?title WHERE {
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
}
SERVICE <http://example.com/kegg> {
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
}
-
Query as if all triples are merged into a single endpoint
Main federation bottleneck: data transfer
-
Challenging the data locality principle
Data locality: keep data physically close to the place where it's processed
Federated querying leads to a separation of data and processing
-
Optimizations focus on maximizing data locality
Minimize size and number of HTTP requests
Through query planning and intelligent join algorithms
Exhaustive source selection (worst-case)
-
Wrap each triple pattern into UNIONs of SERVICEs
To ensure completeness
# Source 1: http://example.com/drb
# Source 2: http://example.com/kegg
SELECT ?drug ?title WHERE {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
}
SELECT ?drug ?title WHERE {
{
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
}
} UNION {
SERVICE <http://example.com/kegg> {
?drug db:drugCategory dbc:micronutrient.
}
}
{
SERVICE <http://example.com/drb> {
?drug db:casRegistryNumber ?id.
}
} UNION {
SERVICE <http://example.com/kegg> {
?drug db:casRegistryNumber ?id.
}
}
}
Identifying exclusive groups
-
Exclusive group
Set of triple patterns that can be sent to a single endpoint
-
Identified during query planning
Using ASK or COUNT queries, VoID descriptions, …
-
Exclusive groups are more selective
Reduces number and size of HTTP requests
SELECT ?drug ?title WHERE {
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
}
SERVICE <http://example.com/drb> {
?drug db:casRegistryNumber ?id.
}
}
SELECT ?drug ?title WHERE {
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
}
}
Federation-specific join algorithms
-
Bound join
Variant of nested-loop join grouped by blocks
Multiple iterations of a join grouped in a single query (e.g. using VALUES)
Example of one iteration (block size 16):
Joining (?category) ⋈ (?category, ?drug, ?id)
SELECT ?drug ?id WHERE {
VALUES (?category) {
db:drugCategory1 db:drugCategory2 db:drugCategory3 db:drugCategory4
db:drugCategory5 db:drugCategory6 db:drugCategory7 db:drugCategory8
db:drugCategory9 db:drugCategory10 db:drugCategory11 db:drugCategory12
db:drugCategory13 db:drugCategory14 db:drugCategory14 db:drugCategory15
}
?drug db:drugCategory ?category.
?drug db:casRegistryNumber ?id.
}
Federation over heterogeneous sources
-
Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
-
Clients may have to take up more effort
Executing parts of queries client-side
-
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues
Limitations of federated querying
-
All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
-
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources
-
Public endpoints have restrictions
Timeouts, rate limits, …
Current federation techniques can not cope with them
Exploit interlinking of documents
-
Linked Data documents are linked to each other
Following the Linked Data principles
-
Query engine can follow links
Start from one document, and discover new documents on the fly
Example: decentralized address book
Example: Find Alice's contact names
SELECT ?name WHERE {
<https://alice.pods.org/profile#me>
foaf:knows ?person.
?person foaf:name ?name.
}
Query process:
- Start from Alice's address book
- Follow links to profiles of Bob and Carol
- Query over union of all profiles
- Find query results:
[ { "name": "Bob" }, { "name": "Carol" } ]
The Link Traversal Query Process
Link queue: stores links to documents that need to be fetched
-
1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
-
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document
Add all URLs in document to link queue
-
3. Execute query
Over union of all discovered RDF triples
Example: Evolution of the Link Queue (1)
Initialize link queue with seed URL(s)
| https://alice.pods.org/profile |
Example: Evolution of the Link Queue (2)
Follow link to head of queue
| https://alice.pods.org/profile |
Example: Evolution of the Link Queue (3)
Push URLs in https://alice.pods.org/profile to queue
| https://alice.pods.org/profile |
| https://bob.pods.org/profile |
| https://carol.org/ |
</profile#me> :knows
<https://bob.pods.org/profile#me>,
<https://carol.org/#i>.
No need to re-add link to https://alice.pods.org/profile
Example: Evolution of the Link Queue (4)
Remove handled link from queue
| https://bob.pods.org/profile |
| https://carol.org/ |
Example: Evolution of the Link Queue (5)
Follow link to head of queue
| https://bob.pods.org/profile |
| https://carol.org/ |
Example: Evolution of the Link Queue (6)
No more URLs found in https://bob.pods.org/profile
| https://bob.pods.org/profile |
| https://carol.org/ |
</profile#me> :name "Bob";
:email <mailto:bob@mail.org>;
:telephone "0499 12 34 56".
mailto:bob@mail.org is not an HTTP(S) link that can be followed
Example: Evolution of the Link Queue (7)
Remove handled link from queue
Example: Evolution of the Link Queue (8)
Follow link to head of queue
Example: Evolution of the Link Queue (9)
No more URLs found in https://carol.org/
</#i> :name "Carol";
:email <mailto:i@carol.org>;
:telephone "0499 11 22 33".
Example: Evolution of the Link Queue (10)
Remove handled link from queue
Link queue is empty
Execute query over union of all discovered RDF triples
Link Traversal has specific challenges
That do not exist in centralized querying
-
Link queue iteration blocks evaluation
Following links is expensive
-
Number of links can become very large
Each document typically contains many links → exponential growth
-
Traditional query planning does not work
No a priori knowledge about cardinalities and other statistics
Link queue iteration blocks evaluation
→ Process query during link queue handling
-
Extension of pipelining approach
Link queue is a component in pipeline
May produce an infinite data stream
Number of links can become very large
→ Pre-filter links before they enter queue
Different filtering strategies exist
-
Follow all
No filtering, follow all possible links
-
Follow none
Follow no links
-
Follow if triple matches query
Follow links in a triple if they match at least one triple pattern in the query
Follow if triple matches query: example
SELECT ?name WHERE {
<https://alice.pods.org/profile#me> foaf:knows ?person.
?person foaf:name ?name.
}
Contents of https://alice.pods.org/profile:
✅ </profile#me> :knows <https://bob.pods.org/profile#me>.
✅ </profile#me> :knows <https://carol.org/#i>.
❌ </profile#me> :likes <https://bob.pods.org/posts/hello-world>.
Filtering links and query semantics
-
SPARQL queries have universal meaning
Queries can be executed over all (RDF) datasets
-
Query results depend on datasets
Executing same query over different datasets may produce different results
-
Different link filtering strategies leads to different range of documents
Query results may differ for different strategies
Results also depend on the seed URLs
Traditional query planning does not work
→ Zero-knowledge query planning
Score-based heuristics for determining triple pattern order in BGPs
-
Selectivity
Triple patterns with fewer variables are prioritized
-
Seed-based
Triple patterns that contain a seed URL are prioritized
-
No ontologies
Triple patterns for well-known ontologies (e.g. rdf:type) are deprioritized
These often produce many results
Link Traversal: too slow for querying over Linked Open Data
-
Massive number of possible links on the open Web
New content is produced faster than you can follow links
-
Inefficient query plans
Traversal engine can not optimize sufficiently due to lack of statistics
Link Traversal becomes feasible with structural assumptions
-
Environments with structural properties
Subsets of the Web that follow specific structures
-
Traversal engines can make additional assumptions
Assumptions about how data is structured and interlinked
→ Guided link traversal
Solid pods follow structural properties
-
Pod contents listed through as REST API
Linked Data Platform
-
WebID profile
User name and link to storage
-
Type index
Type-based resource discovery
Conclusions
-
Query engines can integrating decentralized data
Abstract away complexities of decentralized data
-
Trade-offs between federated querying and link traversal
Federation faster, but lacks source discovery and limited in #sources
Link traversal currently only works over documents
Need for a hybrid between federation and link traversal?
-
You can become a pioneer in this domain!
IDLab is doing research in these fields
We offer student jobs, master theses, PhD jobs
Query results arrive incrementally
within human attention limits
Single-pod queries are fast, but multi-pod queries can lead to link overload
Current techniques for following cross-pod links are unselective
If pods expose more information, complex querying can become faster
-
When following links, engines can discover optimizations
If pods expose them
-
A pod can guide engines more efficiently towards relevant data
Type indexes, shape trees, shape indexes, …
-
Cross-pod indexes can prune out irrelevant pods
Aggregations or summarization of data across collections of pods
Dangers: staleness, trust, censorship, …