Data is highly decentralized
- Produced by governments, companies, individuals, …
- Stored at different locations across the world
- Consumed through interactive applications, processes, intelligent agents, …
Decentralized data integration is challenging for user-facing app developers
-
Discovering data
How can I find data sources? How can I find data within a source?
-
Combining data
How to combine data across different data sources?
-
Preserving privacy
How to not leak sensitive data?
Query processing over centralized data
Centralization not always possible
-
Private data
Technical and legal reasons
-
Evolving data
Requires continuous re-indexing
-
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)
How to query over decentralized data?
-
Data and query engine are not collocated
Query engine runs on a separate machine
-
Not just one datasets
Data is spread over the Web into multiple documents
Approaches for querying over decentralized data
-
Federated Query Processing
Distributing query execution across known sources
-
Link Traversal Query Processing
Local query execution over sources that are discovered by following links
Federation distributes query over APIs
-
Clients do limited effort
Split up the query, distribute it (source selection), and combine results
-
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets
Link traversal follows linked documents
-
Documents are linked to each other
Following the Linked Data principles
-
Query engine can follow links
Start from one document, and discover new documents on the fly
Limitations of querying approaches
-
All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
-
Limited scalability of federation in terms of number of sources
Current federation techniques scale to the order of 10 sources
-
Link traversal can be too slow in practise
Too many links are followed for complex queries
Focus on Knowledge Graphs
Publishing Knowledge Graphs
as SPARQL Endpoints
SPARQL endpoint: API that accepts SPARQL queries, and replies with results.
-
Most popular way to publish Knowledge Graphs
Alternatives are data dumps and Linked Data Documents
-
Very powerful
Very complex queries can be formulated with SPARQL
SPARQL endpoints have low availabily
Public endpoints have an availability of 95%.
→ ~1.5 days downtime per month!
Vandenbussche, Pierre-Yves, et al. "SPARQLES: Monitoring public SPARQL endpoints." Semantic web 8.6 (2017): 1049-1065.
-
SPARQL query expressivity is high
Some queries can be computationally intensive
-
Publicly accessible
An unbounded number of clients can send queries
SPARQL endpoints have restrictions
To counter availability issues
-
Timeouts
Limit execution time per query
-
Rate limits
Limit number of queries a client can send
→ Limits types and number of queries that can be executed!
Alternatives to SPARQL endpoints
To limit expressivity
Linked Data Fragments (LDF) axis: investigates trade-offs between server and client effort for query execution.
Verborgh, Ruben, et al. "Triple pattern fragments: a low-cost knowledge graph interface for the web." Journal of Web Semantics 37 (2016): 184-206.
LDF interfaces complicate federation
→ Federation engines must combine data across heterogeneous APIs
Personal data
Decentralization initiatives offer users full control of where data is stored and who can access it
How to query protected data?
-
Privacy preservation
Ensure only authorized agents can access data.
Trusted servers? Client-side? Combinations?
-
Trustworthiness
Allow inspection of how and where data was combined
→ Lack of understanding of how to do this for decentralized data!
Solid as decentralized environment
-
Set of standards to decentralize data
Linked Data, LDP, Authentication, …
-
Solid pods decouple storage from app
Users are in full control over their pod
-
Foundation for Linked Web Storage
Is standardized in W3C working group
We develop the Comunica framework
-
Abstraction layer to query decentralized data
Federation, link traversal, Solid, …
SPARQL, GraphQL-LD, …
-
Modular query engine framework
Flexible experimentation ground for research (link traversal, federation, …)
-
Open-source (MIT)
Used for research and beyond
Used by 1.500+ open-source projects, 380.0000+ monthly downloads
The future of RDF and SPARQL
-
Contributing in RDF and SPARQL W3C working group
Towards RDF and SPARQL 1.2
-
And towards SPARQL 1.3
Better list support, lateral joins, rate limit announcement, …
LDES enables data silo synchronization
-
Linked Data Event Streams
Low-cost Linked Data API
API for replication and synchronization across datasets
-
For Linked Data publishers
Recommended by Flanders, SEMIC (EU), …
From access to usage control with ODRL
-
ODRL: Open Digital Rights Language
Policy expression language
-
Usage control
Traditional access control mechanisms are too limited
→ Permissions, prohibitions, duties
Retain data for max 10 days, use data only for medical research, …
Conclusion: Query-driven data integration for decentralization
-
Knowledge Graph technologies
Universal semantics to interlink data across organizational boundaries
-
Need for intelligent query engines
Abstract away complexities of decentralized data
-
Querying over decentralized data → research challenges
Open to collaborations!