Storing data as Knowledge Graphs
1989, CERN Switzerland
Tim Berners-Lee
inventor of the World Wide Web
The Web's foundational ideas
-
Data is linked to each other
Across heterogeneous systems (decentralized)
-
Browser programs
Visualises data and helps users navigating the Web
-
Everyone can read data
Using Web browsers
-
Open standards
Anyone can implement tools (browsers, ...) on top of it
1990-... World-wide adoption
Not just for researchers anymore
The Web is a global information space
a.k.a. The World Wide Web (WWW)
Mostly used by humans through Web browsers
Web is focused on humans
-
Web pages show information
Visualized using Web browsers
-
Clicking on links
To discover new information
-
Search information
Using search engines such as Google, Bing, ...
Achieving tasks requires manual effort
-
Will it rain next week?
- Find a weather prediction website
- Select your location
- Navigate to next week
-
Book a trip for a group of people
- Comparing agendas
- Comparing interests: nature, musea, ...
- Regional temperatures
- Geopolitical status
- ...
What if intelligent agents
could do these tasks for us?
I Have a Dream
A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001
The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.
The dream: Getting there step by step
2015: Agents are being introduced that can perform simple tasks
Google Assistant, Alexa, Siri, ...
How do these agents work?
→ Execute Queries over Knowledge Graphs
Query = Structured Question
Knowledge Graph = Structured information
→ Structured questions over structured information
A Knowledge Graph is
a collection of structured information
Semantic Web technologies
-
RDF
Standard for representing and exchanging graph data
Knowledge Graphs are collections of such graph data
-
SPARQL
Standard for querying over RDF data
Generative AI models
2022: LLMs offer human-like conversations based on crawled Web data
Differences to Knowledge Graphs:
- Adaptive understanding of unstructured data
- Inconsistent answers and hallucinations
- Black box
RDF enables data exchange on the Web
-
RDF
Resource Description framework
-
Recommended by W3C since 1999
-
Recommended by W3C since 2014
-
Upcoming in 2025
Facts are represented as RDF triples

Multiple RDF triples form RDF datasets
:Alice :knows :Bob .
:Alice :knows :Carol .
:Alice :name "Alice" .
:Bob :name "Bob" .
:Bob :knows :Carol .
RDF Resources are identified by IRIs
https://example.org/Alice
-
IRI
Internationalized Resource Identifier
-
Global identifiers
Can be used by different data sources → interlinking!
-
IRIs can be URLs
A URL is an IRI that can be dereferenced (e.g. looked up in Web browsers).
Allows additional information to be looked up for a resource.
Multiple syntaxes exist for RDF
Turtle
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
dbr:Alice foaf:knows dbr:Bob .
dbr:Alice foaf:name "Alice" .
JSON-LD
{
"@context": {
"dbr": "http://dbpedia.org/resource/",
"foaf": "http://xmlns.com/foaf/0.1/"
},
"@id": "dbr:Alice",
"foaf:knows": "dbr:Bob",
"foaf:name": "Alice"
}
SPARQL querying over RDF datasets
- SPARQL: language to read and update RDF datasets via declarative queries.
-
Different query forms:
SELECT
: selecting values in tabular form → focus of this presentation
CONSTRUCT
: construct new triples
ASK
: check if data exists
DESCRIBE
: describe a given resource
INSERT
: insert new triples
DELETE
: delete existing triples
Specification: https://www.w3.org/TR/sparql-query/
Find all artists born in York
name | deathDate |
Albert Joseph Moore | |
Charles Francis Hansom | 1888 |
David Reed (comedian) | |
Dustin Gee | |
E Ridsdale Tate | 1922 |
How do query engines process a query?
RDF dataset + SPARQL query
↓
...
↓
query results
How do query engines process a query?
RDF dataset + SPARQL query
↓
SPARQL query processing
↓
query results
Basic Graph Patterns enable graph pattern matching
Query results representation
-
Solution Mapping
Mapping from a set of variable labels to a set of RDF terms.
-
Solution Sequence
A list of solution mappings.
1 solution sequence with 3 solution mappings:
name | birthplace |
Bob Brockmann | http://dbpedia.org/resource/Louisiana |
Bennie Nawahi | http://dbpedia.org/resource/Honolulu |
Weird Al Yankovic | http://dbpedia.org/resource/Downey,_California |
Steps in SPARQL query processing
-
1. Parsing
Transform a SPARQL query string into an algebra expression
-
2. Optimization
Transform algebra expression into a query plan
-
3. Evaluation
Executes query plan to obtain query results
Publishing Knowledge Graphs as SPARQL Endpoints
SPARQL endpoint: API that accepts SPARQL queries, and replies with results.
-
Most popular way to publish Knowledge Graphs
Alternatives are data dumps and Linked Data Documents
-
Very powerful
Very complex queries can be formulated with SPARQL
-
Power comes with a cost
SPARQL endpoints can require very powerful servers
Parsing: transform query string into SPARQL Algebra
-
SPARQL Algebra = Abstract Syntax Tree (AST)
Algebraic datastructure that represents the query structure
Tree where each node is a separate operator: join, BGP, filter, ...
Easier for engines to work with than strings
Enables algebraic query optimizations
A.k.a query plan → basis for query planning
SPARQL Algebra example
Input:
SELECT ?x ?y ?z WHERE {
?x ?y ?z
}
Output:
{
"type": "project",
"input": {
"type": "bgp",
"patterns": [
{
"type": "pattern",
"subject": {
"termType": "Variable",
"value": "x"
},
"predicate": {
"termType": "Variable",
"value": "y"
},
"object": {
"termType": "Variable",
"value": "z"
}
}
]
},
"variables": [
{ "termType": "Variable",
"value": "x" },
{ "termType": "Variable",
"value": "y" },
{ "termType": "Variable",
"value": "z" }
]
}
Why is optimization needed?
-
Queries are declarative
They say what needs to be done, not how.
Transformation into executable code can happen in many different ways.
-
Factors that influence performance
Dataset
Order of operators in AST
Available operator algorithm implementations
Available data indexes
Different levels of optimization
-
Algebraic optimization
Independent of underlying data
Based on formally provable equivalence of operations
-
Low-level optimization
Based on the underlying data
Select and configure specific algorithms of algebra operators
Algebraic optimization
-
Data-independent
Independent of the underlying dataset
Based on heuristics
-
Preserves evaluation semantics
Optimized algebra produces same results as executing unoptimized algebra
-
Trade-off between optimization and performance
Algebraic optimization can take a lot of time,
sometimes even more than execution the sub-optimal algebra.
Example: ORDER BY
+ DISTINCT
-
ORDER BY
: Sort solution mappings by the given variable
-
DISTINCT
: Remove duplicate solution mappings
-
SPARQL requires
DISTINCT
to happen after ORDER BY
-
Performance issues when sorting many non-distinct solution mappings
-
Optimization: Apply
DISTINCT
before ORDER BY
Input:
{
"type": "distinct",
"input": {
"type": "orderby",
"variable": "x",
"input": { ... }
}
}
Output:
{
"type": "orderby",
"variable": "x",
"input": {
"type": "distinct",
"input": { ... }
}
}
Low-level optimization
-
Data-dependent
Based on the underlying dataset, statistics, and derived indexes.
-
Minimize computational complexity
Reduce the number of intermediary solution mappings.
-
Select and configure specific algorithms of algebra operators
Different algorithms behave better in certain situations.
Example: joining in BGPs
How, and in what order should triple patterns be evaluated?
1. ?person a dbpedia-owl:Artist.
2. ?person rdfs:label ?name.
3. ?person dbpedia-owl:birthPlace ?birthPlace.
Different factors influence join order
-
Cardinality of each pattern
Can be exact, or an estimate
-
Complexity of join algorithms
Based on cardinalities of patterns
-
Precomputing steps of join algorithms
Precomputation may be too expensive for low cardinalities
-
Restrictions of join algorithms
Some algorithms only work in certain cases
Obtain cardinality of each pattern
1. ?person a dbpedia-owl:Artist.
200
2. ?person rdfs:label ?name.
10.000
3. ?person dbpedia-owl:birthPlace ?birthPlace.
1.000
Can be pre-computed, or estimated at runtime.
Select appropriate join algorithms
Usually happens in pairs of two operators (n
, m
)
-
Nested-loop join
Double for-loop → O(n*m)
Works well for small sequences due to the lack of preprocessing.
-
Hash join
Precomputes hash tables for all mappings in one solution sequence.
Iterates over other solution sequence with lookups to hash table → O(n+m)
Not possible with OPTIONAL
-
Sort-merge join
Sort solution sequences.
Iterates over both solution sequences in order → O(n+m)
Sorting can become expensive with larger solution sequences
Join orders and algorithms influence worst-case complexity
-
All nested-loop join
((1 ⋈ 2) ⋈ 3) = ((200 * 10.000) * 1.000) = 2.000.000.000
-
All hash join
((1 ⋈ 2) ⋈ 3) = ((200 + 10.000) + 1.000) = 11.200
-
Nested-loop join and hash join
((1 ⋈ 3) ⋈ 2) = ((200 * 1.000) + 10.000) = 210.000
→ Restrictions and preprocessing steps (sorting, hashing, ...) not yet taken into account
→ Trade-off between different factors that influence join order
Evaluating the optimized query plan
-
Bottom-up
Start with leaves in query plan, and work up to the root.
-
Parallelisation
Evaluate independent branches in parallel on different threads.
-
Pipelining
Reduce time to first result.
-
Adaptive processing
Advanced: adaptively modify the query plan at runtime.
Evaluation iteratively handles leaf nodes
Based on operator type
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
...solution sequence...
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
...solution sequence...
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
...solution sequence...
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
...solution sequence...
}
}
Evaluation iteratively handles leaf nodes
...solution sequence...
Some operations can be parallelised
-
Operators with multiple independent inputs
UNION branches are fully independent
Triple patterns in BGPs can be handled independently based on the algorithm
Some joins can be partially handled independently (e.g. sorting step in sort-merge join)
-
Local threads vs. distributed machines
Communication cost may be high → parallelising not always beneficial
-
I/O vs. CPU cost
Depending on how data is stored, I/O may be the bottleneck
E.g. remotely stored data → HTTP is bottleneck instead of CPU
Pipelining produces results as soon as possible
-
Evaluating operators after each other is sub-optimal
Requires intermediary results to be kept in memory
Results only become available once everything is done processing
-
Pipelining enables intermediary results to flow through operators
In general, intermediary results do not have to be kept in memory
Query results are produced as a stream, as soon as they are available
A pipeline of operators
Adaptive processing responds to changing conditions
-
Query optimization as preprocessing is restrictive
Assumes that all information is known beforehand
Can not cope with volatile distributed environments
-
Adaptive query processing: optimization during evaluation
Modify query plan based on network or data changes
Algorithms are typically more complex
SPARQL processing over centralized data
-
Dataset is collocated with query engine
All data is known beforehand
-
Single dataset
Combining multiple datasets is hard
Centralization not always possible
-
Private data
Technical and legal reasons
-
Evolving data
Requires continuous re-indexing
-
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)
How to query over decentralized data?
-
Data and query engine are not collocated
Query engine runs on a separate machine
-
Not just one datasets
Data is spread over the Web into multiple documents
Approaches for querying over decentralized data
-
Federated Query Processing
Distributing query execution across known sources
-
Link Traversal Query Processing
Local query execution over sources that are discovered by following links
Client distributes query over query APIs
-
Clients do limited effort
Split up the query, distribute it (source selection), and combine results
-
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets
Federation over SPARQL endpoints
-
Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
-
Client-side source selection
Rewrite query in terms of SERVICE clauses
SELECT ?drug ?title WHERE {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
SELECT ?drug ?title WHERE {
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
}
SERVICE <http://example.com/kegg> {
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
}
Federation over heterogeneous sources
-
Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
-
Clients may have to take up more effort
Executing parts of queries client-side
-
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues
Limitations of federated querying
-
All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
-
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources
Exploit interlinking of documents
-
Linked Data documents are linked to each other
Following the Linked Data principles
-
Query engine can follow links
Start from one document, and discover new documents on the fly
Link Traversal-based Query Processing
= Querying by following links between documents
Example: decentralized address book
Example: Find Alice's contact names
SELECT ?name WHERE {
<https://alice.pods.org/profile#me>
foaf:knows ?person.
?person foaf:name ?name.
}
Query process:
- Start from Alice's address book
- Follow links to profiles of Bob and Carol
- Query over union of all profiles
- Find query results:
[ { "name": "Bob" }, { "name": "Carol" } ]
The Link Traversal Query Process
Link queue: stores links to documents that need to be fetched
-
1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
-
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document
Add all URLs in document to link queue
-
3. Execute query
Over union of all discovered RDF triples
Link Traversal is unsuitable in practise
-
Link queue takes a long time to become empty
It grows in size due to massive size of the Web
Query process is delayed!
-
Too many links are followed
Each document typically contains many links → exponential growth
-
Existing query planning algorithms don't work
There is no a priori knowledge about cardinalities and other statistics
Decentralized querying has open problems
-
Link Traversal is suitable for querying decentralized environments
Copes with limitations of federated querying
Relatively new research area → many open problems
-
Main bottleneck is number of links
Future research should focus on guidance towards query-relevant links
Notable SPARQL engines
-
SPARQL endpoint as a service
-
Open-source engine in Java
-
Open-source engine in C
-
Open-source engine in TypeScript, which focus on decentralized querying
Example Federated queries
Requires prior knowledge of sources.
-
Joins Uniprot and Rhea
-
Joins Uniprot, Rhea, and ChEMBL
Example Link Traversal queries
Discover sources on the fly.
-
Queries over real-world and synthetic environments
Conclusions
-
Knowledge Graphs use (Semantic) Web technologies
RDF, SPARQL, …
-
Knowledge Graphs can be interlinked
Thanks to RDF's global identifiers
-
Federated Querying and Link Traversal
Combining data across multiple Knowledge Graphs
Complex queries can be slow