An Introduction to Querying over Knowledge Graphs on the Web

An Introduction to Querying
over Knowledge Graphs
on the Web

Ruben Taelman

Ghent University – imec – IDLab, Belgium

Storing data as Knowledge Graphs

Graph-based data model
When data does not fit into a fixed relation model
Based on Semantic Web and Linked Data technologies
Interlinking data across multiple data sources

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

1989, CERN Switzerland

Tim Berners-Lee
inventor of the World Wide Web

The Web's foundational ideas

Data is linked to each other
Across heterogeneous systems (decentralized)
Browser programs
Visualises data and helps users navigating the Web
Everyone can read data
Using Web browsers
Open standards
Anyone can implement tools (browsers, ...) on top of it

1990-... World-wide adoption

Not just for researchers anymore

The Web is a global information space

a.k.a. The World Wide Web (WWW)

Mostly used by humans through Web browsers

Web is focused on humans

Web pages show information

Visualized using Web browsers
Clicking on links

To discover new information
Search information

Using search engines such as Google, Bing, ...

Achieving tasks requires manual effort

Will it rain next week?
1. Find a weather prediction website
2. Select your location
3. Navigate to next week
Book a trip for a group of people
- Comparing agendas
- Comparing interests: nature, musea, ...
- Regional temperatures
- Geopolitical status
- ...

What if intelligent agents
could do these tasks for us?

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

I Have a Dream

A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.

The dream: Getting there step by step

2015: Agents are being introduced that can perform simple tasks

Google Assistant, Alexa, Siri, ...

How do these agents work?

→ Execute Queries over Knowledge Graphs

Query = Structured Question
Knowledge Graph = Structured information

→ Structured questions over structured information

A Knowledge Graph is
a collection of structured information

Semantic Web technologies

RDF
Standard for representing and exchanging graph data
Knowledge Graphs are collections of such graph data
SPARQL
Standard for querying over RDF data

Generative AI models

2022: LLMs offer human-like conversations based on crawled Web data

Differences to Knowledge Graphs:

Adaptive understanding of unstructured data
Inconsistent answers and hallucinations
Black box

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

RDF enables data exchange on the Web

RDF
Resource Description framework
RDF 1.0
Recommended by W3C since 1999
RDF 1.1
Recommended by W3C since 2014
RDF 1.2
Upcoming in 2025

Facts are represented as RDF triples

Multiple RDF triples form RDF datasets

:Alice :knows :Bob .
:Alice :knows :Carol .
:Alice :name "Alice" .
:Bob :name "Bob" .
:Bob :knows :Carol .

Multiple RDF triples form RDF datasets

RDF Resources are identified by IRIs

https://example.org/Alice

IRI
Internationalized Resource Identifier
Global identifiers
Can be used by different data sources → interlinking!
IRIs can be URLs
A URL is an IRI that can be dereferenced (e.g. looked up in Web browsers).
Allows additional information to be looked up for a resource.

Multiple syntaxes exist for RDF

Turtle

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

dbr:Alice foaf:knows dbr:Bob .
dbr:Alice foaf:name "Alice" .

JSON-LD

{
    "@context": {
        "dbr": "http://dbpedia.org/resource/",
        "foaf": "http://xmlns.com/foaf/0.1/"
    },
    "@id": "dbr:Alice",
    "foaf:knows": "dbr:Bob",
    "foaf:name": "Alice"
}

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

SPARQL querying over RDF datasets

SPARQL: language to read and update RDF datasets via declarative queries.
Different query forms:
- SELECT: selecting values in tabular form → focus of this presentation
- CONSTRUCT: construct new triples
- ASK: check if data exists
- DESCRIBE: describe a given resource
- INSERT: insert new triples
- DELETE: delete existing triples

Specification: https://www.w3.org/TR/sparql-query/

Find all artists born in York

SELECT ?name ?deathDate WHERE {
  ?person a dbpedia-owl:Artist;
          rdfs:label ?name;
          dbpedia-owl:birthPlace [ rdfs:label "York"@en ].
  FILTER LANGMATCHES(LANG(?name),  "EN")
  OPTIONAL { ?person dbpprop:dateOfDeath ?deathDate. }
}

name	deathDate
Albert Joseph Moore
Charles Francis Hansom	1888
David Reed (comedian)
Dustin Gee
E Ridsdale Tate	1922

How do query engines process a query?

RDF dataset + SPARQL query
↓
...
↓
query results

How do query engines process a query?

RDF dataset + SPARQL query
↓
SPARQL query processing
↓
query results

Basic Graph Patterns enable graph pattern matching

Basic Graph Pattern (BGP)
A collection of triple patterns.

    ?person a dbpedia-owl:Artist.
    ?person rdfs:label ?name.
    ?person dbpedia-owl:birthPlace ?birthPlace.

Triple Pattern
A triple in which any component may be a variable.
Variables start with ?, followed by a label. (e.g. ?name)
More complex operators are possible
OPTIONAL, UNION, FILTER, ...

Query results representation

Solution Mapping
Mapping from a set of variable labels to a set of RDF terms.
Solution Sequence
A list of solution mappings.

1 solution sequence with 3 solution mappings:

name	birthplace
Bob Brockmann	http://dbpedia.org/resource/Louisiana
Bennie Nawahi	http://dbpedia.org/resource/Honolulu
Weird Al Yankovic	http://dbpedia.org/resource/Downey,_California

Steps in SPARQL query processing

1. Parsing
Transform a SPARQL query string into an algebra expression
2. Optimization
Transform algebra expression into a query plan
3. Evaluation
Executes query plan to obtain query results

Publishing Knowledge Graphs as SPARQL Endpoints

SPARQL endpoint: API that accepts SPARQL queries, and replies with results.

Most popular way to publish Knowledge Graphs
Alternatives are data dumps and Linked Data Documents
Very powerful
Very complex queries can be formulated with SPARQL
Power comes with a cost
SPARQL endpoints can require very powerful servers

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Parsing: transform query string into SPARQL Algebra

SPARQL Algebra = Abstract Syntax Tree (AST)
Algebraic datastructure that represents the query structure

Tree where each node is a separate operator: join, BGP, filter, ...

Easier for engines to work with than strings

Enables algebraic query optimizations

A.k.a query plan → basis for query planning

SPARQL Algebra example

Input:

SELECT ?x ?y ?z WHERE {
  ?x ?y ?z
}

Output:

{
  "type": "project",
  "input": {
    "type": "bgp",
    "patterns": [
      {
        "type": "pattern",
        "subject": {
          "termType": "Variable",
          "value": "x"
        },
        "predicate": {
          "termType": "Variable",
          "value": "y"
        },
        "object": {
          "termType": "Variable",
          "value": "z"
        }
      }
    ]
  },
  "variables": [
    { "termType": "Variable",
      "value": "x" },
    { "termType": "Variable",
      "value": "y" },
    { "termType": "Variable",
      "value": "z" }
  ]
}

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Why is optimization needed?

Queries are declarative
They say what needs to be done, not how.

Transformation into executable code can happen in many different ways.
Factors that influence performance
Dataset

Order of operators in AST

Available operator algorithm implementations

Available data indexes

Different levels of optimization

Algebraic optimization
Independent of underlying data

Based on formally provable equivalence of operations
Low-level optimization
Based on the underlying data

Select and configure specific algorithms of algebra operators

Algebraic optimization

Data-independent
Independent of the underlying dataset

Based on heuristics
Preserves evaluation semantics
Optimized algebra produces same results as executing unoptimized algebra
Trade-off between optimization and performance
Algebraic optimization can take a lot of time,
sometimes even more than execution the sub-optimal algebra.

Example: `ORDER BY` + `DISTINCT`

ORDER BY: Sort solution mappings by the given variable
DISTINCT: Remove duplicate solution mappings
SPARQL requires DISTINCT to happen after ORDER BY
Performance issues when sorting many non-distinct solution mappings
Optimization: Apply DISTINCT before ORDER BY

Input:

{
  "type": "distinct",
  "input": {
    "type": "orderby",
    "variable": "x",
    "input": { ... }
  }
}

Output:

{
  "type": "orderby",
  "variable": "x",
  "input": {
    "type": "distinct",
    "input": { ... }
  }
}

Low-level optimization

Data-dependent
Based on the underlying dataset, statistics, and derived indexes.
Minimize computational complexity
Reduce the number of intermediary solution mappings.
Select and configure specific algorithms of algebra operators
Different algorithms behave better in certain situations.

Example: joining in BGPs

How, and in what order should triple patterns be evaluated?

1. ?person a dbpedia-owl:Artist.

2. ?person rdfs:label ?name.

3. ?person dbpedia-owl:birthPlace ?birthPlace.

Different factors influence join order

Cardinality of each pattern
Can be exact, or an estimate
Complexity of join algorithms
Based on cardinalities of patterns
Precomputing steps of join algorithms
Precomputation may be too expensive for low cardinalities
Restrictions of join algorithms
Some algorithms only work in certain cases

Obtain cardinality of each pattern

1. ?person a dbpedia-owl:Artist. 200

2. ?person rdfs:label ?name. 10.000

3. ?person dbpedia-owl:birthPlace ?birthPlace. 1.000

Can be pre-computed, or estimated at runtime.

Select appropriate join algorithms

Usually happens in pairs of two operators (n, m)

Nested-loop join
Double for-loop → O(n*m)
Works well for small sequences due to the lack of preprocessing.
Hash join
Precomputes hash tables for all mappings in one solution sequence.
Iterates over other solution sequence with lookups to hash table → O(n+m)
Not possible with OPTIONAL
Sort-merge join
Sort solution sequences.
Iterates over both solution sequences in order → O(n+m)
Sorting can become expensive with larger solution sequences

Join orders and algorithms influence worst-case complexity

All nested-loop join
((1 ⋈ 2) ⋈ 3) = ((200 * 10.000) * 1.000) = 2.000.000.000
All hash join
((1 ⋈ 2) ⋈ 3) = ((200 + 10.000) + 1.000) = 11.200
Nested-loop join and hash join
((1 ⋈ 3) ⋈ 2) = ((200 * 1.000) + 10.000) = 210.000

→ Restrictions and preprocessing steps (sorting, hashing, ...) not yet taken into account

→ Trade-off between different factors that influence join order

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Evaluating the optimized query plan

Bottom-up
Start with leaves in query plan, and work up to the root.
Parallelisation
Evaluate independent branches in parallel on different threads.
Pipelining
Reduce time to first result.
Adaptive processing
Advanced: adaptively modify the query plan at runtime.

Evaluation iteratively handles leaf nodes

Based on operator type

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        ...solution sequence...

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        ...solution sequence...

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        ...solution sequence...

      },
      {

        ...solution sequence...

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {
        ...solution sequence...
      },
      {
        ...solution sequence...
      }
    ]

}
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    ...solution sequence...

}
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    ...solution sequence...
  }
}

Evaluation iteratively handles leaf nodes

...solution sequence...

Some operations can be parallelised

Operators with multiple independent inputs
UNION branches are fully independent

Triple patterns in BGPs can be handled independently based on the algorithm

Some joins can be partially handled independently (e.g. sorting step in sort-merge join)
Local threads vs. distributed machines
Communication cost may be high → parallelising not always beneficial
I/O vs. CPU cost
Depending on how data is stored, I/O may be the bottleneck

E.g. remotely stored data → HTTP is bottleneck instead of CPU

Pipelining produces results as soon as possible

Evaluating operators after each other is sub-optimal
Requires intermediary results to be kept in memory

Results only become available once everything is done processing
Pipelining enables intermediary results to flow through operators
In general, intermediary results do not have to be kept in memory

Query results are produced as a stream, as soon as they are available

A pipeline of operators

Adaptive processing responds to changing conditions

Query optimization as preprocessing is restrictive
Assumes that all information is known beforehand

Can not cope with volatile distributed environments
Adaptive query processing: optimization during evaluation
Modify query plan based on network or data changes

Algorithms are typically more complex

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

SPARQL processing over centralized data

Dataset is collocated with query engine
All data is known beforehand
Single dataset
Combining multiple datasets is hard

Centralization not always possible

Private data
Technical and legal reasons
Evolving data
Requires continuous re-indexing
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)

How to query over decentralized data?

Data and query engine are not collocated
Query engine runs on a separate machine
Not just one datasets
Data is spread over the Web into multiple documents

Approaches for querying over decentralized data

Federated Query Processing
Distributing query execution across known sources
Link Traversal Query Processing
Local query execution over sources that are discovered by following links

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Client distributes query over query APIs

Clients do limited effort
Split up the query, distribute it (source selection), and combine results
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets

Federation over SPARQL endpoints

Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
Client-side source selection
Rewrite query in terms of SERVICE clauses

SELECT ?drug ?title WHERE {
  ?drug db:drugCategory dbc:micronutrient.
  ?drug db:casRegistryNumber ?id.
  ?keggDrug rdf:type kegg :Drug.
  ?keggDrug bio2rdf:xRef ?id.
  ?keggDrug purl:title ?title.
}

SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
  SERVICE <http://example.com/kegg> {
    ?keggDrug rdf:type kegg :Drug.
    ?keggDrug bio2rdf:xRef ?id.
    ?keggDrug purl:title ?title.
  }
}

Federation over heterogeneous sources

Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
Clients may have to take up more effort
Executing parts of queries client-side
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues

Limitations of federated querying

All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Exploit interlinking of documents

Linked Data documents are linked to each other
Following the Linked Data principles
Query engine can follow links
Start from one document, and discover new documents on the fly

Link Traversal-based Query Processing

= Querying by following links between documents

Example: decentralized address book

Example: Find Alice's contact names

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me>
        foaf:knows ?person.
    ?person foaf:name ?name.
}

Query process:

Start from Alice's address book
Follow links to profiles of Bob and Carol
Query over union of all profiles
Find query results: [ { "name": "Bob" }, { "name": "Carol" } ]

The Link Traversal Query Process

Link queue: stores links to documents that need to be fetched

1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document

Add all URLs in document to link queue
3. Execute query
Over union of all discovered RDF triples

Link Traversal is unsuitable in practise

Link queue takes a long time to become empty
It grows in size due to massive size of the Web

Query process is delayed!
Too many links are followed
Each document typically contains many links → exponential growth
Existing query planning algorithms don't work
There is no a priori knowledge about cardinalities and other statistics

Decentralized querying has open problems

Link Traversal is suitable for querying decentralized environments
Copes with limitations of federated querying
Relatively new research area → many open problems
Main bottleneck is number of links
Future research should focus on guidance towards query-relevant links

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Notable SPARQL engines

Amazon Neptune
SPARQL endpoint as a service
Apache Jena
Open-source engine in Java
QLever
Open-source engine in C
Comunica
Open-source engine in TypeScript, which focus on decentralized querying

Popular SPARQL endpoints

Uniprot: Protein sequences (210+ billion triples)
https://sparql.uniprot.org/
Wikidata: RDF representation of Wikipedia (16+ billion triples)
https://query.wikidata.org/

Example Federated queries

Requires prior knowledge of sources.

Find all proteins linked to arachidonate (CHEBI:32395)
Joins Uniprot and Rhea
Retrieve human enzymes that metabolize sphingolipids and are annotated in ChEMBL
Joins Uniprot, Rhea, and ChEMBL

Example Link Traversal queries

Discover sources on the fly.

Comunica Link Traversal
Queries over real-world and synthetic environments

Knowledge Graphs
Centralized Querying
Decentralized Querying
Examples

Conclusions

Knowledge Graphs use (Semantic) Web technologies
RDF, SPARQL, …
Knowledge Graphs can be interlinked
Thanks to RDF's global identifiers
Federated Querying and Link Traversal
Combining data across multiple Knowledge Graphs
Complex queries can be slow