An Introduction to Querying over Knowledge Graphs on the Web

Ruben Taelman

VLIR-UOS ITP 2025, 19 June 2025

An Introduction to Querying
over Knowledge Graphs
on the Web

Ghent University – imec – IDLab, Belgium

Storing data as Knowledge Graphs

1989, CERN Switzerland

Tim Berners-Lee
inventor of the World Wide Web

The Web's foundational ideas

1990-... World-wide adoption

Not just for researchers anymore

The Web is a global information space

a.k.a. The World Wide Web (WWW)

News
Social
Media

Web is focused on humans

Achieving tasks requires manual effort

What if intelligent agents
could do these tasks for us?

I Have a Dream

A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.

The dream: Getting there step by step

2015: Agents are being introduced that can perform simple tasks

Google Assistant Alexa Siri
Google Assistant, Alexa, Siri, ...

How do these agents work?

→ Execute Queries over Knowledge Graphs

Query = Structured Question
Knowledge Graph = Structured information

→ Structured questions over structured information

A Knowledge Graph is
a collection of structured information

Semantic Web technologies

Generative AI models

2022: LLMs offer human-like conversations based on crawled Web data

Differences to Knowledge Graphs:

ChatGPT Copilot DeepSeek

RDF enables data exchange on the Web

Facts are represented as RDF triples

RDF triple
RDF triple

Multiple RDF triples form RDF datasets

:Alice :knows :Bob .
:Alice :knows :Carol .
:Alice :name "Alice" .
:Bob :name "Bob" .
:Bob :knows :Carol .

Multiple RDF triples form RDF datasets

RDF Resources are identified by IRIs

https://example.org/Alice

Multiple syntaxes exist for RDF

Turtle

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

dbr:Alice foaf:knows dbr:Bob .
dbr:Alice foaf:name "Alice" .

JSON-LD

{
    "@context": {
        "dbr": "http://dbpedia.org/resource/",
        "foaf": "http://xmlns.com/foaf/0.1/"
    },
    "@id": "dbr:Alice",
    "foaf:knows": "dbr:Bob",
    "foaf:name": "Alice"
}

SPARQL querying over RDF datasets

Specification: https://www.w3.org/TR/sparql-query/

Find all artists born in York

namedeathDate
Albert Joseph Moore
Charles Francis Hansom1888
David Reed (comedian)
Dustin Gee
E Ridsdale Tate1922

How do query engines process a query?

RDF dataset + SPARQL query

...

query results

How do query engines process a query?

RDF dataset + SPARQL query

SPARQL query processing

query results

Basic Graph Patterns enable graph pattern matching

Query results representation

1 solution sequence with 3 solution mappings:
namebirthplace
Bob Brockmannhttp://dbpedia.org/resource/Louisiana
Bennie Nawahihttp://dbpedia.org/resource/Honolulu
Weird Al Yankovichttp://dbpedia.org/resource/Downey,_California

Steps in SPARQL query processing

Publishing Knowledge Graphs as SPARQL Endpoints

SPARQL endpoint: API that accepts SPARQL queries, and replies with results.

Parsing: transform query string into SPARQL Algebra

SPARQL Algebra example

Input:
SELECT ?x ?y ?z WHERE {
  ?x ?y ?z
}
Output:
{
  "type": "project",
  "input": {
    "type": "bgp",
    "patterns": [
      {
        "type": "pattern",
        "subject": {
          "termType": "Variable",
          "value": "x"
        },
        "predicate": {
          "termType": "Variable",
          "value": "y"
        },
        "object": {
          "termType": "Variable",
          "value": "z"
        }
      }
    ]
  },
  "variables": [
    { "termType": "Variable",
      "value": "x" },
    { "termType": "Variable",
      "value": "y" },
    { "termType": "Variable",
      "value": "z" }
  ]
}

Why is optimization needed?

Different levels of optimization

Algebraic optimization

Example: ORDER BY + DISTINCT

Input:
{
  "type": "distinct",
  "input": {
    "type": "orderby",
    "variable": "x",
    "input": { ... }
  }
}
Output:
{
  "type": "orderby",
  "variable": "x",
  "input": {
    "type": "distinct",
    "input": { ... }
  }
}

Low-level optimization

Example: joining in BGPs

How, and in what order should triple patterns be evaluated?

1. ?person a dbpedia-owl:Artist.

2. ?person rdfs:label ?name.

3. ?person dbpedia-owl:birthPlace ?birthPlace.

Different factors influence join order

Obtain cardinality of each pattern

1. ?person a dbpedia-owl:Artist. 200

2. ?person rdfs:label ?name. 10.000

3. ?person dbpedia-owl:birthPlace ?birthPlace. 1.000

Can be pre-computed, or estimated at runtime.

Select appropriate join algorithms

Usually happens in pairs of two operators (n, m)

Join orders and algorithms influence worst-case complexity

→ Restrictions and preprocessing steps (sorting, hashing, ...) not yet taken into account

→ Trade-off between different factors that influence join order

Evaluating the optimized query plan

Evaluation iteratively handles leaf nodes

Based on operator type

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    "type": "bgp",
    "patterns": [
      {
        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }
      },
      {
        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }
      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    "type": "bgp",
    "patterns": [
      {
        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }
      },
      {
        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }
      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    "type": "bgp",
    "patterns": [
      {
        ...solution sequence...
      },
      {
        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }
      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    "type": "bgp",
    "patterns": [
      {
        ...solution sequence...
      },
      {
        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }
      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    "type": "bgp",
    "patterns": [
      {
        ...solution sequence...
      },
      {
        ...solution sequence...
      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    "type": "bgp",
    "patterns": [
      {
        ...solution sequence...
      },
      {
        ...solution sequence...
      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    ...solution sequence...
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    ...solution sequence...
  }
}

Evaluation iteratively handles leaf nodes

...solution sequence...

Some operations can be parallelised

Pipelining produces results as soon as possible

A pipeline of operators

Adaptive processing responds to changing conditions

SPARQL processing over centralized data

Centralization not always possible

How to query over decentralized data?

Approaches for querying over decentralized data

Client distributes query over query APIs

Federation over SPARQL endpoints

SELECT ?drug ?title WHERE {
  ?drug db:drugCategory dbc:micronutrient.
  ?drug db:casRegistryNumber ?id.
  ?keggDrug rdf:type kegg :Drug.
  ?keggDrug bio2rdf:xRef ?id.
  ?keggDrug purl:title ?title.
}


SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
  SERVICE <http://example.com/kegg> {
    ?keggDrug rdf:type kegg :Drug.
    ?keggDrug bio2rdf:xRef ?id.
    ?keggDrug purl:title ?title.
  }
}

Federation over heterogeneous sources

Limitations of federated querying

Example: decentralized address book

Example: Find Alice's contact names

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me>
        foaf:knows ?person.
    ?person foaf:name ?name.
}

Query process:

  1. Start from Alice's address book
  2. Follow links to profiles of Bob and Carol
  3. Query over union of all profiles
  4. Find query results: [ { "name": "Bob" }, { "name": "Carol" } ]

Notable SPARQL engines

Popular SPARQL endpoints

Example Federated queries

Requires prior knowledge of sources.

Example Link Traversal queries

Discover sources on the fly.

Conclusions