Did AI Crawlers Kill SPARQL Federation?

Ruben Taelman; Elias Crum

SPARQL Federation Integrates Knowledge Graphs

SPARQL [1] is the standard language for querying over RDF-based Knowledge Graphs (KGs), with SPARQL endpoints [2] being a popular way of exposing access to such KGs through a Web-based API. Since RDF is based on using global identifiers for resources, these identifiers can be used and interlinked across multiple distributed KGs. While each SPARQL endpoint offer queryable access to just a single KG, SPARQL federated queries [3] allow combining data from multiple endpoints through SERVICE clauses.

As users may not always know exactly which RDF triples originate from what endpoints, writing these SERVICE clauses manually within their query may not always be feasible. For this reason, various federation approaches [4, 5, 6, 7, 8] exist to automatically decompose a query to a query with SERVICE clauses and join data across multiple SPARQL endpoints efficiently. Furthermore, federation techniques have also been introduced to federate over heterogeneous interfaces [9, 10, 11], which includes not only SPARQL endpoints, but also interfaces such as TPF [12] and brTPF [13]. These techniques are implemented in SPARQL federation engines such as Comunica [14] and HeFQUIN [15]. leading to virtually integrated Knowledge Graphs, which can be queried as if the data was centralized.

AI Crawlers Disrupt Open Data Infrastructure

In recent years, we have seen the rapid rise of generative AI tools such as ChatGPT, Grok, and Copilot. These rely on Large Language Models (LLMs) [16] that strongly depend on data that is accessible to them on the Web. LLMs use this data during their training, for gathering content in real-time based on user queries, and agentic actions using across services according to the Model Context Protocol [17] (e.g. a user navigating the Web using a headless browser). Especially this training step requires massive amounts of data, which involves crawling large parts of the Web [18].

Crawlers [19] have been common since the early days of the Web, for example to build indexes for powering search engines such as Google and Bing. However, with the rising popularity of LLM tools, the Web is experiencing a large increase in traffic due to AI crawlers [20]. While robots.txt [21] has been a common technique for server administrators to tell what data crawlers are allowed to access using which frequency, Content Signals [22] are an extension to this to for what purpose LLMs may use content.

Unfortunately, many AI crawlers do not follow these guidelines and are more aggressive than traditional crawlers. They cause this added traffic to become unmanageable for many Web servers. As such, administrators that want to avoid their servers being overloaded have to resort to mitigation techniques [23] such as rate limits, human verification, and blocking. Other initiatives include requiring AI crawlers to pay per crawl [24]. Since many crawlers are smart enough to work around such mitigation techniques, there are even techniques to trap misbehaving crawlers into AI labyrinths [25].

Findings on Public SPARQL Endpoints

Since 2018, we have been serving a Web-based version of the Comunica engine, with which users can execute SPARQL queries directly within their Web browser. This Web client offers example queries, which include several queries that federate over public SPARQL endpoints. However, within the last year, we have started seeing queries failing, which used to work without any problems. Hereafter, we discuss three of our queries that started failing, which includes federation over three of the most popular and largest public SPARQL endpoints: Wikidata [26], DBpedia [27], and Uniprot [28]. These three queries can be found in Listing 1, Listing 2, and Listing 3. The first two federate purely over SPARQL endpoints, while the last one federates over a SPARQL endpoints and two TPF interfaces [12].

PREFIX rh: <http://rdf.rhea-db.org/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?uniprot ?mnemo ?rhea ?accession ?equation 
WHERE {
  { 
    VALUES (?taxid) { (taxon:83333) }
    GRAPH <http://sparql.uniprot.org/uniprot> {
      ?uniprot up:reviewed true . 
      ?uniprot up:mnemonic ?mnemo . 
      ?uniprot up:organism ?taxid .
      ?uniprot up:annotation/up:catalyticActivity/up:catalyzedReaction ?rhea . 
    }
  }
  ?rhea rh:accession ?accession .
  ?rhea rh:equation ?equation .
}

Listing 1: A federated query over Uniprot and Rhea to find Escherichia coli reactions [29].

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
  ?cat wdt:P31 wd:Q146 ;
       wdt:P19 [ wdt:P17 wd:Q30 ] ; # wd:Q695511
       rdfs:label ?name .
  FILTER(LANG(?name) = "en")
}

Listing 2: A federated query over Wikidata and DBpedia to find all cats in Wikidata.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX schema: <http://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person ?name ?book ?title {
  ?person dbp:birthPlace [ rdfs:label "San Francisco"@en ].
  ?viafID schema:sameAs ?person;
               schema:name ?name.
  ?book dc:contributor [ foaf:name ?name ];
              dc:title ?title.
}

Listing 3: A federated query over DBpedia, VIAF (TPF), and Harvard Library (TPF) to find San Franciscans in the Harvard library.

Comunica implements state-of-the-art SPARQL federation algorithms such as FedX [4] and SPLENDID [6]. In order for a federation engine such as Comunica to execute a SPARQL query, the engine has to split up the original SPARQL query into several smaller queries, which are sent to different endpoints, after which results are joined together locally within the federation engine. Depending on the complexity of the query, dataset size, and the used federation algorithms, the number of HTTP requests to the SPARQL endpoints can vary greatly.

While Comunica could successfully execute the queries above in the past, it is not able to anymore, even though no relevant code changes were made since then. When executing any of the queries within Comunica, the engine errors and stops execution after just a few seconds, and reports HTTP 429 (Too Many Requests) errors from the SPARQL endpoints. This starts occurring after 34 HTTP requests (0.7 seconds) for the query in Listing 1, 116 requests (0.9 seconds) for Listing 2, and 30 requests (0.5 seconds) for Listing 3.

Both Uniprot and Wikidata report the text Rate Limit Exceeded within their HTTP response body. While Wikidata provides no further information on this rate limit, Uniprot returns the Retry-After: 10 header, indicating the client should wait for 10 seconds before another request can be made. DBpedia provides some more information in the form of an HTML page, which says that the site is configured to allow 100 simultaneous connections from the same IP address and 50 requests per second from the same IP address, and advises the user to Please try again soon. These rate limits appear to have been enabled within the last year, or have at least been lowered significantly.

While these are just the results of 3 example queries, we see the same problems occurring for other federated queries over public endpoints.

Discussion and Future Work

AI crawlers are placing open data infrastructure under great pressure. Our findings show that public SPARQL endpoints are no exception to this, as well known endpoints are starting to put in place strict rate limits to be able to cope with this added traffic. Unfortunately, not only AI crawlers are impacted by this, but also federated query engines are impacted as an unintended consequence.

The findings above should come with no surprise, as we have known for a long time that public SPARQL endpoints have had availability issues [30]. Since the recent advancements around LLMs, this existing problem is simply being enlarged. If we want to publish public Knowledge Graphs in a sustainable manner, we will have to rethink how we publish and consume Knowledge Graphs.

Due to the rising popularity of LLMs, these rate limits are likely to stay with us long-term. Hence, there is a need for a new generation of query planning techniques for SPARQL federation that take into account such restrictions [31]. These should depend on new or extended standards that allow these restrictions to be communicated to clients in a machine-readable manner. The Retry-After header and DBpedia’s HTML page are steps in the right direction, but they are only visible to clients after a limit has been exceeded, while this information would be needed during query planning when discovering the endpoint’s capabilities.

Besides quick-fix solutions such as rate limits and API keys, we may have to more fundamentally alter our publishing approaches. It may for example be worth it to revisit research towards alternative low-cost and cache-friendly KG interfaces [12, 13, 32, 33, 34, 35] and link-traversal-based querying over plain Linked Data documents [36, 37], which come with the trade-off of higher client-side effort when querying. Or this may be an indication that free access KGs is simply not sustainable, and that publishers will have to charge clients per request [24, 38], which will require intelligent federated query planning techniques that take into account total monetary costs within their cost model.

The goal of this position paper is to raise awareness to these issues. Our findings are based on just a limited set of queries, so there is certainly a need for more thorough analyses of real-world federation across general and domain-specific endpoints, whereas existing federation techniques have only been evaluated in context of closed and ideal scenarios [39, 40] that lack rate limits and other real-world effects such as timeouts and temporary downtime. For instance, 657 of the 1573 datasets within the Linked Open Data Cloud [41] are public SPARQL endpoints at the time of writing, but it is unknown how many of these endpoints have such restrictions, and therefore break current SPARQL federation engines. Furthermore, discussions with KG publishers must be held to determine if AI crawlers are indeed the main cause of the placement of these restrictions, or if there are other reasons for putting them in place. Finally, there is a need for KG publishers and developers of KG consumer software to come together and define best practises on how to mitigate availability issues originating from AI crawlers.

One of the the main selling points of Knowledge Graphs and the Semantic Web [42] is the ability to distribute and interlink data across different data sources, and integrate them through techniques such as SPARQL federated queries. However, we are on a trajectory where usage restrictions make it impossible for such federated queries to be executed. This problem is so significant that one might start questioning the fundamental motivations behind Knowledge Graph technologies. If we can not integrate data across multiple Knowledge Graphs anymore, what is their value compared to closed and silo-oriented databases?

Acknowledgements

Ruben Taelman is a postdoctoral fellow of the Research Foundation – Flanders (FWO) (1202124N). Elias Crum is a predoctoral fellow of the Research Foundation – Flanders (FWO) (1S27825N). This publication is based upon work from COST Action CA23147 GOBLIN - Global Network on Large-Scale, Cross-domain and Multilingual Open Knowledge Graphs, supported by COST (European Cooperation in Science and Technology, https://www.cost.eu).

Abstract

SPARQL Federation Integrates Knowledge Graphs

AI Crawlers Disrupt Open Data Infrastructure

Findings on Public SPARQL Endpoints

Discussion and Future Work

Acknowledgements