A low-cost triple pattern interface
Servers only accepts triple pattern queries.
Clients execute remainder of SPARQL query client-side.
Reduces server load, at the cost of increased client effort and bandwidth:
high client cost
high availability
high bandwidth
high server cost
low availability
low bandwidth
data dump
subject page
Triple Pattern Fragments
SPARQL endpoints
(R. Verborgh 2014)
TPF leads to longer execution times
Caused by large number of HTTP requests
HTTP delays form the main bottleneck in TPF query execution
1/3 queries: 50% of HTTP requests are membership requests
Membership request = triple pattern query w/o variables
→ check if triple exists in dataset
(M. Vander Sande 2015 )
Example of membership requests
Input query
SELECT ?p WHERE {
?p dbo:birthPlace dbr:York. # Evaluate first
?p a dbo:Artist.
}
Bind results to remaining query → membership queries
dbp:Albert_Joseph_Moore a dbo:Artist
dbp:Charles_Francis_Hansom a dbo:Artist
dbp:Dustin_Gee a dbo:Artist
... (207 in total on DBpedia)
→ 207 HTTP membership requests
Approximate membership filters (AMFs) to reduce HTTP requests
Bloom filters and Golomb-coded sets
Probabilistic datastructures to efficiently determine membership of a set
Chance of false-positives
AMFs are included in TPF responses for triple patterns
?p a dbo:Artist
includes a Bloom filter containing all bindings for ?p
→ Avoid HTTP requests for true negatives
→ Still need HTTP requests for true/false positives
Example with AMF
Input query
SELECT ?p WHERE {
?p dbo:birthPlace dbr:York. # Evaluate first
?p a dbo:Artist. # Download AMF
}
Run results through AMF of second pattern
AMF(dbp:Albert_Joseph_Moore)
AMF(dbp:Charles_Francis_Hansom)
AMF(dbp:Dustin_Gee)
...
→ 207 32 HTTP membership requests
AMFs reduce number of HTTP requests
+ On average, 33% fewer HTTP requests
Pre-filtering of true negatives
- Total execution time increased for most queries
Server delays when generating AMFs
Goal of this work:
How to reduce query execution time with AMFs over TPF?
Investigated directions for optimization
Subselection, more in the article
1. Use AMFs higher up in the query plan
Also use for partially bound triple patterns
2. Caching of AMFs
To avoid regenerating large AMFs
3. Determine the optimal false-positive probability
Probability impacts number of HTTP requests and AMF sizes
1. Using AMFs higher up in query plan reduces time for most queries
None: Default client-side TPF algorithm
Triple: Existing triple-level AMF algoritm
BGPSimple, BGPCombined, TripleBGP: New higher-level algorithms
Speedup for 15 / 20 queries
WatDiv (dataset scale: 100; query count: 5)
1. Using AMFs higher up in query plan changes result arrival times
Delayed time to first result
AMFs used for higher-level algorithms are larger → more HTTP delay
Faster result rate
2. Any form of caching significantly reduces query execution time
HTTP: Server-wide NGINX cache
Filter: Only caching AMFs
3. No clear winner among the different false-positive probabilities
Optimal probability depends on dataset
WatDiv: 1/64
Experiments are fully reproducible
Findings
Using AMFs higher up in the query plan improves query efficiency
But can slow down time to first result
Caching of AMFs is essential
Pre-computation is also possible
Bloom filters are more effective than Golomb-coded sets
Bloom filters are easier to decompress for clients
False-positive rate is dataset-dependent
1/64 leads to a good trade-off for WatDiv