Optimizing AMF in TPF for Clients and Servers

Optimizing Approximate Membership Metadata in Triple Pattern Fragments

A low-cost triple pattern interface

Servers only accepts triple pattern queries.
Clients execute remainder of SPARQL query client-side.
Reduces server load, at the cost of increased client effort and bandwidth:

(R. Verborgh 2014)

TPF leads to longer execution times

Caused by large number of HTTP requests
HTTP delays form the main bottleneck in TPF query execution
1/3 queries: 50% of HTTP requests are membership requests
Membership request = triple pattern query w/o variables
→ check if triple exists in dataset
(M. Vander Sande 2015)

Example of membership requests

Input query

SELECT ?p WHERE {
  ?p dbo:birthPlace dbr:York. # Evaluate first
  ?p a dbo:Artist.
}

Bind results to remaining query → membership queries

dbp:Albert_Joseph_Moore    a dbo:Artist
dbp:Charles_Francis_Hansom a dbo:Artist
dbp:Dustin_Gee             a dbo:Artist
... (207 in total on DBpedia)

Existing Optimization
New Optimizations
Findings

Approximate membership filters (AMFs) to reduce HTTP requests

Bloom filters and Golomb-coded sets
Probabilistic datastructures to efficiently determine membership of a set
Chance of false-positives
AMFs are included in TPF responses for triple patterns
?p a dbo:Artist includes a Bloom filter containing all bindings for ?p
→ Avoid HTTP requests for true negatives
→ Still need HTTP requests for true/false positives

Example with AMF

Input query

SELECT ?p WHERE {
  ?p dbo:birthPlace dbr:York. # Evaluate first
  ?p a dbo:Artist. # Download AMF
}

Run results through AMF of second pattern

AMF(dbp:Albert_Joseph_Moore)
AMF(dbp:Charles_Francis_Hansom)
AMF(dbp:Dustin_Gee)
...

→ 207 32 HTTP membership requests

AMFs reduce number of HTTP requests

+ On average, 33% fewer HTTP requests
Pre-filtering of true negatives
- Total execution time increased for most queries
Server delays when generating AMFs

Goal of this work:
How to reduce query execution time with AMFs over TPF?

Existing Optimization
New Optimizations
Findings

Investigated directions for optimization

Subselection, more in the article

1. Use AMFs higher up in the query plan
Also use for partially bound triple patterns
2. Caching of AMFs
To avoid regenerating large AMFs
3. Determine the optimal false-positive probability
Probability impacts number of HTTP requests and AMF sizes

1. Using AMFs higher up in query plan reduces time for most queries

None: Default client-side TPF algorithm
Triple: Existing triple-level AMF algoritm
BGPSimple, BGPCombined, TripleBGP: New higher-level algorithms

Speedup for 15 / 20 queries

WatDiv (dataset scale: 100; query count: 5)

1. Using AMFs higher up in query plan changes result arrival times

Delayed time to first result
AMFs used for higher-level algorithms are larger → more HTTP delay
Faster result rate

2. Any form of caching significantly reduces query execution time

HTTP: Server-wide NGINX cache
Filter: Only caching AMFs

3. No clear winner among the different false-positive probabilities

Optimal probability depends on dataset
WatDiv: 1/64

Experiments are fully reproducible

Implementations are open-source

Linked Data Fragments Server extension
https://github.com/LinkedDataFragments/Server.js/tree/feature-handlers-amf-2

Comunica client-side querying modules
https://github.com/comunica/comunica-feature-amf
Reproducible Docker-based experiments

Using declarative configuration files
https://github.com/comunica/Experiments-AMF

Existing Optimization
New Optimizations
Findings

Findings

Using AMFs higher up in the query plan improves query efficiency
But can slow down time to first result
Caching of AMFs is essential
Pre-computation is also possible
Bloom filters are more effective than Golomb-coded sets
Bloom filters are easier to decompress for clients
False-positive rate is dataset-dependent
1/64 leads to a good trade-off for WatDiv

Optimizing Approximate Membership Metadata in Triple Pattern Fragments

Optimizing Approximate Membership Metadata in Triple Pattern Fragments

Triple Pattern Fragments (TPF)

A low-cost triple pattern interface

TPF leads to longer execution times

Caused by large number of HTTP requests

1/3 queries: 50% of HTTP requests are membership requests

Example of membership requests

Input query

Bind results to remaining query → membership queries

→ 207 HTTP membership requests

Approximate membership filters (AMFs) to reduce HTTP requests

Bloom filters and Golomb-coded sets

AMFs are included in TPF responses for triple patterns

Example with AMF

Input query

Run results through AMF of second pattern

→ 207 32 HTTP membership requests

AMFs reduce number of HTTP requests

+ On average, 33% fewer HTTP requests

- Total execution time increased for most queries

Goal of this work:
How to reduce query execution time with AMFs over TPF?

Investigated directions for optimization

1. Use AMFs higher up in the query plan

2. Caching of AMFs

3. Determine the optimal false-positive probability

1. Using AMFs higher up in query plan reduces time for most queries

1. Using AMFs higher up in query plan changes result arrival times

2. Any form of caching significantly reduces query execution time

3. No clear winner among the different false-positive probabilities

Experiments are fully reproducible

Implementations are open-source

Reproducible Docker-based experiments

Findings

Using AMFs higher up in the query plan improves query efficiency

Caching of AMFs is essential

Bloom filters are more effective than Golomb-coded sets

False-positive rate is dataset-dependent

Optimizing Approximate Membership Metadata in Triple Pattern Fragments

Triple Pattern Fragments (TPF)

A low-cost triple pattern interface

TPF leads to longer execution times

Caused by large number of HTTP requests

1/3 queries: 50% of HTTP requests are membership requests

Example of membership requests

Input query

Bind results to remaining query → membership queries

→ 207 HTTP membership requests

Approximate membership filters (AMFs) to reduce HTTP requests

Bloom filters and Golomb-coded sets

AMFs are included in TPF responses for triple patterns

Example with AMF

Input query

Run results through AMF of second pattern

→ 207 32 HTTP membership requests

AMFs reduce number of HTTP requests

+ On average, 33% fewer HTTP requests

- Total execution time increased for most queries

Goal of this work: How to reduce query execution time with AMFs over TPF?

Investigated directions for optimization

1. Use AMFs higher up in the query plan

2. Caching of AMFs

3. Determine the optimal false-positive probability

1. Using AMFs higher up in query plan reduces time for most queries

1. Using AMFs higher up in query plan changes result arrival times

2. Any form of caching significantly reduces query execution time

3. No clear winner among the different false-positive probabilities

Experiments are fully reproducible

Implementations are open-source

Reproducible Docker-based experiments

Findings

Using AMFs higher up in the query plan improves query efficiency

Caching of AMFs is essential

Bloom filters are more effective than Golomb-coded sets

False-positive rate is dataset-dependent

Goal of this work:
How to reduce query execution time with AMFs over TPF?