Abstract
Big-data management systems must handle multiple concurrent queries over multi-dimensional data sets. To achieve high throughput, such systems could implement various techniques to avoid redundant computations and data fetches. One such approach is to cache a subset of the query results and reuse these results to (partially) fulfill future query requests. This approach can be quite effective for query-at-a-time processing. However, we suspect that even greater performance is being left on the table if queries are only optimized in isolation, and that higher throughput can be extracted through a systematic examination of the relationships between queries in a given workload.
This paper describes a framework that captures inter-query relationships to reveal increased opportunities to exploit caching. We present a heuristic used for scheduling queries and a novel workload-informed cache replacement policy. When these methods are applied in combination, our system is able to extract impressive speedup of the total execution time of batches of queries, using only modest cache sizes. In this paper we show that the proposed replacement algorithm easily outstrips the performance of the classic algorithms FIFO and LRU. Under certain conditions, our system was able to achieve roughly 2 to 4 time speedup over these traditional replacement schemes.