Thesis Proposal


"Query Processing for Data Analytics on Modern Multicore Systems"

Kayhan Dursun

Wednesday, May 9, 2018 at 11:00 A.M.

Room 506 (CIT 5th Floor)

Developments in the hardware landscape have always influenced data management system solutions. In the last decade alone, we have seen the advent of large-memory, multi-core architectures, and the design of massively parallel, main-memory DBMSs that leverage these new platforms. Moreover, largely due to the power restrictions, modern processors are evolving towards heterogeneous designs where general purpose CPU-cores are replaced with high-performance, energy-efficient specialized compute units. Motivated by these trends, we propose novel query processing techniques that target these modern processing environments.

We first focus on result-reuse techniques for data-intensive analytical workloads. While existing reuse approaches require heavy-weight materialization operations, our novel reuse technique caches internal data structures, in particular hash tables, created by query operators and makes them directly reusable for upcoming operations with little or no additional overheads. We implement a prototype called HashStash to confirm the feasibility of our approach, and demonstrate significant performance gains for typical analytical workloads.

Then, we study novel query processing techniques for main-memory DBMSs with heterogeneous compute environments. These emerging processor designs host compute units with varying performance, functionality and execution characteristics, thus create new challenges for efficient query processing solutions. To target these systems, we propose SiliconDB, a new query processing approach that uses a fine-grained, adaptive workload execution model. SiliconDB splits queries into small chunks of work units, and uses queuing theory to dynamically assign these work elements to available compute units to maximize overall resource utilization.

As the final component of our work, we propose to extend SiliconDB’s relational query processing model and optimization techniques for modern data science pipelines including machine learning. Our goal is to make hardware acceleration accessible for typical data scientists using commodity hardware.

Host: Professor Ugur Cetintemel