Thesis Defense


"Application-Aware Cluster Resource Management"

Jeff Rasley

Friday, December 14, 2018, at 1:30 P.M.

Room 368 (CIT 3rd Floor)

The adoption of cloud computing has pushed many big-data analytics frameworks to run as multi-tenant services where users submit jobs and a cluster resource manager is responsible for provisioning resources and executing jobs. In order to support a wide-range of data processing models, designers of modern cluster resource management frameworks advocate for a strict decoupling of resource management infrastructure (e.g., Apache YARN) from specific data processing models (e.g., Hadoop, TensorFlow, Spark).

This dissertation introduces application-aware resource management, which advocates for judiciously pushing semantics from the application into the cluster manager in order for it to make more informed scheduling decisions. We show that application-aware cluster management can lead to significant gains in terms of decreased job completion times, increased cluster utilization, and improved job utility. We present prototypes that exemplify the benefits of our approach in systems we call Yaq and HyperDrive.

We identify inefficiencies in existing cluster management framework designs that lead to poor utilization and degraded user-experience in the presence of heterogeneous workloads. We present Yaq as a cluster manager which is able to prioritize task execution based on application-level metrics such as task durations or total amount of work remaining in a job. We show that Yaq can significantly improve both cluster utilization and workload completion times.

We identify several difficulties and inefficiencies in the building and training of machine-learning models in the context of shared clusters. We present HyperDrive as a cluster manager which is able to classify and prioritize jobs based on user-defined utility (e.g., model accuracy). We implement a scheduling algorithm in HyperDrive that uses probabilistic model-based classification with dynamic scheduling and early termination to improve model performance and decrease workload completion times.

Host: Professor Rodrigo Fonseca