Sysread Guest Lecture Fall 2017
"The Cloudera Cluster Validation System: Checking for hundreds of thousands of known issues per day"
Dr. Ariel Rabkin, Cloudera
Friday, October 13, 2017 at 12:00 Noon
Room 368 (CIT 3rd Floor)
Cloudera is a leading provider of big-data processing software, particularly in the Hadoop ecosystem. High performance distributed systems are notoriously difficult to set up, manage, and troubleshoot, and so we have invested heavily in tools to automate this work. In particular, we check all incoming diagnostic data bundles from customers against a library of hundreds of known problems.
This talk will describe the how and why of that system. Our programming model is simple enough that supporters can write the checks, expressive enough for hundreds of checks, scalable enough to grow with the business, efficient enough to leave constantly running. We also explain how we exploit a limited form of execution tracing to automate regression testing for these checks. The talk will also discuss the way the system fits into our business processes, and how the two have co-evolved.
Ariel Rabkin is a software engineer at Cloudera, a company selling software and services for big-data processing using Hadoop and related technologies. Ari conducts research and builds tools to improve the efficiency of the support organization. Before starting at Cloudera, he was a postdoctoral researcher at Princeton, working with Mike Freedman on wide-area stream processing. He received his PhD from UC Berkeley in 2012, working in the AMP lab advised by Randy Katz.
Ari believes that software engineering is as much a social science as a technology problem, and is acutely interested in adapting software-development technology to the preferred work patterns of developers and other users.
(If you want to meet with Ari after the talk, send a message to email@example.com)
Host: Professor Rodrigo Fonseca