New England Database Symposium

neds.gif (1190 bytes)

New England Database Society

Friday, March 30, 2012

sponsored by Netezza Corporation

sunlogo.gif (4979 bytes)

NEDS

CrowdDB: Query Processing with People and Machines
Michael Franklin
University of California, Berkeley

Friday, March 30, 2012, 4PM
HP/Vertica Computer Science Lounge (Volen 104), Brandeis University

(preceded by a wine and cheese reception at 3:00 pm, and followed by dinner at 6:00 pm)

Abstract:

The challenge of "Big Data" analytics is more than simply one of data size.   Rather, as the scope of data analysis widens, issues such as diversity, ambiguity, and incompleteness in both queries and the underlying data increasingly complicate query processing.   While advances in scalable data processing are helping to address the data size problem, there remain many important data-centric tasks where humans are more proficient than current state-of-the art algorithms.    Crowdsourcing has emerged as a major problem-solving and data-gathering paradigm, providing the ability to leverage human intelligence and activity at large scale. Emerging popular crowdsourcing platforms have programmatic interfaces (APIs), which provide the opportunity to create hybrid human/computer systems for data-intensive applications.

The CrowdDB project is an on-going effort to better understand the development of such hybrid computation systems. Built in conjunction with colleagues at ETH Zurich, CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. While CrowdDB leverages many aspects of traditional database systems, there are also important differences. From an implementation perspective, human-oriented query operators are needed to solicit, integrate and cleanse crowdsourced data. Furthermore, query performance and cost depend on a number of new factors including worker affinity, training, fatigue, motivation and location, making query optimization a major challenge. From a conceptual perspective, a major change is that the traditional closed-world assumption on which database query processing has always been based, no longer holds in such a system.   Thus, there is a need to rethink the meaning of queries and query results in a hybrid human/computer database system.

CrowdDB was developed as part of a new effort at Berkeley called the AMPLab, where AMP stands for "Algorithms, Machines and People".   AMPLab is a collaboration of machine learning, systems, database, and networking researchers. AMPLab envisions a world where massive data, cloud computing, communication and people resources can be continually, flexibly and dynamically be brought to bear on a range of hard data-intensive problems.   We are developing a new data analytics stack that implements this vision and we work with application partners in data-rich areas such as participatory sensing, urban planning, cancer genomics, and network security to evaluate and validate our technologies.   AMPLab's research is supported in part by 18 leading technology companies, including founding sponsors Google and SAP. In this talk, I will give an overview of the broader AMPLab research agenda, and then focus on our initial results in one part of that agenda: developing hybrid human/machine query processing systems.

Speaker's Bio:

Michael Franklin is a Professor of Computer Science at UC Berkeley, focusing on new approaches for data management and data analysis. His recent research projects have included work on data stream processing and continuous analytics, scalable query processing, large-scale sensing environments, data integration, and hybrid human/computer data processing systems. At Berkeley he directs the Algorithms, Machines and People Laboratory (AMPLab), a cross-disciplinary collaboration taking a new approach to the data analytics problem. He is founder and CTO of Truviso, Inc. a real-time data analytics company that enables customers to quickly make sense of diverse, high-speed, continuous streams of information. He is a Fellow of the Association for Computing Machinery, and a recipient of the National Science Foundation CAREER award, the ACM SIGMOD "Test of Time" award, and the 2011 Outstanding Advisor Award from the Computer Science Graduate Student Association at Berkeley. He is currently serving as a committee member on the US National Academy of Sciences study on Analysis of Massive Data. He received his Ph.D. from the University of Wisconsin in 1993.