|
New England Database Society sponsored by Netezza Corporation |
|
|
NEDS |
CrowdDB: Query Processing with People and Machines
Michael Franklin
University of California, Berkeley
Friday, March 30, 2012, 4PM
HP/Vertica Computer Science Lounge (Volen 104),
Brandeis University
(preceded by a wine and cheese reception at 3:00 pm, and followed by dinner at 6:00 pm)
Abstract:
The challenge of "Big Data"
analytics is more than simply one of data size. Rather, as
the scope of data analysis widens, issues such as diversity, ambiguity,
and incompleteness in both queries and the underlying data increasingly
complicate query processing. While advances in scalable
data processing are helping to address the data size problem, there
remain many important data-centric tasks where humans are more
proficient than current state-of-the art algorithms.
Crowdsourcing has emerged as a major problem-solving and data-gathering
paradigm, providing the ability to leverage human intelligence and
activity at large scale. Emerging popular crowdsourcing platforms
have programmatic interfaces (APIs), which provide the opportunity to
create hybrid human/computer systems for data-intensive applications.
The CrowdDB project is an on-going effort to better understand the
development of such hybrid computation systems. Built in
conjunction with colleagues at ETH Zurich, CrowdDB uses human input
via crowdsourcing to process queries that neither database
systems nor search engines can adequately answer. While CrowdDB
leverages many aspects of traditional database systems, there are also
important differences. From an implementation perspective,
human-oriented query operators are needed to solicit, integrate and
cleanse crowdsourced data. Furthermore, query performance and cost
depend on a number of new factors including worker affinity, training,
fatigue, motivation and location, making query optimization a major
challenge. From a conceptual perspective, a major change is
that the traditional closed-world assumption on which database query
processing has always been based, no longer holds in such a
system. Thus, there is a need to rethink the meaning of
queries and query results in a hybrid human/computer database system.
CrowdDB was developed as part of a new effort at Berkeley called the
AMPLab, where AMP stands for "Algorithms, Machines and
People". AMPLab is a collaboration of machine learning,
systems, database, and networking researchers. AMPLab envisions a
world where massive data, cloud computing, communication and people
resources can be continually, flexibly and dynamically be brought to
bear on a range of hard data-intensive problems. We are
developing a new data analytics stack that implements this vision and
we work with application partners in data-rich areas such as
participatory sensing, urban planning, cancer genomics, and network
security to evaluate and validate our technologies.
AMPLab's research is supported in part by 18 leading technology
companies, including founding sponsors Google and SAP. In this
talk, I will give an overview of the broader AMPLab research agenda,
and then focus on our initial results in one part of that agenda:
developing hybrid human/machine query processing systems.
Michael Franklin is a
Professor of Computer Science at UC Berkeley, focusing on new
approaches for data management and data analysis. His recent
research projects have included work on data stream processing and
continuous analytics, scalable query processing, large-scale sensing
environments, data integration, and hybrid human/computer data
processing systems. At Berkeley he directs the Algorithms,
Machines and People Laboratory (AMPLab), a cross-disciplinary
collaboration taking a new approach to the data analytics
problem. He is founder and CTO of Truviso, Inc. a real-time
data analytics company that enables customers to quickly make sense of
diverse, high-speed, continuous streams of information. He is a Fellow
of the Association for Computing Machinery, and a recipient of the
National Science Foundation CAREER award, the ACM SIGMOD "Test of Time"
award, and the 2011 Outstanding Advisor Award from the Computer
Science Graduate Student Association at Berkeley. He is currently
serving as a committee member on the US National Academy of Sciences
study on Analysis of Massive Data. He received his Ph.D.
from the University of Wisconsin in 1993.
Maintained by Olga Papaemmanouil olga AT cs.brandeis.edu