Systems For Interactive Data Exploration

News

[2016-10-26] Slides from today's lecture on how to setup Vizdom/IDEA can be found here

[2016-10-21] Group sign-up form online

[2016-09-08] We still have 15 open slots. Sign-up now

[2016-09-08] Paper preference form is online

As part of your class you have to present 1-2 papers. Please select your favorite papers with this form

[2016-09-01] Sign-up form is online

The class is by override code only. In order to receive an override code you have to sign-up here. Our plan is to limit the course at 30 students. If we receive more than 30 applicants, we will select applicants based on the provided short summary (see application form). If we receive significant more applicants, we might in addition ask for a small programming assignment, which we will hand-out in the second week.

[2016-08-31] Syllabus online

The syllabus for the class can now be found here.

About

Enabling visual data exploration at ``human speed'' is key to democratizing data science and maximizing human productivity. Unfortunately, traditional data management systems like PostgreSQL, Microsoft SQL Server, or more recent analytical frameworks, like Hadoop, Spark, and many others, are ill-suited for that purpose. Historically, these systems assume (1) long data loading times (e.g., for index construction), (2) text-based input and output (e.g., a SQL terminal), and (3) a one-shot (i.e., stateless) batch querying paradigm, therefore making them an exceptionally bad fit for interactive data exploration.

Therefore a new class of systems is needed that better supports the requirements of interactive data exploration. At the same time, the expectation that a new system supporting both more classical batch-oriented and interactive analytical workloads will replace the existing data management stacks is simply unrealistic. Instead, a system designed specifically for interactive data exploration must integrate and work seamlessly with existing data infrastructures (e.g., data warehouses, distributed file systems, analytics platforms).

The goal of this research seminar is three fold:

Learn about the State-of-The-Art in Data Exploration Systems

Throughout the first part of the semester, we will study recently proposed techniques to make data exploration over large data sets more interactive. We will investigate how database tools and systems can optimize user productivity and experience, focusing on the following questions:
How should next-generation data systems be designed to be more interactive and user friendly? How can these systems better support analysis and exploration of large data sets and decision making? This entails topics ranging from in-memory data management systems, visual and interactive interfaces, novel data exploration and mining techniques, as well as online/approximate and progressive query processing .

Make a (Small) Research Contribution

In the second part of the research seminar, we will build a first prototype of a novel interactive data exploration tool, called A-ware. We already started to develop a basic framework based on our current Vizdom system (see https://vimeo.com/139165014 for a video demonstration) to enable students to work in parrallel on a single system. After an initial an introduction to the system, small student groups will pick a research challenge to be integrated into the system. These task are strongly research focussed and all have the potential to lead to a publication. Potential topics include: (1) Developing statistical tests which can be efficiently progressively computed (2) New sampling strategies and result caching for interactive data exploration (3) Progressive ANOVA analysis (4) Developing efficient interactive method to work with time series data (5) Mechanisms to efficiently receive a random sample out of a database (6) New types of visualization to convey different types of errors to the user (7) Making existing machine learning libraries usable to progressive sampling techniques.

Learn About System Research

The final goal of the research seminar is that the students get an understanding about systems research. This entails, that the students learn as part of the paper discussion in the first part, how to identify the research contributions and flaws of a paper, and get a sense of how to write a research paper. The ambitious goal of this last part is, that every student group writes a short research paper and that the best projects continue to work on the problem in the next semester as part of an independent study to turn it into a full paper for a research conference or workshop.



Location: CIT 368

Time: Wednesdays 3:00-5:20

Prerequisites: One of CSCI 0320, CSCI 0330; and one of CSCI 1270, CSCI 1951-A, CSCI 1670

Schedule

Below is the tentative schedule (subject to change!!!):

  • 9/7

  • 9/14

    Overview of Interactive Data Exploration

    We will do an overview of current Interactive Data Exploration Systems and discuss criteria for a good research paper and how to write a good review.

  • 9/21

    Paper presentation + discussion I

    Student paper presentation and mock program committee discussion

  • 9/28

    Paper presentation + discussion II

    Student paper presentation and mock program committee discussion

  • 10/5

    Paper presentation + discussion III

    Student paper presentation and mock program committee discussion

  • 10/12

    Paper presentation + discussion IV

    Student paper presentation and mock program committee discussion

  • 10/19

    Paper presentation + discussion V

    Student paper presentation and mock program committee discussion

  • 10/26

    We will discuss the current Vizdom/IDEA code base and in the spirit of a boot-camp we will implement a small operator and visualization

  • 11/2

    1-on-1 Meetings

    We have individual project discussions

  • 11/9

    1-on-1 Meetings

    We have individual project discussions

  • 11/16

    We will discuss general principals how to define contributions, how to write research papers, how to design system experiments, etc.

  • 11/30

    Short Progress Demo

    The students will present the project progress

  • 12/07

    1-ON-1 MEETINGS

    We have individual project discussions

  • 12/14

    Final Presentations and project Hand-in

    You have to present and hand-in your short research paper and project

Papers

A list of potential papers we will discuss

Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, Juliana Freire:
Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets.
SIGMOD Conference 2016: 1011-1025

Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, Neoklis Polyzotis:
SEEDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics.
PVLDB 8(13): 2182-2193 (2015)

Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl:
M4: A Visualization-Oriented Time Series Data Aggregation.
PVLDB 7(10): 797-808 (2014)

Roee Ebenstein, Niranjan Kamat, Arnab Nandi:
FluxQuery: An Execution Framework for Highly Interactive Query Workloads.
SIGMOD Conference 2016: 1333-1345

Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi:
Distributed and interactive cube exploration.
ICDE 2014: 472-483

GMax Grazier G'Sell, Stefan Wager, Alexandra Chouldechova, Robert Tibshirani:
Sequential selection procedures and false discovery rate control
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(2), 2016.

Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, Tim Kraska:
Automating model search for large scale machine learning.
SoCC 2015: 368-380

Joseph M. Hellerstein, Peter J. Haas, Helen J. Wang:
Online Aggregation.
SIGMOD Conference 1997: 171-182

Peter J. Haas, Joseph M. Hellerstein:
Ripple Joins for Online Aggregation.
SIGMOD Conference 1999: 287-298

Albert Kim, Eric Blais, Aditya G. Parameswaran, Piotr Indyk, Samuel Madden, Ronitt Rubinfeld:
Rapid Sampling for Visualizations with Ordering Guarantees.
PVLDB 8(5): 521-532 (2015)

Leilani Battle, Remco Chang, Michael Stonebraker:
Dynamic Prefetching of Data Tiles for Interactive Visualization.
SIGMOD Conference 2016: 1363-1375

Manas Joglekar, Hector Garcia-Molina, Aditya G. Parameswaran:
Interactive data exploration with smart drill-down.
ICDE 2016: 906-917

Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, Aditya G. Parameswaran:
zenvisage: Effortless Visual Data Exploration.
CoRR abs/1604.03583 (2016)

Arvind Satyanarayan, Ryan Russell, Jane Hoffswell, Jeffrey Heer:
Reactive Vega: A Streaming Dataflow Architecture for Declarative Interactive Visualization.
IEEE Trans. Vis. Comput. Graph. 22(1): 659-668 (2016)

Thibault Sellam, Martin L. Kersten:
Cluster-Driven Navigation of the Query Space.
IEEE Trans. Knowl. Data Eng. 28(5): 1118-1131 (2016)

Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas:
Indexing for interactive exploration of big data series.
SIGMOD Conference 2014: 1555-1566

Feifei Li, Bin Wu, Ke Yi, Zhuoyue Zhao:
Wander Join: Online Aggregation via Random Walks.
SIGMOD Conference 2016: 615-629

Projects

Please sign up using the group sign-up form
  • Native ML: SVM with library
  • Native ML: Decesion Trees + Random Forrest
  • Native ML: Matrix Completion? ALS?
  • Native ML: PCA? Making it usable might be hard
  • Native ML: Feature engineering: Scaling, Normalization, bucketing, ....
  • Native ML: Clustering
  • Native ML: Integrate automatic hyper-parameter tuning (TuPAQ paper): bandit strategy/random search (first), afterwards more complicated#
  • Data Ingest: Random stream over CSV
  • Data Ingest: ODBC with views
  • Data Ingest: HDFS
  • Data Ingest: Transform operations (e.g., take [Name, Movie] transform to [Name, Forrest Gump, Martian,...]
  • External ML: Progressive inclusion of Python ML
  • External ML: Progressive inclusion of R
  • External ML: Progressive inclusion of Spark
  • Data Cleaning Operators (drop null, regular expression, substition (mean, correlation, etc.),...)
  • Data Cleaning Scripting (have a standard way to add some custom code)
  • Data Cleaning: Find automatically errors
  • Time-series: Re-sampling technique + visualization
  • Time-series: use cases
  • Time-series: Stats (Arima)
  • Time-series: SVR
  • Time-series: CNNs/DNNs (Tensorflow)
  • Time-series: filter based on timestamps
  • Geo-data: visualization and support
  • Infrastructure: Distributed Storage
  • Infrastructure: Distributed Engine
  • Infrastructure: Rare-Item Index
  • Infrastructure: Resource management
  • Other: Implement Interactive Data Exploration Benchmark

Faculty

Carsten Binnig

Office hours Fridays 4-5pm or on request

Tim Kraska

Office hours on request