Lectures

Lecture Description Readings Data and/or Code

Section 10a

HW3 part 1 review

Section 10b

HW3 part 2 review

Section 10c

Using ggmap, with the example of Minard's excellent visualization of Napoleon's 1812 March against Russia.

napoleon.Rmd

Lecture 26: Simulating the Electoral College

Lecture 25: Social Network Analysis

The Small-World Effect is a Modern Phenomenon

RTweet Demo

Julia, one of our TAs, put together a neat little demo on how to access and use Twitter data in R using rtweet

rtweet_demo_recording.mp4

rtweet_demo.R

Lecture 24: Text Analysis

The basics of text analysis, including an example Sentiment Analysis of tweets.

Text Mining the Democratic Debates

Lecture 23: Clustering

k-means middle school dance example and types of clustering.

Visualizing k-Means Clustering

Similarity Metrics

hierarchical.R

cluster_viz.R

Lecture 21: Naive Bayes

Naive Bayes: A simple probabilistic generative classifier.

Naive Bayes: Theory and Example in R

Lecture 20a: Maximum Likelihood Estimation

Finding parameters that maximize the likelihood of the data.

Lecture 20b: Bayes' Rule

Bayes' Rule, with examples, like the Monty Hall Problem!

Play the Monty Hall Problem

Monty Hall, Explained

Section 9

Review of Homework 2.

Lecture 19: Model Selection

The bias-variance tradeoff, linear regression with regularizers, and a bit about variable selection.

Curse of Dimensionality

Lecture 18a: Decision Trees

Classification via decision trees.

Decision Trees, Summarized

Lecture 17: k Nearest Neighbors

Classification via the k Nearest Neighbors algorithm.

k Nearest Neighbors, Summarized

Bias vs. Variance in kNN

iris.R

Lecture 16c: Properties of Estimators

Three desiderata of estimators: consistency, unbiasedness, and efficiency.

Categorical Variables

birthWeight.R

birthWeight.tsv

Lecture 16b: Regression in Practice

How to gauge the goodness of fit of a linear model, and to improve a model that isn't so good.

Why Economists Love the log-log Transformation

ggplotOutliers.R

xSquared.R

brains.R

brains.csv

Lecture 16a: Simple Linear Regression

Using least squares to compute the line of best fit, and where regression got its name.

Simple Linear Regression

Regression Fallacy

Beauty in the Classroom

iceCream.R

iceCream.csv

pearson.R

pearson.csv

Lecture 15: Introduction to Machine Learning

An overview of machine learning, including regression, classification, and clustering.

Incredible Examples of AI in the Wild

Guest Speaker Kate Miller: Data Visualization

Simple Rules for Better Graphs

Lecture 14c: Hypothesis Testing

Hypothesis Testing

Lecture 14b: Confidence Intervals

Interval Estimation

cholera.R

Lecture 14a: Introduction to Statistical Inference

We use descriptive statistics to summarize observed data. We use inferential statistics to draw conclusions about unobserved data from observed data.

Lecture 13a: The Normal Distribution

The normal distribution, with special guest the standard normal, via the z-transform.

Clearly Explained: The Normal Distribution

Z-Transform

If I Didn't Have You

Lecture 13b: Probability Distributions and CLT

The normal distribution approximates the binomial, and it ain't an accident!

The Mighty Central Limit Theorem

Lecture 12b: Random Variables

Random Variables, their expectation, and their variance.

Seeing Theory: A Visual Introduction to Probability and Statistics

More Visual Explanations of Probability & Statistics

Lecture 12a: Introduction to Probability

Introduction to Probability.

What is Probability?

Kolmogorov's Axioms

Section 6

A data cleaning exercise using TA birthdays.

Data Cleaning: Birthdays

birthdays.Rmd

birthdays.csv

Section 5

Review of Homework 1 (including a discussion of Hilary, as the most poisoned name in U.S. history).

Hilary Parker Blog Post

Lecture 10b: Tidy Data

An introduction to tidy data and tidyr.

Tidy Data

iris_airQuality.R

Lecture 10a: Data Cleaning

Introduction to data cleaning with stringr and lubridate.

Janitorial Work

clean.R

Homework 0

Review of Homework 0, and the inherent untrustworthiness of rankings!

College_Rankings.ods

Section 3

In-class activity: Electricity Consumption

electricity.Rmd

electricity_consumption.csv

Programming Basics: Iteration (For and While Loops)

Introduction to the concept of iteration in programming, including for and while loops. More about R's data structures: vectors, matrices, arrays, lists, etc.

Lecture 8a: Programming Basics: Functions and Conditionals

Diving into programming fundamentals with functions and conditionals.

Programming in R

practice.R

Lecture 7a: Measures of Dispersion

A discussion of spread as it pertains to data, including defining variance and standard deviation. A follow on discussion of quartiles, interquartile range (IQR), and the IQR rule of thumb for identifying outliers. And a study that purports to show that pets relieve stress.

OnlineStatBook, again

A study of friends vs. pets as stress reducers in women.

Variance.ods

Lecture 7b: Covariance and Correlation

Bivariate data, and metrics to measure how data covary and/or correlate with one another. An analysis of ice cream sales as a function temperature (yes, sales increase as the temperature increases!). Finally, some guidance on how to (and not to) interpret correlation.

Eat Chocolate, Win the Nobel Prize?

Chocolate Consumption and Nobel Prizes: A Bizarre Link

Covariance.ods

cov_corr.R

caution.R

Lecture 6a: Probability Distributions

Introduction to probability distributions.

M&M Color Distribution

crew.R

crew.csv

Lecture 6b: Histograms

Histograms are used to visualize univariate data.

Histograms.

movies.R

movies.csv

Section 2: Introduction to plotting in R

Introduction to plotting in R, with special guest: ggplot!

Graphs

ggplot2 Cheatsheet

Visualization Tips, and Getting Started with ggplot

Section-2-handout.Rmd

Lecture 9a: Exploratory Data Analysis

Exploratory Data Analysis: how to conduct an initial investigation of data. Anscombe's quartet: on data visualization vs. descriptive statistics. Finally, an exploration of data gathered from a draft procedure used during the Vietnam War.

Datasaurus Dozen

Getting Down to Data

tips.R

tips.csv

vietnam.R

vietnam.csv

Lecture 9b: EDA Again

Another EDA example on air pollution.

Air Pollution

Lecture 5b: Introduction to dplyr

Introduction to the dplyr library.

dplyr Tutorial

Another dplyr Tutorial

Data Wrangling Cheat Sheet

intro_dplyr.R

responses.csv

responses.txt

Lecture 5a: Introduction to R

Introduction to R.

Tutorial: Data Camp Introduction to R

R Style Guide

intro_R.R

Section 1: Data Exploration using Spreadsheets

Explore international development data in spreadsheets, using pivot tables to group data.

InternationalDevelopment.ods

Lecture 4c: The Join Operation in spreadsheets, VLOOKUP

Merging data in spreadsheets.

DogWeights.csv

DogWeights.ods

Lecture 4b: Pivot Tables in Spreadsheets

Grouping, and then aggregating, data in spreadsheets.

Obama.ods

SalesForce.ods

Lecture 4a: Sort and Filter in Spreadsheets

Sorting and filtering data in spreadsheets.

SeattleWagesByGender.csv

GameOfThrones.csv

Lecture 3c: Measures of Central Tendency

The mean, the median, and the mode. (Including speculations about why both W and Bernie used the mean rather than the median to make claims about tax cuts and donations, respectively.)

Lecture 3b: Descriptive Statistics

The benefits and risks of descriptive statistics.

Five Measures of Growth that are Better than GDP

Lecture 3b: Qualitative vs. Quantitative Data

An overview of basic data types.

Lecture 2a: Databases vs. Spreadsheets

Databases are structured stores of organized data. Database software makes it easy to organize, analyze, and visualize information.

Attendance.ods

Lecture 2b: Introduction to Spreadsheets

Spreadsheets provide quick methods for summarizing (mostly numerical) data, and for rudimentary visualzations.

IvyLeagueMoney.csv

IvyLeagueMoney.ods

Lecture 1b: Three Modern Case Studies

Three modern case studies highlight the breadth of applications of data science, from sports to politics. Netflix' collaborative filtering algorithm, which is used to predict user ratings of films, is the third.

The Netflix Prize

Lecture 1a: Two Historical Case Studies

An anesthesiologist named John Snow (John, not Jon) used visualization to map out cholera. He was a pioneer in data visualization. Florence Nightingale might be remembered today as a nurse, but she also saved many lives using her skills as a statistician and in data visualization.

John Snow and the Broad Street Pump

Barnyard Dust Offers a Clue to Stopping Asthma in Children

Lecture 0: Introduction to Data Fluency

A brief introduction to data science, including our favorite visualizations, as well as an overview of the many applications of data science and its exploratory, explanatory, and predictive goals.

Fernanda Viégas and Martin Wattenberg on new ways for people to talk and think about data

The Beauty of Data Visualization

Talithia Williams on knowing your body's data