CS100: Data Fluency for All

Lectures

Lecture	Description	Readings	Data and/or Code
Section 10a	HW3 part 1 review
Section 10b	HW3 part 2 review
Section 10c	Using ggmap, with the example of Minard's excellent visualization of Napoleon's 1812 March against Russia.		napoleon.Rmd
Lecture 26: Simulating the Electoral College
Lecture 25: Social Network Analysis		The Small-World Effect is a Modern Phenomenon
RTweet Demo	Julia, one of our TAs, put together a neat little demo on how to access and use Twitter data in R using rtweet		rtweet_demo_recording.mp4 rtweet_demo.R
Lecture 24: Text Analysis	The basics of text analysis, including an example Sentiment Analysis of tweets.	Text Mining the Democratic Debates
Lecture 23: Clustering	k-means middle school dance example and types of clustering.	Visualizing k-Means Clustering Similarity Metrics	hierarchical.R cluster_viz.R
Lecture 21: Naive Bayes	Naive Bayes: A simple probabilistic generative classifier.	Naive Bayes: Theory and Example in R
Lecture 20a: Maximum Likelihood Estimation	Finding parameters that maximize the likelihood of the data.
Lecture 20b: Bayes' Rule	Bayes' Rule, with examples, like the Monty Hall Problem!	Play the Monty Hall Problem Monty Hall, Explained
Section 9	Review of Homework 2.
Lecture 19: Model Selection	The bias-variance tradeoff, linear regression with regularizers, and a bit about variable selection.	Curse of Dimensionality
Lecture 18a: Decision Trees	Classification via decision trees.	Decision Trees, Summarized
Lecture 17: k Nearest Neighbors	Classification via the k Nearest Neighbors algorithm.	k Nearest Neighbors, Summarized Bias vs. Variance in kNN	iris.R
Lecture 16c: Properties of Estimators	Three desiderata of estimators: consistency, unbiasedness, and efficiency.	Categorical Variables	birthWeight.R birthWeight.tsv
Lecture 16b: Regression in Practice	How to gauge the goodness of fit of a linear model, and to improve a model that isn't so good.	Why Economists Love the log-log Transformation	ggplotOutliers.R xSquared.R brains.R brains.csv
Lecture 16a: Simple Linear Regression	Using least squares to compute the line of best fit, and where regression got its name.	Simple Linear Regression Regression Fallacy Beauty in the Classroom	iceCream.R iceCream.csv pearson.R pearson.csv
Lecture 15: Introduction to Machine Learning	An overview of machine learning, including regression, classification, and clustering.	Incredible Examples of AI in the Wild
Guest Speaker Kate Miller: Data Visualization	Simple Rules for Better Graphs
Lecture 14c: Hypothesis Testing		Hypothesis Testing
Lecture 14b: Confidence Intervals		Interval Estimation	cholera.R
Lecture 14a: Introduction to Statistical Inference	We use descriptive statistics to summarize observed data. We use inferential statistics to draw conclusions about unobserved data from observed data.
Lecture 13a: The Normal Distribution	The normal distribution, with special guest the standard normal, via the z-transform.	Clearly Explained: The Normal Distribution Z-Transform If I Didn't Have You
Lecture 13b: Probability Distributions and CLT	The normal distribution approximates the binomial, and it ain't an accident!	The Mighty Central Limit Theorem
Lecture 12b: Random Variables	Random Variables, their expectation, and their variance.	Seeing Theory: A Visual Introduction to Probability and Statistics More Visual Explanations of Probability & Statistics
Lecture 12a: Introduction to Probability	Introduction to Probability.	What is Probability? Kolmogorov's Axioms
Section 6	A data cleaning exercise using TA birthdays.	Data Cleaning: Birthdays	birthdays.Rmd birthdays.csv
Section 5	Review of Homework 1 (including a discussion of Hilary, as the most poisoned name in U.S. history).	Hilary Parker Blog Post
Lecture 10b: Tidy Data	An introduction to tidy data and tidyr.	Tidy Data	iris_airQuality.R
Lecture 10a: Data Cleaning	Introduction to data cleaning with stringr and lubridate.	Janitorial Work	clean.R
Homework 0	Review of Homework 0, and the inherent untrustworthiness of rankings!		College_Rankings.ods
Section 3	In-class activity: Electricity Consumption		electricity.Rmd electricity_consumption.csv
Programming Basics: Iteration (For and While Loops)	Introduction to the concept of iteration in programming, including for and while loops. More about R's data structures: vectors, matrices, arrays, lists, etc.
Lecture 8a: Programming Basics: Functions and Conditionals	Diving into programming fundamentals with functions and conditionals.	Programming in R	practice.R
Lecture 7a: Measures of Dispersion	A discussion of spread as it pertains to data, including defining variance and standard deviation. A follow on discussion of quartiles, interquartile range (IQR), and the IQR rule of thumb for identifying outliers. And a study that purports to show that pets relieve stress.	OnlineStatBook, again A study of friends vs. pets as stress reducers in women.	Variance.ods
Lecture 7b: Covariance and Correlation	Bivariate data, and metrics to measure how data covary and/or correlate with one another. An analysis of ice cream sales as a function temperature (yes, sales increase as the temperature increases!). Finally, some guidance on how to (and not to) interpret correlation.	Eat Chocolate, Win the Nobel Prize? Chocolate Consumption and Nobel Prizes: A Bizarre Link	Covariance.ods cov_corr.R caution.R
Lecture 6a: Probability Distributions	Introduction to probability distributions.	M&M Color Distribution	crew.R crew.csv
Lecture 6b: Histograms	Histograms are used to visualize univariate data.	Histograms.	movies.R movies.csv
Section 2: Introduction to plotting in R	Introduction to plotting in R, with special guest: ggplot!	Graphs ggplot2 Cheatsheet Visualization Tips, and Getting Started with ggplot	Section-2-handout.Rmd
Lecture 9a: Exploratory Data Analysis	Exploratory Data Analysis: how to conduct an initial investigation of data. Anscombe's quartet: on data visualization vs. descriptive statistics. Finally, an exploration of data gathered from a draft procedure used during the Vietnam War.	Datasaurus Dozen Getting Down to Data	tips.R tips.csv vietnam.R vietnam.csv
Lecture 9b: EDA Again	Another EDA example on air pollution.	Air Pollution
Lecture 5b: Introduction to dplyr	Introduction to the dplyr library.	dplyr Tutorial Another dplyr Tutorial Data Wrangling Cheat Sheet	intro_dplyr.R responses.csv responses.txt
Lecture 5a: Introduction to R	Introduction to R.	Tutorial: Data Camp Introduction to R R Style Guide	intro_R.R
Section 1: Data Exploration using Spreadsheets	Explore international development data in spreadsheets, using pivot tables to group data.		InternationalDevelopment.ods
Lecture 4c: The Join Operation in spreadsheets, VLOOKUP	Merging data in spreadsheets.		DogWeights.csv DogWeights.ods
Lecture 4b: Pivot Tables in Spreadsheets	Grouping, and then aggregating, data in spreadsheets.		Obama.ods SalesForce.ods
Lecture 4a: Sort and Filter in Spreadsheets	Sorting and filtering data in spreadsheets.		SeattleWagesByGender.csv GameOfThrones.csv
Lecture 3c: Measures of Central Tendency	The mean, the median, and the mode. (Including speculations about why both W and Bernie used the mean rather than the median to make claims about tax cuts and donations, respectively.)
Lecture 3b: Descriptive Statistics	The benefits and risks of descriptive statistics.	Five Measures of Growth that are Better than GDP
Lecture 3b: Qualitative vs. Quantitative Data	An overview of basic data types.
Lecture 2a: Databases vs. Spreadsheets	Databases are structured stores of organized data. Database software makes it easy to organize, analyze, and visualize information.		Attendance.ods
Lecture 2b: Introduction to Spreadsheets	Spreadsheets provide quick methods for summarizing (mostly numerical) data, and for rudimentary visualzations.		IvyLeagueMoney.csv IvyLeagueMoney.ods
Lecture 1b: Three Modern Case Studies	Three modern case studies highlight the breadth of applications of data science, from sports to politics. Netflix' collaborative filtering algorithm, which is used to predict user ratings of films, is the third.	The Netflix Prize
Lecture 1a: Two Historical Case Studies	An anesthesiologist named John Snow (John, not Jon) used visualization to map out cholera. He was a pioneer in data visualization. Florence Nightingale might be remembered today as a nurse, but she also saved many lives using her skills as a statistician and in data visualization.	John Snow and the Broad Street Pump Barnyard Dust Offers a Clue to Stopping Asthma in Children
Lecture 0: Introduction to Data Fluency	A brief introduction to data science, including our favorite visualizations, as well as an overview of the many applications of data science and its exploratory, explanatory, and predictive goals.	Fernanda Viégas and Martin Wattenberg on new ways for people to talk and think about data The Beauty of Data Visualization Talithia Williams on knowing your body's data