Lab 4: Generating Data

Get in the mystery-solving mood!

Backstory

You’ve just arrived at Brown University to meet with one of the resident psychology professors. The professor wants to learn more about human behavior from an expert detective such as yourself.

They need your help to collect meaningful data and analyze it for them so they can publish their next paper!

Your Mission

Brainstorm. What defines you as a person? What kind of information would be good for a report on human behavior? Some types of data are better for different things. For example, line graphs are often best at representing data in the form of numbers, while pie charts may be better for presenting tallies and votes.

For their paper, the professor would like a variety of data types, from numbers to strings. After thinking for a while, the professor decides that they want to study the life of a student – namely, Brown University students (how convenient!).

A Google Form with the professor’s questions can be found here.

Go to the response spreadsheet, make a copy of it (select “Make a Copy” under the “File” tab), and import it into your program. This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.

To import the spreadsheet, include this at the top of your program:
(Replace the ssid with that of your own copy.)

include tables
include image
include shared-gdrive("cs111-2019.arr", "1PK_J7RfR-h4cS89j12TkKGB2yIWAjDpA")
include gdrive-sheets
import data-source as ds

# spreadsheet id from Google Sheets
ssid = "<Copy of Your SSID>" # aka that weird sequence of letters and numbers in the url
data-sheet = load-spreadsheet(ssid)
student-data =
  load-table: timestamp, average-sleep, weekend-sleep, concentration, stem-humanities, classes, schoolwork-hours, extracurricular-activities, extracurricular-hours, on-campus, on-mealplan, avg-commute
    source: data-sheet.sheet-by-name("Form Responses 1", true)
 end

Preparing the Data

Now that we have the data collected, we need to make sure that the data is clean and processed before analysis.

  1. Talk about why it will be difficult to draw insights from the columns about concentration name and hours of extracurricular obligations. Be able to explain your reasoning to a TA. Create a new table (within Google Sheets) that doesn’t include these columns, and use this new table for the rest of the lab.

  2. Once you import your table of student-data, you will notice that the values in certain columns are in the format some(x) or none. In order to turn these values into the data types we all know and love (String, Num, etc.) use the sanitize functions upon import. Read the documentation here.

  3. Look at the other columns. Discuss with your partner which columns might need to be cleaned, and come up with a plan to do this. In particular, come up with functions to clean up the columns about the number of classes, the hours spent doing schoolwork, the number of extracurriculars, and the commute time.

  4. Write these functions, and create a new table with cleaned columns.


CHECKPOINT: Call over a TA once you reach this point.


Analyze the data

Now that the data has been tidied up, we need to put it together and analyze it in a way that will be meaningful for a research paper! What interesting ways can we represent this information? How should we visualize it?

  1. Filter your table so that you only have STEM concentrators (you can decide which value on the scale defines the cutoff).
  2. Filter your table to people who get 6-8 hours of sleep on an average weeknight.
    1. What is the average number of classes these people are taking?
    2. What percentage of them live on campus?
  3. Filter your table so that you only have humanities concentrators who get at least 7 hours of sleep. Once again, you can choose the cutoff.
  4. Create a scatterplot to compare hours of sleep on weeknights against minutes spent traveling to class.
  5. Create a scatterplot to compare hours spent working vs. hours spent sleeping.
  6. Create a scatterplot to compare average hours of sleep on weekends to average hours of sleep on weekdays.
  7. Create a new column that sums the number of hours spent sleeping and doing homework.
    1. Do you notice any impossible results?
    2. With your partner, discuss the pros and cons of filtering out responses that seem impossible.
  8. Create a new column that calculates the ratio of hours doing homework to hours sleeping (on weeknights).
  9. Find the average number of minutes people spend on their commute.

CHECKPOINT: Call over a TA once you reach this point.


Reporting back…

You get a message from an old client of yours and they’re wondering what is taking so long – they need your help solving a big-time jewel theft! Consult with your lab partner on how to explain what you have been up to. Telling your client that you’re “relaxing on Brown’s beautiful campus” probably isn’t going to cut it. You need to convince your client by showing them your findings. Show them something cool to persuade them that your trip is worthwhile (and that they shouldn’t look for a new detective to replace you)!

  1. We want to transpose the table (flip rows and cols) to look at the average commute times and average hours of sleep on weekdays of students on meal plan and students off meal plan. Since we don’t yet know how to transpose an entire table through code, we’ll build the table we need manually.

    1. Filter the table to compute average commute times and average hours of sleep on weekdays of students on meal plan.
    2. Do the same for students not on meal plan.
    3. Create a new table (in Pyret) named meal-plan-table with the columns on-meal-plan, avg-commute, and avg-sleep-weekdays.
    4. Fill in the table with the appropriate data.
    5. Note: there should only be two rows – one for those who are on meal plan and one for those who are not. It will look something like the following:
      Meal plan status Commute time Sleep on weekdays
      Off meal plan Average Average
      On meal plan Average Average
  2. Plot some columns that you think might be related to one another.

    1. Do you notice any cool correlations?

CHECKPOINT: Call over a TA once you reach this point.


Presentation Time!

It’s now time for the professor’s paper to be published and presented to a bunch of other smart professors! The professor has asked you to help present the findings since you were such an integral part of their studies. The other professors at the presentation are asking some really tough questions. The professor gets a little shy when it comes to answering questions and presenting – so they have asked you to do all of it.

Discuss the answers to the following questions with your partner:

  1. What preparation did you have to do before you could start analyzing?
  2. Why does messy data exist?
  3. When can you safely assume your data matches a specific format?
  4. How can we expand what we did today to work on even larger, even messier data sets?
  5. In what scenarios do people have to sort through messy data?
    1. Do you have any suggestions for them on how to clean their data?
    2. What should they watch out for?
  6. How can we prevent “messy data”? What could we have changed in the form to make the data / results nicer?

Great job!

Thanks to your excellent work, the presentation was a success and the professor is very pleased. Who knew detectives could make such good psychologists?

If you have extra time, check out these spurious correlations.