Lab 4: Generating Data

Here comes the sun doo doo doo doo…it’s all right…duhduhduhduhduhduhduh

Backstory

You’ve just landed on the fiery ball that is the sun. As you step off of your spaceship in search of water (I know – what are you thinking?), you hear strange murmurs from your left and you go over to investigate.

More aliens?!

You use your universal translators to communicate and discover that these folks are alien scientists who specialize in human behavior. For the life of them, they cannot even begin to understand the human race…

They need your help to collect meaningful data and analyze it for them so they can win the next science fair!!

Your Mission

Brainstorm. What defines you as a person? What kind of information would be good for a science fair? Some types of data are better for different things. For example, line graphs are often best at representing data in the form of numbers, while pie charts may be better for presenting tallies and votes.

For their science fair, these alien scientists would like a variety of data types, from numbers to strings. After thinking for a while, the aliens decide that they want to study the life of a student – namely, Brown University students (how convenient!).

A Google Form with their questions can be found here (yeah, who knew that Google was a thing on the sun?)

Go to the response spreadsheet, make a copy of it, and import it into your program. This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.

To import the spreadsheet, include this at the top of your program:
(Replace the ssid with that of your own copy.)

include tables
include image
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")
include gdrive-sheets
import data-source as ds

# spreadsheet id from Google Sheets
ssid = "<Copy of Your SSID>" # aka that weird sequence of letters and numbers in the url
data-sheet = load-spreadsheet(ssid)
student-data =
  load-table: timestamp, average-sleep, weekend-sleep, concentration, stem-humanities, classes, schoolwork-hours, extracurricular-activities, extracurricular-hours, on-campus, on-mealplan, avg-commute
    source: data-sheet.sheet-by-name("Form Responses 1", true)
 end

Preparing the Data

Now that we have the data collected, we need to make sure that the data is clean and processed before analysis.

  1. Talk about why it will be difficult to draw insights from the columns about concentration name and hours of extracurricular obligations. Be able to explain your reasoning to a TA. Create a new table (within Google Sheets) that doesn’t include these columns, and use this new table for the rest of the lab.

  2. Once you import your table of student-data, you will notice that the values in certain columns are in the format some(x) or none. In order to turn these values into the data types we all know and love (String, Num, etc.) use the sanitize functions upon import. Read the documentation here.

  3. Look at the other columns. Discuss with your partner which columns might need to be cleaned, and come up with a plan to do this. In particular, come up with functions to clean up the columns about the number of classes, the hours spent doing schoolwork, the number of extracurriculars, and the commute time.

  4. Write these functions, and create a new table with cleaned columns.


CHECKPOINT: Call over a TA once you reach this point.


Analyze the data

Now that the data has been tidied up, we need to put it together and analyze it in a way that will be meaningful for a science fair! What interesting ways can we represent this information? How should we visualize it?

  1. Filter your table so that you only have STEM concentrators (you can decide what value the cutoff should be at)
  2. Filter your table to people who get 6-8 hours of sleep on an average weeknight
    1. What is the average number of classes these people are taking?
    2. What percentage of them live on campus?
  3. Filter your table so that you only have humanities concentrators who get at least 7 hours of sleep. Once again, you can choose the cutoff.
  4. Plot hours of sleep on weeknights against minutes spent traveling to class
  5. Plot hours spent working vs hours spent sleeping
  6. Compare average hours of sleep on weekends to average hours of sleep on weekdays
  7. Create a new column that sums the number of hours spent sleeping and doing homework.
    1. Do you notice any impossible results?
    2. With your partner, discuss the pros and cons of filtering out responses that seem impossible.
  8. Create a new column, finding the ratio of hours doing homework to hours sleeping (weeknights)
  9. Find the average number of minutes people spend on their commute
    1. Calculate the average number of minutes spent commuting to each class?

CHECKPOINT: Call over a TA once you reach this point.


Mission Control

You get a transmission from home and they’re wondering what is taking so long. Consult with your lab partner on how to explain what you have been up to. Telling Mission Control that “aliens needed us for a science fair” probably isn’t going to cut it. You need to convince Mission Control by showing them your findings. Show them something cool to persuade them that your missions are worthwhile (and that they shouldn’t cut your funding)!

  1. We want to transpose the table (flip rows and cols) to look at the average commute times and average hours of sleep on weekdays of students on meal plan and students off meal plan. Since we don’t yet know how to transpose an entire table through code, we’ll build the table we need manually.

    1. Filter the table to compute average commute times and average hours of sleep on weekdays of students on meal plan.
    2. Do the same for students not on meal plan.
    3. Create a new table (in Pyret) named meal-plan-table with the columns on-meal-plan, avg-commute, and avg-sleep-weekdays
    4. Fill in the table with the appropriate data.
    5. Note: there should only be two rows – one for those who are on meal plan and one for those who are not. It will look something like the following:
      Meal plan status Commute time Sleep on weekdays
      Off meal plan Average Average
      On meal plan Average Average
  2. Plot some columns that you think might be related against one another.

    1. Do you notice any cool correlations?

CHECKPOINT: Call over a TA once you reach this point.


Science Fair

It’s now time for the science fair! The alien scientists have asked you to participate as well since you were such an integral part of their studies. The judges at the science fair are perusing the exhibits and asking some really tough questions. The alien scientists get a little shy when it comes to answering questions and presenting – so they have asked you to do all of it.

Discuss the answers to the following questions with your partner:

  1. What preparation did you have to do before you could start analyzing?
  2. Why does messy data exist?
  3. When can you safely assume your data matches a specific format?
  4. How can we expand what we did today to work on even larger, even messier data sets?
  5. In what scenarios do people have to sort through messy data?
  6. How can we prevent “messy data”? What could we have changed in the form to make the data / results nicer?