Project 1: Cowboy Conspiracy

Important notes:

Make sure to copy the code we give exactly for loading the spreadsheets

Do not use lists in this project

Due Date Information

Out: Wednesday, Feb 19

Design check sign-up deadline: Thursday, Feb 20 11:59pm

Design check dates: Friday, Feb 21 10am - Sunday Feb 23 11pm

Optional second check-in dates: Tuesday, Feb 25 - Sunday, Mar 1

In: Mar 3

Summary

Howdy, partner!

You and your cowboy posse are in pursuit of an outlaw who has robbed the local saloon. Rumor has it they have escaped to the city and are masquerading as a taxi cab driver. Your sense of justice and revenge compels you to venture beyond the Wild West into the Urban East in order to catch them. Unfortunately, your horses refuse to ride into the city because of their batophobia (fear of tall buildings). Y’all must navigate dealing with rainy weather for the first time and learn how to hail a NYC taxi cab. Use the data available to figure out when they are likely to be driving. Lasso that data to catch the outlaw!

NYC publishes a lot of open data (see this link). You found records of every taxi ride taken in city cabs during 2015 and 2016, and want to analyze the data, with a particular look at how taxi usage varies by time of day and weather conditions.

Visit this webpage on the 2016 taxi data, to get familiar with the columns that NYC provides in these datasets. Please note that this link might take a while to load.

For better or worse, this dataset is HUGE – it has 131 million rows and consumes more than 17GB of space (so don’t download it!). The raw dataset is too big to open easily in Excel or Pyret, so you’re going to work with a summarized version that we have already computed from the raw data.

The Project

For this project, you need to answer the following Analysis Questions about the 2016 taxi data, which will help you catch the outlaw:

  1. To what extent does bad weather affect how many rides people take? There are many ways to interpret bad weather and you can analyze this question through different lenses, such as rain, snow, and temperature.

  2. Do the number of rides and total fares follow similar patterns for each day of the week across the year? In other words, is there a reasonably consistent pattern across all Mondays of a year? What about across Saturdays? And so on.

  3. Are some days of the week more likely than others to have high numbers of rides?

In addition, you need to provide a way to produce a table that summarizes statistics about the numbers of rides at different times of day under different weather conditions. Specifically, given a table and a function to use to summarize the values for a particular weather condition and time of day, you will write a function summary-table to produce a (Pyret) table of the following form, where each cell contains some statistic about the number of rides in the given time period on a day with the given weather:

|            | Rain   | Snow   |  Clear  |
| ---------- | ------ | ------ | ------- |
| Morning    |  num   |  ...   |   ...   |
| Afternoon  |  ...   |        |         |
| Evening    |  ...   |        |         |
| Night      |  ...   |        |         |

where num might be the sum of all rides on rainy-day mornings,
or the daily average on rainy-day mornings, etc.

The project will be completed in two stages: Design and Analysis. In the design stage, you will plan the data, tables, and functions that you will need to conduct the analysis during week two. You’ll do little to no coding for the analysis until after you meet with a TA to review your plans during the Design Check. At the end of Analysis, you will turn in both a Pyret file with your code and a PDF file describing your findings. Expectations for each phase are described in separate sections below.

Note: We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Plan enough time to think out your table and analysis designs.

Accessing the Data

The following code will load the summarized 2016 taxi data into Pyret:

include tables
include shared-gdrive("cs111-2019.arr", "1PzXKPvJHTi3N_QTShsALKgYaV77ybeqx")
include gdrive-sheets
include image
import math as M
import statistics as S

include shared-gdrive(
   "taxi-project-support.arr",  "1cN92aQzBeURXjpFWM48pAbm7vwE7p0Sj") 

taxi-ssid = "1ZbiTAuBpy55akMtA-gWjRBBW0Jo6EP0h_mQWmLMyfkc" # Spring 2020
taxi-sheet = load-spreadsheet(taxi-ssid) # load spreadsheet
taxi-data-sheet = taxi-sheet.sheet-by-name("data", true) # get data sheet
taxi-data-long =
  load-table:
    day, weekday, timeframe, num-rides, avg-dist, total-fare
    source: taxi-data-sheet
  end

Note: that the source code file imported above, taxi-project-support.arr, contains functions that might be useful for this project that are not in the standard CS0111 Pyret Documentation. Details on this can be found below.

For weather data, we have extracted data from La Guardia airport in New York City in 2016 (from NCDC) and left it in a Google Sheet. You can access it with the following code:

weather-ssid = "1uiWXHjKAeZ7aUjiL6V_IFN5j9uLRHv_b1ji_Nc3IZm4" # Spring 2020
wdata-sheet = load-spreadsheet(weather-ssid)
weather-data =
  load-table: date, weekday, awnd, prcp, snow, tavg, tmax, tmin
    source: wdata-sheet.sheet-by-name("final2", true)
  end

Project Learning Goals

Our hopes for this project are that you

Files to Submit on Gradescope

Design Check:

Final Handin:

Deadline 1: Design Check

With any large computer science project, a large amount of planning and design often occurs before anyone even begins to code.

So, the first deadline is the design check, a time when you meet your project TA (the TA who will grade your project). It’s a low-stress way to start implementing your project to make sure you are on the right track.

Signing Up For a Design Check
Please read the following information carefully.

All projects in CS0111 are done in pairs, so first you must find a partner for the project. Since the deadline to sign up for a design check is tomorrow at midnight, if you haven’t found a partner yet, please email the HTAs ASAP so we can help you find a partner.

To sign up for a design check slot, please fill out this Google form. You’ll be asked for both students’ CS/Brown banner logins, what time slot you’d lke, and the Gradescope Anonymous ID of the partner who will be submitting your design check handin. Only one student must fill out this form, and it is very important that your login information is correct.

Once you fill out the form, you will get two confirmation emails:

  1. One, from Google forms, that will send you a copy of your responses. Only the person who filled out the form will get this email.
  2. An email from the staff, saying that your slot has either been confirmed or is now unavailable to you. Both partners should get this email. If you don’t get this email within a minute or two, you might have typed your logins incorrectly.

If you get an email saying that the slot you signed up for is no longer available, this slot might either belong to a TA who has blocklisted you, or someone else might have already signed up for the slot while you were filling out the form. This means your response was not recorded on our end, so please edit your response to form and submit again, choosing a different slot.

Setup and Handin Info

Answer the following questions in a PDF file, project-1-design-check.pdf, and submit it on Gradescope under Project 1 Design Check by at least 2 hours before your design check. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner’s cs login (banner username) information at the top of the file.

Questions

  1. Look at the summarized 2016 taxi data. Compare the summarized data with the sample of the original table shown on the NYC website. What operations or steps could produce the summarized data from the original? Write a bulleted list of steps (in English, not code) that explain how to produce the summarized form from the original. Make sure you have some ideas of what functions from the Pyret tables documentation you might use.

    (The point of this question is to show you that you know almost everything you’d need to do this conversion yourself, had the source data not been so huge – within a couple of weeks you will know how to do all of these steps yourself.)

  2. For each of the three analysis questions listed above at the beginning, describe how you plan to do the analysis. You should try to answer these questions:

    • What charts, plots and statistics do you plan to generate to answer the analysis questions? Why? What are the types and the axes of these charts, plots and statistics?
    • What table(s) will you need to generate those charts, plots and statistics?
    • If the table(s) you need have different columns or rows than those that we gave you, provide a sample of the table that you need.
    • For each of the new tables that you identified, describe how you plan to create these tables from the ones that we’ve given you. This can include the overall summary table produced by the summary table function. Make sure to list all Pyret operators and functions you might use, (with input/output types and description of what they do, but without the actual code). If you don’t know how to create any table, discuss it with the TA at your design check, or feel free to discuss with TAs at hours beforehand.

    Important note: You can use any of the Pyret table, chart and plot operations as you see fitting - some that you could use (but you are not limited to, or required to use these) are: sort-by, filter-by, stdev, mean, sum, scatter-plot, freq-bar-chart, histogram. You can read more about these in Tables Documentation.

    Sample Answer:
    "For example, if you were asked to analyze whether municipalities with a population (in 2000) larger than 30,000 have an increase or decrease in population, your answer to this might be: "I’d start with a table of municipalities that have a population in 2000 of over 30,000, and then make a scatterplot of the population of those cities in 2000 and 2010. I’d add a linear regression line, then check whether there was a pattern in changes between the two population values.

    I’d obtain a table of municipalities with a population of greater than 30,000 in 2000 by using the filter-by function."

  3. For the summary-table function, you will be filling in the body of the following function (you do not have to implement it for the design check, but you do eventually have to implement it):

    fun summary-table(t :: Table, f :: (Table, String -> Number)) -> Table:
      doc: ```Produces a table that uses the given function f to summarize
            rides for each of rain/snow/clear weather during morning/
            afternoon/evening/night timeframes.```
      ...
    end
    
    # the type of f is function that takes Table and String and returns a Number
    # the String should correspond to the name of the column f will operate on
    

    Generate a general idea of how you want to implement this function for the design check. For example, this might be called summary-table(mytable, sum) or summary-table(mytable, mean) to summarize the total or average numbers of rides within the dates represented in mytable. You are welcome to create any other helper function to work with summary-table that you see fit for your analysis.

    • Provide an example of how this function summary-table will be used. Your answer should include an example of the input table, an input function that takes in a Table and String and returns a Number, and an output Table.
  4. Given these two tables:

    Table 1:

    date prcp
    2019/10/14 1.0
    2019/10/15 1.1
    2019/10/16 0.0

    Table 2:

    date number_of_rides
    2019/10/15 28591
    2019/10/14 2355
    2019/10/17 14513
    2019/10/16 4810

    Write a bulleted list of steps (in English) to combine these two tables to one that looks like the table below. If a step corresponds to a specific Pyret tables function, make sure to name the function, even if you’re not completely sure how it will be used!.

    date prcp number_of_rides
    2019/10/14 1.0 2355
    2019/10/15 1.1 28591
    2019/10/16 0.0 4810

Grading

Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you’ve thought about the questions and what will be required to answer them. The TA will give you feedback to consider as part of your final implementation of the project.

Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in deductions.

Remember that the goal of the design check is to review your project plans and to give you feedback well before the final deadline! Many students make changes to their designs following the check: doing so is common and will not cost you points.

Requirements

Optional Second Check-in

During your design check, you will also have the opportunity to schedule a personal check-in with your project TA where you can ask them any questions you have at that point, or work on a bug you might be having. These are 20-30 minute meetings with your project TA from Feb 25 - Mar 1. Meeting earlier in the time frame above allow you to get higher-level help, while later might allow you to get more focused help.

While this meeting is optional, it’s highly recommended you schedule one with your project TA. In addition, although both partners must go to the design check, only one partner needs to go to the check-in (although it is best if both go).

Note: If you schedule a personal check-in but wish to cancel, do so at least 12 hours before the time of the check-in. We want to respect both your time and the staff’s time! Failure to do so may result in point deductions on your final project grade.

Deadline 2: Analysis and Report

Analysis

For the analysis, you will be submitting a Pyret file named transit-analysis.arr that contains the function summary-table, the tests for the function, and all the functions used to generate the report (charts, plots, and statistics).

Note:

  1. Create at least two different example tables in your tests for the summary-table function.
  2. Make sure to test all helper functions that you create unless they return images.
  3. If you copy a table or plot into your analysis, you must tell us what it is called in your code so we can reproduce your results.

Sample Answer: Continuing with comparing exam grades for C students as an example, we’d expect to see something in your Pyret file like the following:

# ------ Analysis for question on exam grades for C students --- #

fun more-than-thirty-thousand(r :: Row) -> Boolean:
  ...
end

qualifying-munis = filter-by(municipalities, more-than-thirty-thousand)
munis-ex1-ex2-scatter = lr-plot(c-students, "population-2000", "population-2010")

Then, your report may look like this:

Guidelines on the Analysis

In order to do these analyses, you will need to get day-of-the-week information into the tables and combine data from the two tables based on common dates.

Combining data across tables: Both tables store data by dates, which means you should be able to combine information to create a single table. However, these two tables have different date formats (this was intentional on our part). Handle aligning the date formats in Pyret, not in Google Sheets. One of our goals for this project is making sure you know how to use coding to manipulate tables for combining data. Load both tables into Pyret, then figure out how to combine the information. Pyret String documentation might be your friend!

Note: As we saw in the lecture on errors in data tables, small errors and typos can lurk in datasets. While you might be tempted to just combine columns from the tables by relying on them having the same dates in the same order, this would not be a safe option unless you also had code to check this assumption about the dates. For now, your approach should look up each date from one table in the other. We will revisit to how to write this check in lecture once we finish teaching you what we need to do that.

Hint: If you feel your code is getting to complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don’t need helpers for everything, though – it is fine for you to have nested build-column expressions in your solution, for example. Don’t hesitate to reach out to us if you want to review your ideas for breaking down this problem.

Report

For the report, you will be submitting a file named transit-report.pdf. Include in this file the copies of your charts and the written part of your analysis. Your report should address the three analysis questions outlined at the beginning of this assignment.

You should make a report of your findings in a Word or Google Document, which you can then conver to a PDF for submission. Pyret makes it easy to make this kind of report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a .png file which you can then copy into your document.

Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs.

Your report should contain any relevant plots and tables, any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making sure the report reads well (use section headings, full sentences, spell-check it, etc). There’s no specified length – just say what you need to say to present your analyses to answer the questions.

An example of what part of your report might look like:

Reflection

Your report should also include a section with answers to each of the following questions. Do this after you have finised the coding portion of the project!

  1. Describe one key insight that each partner gained about programming or data analysis from working on this project and one mistake or misconception that each partner had to work though.
  2. Based on the data and analysis techniques you had, how confident are you in the quality of your results? What other information or skills could have improved the accuracy and precision of your analysis.
  3. State one or two followup questions that you have about programming or data analysis after working on this project.
  4. Imagine you are an urban planner using this dataset to identify dates and times that are more likely than others to have high numbers of commuters. How could only using this dataset make your analysis less accurate? Think about data or populations that are missing from this dataset.
  5. Imagine that the following attributes were added to the public taxi data set. For each of the following attributes, identify a potential ethical issue that could arise due to its addition. Think about who could use the dataset, how it could be analyzed, or for what purposes the analysis could be used.
    • the address of the start and end point of the ride
    • the individual who took the ride (identified by a unique number, not their name) along with their pickup location
    • any “protected attributes” (defined in the U.S. as gender, race, disability, age, etc.)

Handin Information

For your final handin, submit transit-analysis.arr and transit-report.pdf on Gradescope under Project 1. Nothing is required to print in the interactions window when we run transit-analysis.arr, but your analysis answers should include comments indicating which variable names or expressions yield the data on which you based your answers.

Grading

You will be graded on Functionality, Testing, and Design/Style for this assignment. Key metrics for each of these categories are described below.

Functionality:

Testing:

Design/Style:

Addtional helper functions

taxi-project-support.arr contains a function that might be helpful in manipulating your data. This is not in the original CS0111 Pyret Tables Documentation, but feel free to use it if you’d like:

Campuswire and Feedback