Project 1: Cab Conspiracy

Due date information

Out: Oct 2, 9:00 AM EST

Design check sign-up deadline: Oct 4, 4:30 PM EST

In: Oct 17, 9:00 PM EST (extended from Oct 15, 9:00 PM EST)

Summary

Your detective services have been requested to investigate the suspicious connection between bad weather and taxi usage. According to an anonymous tip, the owner of New York City’s premier taxi company has created a machine that controls the weather to increase his profits.

NYC publishes a lot of open data (see this link). You found records of every taxi ride taken in city cabs during 2015 and 2016, and want to analyze the data, with a particular look at how taxi usage varies by time of day and weather conditions.

Visit this webpage on the 2016 taxi data, to get familiar with the columns that NYC provides in these datasets. For better or worse, this dataset is HUGE – it has 131 million rows and consumes more than 17GB of space (so don’t download it!). The raw dataset is too big to open easily in Excel or Pyret, so you’re going to work with a summarized version that we have already computed from the raw data.

The Project

For this project, you need to answer the following questions about the 2016 taxi data:

In addition, you need to provide a way to produce tables that summarize statistics about the numbers of rides at different times of day under different weather conditions (the company strategy team will generate these as they do their planning). Specifically, given a table and a function to use to summarize the values for a particular weather condition and time of day, you will need to produce a (Pyret) table of the following form, where each cell contains the number of rides in the given time period on a day with the given weather:

|            | Rain   | Snow   |  Clear  |
| ---------- | ------ | ------ | ------- |
| Morning    |  num   |  ...   |   ...   |
| Afternoon  |  ...   |        |         |
| Evening    |  ...   |        |         |
| Night      |  ...   |        |         |

where num might be the sum of all rides on rainy-day mornings, 
or the daily average on rainy-day mornings, etc.

The project will be completed in two stages: Design and Analysis. In the design stage, you will plan the data, tables, and functions that you will need to conduct the analysis during week two. You’ll do little to no coding for the analysis until after you meet with a TA to review your plans. Expectations for each phase are described in separate sections below. At the end of Analysis, you will turn in both a Pyret file with your code and a PDF file describing your findings.

Note: We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Plan enough time to think out your table and analysis designs.

Accessing the Data

The following code will load the summarized 2016 taxi data into Pyret:

include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")
include gdrive-sheets
include image
import math as M
import statistics as S

ssid = "1ZbiTAuBpy55akMtA-gWjRBBW0Jo6EP0h_mQWmLMyfkc" # Fall 2019
ss = load-spreadsheet(ssid) # load spreadsheet
s1 = ss.sheet-by-name("data", true) # get data sheet
taxi-data-long =
  load-table:
    day, weekday, timeframe, num-rides, avg-dist, total-fare
    source: s1
  end

For weather data, we have extracted data for La Guardia airport in New York City in 2016 (from NOAA) and left it in a Google Sheet. You can access it with the following code:

w-days-ssid = "1uiWXHjKAeZ7aUjiL6V_IFN5j9uLRHv_b1ji_Nc3IZm4" # Fall 2019
wdata-sheet = load-spreadsheet(w-days-ssid)
weather-data =
  load-table: date, weekday, awnd, prcp, snow, tavg, tmax, tmin
    source: wdata-sheet.sheet-by-name("final2", true)
  end

Deadline 1: The Design Stage

Submit your work for the design check as a PDF file named project-1-design-check.pdf. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner’s login information at the top of the file.

  1. Setup and Handin Info

    • Find a partner for the project. You can also use this when2meet to find a partner with your schedule if you prefer - essentially just put down the timeslots that you would be free (use your Brown email as your name!) and reach out to the people that already indicated that they would be free during the time periods that you are.
    • Sign up for a design check here before Friday, October 4th at 4.30PM. Design checks are held Wednesday and Thursday (the 9th and 10th) and are mandatory for all groups. Only one of you has to sign up for the design check; after one of you signs up, you must invite your partner to the Google Calendar event. If you’re unsure how to do this, there’s a video at the end of this document.
    • Handin: Submit the PDF before your design check starts to Project 1 Design Check on Gradescope. Please add your project partner to your submission on Gradescope as well.
  2. Look at the summarized 2016 taxi data. Compare the summarized data with the sample of the original table shown on the NYC website. What operations or steps could produce the summarized data from the original? Write a bulleted list of steps (in English) that explain how to produce the summarized form from the original. If a step corresponds to a specific Pyret tables operation, name the operation.

    (The point of this question is to show you that you know almost everything you’d need to do this conversion yourself, had the source data not been so huge – within a couple of weeks you will know how to do all of these steps yourself.)

  3. For each of the three analysis questions listed above, describe how you plan to do the analysis. You should try to answer these questions:

    • What charts, plots and statistics do you plan to generate to answer the analysis questions? Why? What are the types and the axes of these charts, plots and statistics?
    • What table(s) will you need to generate those charts, plots and statistics?
    • If the table(s) you need have different columns or rows than those that we gave you, provide a sample of the table that you need.
    • For each of the new tables that you identified, describe how you plan to create the table from the ones that we’ve given you. Make sure to list all Pyret operators, functions (with input/output types and description of what they do, but without the actual code). If you don’t know how to create any table, discuss it with the TA at your design check.

    Important note: You can use any of the Pyret table, chart and plot operations as you see fitting - some that you could use (but you are not limited to, or required to use these) are: sort-by, filter-by, stdev, mean, sum, scatter-plot, freq-bar-chart, histogram. You can read more about these in Tables Documentation.

    Sample Answer:
    "Using the municipalities data that we have been covering in lectures, for example, if you were asked to analyze whether municipalities with population (in 2000) larger than 30,000 have an increase or decrease in population, your answer to this might be: "I’d start with a table of municipalities that have a population in 2000 of over 30,000, and then make a scatterplot of the population of those cities in 2000 and 2010. I’d add a linear regression line, then check whether there was a pattern in changes between the two population values.

    I’d obtain a table of municipalities with a population of greater than 30,000 in 2000 by using the filter-by function."

  4. For the summary-table generation question, you will be filling in the body of the following function:

    fun summary-table(t :: Table, f :: (Table, String -> Number)) -> Table:
      doc: ```Produces a table that uses the given function f to summarize
            rides for each of rain/snow/clear weather during morning/
            afternoon/evening/night timeframes.```
      ...
    end
    
    # the type of f is function that takes Table and String and returns a Number
    

    This might be called as summary-table(mytable, sum) or summary-table(mytable, mean) to summarize the total or average numbers of rides within the dates represented in mytable.

    • Provide an example of how this function will be used. Your answer should include an example of the input table, an input function that takes in a Table and String and returns a Number, and an output Table.
  5. Given these two tables:

    Table 1:

    date prcp
    2019/10/14 1.0
    2019/10/15 1.1
    2019/10/16 0.0

    Table 2:

    date number_of_rides
    2019/10/15 28591
    2019/10/14 2355
    2019/10/17 14513
    2019/10/16 4810

    Write a bulleted list of steps (in English) to combine these two tables to one that looks like the table below. If a step corresponds to a specific Pyret tables operation, name the operation.

    date prcp number_of_rides
    2019/10/14 1.0 2355
    2019/10/15 1.1 28591
    2019/10/16 0.0 4810

Details on the Design Check itself are in section further down in this handout.

Deadline 2: The Analysis Stage

The deliverables for this stage include:

  1. A Pyret file named transit-analysis.arr that contains the function summary-table, the tests for the function, and all the functions used to generate the report (charts, plots, and statistics).
  2. A report file named transit-report.pdf. Include in this file the copies of your charts and the written part of your analysis. Your report should address the three analysis questions outlined at the beginning of this assignment.

Note:

  1. Use at least two different tables in your tests for the summary-table function.
  2. If you copy a table or plot into your analysis, you must tell us what it is called in your code so we can reproduce your results.

Sample Answer: Continuing with comparing exam grades for C students as an example, we’d expect to see something in your Pyret file like the following:

# ------ Analysis for question on exam grades for C students --- #

fun more-than-thirty-thousand(r :: Row) -> Boolean:
  ...
end

qualifying-munis = filter-by(municipalities, more-than-thirty-thousand)
munis-ex1-ex2-scatter = lr-plot(c-students, "population-2000", "population-2010")

Then, your report may look like this:

Guidelines on the Analysis

In order to do these analyses, you will need to get day-of-the-week information into the tables and combine data from the two tables based on common dates.

Combining data across tables: Both tables store data by dates, which means you should be able to combine information to create a single table. However, these two tables have different date formats (this was intentional on our part). Handle aligning the date formats in Pyret, not in Google Sheets. Making sure you know how to use coding to massage tables for combining data is one of our goals for this project. Load both tables into Pyret, then figure out how to combine the information. Pyret String documentation might be your friend!

Note: As we saw in the lecture on errors in data tables, small errors and typos can lurk in datasets. While you might be tempted to just combine columns from the tables by relying on them having the same dates in the same order, this would not be a safe option unless you also had code to check this assumption about the dates. For now, your approach should look up each date from one table in the other. We will revisit to how to write this check in lecture once we finish teaching you what we need to do that.

Hint: If you feel your code is getting to complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don’t need helpers for everything, though – it is fine for you to have nested build-column expressions in your solution, for example. Don’t hesitate to reach out to us if you want to review your ideas for breaking down this problem.

Report

You should make a report of your findings in a Word or Google Document, which you will submit as a PDF. Pyret makes it easy to make this kind of report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a .png file which you can then copy into your document.

Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs.

Your report should contain any relevant plots and tables, any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making the report reads well (use section headings, full sentences, spell-check it, etc). There’s no specified length – just say what you need to say to present your analyses.

Reflection

Have a section in your report document with answers to each of the following questions after you have finished the coding portion of the project:

  1. Describe one key insight that each partner gained about programming or data analysis from working on this project and one mistake or misconception that each partner had to work though.
  2. Based on the data and analysis techniques you had, how confident are you in the quality of your results? What other information or skills could have improved the accuracy and precision of your analysis.
  3. State one or two followup questions that you have about programming or data analysis after working on this project.
  4. Imagine you are an urban planner using this dataset to identify dates and times that are more likely than others to have high numbers of commuters. How could only using this dataset make your analysis less accurate? Think about data or populations that are missing from this dataset.
  5. Imagine that the following attributes were added to the public taxi data set. For each of the following attributes, identify a potential ethical issue that could arise due to its addition. Think about who could use the dataset, how it could be analyzed, or for what purposes the analysis could be used.
    • the address of the start and end point of the ride
    • the individual who took the ride (identified by a unique number, not their name) along with their pickup location
    • any “protected attributes” (defined in the U.S. as gender, race, disability, age, etc.)

Design Check

The design check is a one-on-one meeting between your team and a TA to review your project plans and to give you feedback well before the final deadline. Many students make changes to their designs following the check: doing so is common and will not cost you points.

Requirements

Design Check Grading

Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you’ve thought about the questions and what will be required to answer them. The TA will give you feedback to consider as part of your final implementation of the project.

Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in a deduction on your analysis stage grade (for example, a check moving to a check minus).

Final Handin (Analysis Portion)

For your final handin, submit one code file named transit-analysis.arr containing all of your code for producing plots and tables for this project. Put a summary of the plots, tables, and conclusions into a separate document called transit-report.pdf. Also put your project reflection into this file. Nothing is required to print in the interactions window when we run your file, but your analysis answers should include comments indicating which variable names or expressions yield the data on which you based your answers.

Final Grading

You will get grades on each of Functionality, Design, Testing, and Code Clarity for this assignment.

Functionality – Key metrics:

Testing – Key metrics:

Design – Key metrics:

Code Clarity – Key metrics:

With regards to functionality, you should write appropriate code to perform each analysis and wrote a working table-summary function. Strive to use appropriate functions to produce your tables and analyses.

For design, your computations that create additional tables should be clear and well-structured, rather than appearing as you made some messy choices just to get things to work.

Handin

Please handin both your two files to Project 1 on Gradescope. If you forget to include either of them, no credit will be received.

Design Check Signup extra instructions

Note, 2019: You will not see a name (like Eli Berkowitz on the video), but you will receive a notification by 7PM on Friday who your Design Check TA is going to be!

Campuswire and Feedback