You’ve been hired by a new ride-hailing service to help them figure out how many drivers they will need in New York City, one of their initial planned locations. NYC publishes a lot of open data (see this link). Your boss found records of every taxi ride taken in city cabs during 2015 and 2016, and wants you to analyze the data, with a particular look at how taxi usage varies by time of day and weather conditions.
Visit this webpage on the 2016 taxi data, to get familiar with the columns that NYC provides in these datasets. For better or worse, this dataset is HUGE – it has 131 million rows and consumes more than 17GB of space (so don’t download it!). The raw dataset is too big to open easily in Excel or Pyret, so you’re going to work with a summarized version that we have already computed from the raw data.
For this project, you need to answer the following questions about the 2016 taxi data:
To what extent does bad weather affect how many rides people take?
Do numbers of rides and total fares follow similar patterns within each day of the week across the year? In other words, is there a reasonably consistent pattern across all Mondays? What about across Saturdays? And so on.
Are some days of the week more likely than others to have high numbers of rides?
In addition, you need to provide a way to produce tables that summarize statistics about the numbers of rides at different times of day under different weather conditions (the company strategy team will generate these as they do their planning). Specifically, given a table and a function to use to summarize the values for a particular weather condition and time of day, you will need to produce a (Pyret) table of the following form, where each cell contains the number of rides in the given time period on a day with the given weather:
| | Rain | Snow | Clear |
| ---------- | ------ | ------ | ------- |
| Morning | num | ... | ... |
| Afternoon | ... | | |
| Evening | ... | | |
| Night | ... | | |
where num might be the sum of all rides on rainy-day mornings,
or the daily average on rainy-day mornings, etc.
The project will be completed in two stages: Design and Analysis. In the design stage, you will plan the data, tables, and functions that you will need to conduct the analysis during week two. You’ll do little to no coding for the analysis until after you meet with a TA to review your plans. Expectations for each phase are described in separate sections below. At the end of Analysis, you will turn in both a Pyret file with your code and a PDF file describing your findings.
Note: We believe the hardest part of this assignment lies in figuring out what analyses you will do and in creating the tables you need for those analyses. Once you have created the tables, the remaining code should be similar to what you have written for homework and lab. Plan enough time to think out your table and analysis designs.
The following code will load the summarized 2016 taxi data into Pyret:
include tables
include shared-gdrive("cs111-2018.arr", "1XxbD-eg5BAYuufv6mLmEllyg28IR7HeX")
include gdrive-sheets
import math as M
import statistics as S
ssid = "1AU_4pU4PpTdwm1RGJY_XcofNYml-aqla3gFypU5NxzU"
ss = load-spreadsheet(ssid) # load spreadsheet
s1 = ss.sheet-by-name("data", true) # get data sheet
taxi-data-long =
load-table:
day, timeframe, num-rides, avg-dist, total-fare
source: s1
end
For weather data, we have extracted data for La Guardia airport in New York City in 2016 (from NOAA) and left it in a Google Sheet. You can access it with the following code:
w-ssid = "1DITBIRW_jpqEpLWACum885KPt5LiNVm-5LZoCmcfBno"
wdata-sheet = load-spreadsheet(w-ssid)
weather-data =
load-table: date, awnd, rain, snow, tavg, tmax, tmin
source: wdata-sheet.sheet-by-name("la-guardia-2016", true)
end
Submit your work for the design check as a PDF file named project-1-design-check.pdf
. You can create a PDF by writing in your favorite word processor (Word, Google Docs, etc) then saving or exporting to PDF. Ask the TAs if you need help with this. Please put both you and your partner’s login information at the top of the file.
Setup and Handin Info
Look at the summarized 2016 taxi data. Contrast the shape of this data to the sample of the original table shown on the NYC website. What operations or steps could produce the summarized data from the original? Write a bulleted list of steps (in English) that explain how to produce the summarized form from the original. If a step corresponds to a specific Pyret tables operation, name the operation.
(The point of this question is to show you that you know almost everything you’d need to do this conversion yourself, had the source data not been so huge – within a couple of weeks you will know how to do all of these steps yourself.)
For each of the three analysis questions listed above, describe how you plan to do the analysis. Your answer should describe charts and statistics you would generate, and how you would interpret them to get your answer. You may use any of the operations (sort-by
, filter-by
, stdev
, mean
, sum
, etc) or charts (scatter-plot
, freq-bar-chart
, histogram
, etc) that are detailed in the Tables Documentation.
Sample Answer: Using the gradebook table as an example, if you were asked to analyze whether C students improved or declined from exam1 to exam2, your answer to this might be: “I’d start with a table of students who earned C’s, then make a scatterplot of the scores on exam1 versus exam2. I’d add a linear-regression line, then check whether there was a pattern in changes between the two exam scores.”
Do not write code to to produce the answer for the design stage. You may find it useful to outline some code to help you work out your ideas. Your analysis plan should include: (a) Descriptions of the charts or statistics you want to generate, (b) a description of what you will look for when interpreting this information, (c) what table(s) you will need to generate the information in (a). If you need a table with different columns or rows than those we gave you, provide a sample of the table you need (you can write it in Google Sheets then copy/paste it into your document), covering two dates. If multiple problems need the same new table, just write the sample table once.
For each new table that you identified, describe how you plan to create the table from the ones we’ve given you. Describe which Pyret operators you will use, and list the functions (with input/output types) that you will need to write (don’t write the actual code until the analysis stage). If you don’t know how to create the table you want (which could indeed happen), discuss it with the TA at your design check. We have some support code with other table operations that you will get access to after your design check (for now, we want you to think about what you need without being influenced by our support code).
Sample Answer: Using the gradebook table as an example, if you were asked to analyze whether C students improved or declined from exam1 to exam2, your answer to this might be: “I’d obtain a table of students who earned C’s by adding letter grades to the gradebook then using filter-by
to keep only rows for students with C’s.”
For the summary-table generation question, you will be filling in the body of the following function:
fun summary-table(t :: Table, f :: (Table, String -> Number)) -> Table:
doc: ```Produces table that uses the given function f to summarize
rides for each of rain/snow/clear weather during morning/
afternoon/evening/night timeframes.```
...
end
# the type of f is function that takes Table and String and returns a Number
This might be called as summary-table(mytable, sum)
or summary-table(mytable, mean)
to summarize the total or average numbers of rides within the dates represented in mytable
.
For the design stage, provide a where
example for this function, so we can check that you understand what it will do (you will write the actual body in the Analysis Stage).
Details on the Design Check itself are in section further down in this handout.
Actually conduct your analyses for the three questions, and write and test the summary-table
function (which you are welcome to also use as part of your analyses if you wish). This involves creating tables, generating plots, and computing summary statistics in Pyret code. Your summary-table
tests should use at least two different input tables that check different scenarios for this function.
You will submit two files for this part: a Pyret file named transit-analysis.arr
and a report file named transit-report.pdf
. Any code you write to conduct analysis goes in the Pyret file. Copies of your charts and the written part of your analysis go in the report file.
Sample Answer: Continuing with comparing exam grades for C students as an example, we’d expect to see something in your Pyret file like the following:
# ------ Analysis for question on exam grades for C students --- #
fun has-C(r :: Row) -> Boolean:
...
end
c-students = filter-by(gradebook, has-C)
c-ex1-ex2-scatter = lr-plot(c-students, "exam1", "exam2")
Then, your report may look like this:
Note: If you copy a table or plot into your analysis, you must tell us what it is called in your code so we can reproduce your results.
In order to do these analyses, you will need to get day-of-the-week information into the tables and combine data from the two tables based on common dates.
Days of the week: In general, there are two ways you could go about getting the day of the week into your data:
w-ssid
)num-floor
and num-modulo
will be useful in doing so (see the Pyret Numbers documentation for more on these functions).Either approach is fine (no difference in points, etc). In the big picture, though, there are advantages and disadvantages to each approach, which you will discuss in your report.
Combining data across tables: Both tables store data by dates, which means you should be able to combine information to create a single table. However, these two tables have different date formats (this was intentional on our part). Handle aligning the date formats in Pyret, not in Google Sheets. Making sure you know how to use coding to massage tables for combining data is one of our goals for this project. Load both tables into Pyret, then figure out how to combine the information.
Other than adding days of the week, do not edit the source tables. They have been chosen carefully to exercise certain concepts by having you process them with Pyret code.
Note: As we saw in the lecture on errors in data tables, small errors and typos can lurk in datasets. While you might be tempted to just combine columns from the tables by relying on them having the same dates in the same order, this would not be a safe option unless you also had code to check this assumption about the dates. For now, your approach should look up each date from one table in the other. We will revisit to how to write this check in lecture once we finish teaching you what we need to do that.
Hint (Added 10/9) If you aren’t sure how to go about combining the tables, write out the list of steps/tasks that you need to do before you start writing code. Most steps will end up corresponding to an expression or function that you write separately and then combine with code from other steps to yield a clear program. If you feel your code is getting to complicated to test, add helper functions! You will almostly certainly have computations that get done multiple times with different data for this problem. Create and test a helper or two to keep the problem manageable. You don’t need helpers for everything, though – it is fine for you to have nested build-column
expressions in your solution, for example. Don’t hesitate to reach out to us if you want to review your ideas for breaking down this problem.
Do not use any list-based operations for this project.
You should make a report of your findings in a Word or Google Document, which you will submit as a PDF. Pyret makes it easy to make this kind of report. When you make a plot, there is an option in the top left hand side of the window to save the chart as a .png
file which you can then copy into your document.
Additionally, whenever you output a table in the interactions window, Pyret gives you the option to copy the table. If you copy the table into some spreadsheet, it will be formatted as a table that you can then copy into Word or Google Docs.
Your report should contain any relevant plots and tables, any conclusions you have made, and your reflection on the project (see next section). We are not looking for fancy or specific formatting, but you should put some effort into making the report reads well (use section headings, full sentences, spell-check it, etc). There’s no specified length – just say what you need to say to present your analyses.
Have a section in your report document with answers to each of the following questions after you have finished the coding portion of the project:
The design check is a one-on-one meeting between your team and a TA to review your project plans and to give you feedback well before the final deadline. Many students make changes to their designs following the check: doing so is common and will not cost you points.
Submit your work on the handin form before your design check. You should submit a PDF with this work. Bring your work for the design phase to the meeting either on laptop (files already open and ready to go) or as a printout. Use whichever format you will find it easier to take notes on.
We expect that both partners have participated in designing the project. The TA may ask either one of you answer questions about the work you present. Splitting the work such that each of you does 1-2 of the analysis questions is likely to backfire, as you might have inconsistent tables or insufficient understanding of work done by your partner.
Be on time to your design check. If one partner is sick, contact the TA and try to reschedule rather than have only one person do the design check.
Your design check grade will be based on whether you had viable ideas for each of the questions and were able to explain them adequately to the TA (for example, we expect you to be able to describe why you picked a particular plot or table format). Your answers do not have to be perfect, but they do need to illustrate that you’ve thought about the questions and what will be required to answer them. The TA will give you feedback to consider as part of your final implementation of the project.
Your design check grade will be worth roughly a third of your overall project grade. Failure to account for key design feedback in your final solution may result in a deduction on your analysis stage grade (for example, a check moving to a check minus).
For your final handin, submit one code file named transit-analysis.arr
containing all of your code for producing plots and tables for this project. Put a summary of the plots, tables, and conclusions into a separate document called transit-report.pdf
. Also put your project reflection into this file. Nothing is required to print in the interactions window when we run your file, but your analysis answers should include comments indicating which variable names or expressions yield the data on which you based your answers.
You will get grades on each of Functionality, Design, Testing, and Code Clarity for this assignment.
Functionality – Key metrics:
Testing – Key metrics:
Design – Key metrics:
Code Clarity – Key metrics:
You can pass the project (with a check-minus) even if you either (a) skip the table-summary function or (b) have to massage some of the tables by hand rather than through code. A project that does not meet either of these baseline requirements will earn a fail on functionality.
A high-check on functionality will require that you wrote appropriate code to perform each analysis and wrote a working table-summary function. The difference between check-plus and check will lie in whether you chose and used appropriate functions to produce your tables and analyses.
For design, the difference between high-check and check will lie in whether your computations that create additional tables are clear and well-structured, rather than appearing as you made some messy choices just to get things to work.
Handin your two files to the Google Form (https://goo.gl/forms/ctsggckercw11gSm2).
As always, make sure you receive your confirmation email. Simply submitting the Google Form does not mean you have successfully submitted your project.