Class summary: Table Organization
Copyright (c) 2017 Kathi Fisler
1 A Recap on Nested Functions
There were some lingering questions from the last lecture about nested functions, especially for use with table functions. We started with a review, using the following example.
fun filter-by-discount(t :: Table, d :: String) -> Table: |
doc: "filter table to rows with given discount" |
fun has-discount(r :: Row) -> Boolean: |
r["discount"] == d |
end |
filter-by(t, has-discount) |
end |
|
student-tickets = |
sum(filter-by-discount(event-data, "student"), |
"tickcount") |
We went through a justification of why has-discount needs to be nested within filter-by-discount. We also reviewed how the call to filter-by-discount works, showing how d gets its value, then how has-discount gets used on the table rows.
Lecture capture shows both explanations. We have also summarized them in an animated Powerpoint document [PDF version].
1.1 Anonymous Functions (lambda)
If you are getting comfortable writing filter-by and build-column expressions, you may be getting tired of having to write out the nested functions all the time. If you are ready for a shorthand, this section is for you (if you aren’t ready for this yet, that’s also fine).
Notice how in filter-by-discount the function has-discount is somewhat temporary – we create it just so we can give it as the argument to filter-by. We aren’t planning to use the name again writing the filter-by call.
For situations such as these, Pyret provides the ability to define anonymous functions – functions with arguments and bodies, but no names. Here’s how the same code appears written with an anonymous function instead:
fun filter-by-discount(t :: Table, d :: String) -> Table: |
doc: "filter table to rows with given discount" |
filter-by(t, lam(r): r["discount"] == d end) |
end |
lam (short for the greek letter lambda, which is used for functions in the mathematical foundations of programming languages) says "make an anonymous function". The function still takes the same parameters, and still has the same body, as well as the end marker. But the name has been stripped off.
We have also stripped off the types (though we could have included them). Why? Because we tend to use anonymous functions in very localized situations like this where the types are easy to see from usage context (i.e., we know that filter-by function arguments take a Row and return a Boolean.
You are welcome to use lam functions or not, as you see fit. If you’re comfortable with them, they do make your code tighter. But it’s also fine to continue to use named functions, as the idea of anonymous functions can be a bit abstract at first.
2 Organizing Tables to Tasks
We began by looking at a Google Sheet with data bout the regional demographics of recent Brown entering classes. This data comes from Brown Admissions Factbook, First-Year Class Geographic Profile (the overall factbook has a lot of data on Brown’s student and employee populations).
Here are some questions we might want to ask about this data?
Which region has had the largest range of population over the last four years?
What percentage of the class is from a particular region in a given year?
What percentage of the total student body (first-years through seniors) in a given year is from a particular region?
We asked about the types of charts and plots made the most sense for each of these three questions. We also asked what organization of the data would be best suited to generating each of those charts. As a reminder, the available charts are listed in the tables-functions documentation page.
We generally concluded that
The first question calls for a box-plot. Generating that needs a version of the table rows for years and regions for columns.
The second question calls for a pie chart, which we can generate with existing columns.
The third question also calls for a pie chart, but the table needs an additional column that sums data from the years to be included in the aggregate.
2.1 Takeaways
Any data-oriented problem that you work on has a collection of variables, which represent information about a series of observations.
Typically, software tools that process data expect that you have a column for each variable and a row for each observation.
Many variables have ranges of values which might be important in your analysis. This raises a question of whether you should have a column tracking the range or a separate column for each range. Which one you want depends on the analysis question you are asking, and what inputs your plotting operations expects (i.e., an entire column versus some sort of range specification).
Key idea in CS: Being able to choose a table organization based on your analysis questions is a key skill in data science. Identifying manipulations that would properly reformat a table is a skill from computer science. Actually using code to reformat a table from one organization to another is a skill from programming.
In short, the course is starting to move from learning mechanics to really bringing our topics together to work on problems that arise in real analysis contexts. Let the fun begin!