# Homework 1-4

Due September 29, 2015, 09:00am

## Reminders

If a problem is marked as “(Independent)”, you may only discuss the problem with course staff. Otherwise, you are free to discuss the concepts that will help you solve the problems with classmates as well as course staff. However, you are never allowed to simply copy answers.

## Other notes

For this following task, you will be sharing a spreadsheet with the handin account - `cs0931handinfall2015@gmail.com` . In the text file please label your answers clearly so that we know which answers go with which question.

Ranking is one way of understanding the distances between one senator and every other senator. Maybe we want to use this data to do other kinds of labeling or conditional formatting. To do that, we need to be fluent in writing formulas that use "conditions", like IF statements. In this task, we'll practice writing IF statements and other logical statements, like AND and OR.

1. Copy `HW1-4` into your Google Drive. Look at the formulas in cells `C1` through `C5`. In cell `D1`, write a formula that's `TRUE` if `A1` is less than `5` or if both `A2` and `A3` are bigger than `2`, but is `FALSE` otherwise.
2. In cell D2, write a formula that's `TRUE` if both `A1` equals `A2` and `A3` does not equal `A4`. You can use `<>` to mean “not equals” (so that `5 <> 3` is `TRUE`), or you can use the `NOT` function.
3. In cell `D3` write an expression whose value is `7` if either `A2` or `A3` is greater than 0, and is `A5` otherwise.
4. Rename your spreadsheet as `FirstLast_HW1-4`.

During the first days of this class, the TAs and I had to deal with the waitlist, for which we had student email addresses. But instead of a list of email addresses, we wanted to make up a nicely sorted class list. As you know, a typical Brown email address looks like `John_Hughes@brown.edu`. We wanted to make this into something like "Hughes, John", or perhaps "Hughes" in one column of a spreadsheet and "John" in the next column, so that sorting would be easy.

1. On the tab `Problem 2`, we've provided a list of seven brown emails; your job is to produce, on the tab "Problem 2 solution", a table of the names, sorted, where the first column is the last name (e.g. "Hughes"), the second is the first name ("John"), and the third is the email address ("John_Hughes@brown.edu"). The table must be sorted by last name and then further sorted by first name. The first row of your solution should therefore be
` Balan | Andrew | Andrew_Balan@brown.edu `.

To produce this, you'll need to do some work. I'd suggest you look at the google spreadsheets "Split()" function. Your intermediate work can go in the marked area on the `Problem 2 Solution` page (which you may extend to the right if you need it). Your final table should go in the area marked `Solution`. HINT: The SPLIT function can split over multiple characters at the same time. For example, `=Split("Data-and&stuff", "-&")` will produce `Data | and | stuff` in 3 cells. Also, it is possible to enter two ranges as the first two arguments in the SORT function. If you do this, the spreadsheet will sort the FIRST range of data based on the SECOND range of data (assuming that the ranges are of equal length).

Your solution should have the property that if the TAs change the data on the `Problem 2` tab, your solution will update appropriately. You might think that since sorting is needed, that's impossible. But look at the area on the `Problem 2` page marked `example`; you'll see that the data in the second column is just the data from the first column, sorted. Check out the formula in cell `C12` to see how that was done. The key point is that you can do sorting using a formula rather than the `data > sort` menu item.

To simplify matters a little, you may assume that the TAs will only change the data by replacing a name, NOT by adding or deleting names.

Now let's address the possibility that the amount of data in a sorting task might change. In the "Problem 3" tab, there's a list of numbers in the first column -- eleven of them -- and in the third column they appear sorted, these sorted values being produced by the formula `=sort(A2:A13, 1, true)` in cell C2.

1. In cell `B2`, write a formula that will tell how many data items there are in column 1. You may assume (i) The data are all together at the top, i.e., there are no blanks between the label "data" and the data items, and each data item is nonblank, so that the first blank entry in column A signals the end of the data. (ii) there are at most 40 data items, i.e., you only need to look at cells `A1` through `A41`.

You might want to use `Countif`, or `IsBlank`, or `CountA` to do this. Make certain that you don't count the "Data" label as a data item! Your answer, for the data count, should be 11. (Don't just type 11: type a formula which looks at the data and produces a the number of data items, which happens to be 11.)

2. In cell `C2`, change the formula to use the value from `B2` to produce the sorted data, even if there happen to be more that 11 or fewer than 11 (but no more than 40!). You'll need to use `OFFSET()` to produce the range of values to be sorted. Reading the documentation for "Offset" may not help a lot, so here's an example and an explanation:

In entry `D2`, the formula is `=max(OFFSET(A2, 0, 0, 5, 1))`. It produces the value 5, which is the maximum value in the first five entries of the data in column A. How does that work?

Well, first, `MAX()` produces the maximum of any number of numbers, so `=max(3, 5)` produces 5, while `=max(A1, B2, C17)` produces the largest of the values in the three named cells. (If any of them are not numbers, it produces an error.) But `MAX()` can also operate on a RANGE of cells, like `A1:A5`, which means "all the cells between A1 and A5, inclusive." (In our case, that would be a problem, since `A1` has the value "Data", which isn't a number!). A range can even be a block of cells, like `A1:C5`.

Back to the formula in `D2`: `=max(OFFSET(A2, 0, 0, 5, 1))`. It's taking the max of a range, so the `OFFSET()` must be producing a range. Here's what it's doing: it's making a range starting at cell `A2`, but with a few subtleties. First, `=OFFSET(A2, 1, 3)` would refer to cell `D3`: it's offset from `A2` by 1 down and 3 across. You've already seen this in an earlier homework. But a second variant of `OFFSET()` lets you produce a range of cells that starts at the cell described by the first three arguments, and extends down and to the right, by amounts specified by the next two. So `OFFSET(A2, 1, 3, 2, 5)` describes a range that starts with its upper left corner at D3 (just as above), but the resulting range is 2 rows deep and 5 columns wide. So the range is D3:H4. You might be thinking "It should be D3:I5", because "5" is two rows below 3, and I is five columns more than D. But the rule is that the last two items in the `OFFSET(A2, 1, 3, 2, 5)` expression tell you how TALL and how WIDE the results are; to get something two rows tall, starting at `D3`, we need to go to row 4, not row 5 and to column H, not column I.

Back to the formula we've actually got: `=max(OFFSET(A2, 0, 0, 5, 1))`. It says to create a range from A2 (offset not at all: that's what the two zeroes mean!), and the range should be 5 rows tall and 1 column wide. So the "OFFSET" produces the range `A2:A6`. The MAX then takes the maximum value in this set of numbers, which happens to be "5", and that's what's displayed in cell D2.

Now: you need to use OFFSET to make a range in this same way, but it should go from `A2` down to `A12` (unless you add a few more data items, in which case it needs to go further). Ask yourself, "Where will I get the last two items that go in the OFFSET expression?" How many columns do you want your range to have? How many rows? Once you've figured this out, you'll wrap your `OFFSET` expression in a `SORT()` and produce the sorted data.

1. Would your solution to problem 2 work if the data included some of the names shown in the `Problem 4` tab of the spreadsheet instead of the ones we gave you in problem 2? Briefly explain why or why not in the space given. Don't go fix your solution to problem 2: instead, explain what assumptions you made (if any) that turn out not to be valid.

Now we're going to go over a little data importing!

Go to the ` ImportExamples` tab of the spreadsheet. At the top is an example of performing a google search and using `importXML` to extract the data that was returned (i.e., all the web pages you could click on). The xpath part of the formula is more complex than anything we saw in class; it's

`//h3[@class='r']/a/@href`

Let's interpret that. The first part, `//h3`, says to search for anything with the tag `h3`, which is the style used to display results in a google search.

To see this, go to Google, search for "poodle", and when the results show up, note two things:

1. The URL has (probably) changed to a long, complicated string of text that contains `https://www.google.com/search?q=poodle` somewhere within it.
2. If you place your cursor above the first site returned (Wikipedia) and right click and select "inspect element", a new panel will open down below showing you all about this element: it has an `h3` tag, with the attribute `class="r"`...which exactly matches the first part of that Xpath: that first part says "look for h3 tags with the class attribute equal to 'r'". The second part says "and within there, look for 'a' tags, and tell me the value of the 'href' attribute." It turns out that "a" tags are hyperlinks, and the "href" attribute is the URL for the linked page. (Note: these instructions work on Google Chrome or Firefox. If you are using Safari, after you click "Inspect Element", you have to click on the source code button just over the results, and change it to "DOM Tree".)

Note: The URLs found by this XPath begin with a "/url..." because they are relative URLs. In other words, all these links require "https://www.google.com" prepended to them in order to work. The absolute URL for each result is kept in the `data-href` attribute, but we cannot access these because of a bug in Google Spreadsheets (it cannot correctly parse hypenated names). To get around this, we use substitute(cell name or string, target, replacement) to replace every instance of "/url?q=" with "", effectively getting rid of it. We also use substitute to get the right urls for image or news searches google returns, since those behave a bit differently than regular url results.
3. Fill in the formula in cell B5 to instead return the number of "results" listed at the top of the page (about 28.8 million when we tried it!). Once again, you can use "inspect element" to help you. Hint: Everything from `/a` onwards isn't needed, because you want the actual contents of this entity, not the value of some attribute.
4. Let's import earthquake data provided by USGS from online CSV data. Create a new sheet called "Earthquake" and import all the earthquakes 2.5+ Richter in the last day.

In cell `A1`, use the `IMPORTDATA` command, using the following URL as the data source: `"http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv"`

5. Now go back to the ImportExamples sheet and look at cell `B20`: we've used `importHTML`, which is a lot like `importXML`, except that it handles less structured webpages better (but does less with them). In this case, we've opened a webpage that lists incidents of piracy on the high seas, and said that we're looking for a TABLE within that page (we tried both "list" and "table", and "table" worked; we could have used the "inspect element" trick to save ourselves the trouble). We've said that we want the very first table (that's the "1" at the end of the formula), and the results show us several piracy incidents.

The sad thing is that it doesn't show as many as the webpage itself does. Web servers don't have to give the same result to Google Spreadsheets as they do to a browser, and this seems to be an example of that happening: we only get about 10 results.

Below that, in cell `B34`, is a formula that should load up data from the national earthquake center...but their webpage's XML is broken, and Google's XML-reading program can't make sense of it.

Below that, in cell `B36`, is a formula for reading the low temperatures listed on the weather page for zipcode 02912; I found the URL by going to weather.gov and entering "02912" in the search box. I then figured out what to search for (i.e., the Xpath) by using "inspect element".

Change this so that cell `B36` shows the current temperature instead of starting a table of low temperatures.

## Handin

Share the following file with `cs0931handinfall2015@gmail.com`

• `FirstLast_HW1-4` google spreadsheet
• In the above file name, "FirstLast" should be replaced with your first and last name. If it is not, we will take off points. Make sure every task has been completed.