Homework 1-4

Due September 29, 2015, 09:00am

Reminders

If a problem is marked as “(Independent)”, you may only discuss the problem with course staff. Otherwise, you are free to discuss the concepts that will help you solve the problems with classmates as well as course staff. However, you are never allowed to simply copy answers.

Other notes

For this following task, you will be sharing a spreadsheet with the handin account - cs0931handinfall2015@gmail.com . In the text file please label your answers clearly so that we know which answers go with which question.

Task 1

Ranking is one way of understanding the distances between one senator and every other senator. Maybe we want to use this data to do other kinds of labeling or conditional formatting. To do that, we need to be fluent in writing formulas that use "conditions", like IF statements. In this task, we'll practice writing IF statements and other logical statements, like AND and OR.

  1. Copy HW1-4 into your Google Drive. Look at the formulas in cells C1 through C5. In cell D1, write a formula that's TRUE if A1 is less than 5 or if both A2 and A3 are bigger than 2, but is FALSE otherwise.
  2. In cell D2, write a formula that's TRUE if both A1 equals A2 and A3 does not equal A4. You can use <> to mean “not equals” (so that 5 <> 3 is TRUE), or you can use the NOT function.
  3. In cell D3 write an expression whose value is 7 if either A2 or A3 is greater than 0, and is A5 otherwise.
  4. Rename your spreadsheet as FirstLast_HW1-4.

Task 2

During the first days of this class, the TAs and I had to deal with the waitlist, for which we had student email addresses. But instead of a list of email addresses, we wanted to make up a nicely sorted class list. As you know, a typical Brown email address looks like John_Hughes@brown.edu. We wanted to make this into something like "Hughes, John", or perhaps "Hughes" in one column of a spreadsheet and "John" in the next column, so that sorting would be easy.

  1. On the tab Problem 2, we've provided a list of seven brown emails; your job is to produce, on the tab "Problem 2 solution", a table of the names, sorted, where the first column is the last name (e.g. "Hughes"), the second is the first name ("John"), and the third is the email address ("John_Hughes@brown.edu"). The table must be sorted by last name and then further sorted by first name. The first row of your solution should therefore be
    Balan | Andrew | Andrew_Balan@brown.edu .

    To produce this, you'll need to do some work. I'd suggest you look at the google spreadsheets "Split()" function. Your intermediate work can go in the marked area on the Problem 2 Solution page (which you may extend to the right if you need it). Your final table should go in the area marked Solution. HINT: The SPLIT function can split over multiple characters at the same time. For example, =Split("Data-and&stuff", "-&") will produce Data | and | stuff in 3 cells. Also, it is possible to enter two ranges as the first two arguments in the SORT function. If you do this, the spreadsheet will sort the FIRST range of data based on the SECOND range of data (assuming that the ranges are of equal length).

    Your solution should have the property that if the TAs change the data on the Problem 2 tab, your solution will update appropriately. You might think that since sorting is needed, that's impossible. But look at the area on the Problem 2 page marked example; you'll see that the data in the second column is just the data from the first column, sorted. Check out the formula in cell C12 to see how that was done. The key point is that you can do sorting using a formula rather than the data > sort menu item.

    To simplify matters a little, you may assume that the TAs will only change the data by replacing a name, NOT by adding or deleting names.

Task 3

Now let's address the possibility that the amount of data in a sorting task might change. In the "Problem 3" tab, there's a list of numbers in the first column -- eleven of them -- and in the third column they appear sorted, these sorted values being produced by the formula =sort(A2:A13, 1, true) in cell C2.

  1. In cell B2, write a formula that will tell how many data items there are in column 1. You may assume (i) The data are all together at the top, i.e., there are no blanks between the label "data" and the data items, and each data item is nonblank, so that the first blank entry in column A signals the end of the data. (ii) there are at most 40 data items, i.e., you only need to look at cells A1 through A41.

    You might want to use Countif, or IsBlank, or CountA to do this. Make certain that you don't count the "Data" label as a data item! Your answer, for the data count, should be 11. (Don't just type 11: type a formula which looks at the data and produces a the number of data items, which happens to be 11.)

  2. In cell C2, change the formula to use the value from B2 to produce the sorted data, even if there happen to be more that 11 or fewer than 11 (but no more than 40!). You'll need to use OFFSET() to produce the range of values to be sorted. Reading the documentation for "Offset" may not help a lot, so here's an example and an explanation:

    In entry D2, the formula is =max(OFFSET(A2, 0, 0, 5, 1)). It produces the value 5, which is the maximum value in the first five entries of the data in column A. How does that work?

    Well, first, MAX() produces the maximum of any number of numbers, so =max(3, 5) produces 5, while =max(A1, B2, C17) produces the largest of the values in the three named cells. (If any of them are not numbers, it produces an error.) But MAX() can also operate on a RANGE of cells, like A1:A5, which means "all the cells between A1 and A5, inclusive." (In our case, that would be a problem, since A1 has the value "Data", which isn't a number!). A range can even be a block of cells, like A1:C5.

    Back to the formula in D2: =max(OFFSET(A2, 0, 0, 5, 1)). It's taking the max of a range, so the OFFSET() must be producing a range. Here's what it's doing: it's making a range starting at cell A2, but with a few subtleties. First, =OFFSET(A2, 1, 3) would refer to cell D3: it's offset from A2 by 1 down and 3 across. You've already seen this in an earlier homework. But a second variant of OFFSET() lets you produce a range of cells that starts at the cell described by the first three arguments, and extends down and to the right, by amounts specified by the next two. So OFFSET(A2, 1, 3, 2, 5) describes a range that starts with its upper left corner at D3 (just as above), but the resulting range is 2 rows deep and 5 columns wide. So the range is D3:H4. You might be thinking "It should be D3:I5", because "5" is two rows below 3, and I is five columns more than D. But the rule is that the last two items in the OFFSET(A2, 1, 3, 2, 5) expression tell you how TALL and how WIDE the results are; to get something two rows tall, starting at D3, we need to go to row 4, not row 5 and to column H, not column I.

    Back to the formula we've actually got: =max(OFFSET(A2, 0, 0, 5, 1)). It says to create a range from A2 (offset not at all: that's what the two zeroes mean!), and the range should be 5 rows tall and 1 column wide. So the "OFFSET" produces the range A2:A6. The MAX then takes the maximum value in this set of numbers, which happens to be "5", and that's what's displayed in cell D2.

    Now: you need to use OFFSET to make a range in this same way, but it should go from A2 down to A12 (unless you add a few more data items, in which case it needs to go further). Ask yourself, "Where will I get the last two items that go in the OFFSET expression?" How many columns do you want your range to have? How many rows? Once you've figured this out, you'll wrap your OFFSET expression in a SORT() and produce the sorted data.

Task 4

  1. Would your solution to problem 2 work if the data included some of the names shown in the Problem 4 tab of the spreadsheet instead of the ones we gave you in problem 2? Briefly explain why or why not in the space given. Don't go fix your solution to problem 2: instead, explain what assumptions you made (if any) that turn out not to be valid.

Task 5: Data importing

Now we're going to go over a little data importing!

Go to the ImportExamples tab of the spreadsheet. At the top is an example of performing a google search and using importXML to extract the data that was returned (i.e., all the web pages you could click on). The xpath part of the formula is more complex than anything we saw in class; it's

//h3[@class='r']/a/@href

Let's interpret that. The first part, //h3, says to search for anything with the tag h3, which is the style used to display results in a google search.

To see this, go to Google, search for "poodle", and when the results show up, note two things:

  1. The URL has (probably) changed to a long, complicated string of text that contains https://www.google.com/search?q=poodle somewhere within it.
  2. If you place your cursor above the first site returned (Wikipedia) and right click and select "inspect element", a new panel will open down below showing you all about this element: it has an h3 tag, with the attribute class="r"...which exactly matches the first part of that Xpath: that first part says "look for h3 tags with the class attribute equal to 'r'". The second part says "and within there, look for 'a' tags, and tell me the value of the 'href' attribute." It turns out that "a" tags are hyperlinks, and the "href" attribute is the URL for the linked page. (Note: these instructions work on Google Chrome or Firefox. If you are using Safari, after you click "Inspect Element", you have to click on the source code button just over the results, and change it to "DOM Tree".)

    Note: The URLs found by this XPath begin with a "/url..." because they are relative URLs. In other words, all these links require "https://www.google.com" prepended to them in order to work. The absolute URL for each result is kept in the data-href attribute, but we cannot access these because of a bug in Google Spreadsheets (it cannot correctly parse hypenated names). To get around this, we use substitute(cell name or string, target, replacement) to replace every instance of "/url?q=" with "", effectively getting rid of it. We also use substitute to get the right urls for image or news searches google returns, since those behave a bit differently than regular url results.
  3. Fill in the formula in cell B5 to instead return the number of "results" listed at the top of the page (about 28.8 million when we tried it!). Once again, you can use "inspect element" to help you. Hint: Everything from /a onwards isn't needed, because you want the actual contents of this entity, not the value of some attribute.
  4. Let's import earthquake data provided by USGS from online CSV data. Create a new sheet called "Earthquake" and import all the earthquakes 2.5+ Richter in the last day.

    In cell A1, use the IMPORTDATA command, using the following URL as the data source: "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv"

  5. Now go back to the ImportExamples sheet and look at cell B20: we've used importHTML, which is a lot like importXML, except that it handles less structured webpages better (but does less with them). In this case, we've opened a webpage that lists incidents of piracy on the high seas, and said that we're looking for a TABLE within that page (we tried both "list" and "table", and "table" worked; we could have used the "inspect element" trick to save ourselves the trouble). We've said that we want the very first table (that's the "1" at the end of the formula), and the results show us several piracy incidents.

    The sad thing is that it doesn't show as many as the webpage itself does. Web servers don't have to give the same result to Google Spreadsheets as they do to a browser, and this seems to be an example of that happening: we only get about 10 results.

    Below that, in cell B34, is a formula that should load up data from the national earthquake center...but their webpage's XML is broken, and Google's XML-reading program can't make sense of it.

    Below that, in cell B36, is a formula for reading the low temperatures listed on the weather page for zipcode 02912; I found the URL by going to weather.gov and entering "02912" in the search box. I then figured out what to search for (i.e., the Xpath) by using "inspect element".

    Change this so that cell B36 shows the current temperature instead of starting a table of low temperatures.

Handin

Share the following file with cs0931handinfall2015@gmail.com

  • FirstLast_HW1-4 google spreadsheet
  • In the above file name, "FirstLast" should be replaced with your first and last name. If it is not, we will take off points. Make sure every task has been completed.