Final-Exam Practice Problem Solutions

The answers are highlighted in colored boxes interleaved with the questions.

Working with Tables

You are working for a public health organization that is concerned about the spread of malaria. The organization has been working in a particular town, distributing mosquito nets to some homes.

The team is trying to create a formula (a.k.a. a model) that will predict which homes are most likely to have people infected with malaria. Currently, they believe that:

those living within 500 feet of the lake but without mosquito nets have the highest risk,
those living between 500 and 2000 feet from the lake without nets have medium risk, and
those more than 2000 feet from the lake have low risk (whether or not they have nets).

Here’s the initial table that the team received from a previous organization that was working (unsuccessfully) on this problem:

    HouseID     dist        nets?
    -----------------------------------
      1         250          yes
     Amy's      600         false
      2        1 mile        no

What problems do you see in the initial table?
Answer
- The data in the HouseID column do not have consistent type or nature (names vs some sort of id)
- The data in the dist column do not appear to be in the same units (feet vs miles, perhaps), and is not in a consistent format
- The data in the nets? column has different values for the same situation (false and no), and does not have a consistent type
Which types should each column contain to best enable future computation?
Answer
- HouseID: either strings capturing unique names or numbers capturing unique ids. As long as each row has a distinct value and they can be easily compared to each other (for purposes of looking up other data), either type is fine.
- dist: numbers representing feet. The problem description is in terms of feet, so maintaining those same units will reduce the likelihood of errors.
- nets? should be Boolean, since that type has only two values, which is what we want here. Strings like “yes” or “no” open up opportunities for typos unnecessarily.
To what extent could you clean up the table with programs, versus manual edits? You don’t need to write these programs, but describe the programs you would write if you were asked to actually clean this up.
Answer
- If HouseID is just some unique number per house, one could write a program to transform the column to contain unique numbers in order. If instead the ids were descriptive names, they’d have to be fixed manually.
- one could write a program to detect values in dist that look suspiciously large as feet, or that aren’t numbers, but a final human sanity check would be needed on any unusually large or small numeric values
- One could write a program to identify all values in the nets? column that aren’t already booleans. Specific values (like “yes”) could be replaced by appropriate booleans by transforming the column.

Eventually, the table will also need to contain two more pieces of information: (1) a prediction of whether people in the house will contract malaria, and (2) data on whether people in the house did actually contract malaria. The team could then contrast these two pieces of information to see whether their formula/model is accurate (which would in turn tell them where to distribute mosquito nets).

How might someone obtain the additional information (prediction and real data)? Which data can be computed, which come from other tables, which must be gathered by other means?
Answer
- The predictions can be computed from the dist and nets? columns by following the rules given earlier in the problem statement.
- Data about which houses actually have people with malaria would have to be collected via some sort of field work or other observation of the households.
Assume the initial table is stored in a variable named malaria-data. Write an expression that produces a table that also has the prediction information.

As a reminder, the tables operations are:
- filter-by(table, row -> bool) -> table
- sort-by(table, colname, true-if-ascending-order) -> table
- transform-column(table, colname, row -> value) -> table
- build-column(table, colname, row -> value) -> table
- table.select-columns(list[colnames]) -> table
Also, you access a cell by writing `row[“colname”]``.
Answer
```
build-column(malaria-data, "prediction",
  lam(r :: Row): 
    if (r["dist"] < 500) and not(r["nets"]):
      "high"
    else if (r["dist"] > 500) and (r["dist"] < 2000) and not(r["nets"]):
      "medium"
    else: # includes all with nets, as per diagram
      "low"
    end
  end)
```
Assume the malaria-data table has been modified to hold the prediction and real data. What might you do with the table/data to get an initial sense (so not fully analyzed, just a first pass) of whether the formula is a good predictor?

Answer
There are a couple of options. You could create some sort of graph or plot showing the prediction and reality for each house and see whether they seem to align.

You could also compute a new column showing how close the prediction and reality were for each house, then graph or compute the differences between the two values.

The point is that you can produce graphs or simple extra columns for a quick look at your data before you start applying more serious statistical analyses.

Working with Trees

Recall the datatype for ancestor trees that we developed earlier in the semester:

data AncTree:
  | person(
      name :: String, 
      eye :: String, 
      mother :: AncTree,
      father :: AncTree )
  | unknown
end

Write a program that determines whether everyone in a given tree with the name “John” has a given eye color.

Answer

fun john-eyes(t :: AncTree, c :: color) -> Boolean:
  doc: "determine whether everyone named John has given eye color"
  cases (AncTree) t:
    | unknown => true
    | person(nm, eye, mo, fa) =>
      if (nm == "John"):
        (eye == c) and john-eyes(mo, c) and john-eyes(fa, c)
      else:
        john-eyes(mo, c) and john-eyes(fa, c)
   end
  end
end

Write a good set of tests for a program that checks whether every person shares their eye color with one of their parents. You may draw the trees as pictures (rather than write them in code). You don’t have to write the function, just the tests.
Answer
The tests should include:
- the unknown tree
- simple trees of three people (a person and their two parents), some where the condition holds and some where it does not
- trees with at least three generations where the condition is violated at each of the root, and inner mother, and an inner father
- a tree with at least three generations in which the condition holds

Consider the following AncTree:

tree1 = person("Bearna", "brown",
          person("Bluno", "blue", unknown, 
            person("Carter", "brown", unknown, unknown)),
          person("Josiah", "blue",
            person("Miekie", "green", unknown, unknown),
            unknown))

Assume you evaluate the expression john-eyes(tree1, "green"). At the point where the function is called on the person named Carter, what do the program dictionary and memory look like?

Answer

Prog Dict             Memory
-----------------------------------------------
 john-eyes -> func
 tree1 -> loc 1005    loc 1001 -> person(Carter, "brown", unknown, unknown)                            
                      loc 1002 -> person(Bluno, "blue", unknown, loc 1001)
                      loc 1003 -> person(Miekie, "green", unknown, unknown)
                      loc 1004 -> person(Josiah, "blue", loc 1003, unknown)
                      loc 1005 -> person(Bearna, "brown", loc 1002, loc 1004)
***************
t -> loc 1005 
c -> "green"
***************
t -> loc 1002                
c -> "green"
***************
t -> loc 1001                
c -> "green"

The asterisks mark off the local areas of dictionary for the respective recursive calls to john-eyes. As each call finishes, the local dictionary will be erased and the previous one will get restored.

Note that memory does not change once the initial tree is set up. Only the dictionary changes.

This dictionary did not include all the names for the fields of the current person. You can include those or not (if not, the names reference position in the person objects in memory).

Someone is trying to write a program to return the longest name in an ancestor tree (or one of the longest in case of a tie). So far, they have the following code:

fun longest-name(t :: AncTree) -> String:
  doc: "produce the longest name in the given tree"
  cases (AncTree) t:
    | unknown => "none"
    | person(n, e, m, f) => 
      if string-length(longest-name(m)) > string-length(longest-name(f)):
        longest-name(m)
      else: 
        longest-name(f)
      end
  end
end

Without running the code (either in a computer or manually), does anything look suspicious, or do you expect this to work?

Answer

returning “none” in the unknown case is suspicious – what if everyone in the tree had names shorter than 4 characters?
the program never works with the length of the name n of the current person.

For each thing that looks suspicious, can you draw an example tree for which the code would produce the wrong answer due to what you observed?

Answer

For the first case, have a tree in which all names were shorter than 4 characters
For the second case, have a tree with one person and two unknown parents

Fix the code to produce the right answer

Answer

fun longest-name(t :: AncTree) -> String:
  doc: "produce the longest name in the given tree"
  cases (AncTree) t:
    | unknown => "none"
    | person(n, e, m, f) => 
      if string-length(longest-name(m)) > string-length(longest-name(f)):
        if string-length(longest-name(m)) > string-length(n):
          longest-name(m)
        else:
          n
        end
      else: 
        if string-length(longest-name(f)) > string-length(n):
          longest-name(f)
        else:
          n
        end
      end
  end
end

Side note: If you think the nested ifs look messy (which they are), you could write a helper function that returns the longer of two strings, then use that twice, as in:

longer(str1, longer(str2, str3))

Comment on the run-time performance of your code. Do you see any work/computation/expressions that could be eliminated to make the code do fewer computations? If so, how would you change the code to avoid the extra work?

Answer
Ideally, the code would run longest-name only once on each of the mother and father, as follows:

fun longest-name2(t :: AncTree) -> String:
  doc: "produce the longest name in the given tree"
  cases (AncTree) t:
    | unknown => "none"
    | person(n, e, m, f) => 
      mlong = longest-name2(m)
      flong = longest-name2(f)
      if string-length(mlong) > string-length(flong):
        if string-length(mlong) > string-length(n):
          mlong
        else:
          n
        end
      else: 
        if string-length(flong) > string-length(n):
          flong
        else:
          n
        end
      end
  end
end

You could similarly name the string-lengths of the various strings, but the recursive calls are more expensive and hence more of an issue.