The answers are highlighted in colored boxes interleaved with the questions.
You are working for a public health organization that is concerned about the spread of malaria. The organization has been working in a particular town, distributing mosquito nets to some homes.
The team is trying to create a formula (a.k.a. a model) that will predict which homes are most likely to have people infected with malaria. Currently, they believe that:
Here’s the initial table that the team received from a previous organization that was working (unsuccessfully) on this problem:
HouseID dist nets? ----------------------------------- 1 250 yes Amy's 600 false 2 1 mile no
What problems do you see in the initial table?
The data in the HouseID column do not have consistent type or nature (names vs some sort of id)
The data in the dist column do not appear to be in the same units (feet vs miles, perhaps), and is not in a consistent format
The data in the nets? column has different values for the same situation (false and no), and does not have a consistent type
Which types should each column contain to best enable future computation?
HouseID: either strings capturing unique names or numbers capturing unique ids. As long as each row has a distinct value and they can be easily compared to each other (for purposes of looking up other data), either type is fine.
dist: numbers representing feet. The problem description is in terms of feet, so maintaining those same units will reduce the likelihood of errors.
nets? should be Boolean, since that type has only two values, which is what we want here. Strings like “yes” or “no” open up opportunities for typos unnecessarily.
To what extent could you clean up the table with programs, versus manual edits? You don’t need to write these programs, but describe the programs you would write if you were asked to actually clean this up.
If HouseID is just some unique number per house, one could write a program to transform the column to contain unique numbers in order. If instead the ids were descriptive names, they’d have to be fixed manually.
one could write a program to detect values in dist that look suspiciously large as feet, or that aren’t numbers, but a final human sanity check would be needed on any unusually large or small numeric values
One could write a program to identify all values in the nets? column that aren’t already booleans. Specific values (like “yes”) could be replaced by appropriate booleans by transforming the column.
Eventually, the table will also need to contain two more pieces of information: (1) a prediction of whether people in the house will contract malaria, and (2) data on whether people in the house did actually contract malaria. The team could then contrast these two pieces of information to see whether their formula/model is accurate (which would in turn tell them where to distribute mosquito nets).
How might someone obtain the additional information (prediction and real data)? Which data can be computed, which come from other tables, which must be gathered by other means?
The predictions can be computed from the dist and nets? columns by following the rules given earlier in the problem statement.
Data about which houses actually have people with malaria would have to be collected via some sort of field work or other observation of the households.
Assume the initial table is stored in a variable named
malaria-data. Write an expression that produces a table that also has the prediction information.
As a reminder, the tables operations are:
filter-by(table, row -> bool) -> table
sort-by(table, colname, true-if-ascending-order) -> table
transform-column(table, colname, row -> value) -> table
build-column(table, colname, row -> value) -> table
table.select-columns(list[colnames]) -> table
Also, you access a cell by writing `row[“colname”]``.
build-column(malaria-data, "prediction", lam(r :: Row): if (r["dist"] < 500) and not(r["nets"]): "high" else if (r["dist"] > 500) and (r["dist"] < 2000) and not(r["nets"]): "medium" else: # includes all with nets, as per diagram "low" end end)
malaria-data table has been modified to hold the prediction and real data. What might you do with the table/data to get an initial sense (so not fully analyzed, just a first pass) of whether the formula is a good predictor?
There are a couple of options. You could create some sort of graph or plot showing the prediction and reality for each house and see whether they seem to align.
You could also compute a new column showing how close the prediction and reality were for each house, then graph or compute the differences between the two values.
The point is that you can produce graphs or simple extra columns for a quick look at your data before you start applying more serious statistical analyses.
Recall the datatype for ancestor trees that we developed earlier in the semester:
data AncTree: | person( name :: String, eye :: String, mother :: AncTree, father :: AncTree ) | unknown end
Write a program that determines whether everyone in a given tree with the name “John” has a given eye color.
fun john-eyes(t :: AncTree, c :: color) -> Boolean: doc: "determine whether everyone named John has given eye color" cases (AncTree) t: | unknown => true | person(nm, eye, mo, fa) => if (nm == "John"): (eye == c) and john-eyes(mo, c) and john-eyes(fa, c) else: john-eyes(mo, c) and john-eyes(fa, c) end end end
Write a good set of tests for a program that checks whether every person shares their eye color with one of their parents. You may draw the trees as pictures (rather than write them in code). You don’t have to write the function, just the tests.
The tests should include:
Consider the following
tree1 = person("Bearna", "brown", person("Bluno", "blue", unknown, person("Carter", "brown", unknown, unknown)), person("Josiah", "blue", person("Miekie", "green", unknown, unknown), unknown))
Assume you evaluate the expression
john-eyes(tree1, "green"). At the point where the function is called on the person named Carter, what do the program dictionary and memory look like?
Prog Dict Memory ----------------------------------------------- john-eyes -> func tree1 -> loc 1005 loc 1001 -> person(Carter, "brown", unknown, unknown) loc 1002 -> person(Bluno, "blue", unknown, loc 1001) loc 1003 -> person(Miekie, "green", unknown, unknown) loc 1004 -> person(Josiah, "blue", loc 1003, unknown) loc 1005 -> person(Bearna, "brown", loc 1002, loc 1004) *************** t -> loc 1005 c -> "green" *************** t -> loc 1002 c -> "green" *************** t -> loc 1001 c -> "green"
The asterisks mark off the local areas of dictionary for the respective recursive calls to john-eyes. As each call finishes, the local dictionary will be erased and the previous one will get restored.
Note that memory does not change once the initial tree is set up. Only the dictionary changes.
This dictionary did not include all the names for the fields of the current person. You can include those or not (if not, the names reference position in the person objects in memory).
Someone is trying to write a program to return the longest name in an ancestor tree (or one of the longest in case of a tie). So far, they have the following code:
fun longest-name(t :: AncTree) -> String: doc: "produce the longest name in the given tree" cases (AncTree) t: | unknown => "none" | person(n, e, m, f) => if string-length(longest-name(m)) > string-length(longest-name(f)): longest-name(m) else: longest-name(f) end end end
returning “none” in the unknown case is suspicious – what if everyone in the tree had names shorter than 4 characters?
the program never works with the length of the name
n of the current person.
For the first case, have a tree in which all names were shorter than 4 characters
For the second case, have a tree with one person and two unknown parents
fun longest-name(t :: AncTree) -> String: doc: "produce the longest name in the given tree" cases (AncTree) t: | unknown => "none" | person(n, e, m, f) => if string-length(longest-name(m)) > string-length(longest-name(f)): if string-length(longest-name(m)) > string-length(n): longest-name(m) else: n end else: if string-length(longest-name(f)) > string-length(n): longest-name(f) else: n end end end end
Side note: If you think the nested ifs look messy (which they are), you could write a helper function that returns the longer of two strings, then use that twice, as in:
longer(str1, longer(str2, str3))
Ideally, the code would run longest-name only once on each of the mother and father, as follows:
fun longest-name2(t :: AncTree) -> String: doc: "produce the longest name in the given tree" cases (AncTree) t: | unknown => "none" | person(n, e, m, f) => mlong = longest-name2(m) flong = longest-name2(f) if string-length(mlong) > string-length(flong): if string-length(mlong) > string-length(n): mlong else: n end else: if string-length(flong) > string-length(n): flong else: n end end end end
You could similarly name the string-lengths of the various strings, but the recursive calls are more expensive and hence more of an issue.