Class summary: Introduction to Trees
Copyright (c) 2017 Kathi Fisler
This material is not in the textbook.
1 Data Structures for Family Trees
Imagine that we wanted to represent geneaology information (information about people’s biological parents and genetic traits). Here’s a picture showing the relationships between people and their parents (a "family tree").
To capture this in code, we might create a table such as the following:
family = table: name, birthyear, eyecolor, mother, father |
row: "Anna", 1997, "blue", "Susan", "Charlie" |
row: "Susan", 1971, "blue", "Ellen", "Bill" |
row: "Charlie" 1972, "green", "NoInfo", "NoInfo" |
row: "Ellen", 1945, "brown", "Laura", "John" |
... |
end |
Assume we wanted to be able to answer questions such as the following:
How frequent is each eye color?
How many generations do we have information for?
What’s the average age of mothers (or fathers) at time of birth?
Is one specific person an ancestor of another specific person?
1.1 Family Trees as Tables
Let’s say I wanted to write a function to compute someone’s grandparents (at least, those grandparents known in the tree)
fun grandparents(of-name :: String) -> List<String>: |
... |
where: |
grandparents("Anna") is [list: "Laura", "John"] |
grandparents("Laura") is [list:] |
grandparents("Kathi") is [list:] |
end |
What would be involved in doing that computation? What subtasks would we identify/what functions would we write?
Need to go from a name to the mother
Need to go from a name to the father
Let’s write one of these functions to see what it would look like:
import lists as L |
|
fun get-mother(of-name :: String, from-family :: Table): |
person-row = |
filter-by(from-family, |
lam(r :: Row): r["name"] == of-name end).row-n(0) |
person-row["mother"] |
where: |
get-mother("Anna", family) is "Susan" |
end |
What happens if the person we asked for isn’t in the table (meaning that we don’t know their family history)? Right now, we get a Pyret error. The error arises because we shouldn’t try to use L.get unless we know that we found a row for the named person. We could modify the code, but that would be premature.
As always, start with examples: what should the function produce if the named person doesn’t have a row in the table?
if we raise an error, we can’t use this function to get whichever grandparents are known (the raise would terminate the function)
if we use something like "unknown", we can’t tell the difference between a real name and this value (both are strings)
in practice, we want to return an answer of a _different type_, to avoid both problems. Here, we could return false (the boolean) to indicate that the person wasn’t found.
fun get-mother2(of-name :: String, from-family :: Table): |
person-table = |
filter-by(from-family, lam(r :: Row): r["name"] == of-name end) |
if person-table.length() > 0: |
person-table.row-n(0)["mother"] |
else: |
false |
end |
where: |
get-mother2("Anna", family) is "Susan" |
get-mother2("Fred", family) is false |
end |
If you imagine chaining together calls to get-mother in order to find ancestors (and having to also do that on the father’s side), we’d quickly see that we end up doing a lot of table filtering, which seems inefficient.
Look back at the family tree picture. We don’t do any complicated filtering there – we just follow the line in the picture immediately from a person to their mother or father. Can we get that idea in code instead? Yes, through data blocks.
1.2 Creating a Data block for Trees
For this approach, we want to create a data block for Family Trees that has a variant (constructor) for setting up a person. Look back at our picture – what information makes up a person? Their name, their mother, and their father. That suggests the following pattern, which basically turns a row into a data block:
data FamTree: |
| person( |
name :: String, |
mother :: String, |
father :: String |
) |
end |
Try to build the family tree from the picture using this data:
anna-person = person("Anna", "Susan", "Charlie") |
susan-person = person("Susan", "Ellen", "Bill") |
Wait – this seems wierd – we have one family (tree), but we’re setting up separate people? Do we maybe want a list of this information instead?
family-lst = |
[list: |
person("Anna", "Susan", "Charlie"), |
person("Susan", "Ellen", "Bill") |
] |
This is better (one piece of data for the entire family tree, but it still seems to be missing the "tree-ness" of the picture. Note that in the picture, it is easy to get from Anna to her grandparents. Here, there’s this list and we have to look across the people to find the next generation. Could we do better?
Remember that we can make the mother and father be any type we would like. They don’t have to be Strings. In fact when we look at the picture, what we see up the mother and father sides is an entire family tree. Wouldn’t this then be better?
data FamTree: |
| person( |
name :: String, |
mother :: FamTree2, |
father :: FamTree2 |
) |
end |
Try writing the family tree using this definition instead. Do the part starting just from Susan for now.
Hopefully, you got this far, but there’s a question of what to put in the ellipses (the cases in which we don’t know what person goes in there)
susan-as-tree = |
person2("Susan", |
person2("Ellen", ..., ...), |
person2("Bill", |
person2("Laura", ..., ...), |
person2("John", ..., ...)) |
) |
How do we fill in the ellipses? Could we use something like false?
susan-as-tree = |
person2("Susan", |
person2("Ellen", false, false), |
person2("Bill", |
person2("Laura", false, false), |
person2("John", false, false)) |
) |
Oops – that didn’t work. Why not? Our data block requires the mother and father to be FamTrees, but false isn’t a FamTree. Maybe we could relax the type of mother/father to allow Famtree or boolean, but there’s acutally a better approach. We were only using false because we needed some kind of data that we could distinguish from a real name. We can get the same affect by adding another variant of family tree, one corresponding to an "empty" tree (or a tree with no people)
data FamTree: | unknown | person( name :: String, mother :: FamTree, father :: FamTree ) end
Now, we can finish our example
susan-tree = |
person("Susan", |
person("Ellen", unknown, unknown), |
person("Bill", |
person("Laura", unknown, unknown), |
person("John", unknown, unknown)) |
) |
Or we can build up the entire family:
the-family = |
person("Anna", |
susan-tree, |
person("Charlie", unknown, unknown)) |
How would we find Susan’s mother?
susan-tree.mother
This gives the entire person structure. What if I want her name?
susan-tree.mother.name
We still need to come back to the discussion comparing tables and trees, but first, let’s write some programs over trees.
2 Programming Over Trees
Write count-gens, which takes a FamTree and determines the maximum number of generations up any branch of the tree. Don’t forget to write examples!
It might be easier to try to think out what computations would have to occur to build up the answer here. Start with
count-gens("Anna") |
What should the answer be? Anna contributes a generation. The number of generations in her family must be based on the number of generations in each of her mother’s and father’s families. Do we add all of those up? No, we don’t want to count her mother and father as separate generations. Perhaps we should use max to keep the largest number of generations from her parents. This is the sequence of steps informally (informal because we are using the names of the people to refer to their trees without having defined those trees separately)
> count-gens(Anna) |
> 1 + max(count-gens(Susan), count-gens(Charlie)) |
> 1 + max(1 + max(count-gens(Ellen), count-gens(Bill))), |
count-gens(Charlie)) |
> ... |
Having seen this, let’s turn it into code:
fun count-gens(ft :: FamTree) -> Number: |
doc: "produce number of generations in longest branch of the tree" |
cases (FamTree) ft: |
| unknown => 0 |
| person(name, mother, father) => |
1 + num-max(count-gens(mother), count-gens(father)) |
end |
where: |
count-gens(unknown) is 0 |
count-gens(the-family) is 4 |
end |
2.1 The Trees Template
Did you dive in and try writing count-gens from scratch? Remember than when we did lists we had the notion of a template that captured how we traverse (aka, walk along) the entire data structure. The template expanded the data structure into cases, then made a recursive call on the rest of the list (which was also a list).
We can use that same approach here, developing a template for trees. In the tree case, however, there are recursive calls on each of the mother and the father. Here is the template for a family tree:
fun ft-func(ft :: FamTree) -> ???: |
cases (FamTree) ft: |
| unknown => |
| person(name, mother, father) => |
... name |
... ft-func(mother) |
... ft-func(father) |
end |
Think about starting from this template as you try the next example.
2.2 Another Example
Write in-family, which takes a name and a FamTree and determines whether there is a person in the tree with that name. Don’t forget to write examples!
fun in-family(a-name :: String, ft :: FamTree) -> Boolean: |
doc: "determine whether family has a person with the given name" |
cases (FamTree) ft: |
| unknown => false |
| person(name, mother, father) => |
(name == a-name) or |
in-family(a-name, mother) or |
in-family(a-name, father) |
end |
where: |
in-family("Bill", unknown()) is false |
in-family("Zoe", unknown()) is false |
in-family("Susan", the-family) is true |
in-family("Zoe", the-family) is false |
in-family("John", the-family) is true |
end |
3 Binary Trees
What we have worked out here as family trees is actually a common data structure in CS called a binary tree (binary because each position in the tree refers to two other positions). If you go on in CS, you will see binary trees in many different contents. Here, we are just pointing out the common term for what we have built.
4 Tables Versus Trees
Let’s get back to the discussion about tables vs trees – what are the benefits of each?
Trees:
allow direct access to parents, rather than needing another table lookup to find parents
better support multiple people with the same name in the family
structure captures generations naturally
capture siblings easily
feels more like a database of data on people
There are clearly tradeoffs here. In Computer Science, trees are often used instead of table, because of the direct access to parents (and generally capturing the structure of the underlying data).
5 More Practice
If you finish those, extend the data block so that a person also has a birth year and an eye color. Think of some programs that you could write now that you have this information as well.