Trees: what are they good for?

HW 3

We discussed Homework 3 a bit. See the lecture capture for details.

Who cares about binary search trees?

We introduced binary search trees via the find operation, which determines whether and where a particular value is in the binary tree. We also defined add and remove operations. So: we can add and remove values from a tree data structure, and can search for values in that data structure. Who cares?

Sets

Imagine I’m tracking all of the books I’ve read. I sometimes re-read books, but I don’t care about tracking that–I just want to know, for a given book, if I’ve read it or not. How could we implement something like this?

Rather than imagining a particular data structure (e.g., a list of books), let’s raise our level of abstraction slightly. I don’t particularly care how the books I’ve read are stored in the computer’s memory: I just need to be able to:

  • Record that I’ve read a book
  • Check to see if I’ve read a given book
  • Remove a book from my list (if I get a book’s title wrong, maybe)
  • Count how many books I’ve read

So it seems like we need some data type that has these operations:

  • add(item)
  • contains(item)
  • remove(item)
  • count()

As it turns out, computer scientists have a name for this data type: it’s called a Set (and based on the mathematical concept of a set). A Set is an example of an abstract data type. Notice that we’ve only specified operations on sets: we haven’t said anything about how sets are actually stored in memory.

How could we implement such a data type?

class TreeSet:
  def __init__(self):
    self.tree = BST()

  def add(self, value):
    self.tree.insert(value)

  def contains(self, value):
    if self.tree.find(value):
      return True
    else:
      return False

  def delete(self, value):
    node = self.tree.find(value)
    if node:
      self.tree.remove(node)

  def count(self):
    return self.tree.number_of_nodes()

class HashSet:
  def __init__(self):
    self.data = {}

  def add(self, value):
    self.data[value] = True

  def contains(self, value):
    return value in self.data

  def delete(self, value):
    if value in self.data:
      del self.data[value]

  def count(self):
    return len(self.data)

class ListSet:
  def __init__(self):
    self.data = []

  def add(self, value):
    if value not in self.data:
      self.data.append(value)

  def contains(self, value):
    return value in self.data

  def delete(self, value):
    self.data.remove(value)

  def count(self):
    return len(self.data)

All of these are implementations of the same abstract data type.

def use_set(s):
  s.add(4)
  s.add(2)
  s.add(9)
  s.remove(2)
  print(s.contains(4))
  print(s.contains(2))
  print(s.count())

This function’s behavior will be the same, regardless of which implementation of Set we pass in. We’re taking advantage of encapsulation–we don’t need to know how our chosen set implementation stores data–and polymorphism–we can pass any object implementing the Set methods to use_set and it will work!

Depending on our use case, we might choose any of these implementations based on running time, memory usage, code simplicity, etc.

Python’s built-in sets

Python includes hash-based sets as a built-in datatype. They look like this:

> s = {1, 2, 3}
> s.add(4)
> 4 in s
True
> s.remove(2)
> 2 in s
False
> s
{1, 3, 4}
> s.add(3)
> s
{1, 3, 4}