Tree-structured data: documents

Web documents

You’re probably viewing these notes in your web browser. Open up the source of the page (most browsers have a “View Source” menu item somewhere). What does the source look like?

This page is simpler than the notes, so let’s look at it instead. Its source looks like this:

<html>
  <head>
    <title>This is a page</title>
  </head>
  <body>
    <p>Here is a paragraph of text</p>
    <p>Another paragraph, with <strong>bold</strong> text</p>
    <p>And a list:
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
      </ul>
    </p>
    <p>And a bolded list:
      <strong>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
        </ul>
      </strong>
    </p>
  </body>
</html>

What do we notice about the structure of this source?

HTML is made up of tags (the things in angle brackets). A tag can contain text; it can also contain other tags, which can contain other tags, and so on.

Why does HTML look like this? HTML documents have structure. They aren’t just plain text–there are formatted lists, and bold text, and so on.

We won’t really learn how to write HTML in this course. We will however, learn how to do computations over HTML documents in Python.

So: how might we represent the HTML document above?

Trees

Tree-structured data is quite common in computer science. In CSCI 0111, we saw one example: ancestry trees, recording the parents of individuals (and their parents, and their parents, and so on). In this class we’ll see another example: we’ll use a tree to represent HTML documents.

Our tree type will look like this:

@dataclass
class HTMLTree:
    tag: str
    children: list
    text: str = ""

Each instance of the HTMLTree datatype represents a single tag. The name of the tag (p, strong, etc.) is in the tag field. The tag’s children are in the children field.

In the HTML document above, there’s text in addition to tags. In our HTMLTree class, these are represented as text tags; the text goes in the text field. text tags never have children.

So, here’s a simple document:

HTMLTree("p", [HTMLTree("text", [], "Text in a paragraph")])

This corresponds to the HTML:

<p>Text in a paragraph</p>

If our documents get much bigger, defining them by hand like this is going to get pretty annoying. We’ve written a little HTML library for the class–that’s where HTMLTree is defined. There’s a function in the library to take a string of HTML and turn it into a tree:

> tree = parse("<p>Text in a paragraph</p>")
> tree
HTMLTree("p", [HTMLTree("text", [], "Text in a paragraph")])

We can also print out the HTML string a tree corresponds to:

> print_html(tree)
<p>Text in a paragraph</p>