Tree-structured data: documents
Web documents
You’re probably viewing these notes in your web browser. Open up the source of the page (most browsers have a “View Source” menu item somewhere). What does the source look like?
This page is simpler than the notes, so let’s look at it instead. Its source looks like this:
<html> <head> <title>This is a page</title> </head> <body> <p>Here is a paragraph of text</p> <p>Another paragraph, with <strong>bold</strong> text</p> <p>And a list: <ul> <li>Item 1</li> <li>Item 2</li> </ul> </p> <p>And a bolded list: <strong> <ul> <li>Item 1</li> <li>Item 2</li> </ul> </strong> </p> </body> </html>
What do we notice about the structure of this source?
HTML is made up of tags (the things in angle brackets). A tag can contain text; it can also contain other tags, which can contain other tags, and so on.
Why does HTML look like this? HTML documents have structure. They aren’t just plain text–there are formatted lists, and bold text, and so on.
We won’t really learn how to write HTML in this course. We will however, learn how to do computations over HTML documents in Python.
So: how might we represent the HTML document above?
Trees
Tree-structured data is quite common in computer science. In CSCI 0111, we saw one example: ancestry trees, recording the parents of individuals (and their parents, and their parents, and so on). In this class we’ll see another example: we’ll use a tree to represent HTML documents.
Our tree type will look like this:
@dataclass class HTMLTree: tag: str children: list text: str = ""
Each instance of the HTMLTree
datatype represents a single tag. The name
of the tag (p
, strong
, etc.) is in the tag
field. The tag’s children
are in the children
field.
In the HTML document above, there’s text in addition to tags. In our
HTMLTree
class, these are represented as text
tags; the text goes in the
text
field. text
tags never have children.
So, here’s a simple document:
HTMLTree("p", [HTMLTree("text", [], "Text in a paragraph")])
This corresponds to the HTML:
<p>Text in a paragraph</p>
If our documents get much bigger, defining them by hand like this is going
to get pretty annoying. We’ve written a little HTML library for the
class–that’s where HTMLTree
is defined. There’s a function in the library
to take a string of HTML and turn it into a tree:
> tree = parse("<p>Text in a paragraph</p>") > tree HTMLTree("p", [HTMLTree("text", [], "Text in a paragraph")])
We can also print out the HTML string a tree corresponds to:
> print_html(tree)
<p>Text in a paragraph</p>