Class summary: Circular References and Intro to Hashtables
Copyright (c) 2017 Kathi Fisler
1 Circular References
One of the interesting things about our Book and Patron definitions is that they refer to each other: A patron refers to books, which refer to their borrowers, which refer to their books, and so on. Let’s look at these circular dependencies in data a bit more closely.
Here are versions of the Account and Customer classes that set up a circular reference: each Account refers to its owners, and each Customer refers to their account. To keep the example simple, we will assume that each Customer can have only one account.
@dataclass |
class Account: |
id: int |
balance: int |
owners: list # of Customer |
|
@dataclass |
class Customer: |
name: str |
acct: Account |
|
Let’s now create a new customer and a new account for them:
new_acct = Account(5, 150, Customer("Kathi", __________)) |
How do we fill in the blank in the Customer? We’d like to say new_acct but Python (and most other languages) will raise an error that new_acct isn’t defined. Why is that?
When given this assignment, Python first evaluates the right side, to get the value or memory location that should be stored in the dictionary for new_acct. If we filled in the blank with new_acct, Python would start by running:
Account(5, 150, Customer("Kathi", new_acct)) |
To do this, it needs to look up new_acct in the dictionary, but that name isn’t in the dictionary yet (it only goes in after we compute the value to store for that name). Hence the error.
1.1 Use assignment to create circular data
To get around this, we leverage the ability to update the contents of memory locations after names for data are in place. We’ll create the account partially, but without filling in the Customer. Then we create the customer to reference the new Account. Then we update the account owners with the now-created customer:
new_acct = Account(5, 150, []) # note the empty Customer list |
new_cust = Customer("Kathi", new_acct) |
new_acct.owners = [new_cust] |
Note here that each part gets a spot in memory and an entry in the dictionary, but the data hasn’t been finished yet. Once we have the data set up in memory though, we can update the owners component to the correct value.
Here’s what this looks like at the level of memory and the dictionary after running the first two lines:
Dictionary Memory |
----------------------------------------------------------------------- |
(loc 1015) -> [] |
new_acct -> (loc 1016) (loc 1016) -> Account(5, 150, loc 1015) |
new_cust -> (loc 1017) (loc 1017) -> Customer("Kathi", loc 1016) |
Then, when we run the third line, we create a new list containing new_cust and update the owner list within new_acct:
Dictionary Memory |
----------------------------------------------------------------------- |
(loc 1015) -> [] |
new_acct -> (loc 1016) (loc 1016) -> Account(5, 150, loc 1018) |
new_cust -> (loc 1017) (loc 1017) -> Customer("Kathi", loc 1016) |
(loc 1018) -> [loc 1017] |
Notice that the two owners lists each live in memory but aren’t associated with names in the dictionary. They are only reachable going through new_acct, and after the update, the empty list isn’t reachable at all.
If we had instead done the owner update using append, as in:
new_acct = Account(5, 150, []) # note the empty Customer list |
new_cust = Customer("Kathi", new_acct) |
new_acct.owners.append(new_cust) |
We would have updated the list at location 1015 instead of create a new location for a new list, as follows:
Dictionary Memory |
----------------------------------------------------------------------- |
(loc 1015) -> [loc 1017] |
new_acct -> (loc 1016) (loc 1016) -> Account(5, 150, loc 1015) |
new_cust -> (loc 1017) (loc 1017) -> Customer("Kathi", loc 1016) |
Either approach (append or a new list) works fine. The only difference is whether a new list gets created, as shown in these two memory examples.
1.2 Testing Circular Data
When you want to write a test involving circular data, you can’t write out the circluar data manually. For example, imagine that we wanted to write out new_acct from the previous examples:
test("data test", new_acct, |
Account(5, 150, [Customer("Kathi", Account(5, 150, ...)]) |
Because of the circularity, you can’t finish writing down the data. You have two options: write tests in terms of the names of data, or write tests on the components of the data.
Here’s an example that illustrates both. After setting up the account, we might want to check that the owner of the new account is the new customer:
test("new owner", new_acct.owner, new_cust) |
Here, rather than write out the Customer by hand, we drop in the name of the existing item in memory. This doesn’t require you to write ellipses. We also focused on just the owner component, as a part of the Account value that we expected to change.
1.3 Bonus: A Function to Create Accounts for New Customers
What if we turned the sequence for creating dependencies between customers and their accounts into a function? We might get something like the following:
def create_acct(new_id: int, init_bal: int, cust_name: str) -> Account: |
new_acct = Account(new_id, init_bal, []) # note the empty Customer list |
new_cust = Customer(cust_name, new_acct) |
new_acct.owners.append(new_cust) |
return new_acct |
This looks useful, but it does have a flaw – we could accidentally create two accounts with the same id number. It would be better for us to maintain a variable containing the next account id to use, to guarantee that the same id gets used only once. How might we augment the code to do this?
next_id = 1 # stores the next available id number |
|
def create_acct(init_bal: int, cust_name: str) -> Account: |
new_acct = Account(next_id, init_bal, []) # note the empty Customer list |
next_id = next_id + 1 |
new_cust = Customer(cust_name, new_acct) |
new_acct.owners.append(new_cust) |
return new_acct |
Here, we create the next_id variable to hold the next id number to use. When we create an Account, we update next_id to the next unused number. Problem solved!
Well, almost. Now we’re at something that is specific to Python.
When we were still working in Pyret, we talked about what happens to the dictionary when we call a function: we make a separate area of the dictionary for that function, and we put the values of the parameters in that area of the dictionary. When the function ends, its piece of dictionary goes away.
In our code above, we have a variable next_id set up outside the create_acct function. Inside the function, we are assigning to a variable next_id. Is this the same variable from outside the function though, or are we trying to create a new variable (as we do for new_acct or new_cust). Python can’t tell which one we want.
To tell Python that we are trying to use the copy of the variable that is outside the function, we have to add a line telling it that. We do that by using a global annotation inside the function. Here’s the final code:
next_id = 1 # stores the next available id number |
|
def create_acct(init_bal: int, cust_name: str) -> Account: |
# next line says to use the next_id from outside the function |
global next_id |
new_acct = Account(next_id, init_bal, []) # note the empty Customer list |
next_id = next_id + 1 |
new_cust = Customer(cust_name, new_acct) |
new_acct.owners.append(new_cust) |
return new_acct |
This isn’t something you’ll be tested on – it is just here to show you how to do this in case you are interested.
2 Introduction to Hashtables
2.1 Finding Items in Memory
When someone asks for an account balance, they usually provide their account number. This suggests that we need a function that takes an account number and returns the account value with that id. From there, we can dig into the account to get the balance.
This means we need some data structure to store all of the accounts. We’ve seen lists and tables for this purpose. Let’s use a (Python) list. Such a list might look as follows:
all_accts = [Account(8, 100, []), |
Account(2, 300, []), |
Account(10, 225, []), |
Account(3, 200, []), |
Account(1, 120, []), |
... |
] |
For now, we’ll ignore the owners component – it isn’t relevant to our discussion and omitting it will make examples easier to write out.
So we need to search a list for the account with a given id. We’ve written such search code many times at this point. Here it is again for reference, with a Python test function for good measure:
def find_acct(which_id : int, acct_list : list) -> Account: |
"""returns the account with the given id number""" |
for acct in acct_list: |
if acct.id == which_id: |
return acct |
raise ValueError("no account has id " + str(which_id)) |
|
def test_find(): |
test("match num", find_acct(3, all_accts).id, 3) |
test("match exactly", find_acct(1, all_accts).id, 1) |
testValueError("no account", lambda: find_acct(22, all_accts)) |
As we discussed earlier in the course, this function takes linear time in the number of accounts in the worst case. For a real bank, that’s a significant number. How can we do better?
Now that we’ve been thinking about memory a bit, we can reframe the question about finding an account. Finding an account is ultimately about finding the account in memory. Once we know the memory location of the account we’re trying to find, we can operate on that account however we want.
In the find_acct version, we use the dictionary to find the list in memory, then iterate through all the accounts in the list, checking their ids until we find the one that we want. But what if we had a way to predict where the account we want would be, so we could go directly to that memory location? Could we have to set up the data to enable that?
Looking at our memory diagrams, we see that the items in the list are in order in memory. So if we know which position we want within the list, we should be able to hop to that spot immediately once we know the memory location of the list itself (basically, we take the address of the list itself and add the list position to get to the right memory spot). Things are a touch more complicated than this in practice (because we’ve simplified the structure of lists in memory the last several lectures), but this is the intuition.
There’s nothing in the programming language that lets us grab memory locations from the dictionary and add offsets to them for specific elements. But there are data structures that effectively do that for you. Enter the hashtable.
2.2 The Hashtable Datatype
For today, we’re just going to show you a hashtable and how to use it. Next class, we’ll discuss how it works.
Hashtables are data structures that let us map quickly from some notion of keys (like account numbers or course codes) to values. When we set up a hashtable, we write a series of entries of the form
key: value |
Here’s what a hashtable mapping id numbers to accounts might look like:
accts_HM = {5: Account(5, 225, []), |
3: Account(3, 200, []), |
2: Account(2, 300, []), |
4: Account(4, 75, []), |
1: Account(1, 100, []) |
} |
We create a hashtable by using curly braces instead of square brackets (as we do for creating lists) on the outside, and by using the key: value form for each item. The hashtable above has five account IDs that map to accounts with those same ID numbers.
Now, if we want to get the account with ID 1 from the hashmap, we can simply write:
accts_HM[1] |
This says "in the accts_HM hashtable, get the value associated with key 1. Because of the way hashtables are set up within Python (and any other language), Python can hop to the memory address in constant time and extract the Account with ID 1.
There are many more things to discuss about hashtables: how to use keys that aren’t numbers, how they work, and when to use them. All of these are coming up in the next couple of lectures.