NB: Will and Kathi will begin limiting the scope of the help they can provide during office hours
Suppose you want to build an app that lets people search for businesses by name, zip code, or both.
How do we organize the data to handle these searches quickly?
Assume that businesses have names, locations, phone numbers (with area codes), and descriptive tags.
Options to make this structured data include:
Without knowledge we'll acquire later in this class, we might have written something like this:
# Let's assume there are no duplicate business names, to simplfy things
# how we might have written this function before considering runtime
def find_bus_name_exact(busname : str, bus_dir : list) -> dict:
"""Return BusInfo from bus_dir with given name"""
matches = list(filter(lambda bus: bus["name"] == busname, bus_dir))
if len(matches) > 0:
return matches[0]
else:
raise Exception("Business name not found: " + busname)
This will get us the answer we want, but is this the approach we always want to take?
To illustrate that this might not always be the best idea, let's think about Google. There are approximately 4.54 billion pages, but when we query Google it takes less than a second to return results. How can this be? Does Google just have a list of web pages and loop through?
(Spoiler: They don't).
We could talk about the speed of executing a piece of code in terms of hard time (e.g. minutes or seconds or milliseconds). This gives us a physical way of conceptualizing the time cost of a piece of code, but presents an issue if we want to make comparisons. Since people operate on different types of computers, it becomes really difficult to compare runtimes between people.
We actually want a way to describe the time cost that is transferable across different computers.
In order to do this, we will talk about running time as a proportion to the size of our data.
# ignore this code, just focus on the graphs that illustrate how the time cost grows
import matplotlib.pyplot as plt
plt.subplot(211)
plt.subplot(1, 2, 1)
plt.plot([1,2,3,4])
plt.yticks([])
plt.xticks([])
plt.ylabel('# of computations')
plt.xlabel('data size')
plt.title('Linear-time function')
plt.subplot(1, 2, 2)
plt.axhline(y=0.5, color='r', linestyle='-')
plt.yticks([])
plt.xticks([])
plt.ylabel('# of computations')
plt.xlabel('data size')
plt.title('Constant-time function')
plt.tight_layout()
plt.show()
# a more efficient implementation of finding bus names
def find_bus_name_exact_loop(busname : str, bus_dir : list) -> dict:
"""Return BusInfo from bus_dir with given name"""
for bus in bus_dir:
if bus["name"] == busname:
return bus
# if we get here without returning, no such name in list
raise Exception("Business name not found: " + busname)
# return stops the function. by putting it in the if/else statement, we effectively
# break the loop when our function encounters someting that matches busname
We now need to cosider two different cases for how we compute running time. These are:
We can think of average-case running time as the time cost given some random input data. Worst-case running time can be thought about if our input data is organized in a way that makes our computation the most time-expensive.
We can think about worst case using the below example. Say we are looking for the number 6 in the list below:
[0, 1, 2, 3, 4, 5, 6]
If we wrote a function to look for six by starting at the beginning of the list and looping through, it would have to look at every element of the list before finding 6. This is our worst case for this example.
The structure of data can make a huge difference in our runtime.
We're familiar with a tree based structure from Pyret. It might look something like:
12
/ \
6 15
/ \ \
5 8 21
/ /
3 7
Tree based data makes a logarithmic runtime. Specifically, the tree illustrated above is a special kind of tree called a binary search tree. It achieves logarithmic runtime if it is "bushy" becuase we can throw away half of the numbers at each comparison (each time we move down the tree).
If we instead were to organize our data by 'buckets', where data are split up by their tens digit, we could get something that looks like:
{5, 6, 8, 3, 7} {12, 15}, {21}
At worst case, each bucket will have at most 10 elements. Thus, we will have to make 10 comparisons at worst. We can decide which bucket to go to using the modulo operator. This will give us a constant-time lookup. This kind of data structure is known as a hash-table.
When we're thinking about accessing data quickly, there's really two big techniques we can use.
Both of these options can interact with where data is in memory.
Now let's suppose we want to lookup runners by their finishing position in a race.
We might have some dataclass to suggest their name and age.
from dataclasses import dataclass
@dataclass
class Runner:
name: str
age: int
Runner("Amy", 32)
Runner("Willa", 76)
Runner("Bill", 45)
Runner("J", 52)
So our runners are:
1st: Runner("Amy", 32) 2nd: Runner("Willa", 76) 3rd: Runner("Bill", 45) 4th: Runner("J", 52) ...
It would take three 'operations' to get to the third runner. But there is a way to do this quicker!
If you were in Python and you asked RunnersList[3]
, Python doesn't set up a for
loop. It's a little more efficient than that:
Environment (Known Names)
RunnersList => loc 1021
Python Memory
label | slot |
---|---|
1021 | Runner("Amy", 32) |
1022 | Runner("Willa", 76) |
1023 | Runner("Bill", 45) |
1024 | Runner("J", 52) |
Using this information, we know that RunnersList
is located at loc 1021
. We can then just add three to that to access the memory location (e.g. loc 1021 + 3 = loc 1024
). Since Python knows the memory address of the list and the index of the element in the list, it can exploit this to lookup the 3rd element of the list. This allows Python to return elements of the list in constant time.
What did we rely on? To do this we relied on the fact that there is an order to out data. Specifically that it is indexed by a number. The order is what lets us (and Python) use addition and the indexing by number allows us to just add 3 to the memory location.
Whenever our data is indexed by a number, our language can use this trick to lookup in constant time!
Now that we understand this, we can go back to our Buisness directory problem from the beginning of class.
Wouldn't it be nice to be able to ask BusDir["Kabob"]
and have it return a result in constant time?
How can we get from ["Kabob"]
to a number? We can't use string length becuase it is not unique. There are plenty of strings that are of length 5. We have to use a special, built-in feature — the Dictionary.
# Dictionaries in Python (aka hashtables)
def Phone(area : int, base : int):
"""Create a phone number with area code and base number"""
return {'area': area, 'base': base}
def BusInfo(name : str, loc : int, phone : dict, type : list):
"""Create business info with name, location, phone, and type tags.
The phone number will be a Phone dictionary"""
return {'name': name, 'location': loc, 'phone' : phone, 'tags': type}
BusDirName = {"Brown": BusInfo("Brown", "02912", Phone(401,8635000), ["University"]),
"Starbucks": BusInfo("Starbucks", "02906", Phone(401,5551212), ["coffee"])
}
We're telling Python that we want it to look up on a 'key', the string before the colon, and return the 'value', the rest of the stuff after the colon. This is what's known as a key-value pair.
We've now achieved constant time lookup!
Now what if we wanted to add to our dictionary. If we wanted to add a new entry, we could do:
# we can add our new entry
BusDirName["Froyo"] = BusInfo("FroyoWorld", "02906", Phone(401, 5551213), ["dessert"])
# check to make sure our entry was input properly
print(BusDirName["Froyo"])
We said we wanted to be able to search by name, which we have achieved. But we also want to be able to search by zipcode.
One thing we can do is to set up another dictionary:
# define dictionary listed by zip
BusDirZip = {"02906": [BusInfo("Starbucks", "02906", Phone(401,5551212), ["coffee"]),
BusInfo("FroyoWorld", "02906", Phone(401, 5551213), ["dessert"])],
"02912": [BusInfo("Brown", "02912", Phone(401,8635000), ["University"])]}
# check this
print(BusDirZip["02906"])
There are two important restrictions to keep in mind when making a dictionary:
You can use a for loop to go through a hashtable:
for key in BusDirName:
print(key)
for key in BusDirName:
print(BusDirName[key]['location'])
We can make a dictionary of zipcodes and names (a dictionary inside a dictionary):
BusDirZipName = {"02906": {
"Froyo": BusInfo("FroyoWorld", "02906", Phone(401, 5551213), ["dessert"]),
"Starbucks": BusInfo("Starbucks", "02906", Phone(401,5551212), ["coffee"])
},
"02912": {
"Brown": BusInfo("Brown", "02912", Phone(401,8635000), ["University"])
}
}
print(BusDirZipName["02906"]["Starbucks"])
If we want to update this dictionary, we will have to approach it carefully. We can do this like so:
# starbucks phone number before update
print(BusDirName["Starbucks"])
def update_phone(for_bus: str, new_num: int):
"""change the phone number for the named business to a nwe number (same area code)"""
BusDirName[for_bus]["phone"]["base"] = new_num
# call our function to update dictionary
update_phone("Starbucks", 8675309)
# starbucks phone number after update
print(BusDirName["Starbucks"])
But there is an issue! If I was to look at BusDirZip, the Starbucks number will not be updated.
print(BusDirZip["02906"])
We still need to update the zip directory. We can do this by changing the function we wrote above
# starbucks phone number before update
print("Before update")
print(BusDirZip["02906"])
def update_phone(for_bus: str, new_num: int):
"""change the phone number for the named business to a nwe number (same area code)"""
BusDirName[for_bus]["phone"]["base"] = new_num
# still need to update in the zip dictionary
for bus in BusDirZip[BusDirName[for_bus]["location"]]:
if bus["name"] == for_bus:
bus["phone"]["base"] = new_num
update_phone("Starbucks", 8675309)
# starbucks phone number after update
print("\nAfter update")
print(BusDirZip["02906"])
What's the running time of this function?
It's not constant if you see a for
loop (except if the list is fixed in size). However, we want constant time! There's something we could do to achieve a constant time lookup. However, it's not in the function. We need to change the way our data is constructed.
If we think about how our data is set up, we have two copies of every business in memory. How do we know this? We used BusInfo()
twice.
BusDirName = {"Brown": BusInfo("Brown", "02912", Phone(401,8635000), ["University"]),
"Starbucks": BusInfo("Starbucks", "02906", Phone(401,5551212), ["coffee"])
}
BusDirZip = {"02906": [BusInfo("Starbucks", "02906", Phone(401,5551212), ["coffee"]),
BusInfo("FroyoWorld", "02906", Phone(401, 5551213), ["dessert"])],
"02912": [BusInfo("Brown", "02912", Phone(401,8635000), ["University"])]}
If we had made only one of these in memory and shared it between copies, we could change it in only one place and see it reflected in all our dictionaries.
starbucks = BusInfo("Starbucks", "02906", Phone(401,5551212), ["coffee"])
brown = BusInfo("Brown", "02912", Phone(401,8635000), ["University"])
froyoworld = BusInfo("FroyoWorld", "02906", Phone(401, 5551213), ["dessert"])
BusDirName = {"Brown": brown, "Starbucks": starbucks}
BusDirZip = {"02906": [starbucks, froyoworld], "02912": [brown]}
print("Before change")
print("Business directory names")
print(BusDirName["Starbucks"])
print("Business directory zips")
print(BusDirZip["02906"])
print("\nAfter change")
update_phone("Starbucks", 8675309)
print("Business directory names")
print(BusDirName["Starbucks"])
print("Business directory zips")
print(BusDirZip["02906"])
If there is a new business we want to add, we can write a function that registers the business in both dictionaries.
def register_business(new_bus: BusInfo):
BusDirName[new_bus["name"]] = new_bus
BusDirZip[new_bus["location"]].append(new_bus)
register_business(BusInfo("Kabob", "02906", Phone(401, 5551234), ["lunch"]))
print("Business Directory Names")
print(BusDirName["Kabob"])
print("\nBusiness Directory Zips")
print(BusDirZip["02906"])