Pandas
Datatypes as dictionary keys
What if we wanted our dictionary to work the other way, storing mappings
between Coord
instances as keys and strings as values?
So let’s say we have a dictionary like this:
p = Coord(1, 2) objects = { p: "tree", Coord(4, 3): "rock" }
What happens, though, if we change one of the keys (p.x = 3
)? Python might look in the
wrong location for the value. Dictionaries aren’t going to work if our keys can be
changed! Because of this, the objects
dictionary above will actually cause
Python to throw an error, something like unhashable type: Coord.
We can get around this by modifying our Coord
class:
@dataclass(frozen=True) class Coord: x: int y: int
This tells Python that Coord
’s should be treated as atomic values. Now we
can use them as keys in a hashtable, but we can’t modify instances:
> p = Coord(1, 2) > p.x = 3 ERROR
Pandas: municipalities data
Remember our old RI municipalities data? We worked with these data to learn about table operations in Pyret. We’re nearing the end of the class, so we wanted to make sure to talk about how to do the same types of operations in Python.
Based on what we’ve seen so far in the class, how would we represent these data
in Python? Pyret has a built in Table
type that was perfect for this kind of
tabular data; Python does not. so: how would we do it?
We could use:
- A list of lists, where each inner list is a row
- A list of lists, where each inner list is a column
- A list of dictionaries, where each dictionary is a row
- A dictionary of lists, where each list is a column
- A list of dataclasses
- etc.
Each of these representations might be appropriate for your task–it depends what you want to do with your data!
Another choice would be to use a library that does have built in support for
tabular data. One such library that has become quite popular among data
scientists is pandas
. pandas
includes built in support for many common
operations (like the ones we saw on Pyret tables). It also includes tools for
writing efficient programs that access very large amounts of data; these are out
of scope for this class.
So: when should we use pandas
instead of one of the representations described
above? pandas
is great for data analysis (or data science!) tasks. If you’re
trying to explore a dataset, with an eye towards producing a graph or a table of
summary statistics, pandas
(or another similar library) is a great fit. If
not, sticking with Python’s built-in datatypes might make more sense.
Using pandas
Here’s the Pandas code we wrote in class–see the lecture capture for details.
# install pandas, xlrd, matplotlib import pandas as pd munis = pd.read_excel("municipalities.xlsx", 0) towns = munis[~munis['City']]