Pandas

Datatypes as dictionary keys

What if we wanted our dictionary to work the other way, storing mappings between Coord instances as keys and strings as values?

So let’s say we have a dictionary like this:

p = Coord(1, 2)
objects = {
  p: "tree",
  Coord(4, 3): "rock"
}

What happens, though, if we change one of the keys (p.x = 3)? Python might look in the wrong location for the value. Dictionaries aren’t going to work if our keys can be changed! Because of this, the objects dictionary above will actually cause Python to throw an error, something like unhashable type: Coord.

We can get around this by modifying our Coord class:

@dataclass(frozen=True)
class Coord:
  x: int
  y: int

This tells Python that Coord’s should be treated as atomic values. Now we can use them as keys in a hashtable, but we can’t modify instances:

> p = Coord(1, 2)
> p.x = 3
ERROR

Pandas: municipalities data

Remember our old RI municipalities data? We worked with these data to learn about table operations in Pyret. We’re nearing the end of the class, so we wanted to make sure to talk about how to do the same types of operations in Python.

Based on what we’ve seen so far in the class, how would we represent these data in Python? Pyret has a built in Table type that was perfect for this kind of tabular data; Python does not. so: how would we do it?

We could use:

  • A list of lists, where each inner list is a row
  • A list of lists, where each inner list is a column
  • A list of dictionaries, where each dictionary is a row
  • A dictionary of lists, where each list is a column
  • A list of dataclasses
  • etc.

Each of these representations might be appropriate for your task–it depends what you want to do with your data!

Another choice would be to use a library that does have built in support for tabular data. One such library that has become quite popular among data scientists is pandas. pandas includes built in support for many common operations (like the ones we saw on Pyret tables). It also includes tools for writing efficient programs that access very large amounts of data; these are out of scope for this class.

So: when should we use pandas instead of one of the representations described above? pandas is great for data analysis (or data science!) tasks. If you’re trying to explore a dataset, with an eye towards producing a graph or a table of summary statistics, pandas (or another similar library) is a great fit. If not, sticking with Python’s built-in datatypes might make more sense.

Using pandas

Here’s the Pandas code we wrote in class–see the lecture capture for details.

# install pandas, xlrd, matplotlib

import pandas as pd

munis = pd.read_excel("municipalities.xlsx", 0)

towns = munis[~munis['City']]