Lab 12

In today’s lab, you have the chance to learn how to use some Python packages that might be useful in the future!

We’ll start with a brief explanation of what a package is and how to install one in PyCharm. Then, the rest of the lab will consist of a number of explanations and practice problems meant to familiarize you with popular Python packages.

You do not have to do this lab in order, and we do not expect you to finish all of the problems.

Instead, skim through the sections and pick one that seems interesting. Once you finish a section, feel free to either move onto another section or look up the package documentation online and see what else you can do with it!

What’s a package?

First, some terminology:

There are hundreds of thousands of Python packages available online. Some are so commonly used that you’ll find them in almost every large-scale Python application; others serve highly specific purposes.

PyCharm makes it very easy to download and use packages in your projects.

On a Mac computer: go to PyCharm --> Preferences --> Project Interpreter and press the + button.

On a Windows computer: go to File --> Settings --> Project --> Project Interpreter and press the + button.

This will bring up a list of common packages; search for the one that you want and press “Install Package”.

Section 1: Numpy

Numpy is a widely used library for manipulating numbers, vectors, and matrices (many other libraries are built on top of Numpy). Numpy has great support for manipulating vectors with arbitrary numbers of dimensions. We’ll go through a couple examples of problems where Numpy might be helpful. Note that a lot of this can be done in Python, without Numpy, but Numpy tends to make math-related computations easier and more efficient.

It might be helpful to look at documentation here. And in general, googling the name of a function should turn up some helpful documentation.

Getting started

Once you’ve installed Numpy, include import numpy as np at the top of your Python file.

Part 1

Numpy works with arrays which are essentially lists that can be one, two, or more dimensions. Let’s get our feet wet by creating some Numpy arrays. For each of these problems, assign the array to a variable and then print the array (no need to write functions):

  1. Make the array [0, 1, 2] with np.array
  2. Make the array [0, 1, 2] with np.arange
  3. Make the array [0, 0, 0] with np.zeros
  4. Make the matrix [[0, 1], [2, 3]] with np.array
  5. Make the matrix [[0, 1], [2, 3]] with np.arange and np.reshape.
    Note that the ‘shape’ of an array is essentially its dimensions. The above matrix has two rows and two columns, so its shape is (2, 2).
  6. Print the shape of the matrix from the previous part with .shape.
    Note that if v is an array, v.shape will return a tuple that contains its shape.
  7. Print the number of rows in the matrix from part 5 with .shape.
    Note that the number of rows is the first element of the shape tuple.

Part 2

Great, now let’s use some Numpy functions to manipulate arrays. Write a couple example vectors (these can be small) and print the result of each of these computations. Again, no need to write functions here.

  1. Take the average of elements in a vector (np.average)
  2. Add two vectors element-wise, which means that the each element in the result vector is the sum of the corresponding elements in two vectors.
    Note that you can add vectors element-wise with ‘+’!
  3. Take the sum of elements in a vector (np.sum)
  4. Take the square root of a number (np.sqrt)

Part 3

Finally, let’s tie everything together and write a function that does a substantial mathematical computation.

One simple way of method outliers is to mark each element that’s more than two standard deviations from the mean as an outlier. Write a function that takes in a one-dimensional numpy array and returns a list of outliers, using this method.

np.abs might be helpful. You can also use the first formula for standard deviation in this Wikipedia article.

Also make sure to verify that your function works by inputting an array that clearly has a couple outliers.

If you’re excited about learning more about Numpy, we recommend the following tutorials:

There are also some Numpy image manipulation tutorials linked in the Pillow section of the lab.

Section 2: Pandas

Pandas is a really powerful and fun library for data manipulation/analysis, with easy syntax and fast operations. Because of this, it is the probably the most popular library for data analysis in Python programming language.

Millions of people around the world use Pandas. In October 2017 alone, Stack Overflow, a website for programmers, recorded 5 million visits to questions about Pandas from more than 1 million unique visitors. Data scientists at Google, Facebook, JP Morgan, and virtually every other major company that analyzes data uses Pandas.

In this lab section, we’re going to learn the basics of Pandas and use its functionality to analyze some datasets.

Series and DataFrames

In order to master pandas you need to learn its two main data structures: Series and DataFrame.

Series

Series is an object which is similar to Python’s built-in list data structure but differs from it because it has associated label with each element. This distinctive feature makes it act more like a hashtable/dictionary.

>>> import pandas as pd
>>> my_series = pd.Series([5, 6, 7, 8, 9, 10])
>>> my_series
0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64
>>>

If desired, you can give each item in the series its own custom key, creating a key-value pairing much like Python’s dictionary data structure.

>>> my_series3 = pd.Series({'a': 5, 'b': 6, 'c': 7, 'd': 8})
>>> my_series3
a    5
b    6
c    7
d    8
dtype: int64

DataFrame

Simply said, DataFrame is a table. It has rows and columns. Each column in a DataFrame is a Series data structure, rows consist of elements inside Series.

DataFrame can be constructed using built-in Python dicts:

>>> df = pd.DataFrame({
...     'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
...     'population': [17.04, 143.5, 9.5, 45.5],
...     'square': [2724902, 17125191, 207600, 603628]
... })
>>> df
   country  population    square
0  Kazakhstan       17.04   2724902
1      Russia      143.50  17125191
2     Belarus        9.50    207600
3     Ukraine       45.50    603628

Reading and Writing to Files

Reading and Writing data from a program to files is incredibly easy, and pandas supports many file formats including but not limited to CSV, XML, HTML, Excel, SQL, JSON, and many more (check out official docs).

For example, if we wanted to save our previous DataFrame df to a file, we only need a single line of code:

>>> df.to_csv('filename.csv')

We have saved our DataFrame, but what about reading data? No problem:

df = pd.read_csv('filename.csv', sep=',')

Now that we know the basics of pandas, let’s go ahead and analyze some datasets! Here are some links to the official documentation and a cheat sheet if you get stuck. And don’t forget about our good friend Stack Overflow!

Candy

The first one is a dataset you know and love-- the candy dataset! (See Lab 10 for instructions on how to create the candy.csv file).

  1. Use pandas to get the candy with the highest sugar percentage.
  2. Use pandas to get candy that contains both chocolate and caramel.
  3. Save the DataFrame with candy containing chocolate and caramel as a csv file called chocolate_and_caramel.csv.

(Very Basic) Stock Price Predictor

The second dataset is the stock data for American Airlines. We’re going to use some basic statistical analysis to predict the stock’s value on a certain date.

  1. Go to this link and use the data to create a csv file called aal.csv.
  2. Read the csv file into a pandas dataframe, and sort the values by date.
  3. Create a new column in the DataFrame called average_price which is equal to the (high + low) / 2 of the stock price on a certain day.
  4. In order to predict the stock market price for a certain day, k, we’re going to calculate the average stock price of all the previous days from 0 to k - 1.
  5. Add a new column to the dataframe called predicted_price which is equal to the value calculated in step 4.
  6. Save the dataframe to a csv file named aal_predictions.csv.

If you want to learn more about Pandas, below are a few tutorials that you can try:

Section 3: Matplotlib

Setup

Matplotlib is a Python package used for plotting! It gives access to the same kind of plotting functions we used in Pyret, except with more power (since we’re in Python). Now that you know how to use lists, plotting will be much more flexible (before we were stuck with tables).

Get started to use matplotlib by installing it in Pycharm, then importing it with the following line:

import matplotlib.pyplot as plt

This imports the pyplot function from the matplotlib module, and renames it plt so you don’t have to write matplotlib.pyplot every time you want to plot something.

Throughout these questions, you will need to use the online documentation for matplotlib. Try to figure out how to Google the approaches to these questions (for example, “how to plot a line in matplotlib”). The matplotlib official documentation and stackoverflow are both great sources.

Here’s a page to get started: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

Part 1: Simple Plotting

Beep is working on a logo for his business. He likes things simple, so he wants you to make a (hollow) red rectangle with a (hollow) green triangle inside of it.

Using the plt.plot() function (imported above), make Beep his business card logo. Show the plot by using plt.show()!

You can use plt.plot multiple times to build up a plot with multiple lines before using plt.show(), but once you use plt.show(), the plot will be erased and plt.plot starts from scratch.

Part 2: Timeseries and Scatter Plotting

Timeseries

The plt.plot function is great for looking at trends over time. Download this csv file and load it using the csv module (which you can figure out by Googling), by using tools from the Pandas section of this lab, or by using tools from numpy (look up “loading a text file in numpy”).

This dataset gives the day of the year and the number of hours in that day (for Providence RI). For example, day 350 has 9.054 hours, whereas day 173 has 15.014 hours.

  1. Plot hours per day versus day.
  2. Give the plot a title and informative labels on the axes.
  3. Show Beep and Boop the plot using plt.show()

NOTE: Be careful about types! You cannot plot a list of strings.

Scatter plots

The plt.scatter function is good for looking at trends in observations.

Download the data from the 2017 Boston Marathon here. It contains one row for each runner, with the first column being age (in years) and the second being how long it took that runner to complete the marathon (in minutes).

Again use csv to load the data, then plt.scatter() to plot it. Put a title and labels on the axes. Is the data surprising to you, or what you expected?

Note: You may want to pass s=2 to plt.scatter so the data is more viewable: plt.scatter(..., ..., s=2)

If you have done the numpy section of the lab, use StackOverflow to figure out how to fit a line of best fit to this scatter plot!

Part 3: Math Plotting

If you have done the numpy section of the lab, try to use numpy arrays for this plotting. matplotlib accepts numpy arrays input to pyplot.plot.

  1. Plot these functions on top of each other for x between 0 and 5.

  2. Make a legend so it’s clear which function is which. You should only need plt.plot, not any of matplotlib's other functions.

Note: Python’s ** operator may be useful; it is the exponentation operator.

>>> 2 ** 3
8
>>> 4 ** 0.5 # (a ^ 0.5) power is the same as square root of a
2.0

HINT 1: You cannot plot a function in most programming languages; it’s impossible to apply the function to every single point in some domain. So you need to choose a range of x values, then apply the function to those x values. For example, x is all numbers between 0 and 5 with a step size of 0.05.

HINT 2: You may find map helpful for these problems.

Part 4a: Subplots

Look up matplotlib subplots and use them to arrange the first four plots from part 3 in a 2 by 2 grid. Be sure to put a title on each subplot so that you can tell which function is which.

Part 4b: Tea time (Math-heavy)

Newton’s law of cooling states that the an object cools faster the colder its surroundings (and vice versa). So if you put an icecube in a pot of boiling water, it’ll melt faster than if you put it into a cup of tapwater. But a few seconds later, once the temperature of the ice cube has increased a bit, it’ll warm less quickly than before.

In any given second, you can write the change in temperature of an object (ΔT(t)) as some constant (which depends on the material of the object) multiplied by the difference between the current temperature of the object (T(t)) and its surroundings.

ΔTobj(t)=c(Tobj(t)Tsurroundings)

Tobj is written as a function of t since the temperature of the object relies on the time elapsed.

Here’s how you would want to write this programmatically:

Tnew=Toldc(ToldTsurroundings)

OK, now the question:

Boop has a cup of tea that starts off at 95 degrees (Celcius), and wants to drink the tea. Boop wants to see the trajectory of the temperature, and wants to know the temperature after 3 minutes (180 seconds) have elapsed.

It’s summer time, so things are pretty toasty. Set Tsurroundings=40. The constant c for tea has been experimentally found to be 0.02.

First, sketch out what you think the temperature of the tea will look like over the first 3 minutes. Then, use matplotlib to do so, so Boop can determine whether the tea is cool enough to drink or not.

Remember, you will need a list of x-coordinates and y-coordinates to do make the plot. The x-coordinates should correspond to the second, and the y-coordinates to the temperature of the tea in that second.

If you’re excited about Matplotlib, check out the following links:

Section 4: Pillow

Intro to Pillow

Pillow is a package meant for the programmatic creation and modification of images. The documentation is here, if you want to look through it.

Part 1: Pillow Basics

Start by creating a new Python file wherever you normally store Python files. Install Pillow by following the instructions at the beginning of the lab. Finally, download “puppy.jpg” from the course website (you can find it under Additional Materials for this week’s lab). Move it from your downloads folder to the same folder as your newly created Python file.

Include the following line at the top of your code:

from PIL import Image, ImageDraw

Image and ImageDraw are the classes that we’ll be using for the following exercises.

We’ll start by loading our image into Python as follows:

pupper = Image.open("puppy.jpg")

Image.open loads the pixel data from the input file into Python, converts it into a Pillow image object, and then closes the file. We can now use Pillow functions to augment our “puppy” image.

To convert pupper back to a regular image, use the function:

pupper.show()

This will open pupper using your computer’s default image software.

NOTE: Note that the file’s name is NOT “puppy.jpg” – this is because pupper is a new image that just looks identical to “puppy.jpg”.

Finally, to save pupper, use the function:

pupper.save("pupper.jpg", "jpeg")

The first argument represents the new filename and the second argument represents the file type. The file will be saved in the same folder as your Python code.

Filtering Images

Pillow has several built-in filters that can be used to augment Image objects. Applying a filter to an Image creates a new Image (rather than mutating the original image). To apply a filter to an Image, use the following code:

blurred = pupper.filter(ImageFilter.BLUR)

ImageFilter.Blur is one of several preset filter options. The fill list can be found in the documentation.

  1. Use a few of the filters in the documentation to augment pupper. Open and save each of the images that you create using Pillow functions.

  2. If you’re interested in trying out more image augmentation, try looking through this tutorial from the Pillow documentation. See what other transformations you can make to pupper! Otherwise, move on to part 2.

Creating New Images

Pillow has an ImageDraw module that facilitates the programmatic creation of images – similar to the Pyret image library that we used at the beginning of CS-111!

To draw an image from scratch, the first thing you’ll need to do is create a new blank Image object with the following function:

new_image = Image.new("RGBA", (300, 200), "white")

The first argument specifies the image mode, which you can read about here.

NOTE: In this case, “RGBA” stands for “red, green, blue, alpha”. This is a standard way of encoding color and transparency values for pixels, but there are other possibilities that you might find useful. For example, “L” creates a grayscale image and “HSV” specifies colors by “hue”, “saturation”, and “value”.

The second argument specifies the dimensions of the image as a (width, height) tuple, and the third argument specifies the color. You can pass the function standard HTML color names as strings (about 140 are supported), or you can RGB or HSL color specifiers.

  1. Try using new_image.show() to see your Image. Right now, it should just be a blank square.

In order to draw on an Image, it must be converted into an ImageDraw object. You can use the following code to do so:

new_idraw_object = ImageDraw.Draw(new_image)

This creates a new ImageDraw object called new_idraw_object based on new_image.

To draw an ellipse, add the following line:

new_idraw_object.ellipse((10,30,20,50), fill="blue")

The first input is an (x1,y1,x2,y2) tuple, where (x1, y1) represent the coordinates of the ellipse’s top left corner, and (x2, y2) represent the coordinates of the ellipse’s bottom right corner. The second input, as before, represents the ellipse’s color.

You can find a list of ImageDraw functions here, but the most useful ones (other than .ellipse) are:

The last one is worth talking about explicitly: the first input is an (x1,x2,y1,y2) tuple, the second input is a string, and the third input is the fill. The function stamps the string onto the image.

NOTE: All of the ImageDraw functions that we’ve mentioned so far mutate ImageDraw objects, so all you need to do is call them in order to see their effects.

  1. Use ImageDraw functions to create a blank image and draw this image. Save the image after it’s drawn.

  2. Beep and Boop have always had a passion for photography and recently opened a small photography business. In order to ensure that their competitors don’t steal their images, they want to put a unique watermark on all of them.

    Write a function add_watermark(image_list) that takes in a list of file names, applies a watermark to them, and saves the watermarked images.

    Your watermark can be whatever you want: it might be a combination of shapes, filters, and text. Be creative! The new image should be saved with the name “<original image>-watermark”. For example, if one of the original images is “puppy.jpg”, the watermarked image should be saved as “puppy-watermark.jpg”.

    Test your functions by downloading images onto your computer and moving them to the same folder as your code!

If you’re excited about learning more about Pillow, we recommend the following tutorials:

You can also process images using a different package, Numpy (another section in this lab!). Image manipulation in Numpy works directly with the numerical color values of each image pixel. This allows for significantly more control over small changes. Here are a few tutorials on that, if you’re interested:

So long, farewell

As this is the final lab, Beep and Boop want to bid you all adieu. Beep and Boop have learned so much from y’all, and hopefully, you have learned from them too! Beep and Boop are confident that you are all ready to take on the world, even without them by your side, and so, they are finally heading back home to the sun.

You are all stars. Don’t ever let the fire in you stop burning.
– Beep and Boop