This activity will be an introduction to Pandas. Pandas is a library often used for performing computations on large data sets.
Lets explore pandas in your terminal's interactive mode. Before opening python though, use cd to change your current directory to the CS 3 workspace.
- Begin by calling
import pandas as pd. This imports the module pandas, but renames it as pd. This saves some key strokes when using pandas a lot in our code.
- Create a series with the numbers -5 to 5. Creating a series requires a collection or an iterable, like a list or a range, to be passed in.
- You can use the mathematical operators we have been familiar with between a series and a single number. Test out the various mathematical operators by trying to add, subtract, multiply and divide the series by a number.
- Create a series with the numbers from 10 up to but not including 20 and store it in a variable called series2
- Pandas also supports these same operators between two series of data. Again, try adding, subtracting, multiplying and dividing each series by one another. This provides convenient manipulation of large arrays of data in Python.
- When performing these operations, the two arrays need to be the same shape. Try adding series1 to a new series
pd.Series(range(20)). Notice that the last ten values are represented by 'nan'. This is a special signifier for float values which stands for Not A Number. Therefore, it's important that you ensure that the dimensions of the series on which you are performing operations matches.
- Add the two series together and store the new series in a variable called summation
- Pandas also supports comparison operators like
== != > >= \< \<=with either another series or with float values. For example,
series == 5will create a resulting series which is True for any value that equaled 5 and False otherwise. Using the modulo operator (
%) and a comparison operator, write an expression to find the values in Series that are evenly divisible by 2 and assign this to a variable called
- You can get multiple elements from a series by indexing using an array of indices
indices = [0,1,2,3,4]series[indices]You can also use an array of booleans (True/False) that match the items you want pulled from the array:
are_positive = series > 0series[are_positive]Using the variable series and divisible_by_two, Try creating an array with just the elements that are even.
series = pd.Series(range(-5, 5)))
The pandas Series includes many other helpful methods. Still using the interactive terminal, try out these methods. When performing these operations in your programs, you just need to replace series with the pd.Series variable you have in your own code.
- Compute the sum of the series, using
- Compute the mean, median and mode, using
series.mean()and similar for the other items
- Compute the max and min of the series using
- You can sort the elements of a series extremely quickly using
- You can also find the smallest or largest n items using
series.nsmallest(n). Replace n with the number of items you want listed.
In this task, we'll write a small program that performs some operations on some data. Keep your interactive terminal open, we'll be using it to try out new functions. We will analyze happiness data on countries worldwide. Download the csv file here and save it in your CS 3 workspace.
We'll be going back and forth between the interactive terminal and your program code. We'll note accordingly in which one we want you to do a particular step of a task
Now you'll begin to set up your python program. We have provided you with an initial stencil here to start. Be sure to rename the file to FirstLast_ACT3-1.py
In your program: Remember to import pandas at the top of your code
import pandas as pd
In the interactive terminal: Your first step will be to read the csv file using Pandas. This can be done with the following code:
data_frame = pd.read_csv('data.csv')
In the interactive terminal: Copy and paste this line into the terminal. Pandas is able to read a csv file and store it into a data structure called a DataFrame. A DataFrame is similar to a mini-spreadsheet. Click here to learn more about DataFrames.
In the interactive terminal: You can view the first few rows of data by typing
data_frame.head(). It can also be helpful to glance at summary statistics of the DataFrame. You can do this by calling
In your program: Place the
read_csv() code within your main() function.
Now that we have the data stored in a data frame, we can perform a lot of the similar functions we learned earlier in the course in Google Sheets on our data frame, thanks to Pandas. Let's try out a few things:
In the interactive terminal: In a data frame, the columns can be indexed by their name. For example, the data set contains the column Country that contains the name of all of the countries. We can observe this column by running:
print(data_frame["Country"]). To observe multiple columns, we can chain them together as follows:
In your program: In the main function, print a table with the Country Name, its Happiness Rank, and its Happiness Score.
In the interactive terminal: In the data frame, the rows can be indexed numerically. You can do that with
data_frame[start index: end index]. Try printing out a few rows of data. A specific row of data can be printed out with
data_frame.iloc[index]. Try printing out a single row
In your program: print out the top ten rows.
Pandas provides multitude of tools for analyzing data
In the interactive terminal: When learning Google Sheets, we found sorting data to be very useful. Pandas offers similar sorting functionality with sort_values method:
data_frame.sort_values(columns_to_sort_by). Here, the variable
columns_to_sort_by is a list of column names that indicate the order in which you would like to sort. For example, if you wanted to sort by "Happiness Score" you could type:
data_frame.sort_values(["Happiness Score"]) In the interactive terminal, try sorting in different ways
In the interactive terminal: Often times, we want just the top or bottom items in a sorted list. Pandas provides convenient methods, nlargest(), nsmallest() for DataFrames. Each method requires two parameters: the number of items in the list to return and the columns to sort by. For example, you can find the 5 rows of data with the highest "Freedom" rating with
In your program: write a function with a single input parameter for data frame that returns just the names of the 10 countries which have the highest happiness score.
In the interactive terminal: You can filter a DataFrame very easily using Pandas. The following code provides the row corresponding to the statistics for the United States.
data_frame[data_frame["Country"] == "United States"]]
This line of code finds the rows where the "Country" column is equivalent to the "United States". This returns a data frame that is a subset of the original data frame.
In the interactive terminal: You can also link queries together using the & and | symbols. Create a DataFrame that includes all the European countries by combining the rows of Western, Central and Eastern Europe with
data_frame[(data_frame["Region"] == "Central and Eastern Europe") | (data_frame["Region"] == "Western Europe")] Also, create a DataFrame for the countries of North America with a Freedom index greater than 0.5 by writing
data_frame[(data_frame["Region"] == "North America") & (data_frame["Freedom"] > 0.5)]
Importantly with these chained expressions, you must include each condition inside parenthesis (). Your program will throw an error otherwise.
In your program: Write a function called multi_query that takes in 3 parameters: the data frame, a region, and a happiness score. And returns a DataFrame that contains just the rows of data corresponding to that region and that have a happiness score above the provided value.
In your program: In the homework you will create your own rating method based on the different features in the data set. Let's create a new rating which is the sum of family and freedom.
- Select the family column from the data frame
- Select the freedom column from the data frame
- Create a variable called new_rating that contains the sum of these two columns
- You can add this new column to the data frame as follows
data_frame['New Rating'] = new_rating
If you print out the data frame, you should see a new column called New Rating with your calculated scores.
If you have extra time
Feel free to explore more about Pandas and Numpy. We just barely scratched the surface. You can find more information here:
Once you're done, please check off your lab with a TA or share your file with firstname.lastname@example.org by midnight, 4/11.