HW 1: Python Refresher

Due: September 14, 2021 at 9:00PM EST.

The goal of this assignment is to refresh your memory of how to write Python programs, to get practice organizing data for particular computations, and to get practice working with CSV files.

A Grading Note

During the first lecture, Tim asked students to write down what they needed and what they were worried about from cs112. Many students said that they needed a Python review. For this reason, this assignment will be graded for completion, but we will still give detailed feedback the way we usually do (this means that while we will give you a score on this assignment, as long as you finish it you will receive a 100%, and you will get feedback on what went wrong or how to improve).

This assignment is meant as a Python review, and we want to make sure that when you work through it, your thought process is "how can I use Python to code this" as opposed to "oh no if I can't use Python for this I will be unsuccessful." By making this grading completion-based, we hope to decrease the stress you may have if you struggle with this assignment.

No matter how much Python you remember at the start of this assignment, you can succeed in the course and we will help you learn and review Python.

Python Review Session

There will be a Python Review Session at 10am EDT on Sunday, September 12 over zoom. You can find the zoom link on the course website's hours page.

The Assignment

Some professors at Brown got curious about how many courses students take from different departments, and have asked you to analyze data from the CSV file to help them decide how they should change course offerings for future semesters.

There is a CSV (comma-separated-value) file containing 3 columns: 1. Brown student names, 2. different academic departments at Brown, and 3. the number of courses that these students took in those particular academic departments. So, an example might look like this:

Ashley,ANTH,4

This means that Ashley has taken four courses in the Anthropology Department.

In the courses.csv, every entry is a student’s name, then an academic department, then the number of courses the student took in that department.

Here’s an example of what that file could look like (the real file is much longer):

Ashley,MATH,4
Ben,MATH,8
Ben,EAST,1
David,BIOL,5

Notice that any student could have taken courses in multiple departments, and any given department could have multiple students taking courses in them. In addition, there can be two entries with the same student and the same department, but different counts; in that case, the course count would accumulate for that student and that particular department (i.e. there can be Ashley,ANTH,4 entry early in the file, and then Ashley,ANTH,12 appear in the later part of the file; in this case, Ashley has taken a total of 16 courses from the Anthropology department).

Your goal is to write functions that answer the following questions (and in parentheses, answers for the example table given above):

What is the department with the most courses taken, by the total number of courses taken by all students? (MATH)
Which student has the widest-ranging course set (i.e. has taken courses from the largest number of departments)? (Ben) (A department should not be counted in terms of diversity if the total course count is 0)
How many total courses did each student take, on average? (6) ((4 + 9 + 5 ) / 3 = 6)
Which departments have had only one student take courses in them? (EAST,BIOL)

Note that the answers will be different on the real CSV file!

You’ll also need to write a function to take the CSV data and load it into a structure which is helpful for answering these questions.

Details

The courses.csv file can be found here.

You’ll write your implementation code in a file named courses.py. Here’s some code you can copy in to get started:

import csv

def load_data(filename: str):
    """load a CSV file into a useful format for computation"""
    file = open(filename, encoding="utf")
    reader = csv.reader(file)
    data = {}
    for row in reader:
        # TODO: do something useful with each row
        # for instance, add or modify an entry in a dictionary
        # row is a list of data--for instance,
        # on the first row in the example data, it’s:
        # ["Ashley" , "MATH" , "4"]
        pass
    return data

# TODO: implement this function
def most_taken(data) -> str:
    """
    return the department with the highest total
    number of courses taken by all students
    """
    pass

# TODO: implement this function
def widest_ranging(data) -> str:
    """
    return the name of the student who took the most diverse set of courses from different departments
    """
    pass

# TODO: implement this function
def average_courses(data) -> float:
    """return the average total number of courses taken per student"""
    pass

# TODO: implement this function
def only_once(data) -> list:
    """return all of the departments in which exactly one student took courses"""
    pass

# This code allows the program to be run as a script.
# You shouldn't need to modify it!
if __name__ == '__main__':
    import sys
    # the first argument to the script is the filename
    filename = sys.argv[1]
    # the second argument to the script is the name of the function
    function_name = sys.argv[2]
    data = load_data(filename)
    if function_name == "most-taken":
        print(most_taken(data))
    elif function_name == "widest-ranging":
        print(widest_ranging(data))
    elif function_name == "average-courses":
        print(average_courses(data))
    elif function_name == "only-once":
        print(only_once(data))
    else:
        print("Unknown function name")

The first thing you should do is to decide how you want to structure your data. There are multiple approaches that will work fine, and some that are less good – think about how you will want to access the data in order to write each function! Then, complete the load_data function, which takes in the name of a CSV file and should return your structured data. Something that might be useful: you can convert a str to an int with the int function; for instance, int("4") == 4.

For your other functions, you don’t need to worry about CSV files–each one takes in your structured data. You should complete each function as specified:

most_taken should return the name of the department with the highest total number of courses taken by all students.
widest_ranging should return the name of the student who has taken courses from the maximum number of departments.
average_courses should return the average number of courses taken per each student
only_once should return a list of departments in which only one student took a course.

You may find that once you start writing your analysis functions you want to change the way your data are structured – this is expected! Keep in mind, though, that every analysis function needs to use the same structure.

The bottom of the provided code (starting with if __name__ == '__main__') lets you run your code as a script from the terminal. It calls your load_data function to read in CSV data, then calls one of your other functions and prints the result. So in order to get the average courses played in the courses.csvfile, you can run the following command in the terminal:

python3 courses.py courses.csv average-courses

If you’re not sure what this means, don’t worry – it will be covered in the first lab!

Testing

You should write tests for your functions in courses_test.py. You should include tests for all of your analysis functions; you don’t need to write tests for load_data.

With Pytest, tests are written as Python functions in a testing file (in this case, courses_test.py). For example, to test your only_once function you’d write something like:

def test_only_once():
    assert only_once(...) == ["San Salvador", "Christchurch"]
    assert only_once(...) == []

Remember: each analysis function takes as its argument whatever data structure you’ve decided to use to represent the courses data. If you pass in a CSV string or something similar, they will not work!

Note: your load_data function should work with any CSV file containing students, departments, and course numbers, not just the courses.csv file. See this Guide for help with getting VSCode to work.

Code style

Please follow these Python testing and clarity guidelines.

Readme

You should include a README.txt with answers to the following questions:

How did you structure your data? How did you decide on this structure?
Did you end up needing to change the structure once you started writing your analysis functions?
Would any of the functions have been easier to write if you had chosen a different structure?
Did you discuss this assignment with any other students? Please list their cs logins.
How many late days are you using on this assignment?

The README template can be found here.

Handin

Hand in your work on Gradescope.

You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days. When you submit, the autograder will automatically check your code and display results for each part of our test suite. If it gives you a score of 0.0/1.0 for a section, that means you are failing at least one test in that section. If it gives you a score of 1.0/1.0 for a section, then you are passing all of our tests in that section. Note that these point values do not reflect your final score or the final rubric used to grade your code.

Please don't put your name anywhere in any of the handin files -- we grade assigments anonymously!

Don’t forget to follow the design and clarity guide!

After completing the homework, you will submit:
- README.txt
- courses.py
- courses_test.py