Project 3: Web Scraping

Due Dates

Fill out the Project 3 Partner Form by Nov 9th, 9pm EST
Design Checks Nov. 11th – Nov. 13th
Design Check Response due Nov. 13th at 9pm EST
Project Code due Nov. 20th at 9pm EST

Summary

For this project, we’re asking you to scrape some data from Craigslist and write a few queries over your scraped data. Specifically, you’re going to examine rental housing listings for a number of major US cities, plus Providence.

Learning goals

This project will give you experience:

Designing data structures
Scraping web data

Project tasks

Find a partner: Find a partner and fill out the Project 3 Partner Form by Nov 9th, 9pm EST. If you would like to have the same partner as Project 1 or Project 2, that is fine! You could also select a new partner if you would like. After you submit the form, you will be paired with a TA and they will send you an email to schedule a Design Check. If you’re having trouble finding a partner, please post on Piazza under the post titled “Find Project 3 Partner Thread”!
Read through this document in its entirety!
Setup: Create a new PyCharm project and install some libraries:
- pip install pytest
- pip install requests
- pip install bs4 Then, copy the starting point code below into a file called scraper.py.
Design Check: Complete the Design Check questions due by Nov. 13th at 9pm EST
Implement your web scraper!

When writing code, please follow the testing and style guide.

Implementation starting point

Put this code in a file called scraper.py.

import requests
from bs4 import BeautifulSoup

CITIES = [
    "providence",
    "atlanta",
    "austin",
    "boston",
    "chicago",
    "dallas",
    "denver",
    "detroit",
    "houston",
    "lasvegas",
    "losangeles",
    "miami",
    "minneapolis",
    "newyork",
    "philadelphia",
    "phoenix",
    "portland",
    "raleigh",
    "sacramento",
    "sandiego",
    "seattle",
    "washingtondc",
]


class NoCityError(Exception):
    pass

def craigslist_get_city(city_name) -> BeautifulSoup:
    """gets a BeautifulSoup object for a given city from Craigslist"""
    url_template = "https://{}.craigslist.org/search/apa"
    try:
        resp = requests.get(url_template.format(city_name))
        return BeautifulSoup(resp.content, "html.parser")
    except:
        raise NoCityError("No city named {} found".format(city_name))


def local_get_city(city_name) -> BeautifulSoup:
    """gets a BeautifulSoup object for a given city from the local filesystem"""
    file_template = "localdata/{}.html"
    try:
        with open(file_template.format(city_name), "r") as f:
            return BeautifulSoup(f.read(), "html.parser")
    except:
        raise NoCityError("No city named {} found".format(city_name))

def scrape_data(city_pages: dict):
    """Scrapes data from a collection of pages.
    The keys of city_pages are city names. The values are BeautifulSoup objects."""
    pass

def scrape_craigslist_data():
    """Scrape data from Craigslist"""
    return scrape_data({city: craigslist_get_city(city) for city in CITIES})


def summarize_local_data():
    """Scrape data from the local filesystem"""
    return scrape_data({city: local_get_city(city) for city in CITIES})


def interesting_word(word: str) -> bool:
    """Determines whether a word in a listing is interesting"""
    return word.isalpha() and word not in [
        "to",
        "at",
        "your",
        "you",
        "and",
        "for",
        "in",
        "the",
        "with",
        "bedroom",
        "bed",
        "bath",
        "unit",
    ]

Implementation tasks

Like HW1, your implementation will have two parts: the scraper and the query functions.

The Scraper

Your scraper will be a function called scrape_data. It takes in a dictionary where the keys are city names from the CITIES list and the values are BeautifulSoup objects. Your scraper should use BeautifulSoup methods to transform the data into a format that you can query in your query functions.

You’ll need to scrape the following data from each rental listing:

Price
Number of bedrooms
Description (i.e., the text of the link to that listing)

Some listings may be missing the number of bedrooms. You should skip these listings (i.e., you should not include them in your data).

Implementation hints

Some string methods may be useful; our implementation uses .split, .strip, .replace, and .endswith.

In class, we saw the find_all method called with one argument–a tag name. Another form may be useful for this assignment: you can call soup.find_all("table", "cls") to find table tags with the class "cls". Classes are a way to indicate structure in HTML, and are used by browsers to determine how to display elements. The tag

<div class="assignments red">

has both "assignments" and "red" as classes; we could find it with soup.find_all("div", "assignments") or soup.find_all("div", "red").

Query functions

You will write several queries over your data:

A function to find the average number of bedrooms across all of the cities. The function should take in the data and return a float representing the average bedrooms.
A function to find the city with the highest average price for a given number of bedrooms. The function should take in your structured data and a number of bedrooms and return a city name.
A function to find the most commonly-occuring “interesting” word in the listing for a given city. The function should use the interesting_word function in the starter code to determine which words are interesting, and should count upper- and lower-case versions of a word as being the same. It should take in the data and a city name and return a single word.

Testing and clarity

You should write good tests for all of your query functions in a file called test_scraper.py.

You don’t need to write automated tests for your scraping function. You should convince yourself that it works on Craigslist data by calling summarize_craigslist_data.

Please follow the design and clarity guide–part of your grade will be for code style and clarity.

Local data

In order to be able to test your project against consistent data, we’ve saved a copy of the craigslist listings for every city; you can download these data here. If you want to be able to run your code without internet access, or on consistent data, you can save the contents of that file to a directory called localdata in your PyCharm project. You can then run summarize_local_data to run your scraping function against the locally-saved files.

Design check

Before your design check meeting, you should read through this document and examine a Craigslist Rental Listings Page in a web browser using the Web Inspector (see Doug’s lecture from Nov. 6 for more on how to do this). Then, you should answer the following questions:

What are some things you notice about the way the data you need to scrape are structured on the page? How can you use this structure to scrape your data?
List some subtasks you will need to solve in order to scrape the data.
How will you structure your data internally in order to implement your query functions?
What are the names and signatures of your three query functions?

Handin

You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days.

The README template can be found here.

In addition to the questions in the template, in your README, please tell us how you tested your scraper function.

After completing the homework, you will submit:

README.txt
scraper.py
test_scraper.py

Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go! Only one of your partners should submit the project on Gradescope! Make sure to add your partner as a Group Member on your Gradescope submission so that you both can see the submission. Please DO NOT write your partner’s name in the README when listing other collaborators’ cslogins.

If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.

Please don’t put your name anywhere in any of the handin files–we grade assignments anonymously!

You can follow this step-by-step guide for submitting assignments through Gradescope here.