Lab 5: Web Scraping

Overview:

For this week's lab we'll be web scraping the Brown academic calendar for information about event names and dates!

Part 1: Getting Started

To start this lab, go to the academic calendar at this link . In your broswer, pull up the web inspector. To do this, right click the page and select inspect element (we recommend that you use Firefox or Google Chrome). Take a second to go through the tag information and notice where information is located.

Also, please run pip3 install bs4 to install the library needed to scrape the web.

Part 2: Scraping the Site

Create a file called lab6.py and complete the following tasks:

  • Copy down the following code:
from bs4 import BeautifulSoup
import requests

calendar_url = "https://www.brown.edu/about/administration/registrar/academic-calendar"
calendar_page = BeautifulSoup(
   requests.get(calendar_url).content, features="html.parser"
)


def scrape_events(page: BeautifulSoup) -> dict:
   # TODO: Write function that creates a dict (key is date and value is event name)
  • Write scrape_events (We're expecting this function to return a dictionary containing the information from the specified tags)
  • Run scrape_events in your terminal with the argument for the function as the BeautifulSoup object containing the academic calendar page.
  • Verify that some of the values in the dictionary line up with the keys that we had in the inspector
  • Notify your TA that you have completed the Lab

A couple of tips as you are scraping:

  • BeautifulSoup's find and find_all methods can take multiple arguments. If you give it two arguments, it will not only look for html nodes with the specified tags, but also with the specified id or class. For example, find("h1", class="mac") will find a node with tag 'h1' and class 'mac', while find_all("div", id="cheese") will find all nodes with tag 'div' and class 'cheese'
  • BeatifulSoup's find and find_all will search deep levels of your page, i.e. if you say node.find("div", class="pretzel") it will not only search node's children, it will also search node's grandchildren, and their children, and their children, and so on through all levels of the HTMLTree, until it finds a node with tag "div" and class "pretzel"
  • BeautifulSoup objects have a .text attribute which you can use to get their text rather than all of their html.

Part 3: You're all set!

That's all for lab this week! Be sure to ask any questions as this will be the same Web Scraping format that we follow for Project 3 and may be helpful in the final project as well.