Lab 5: Web Scraping
Overview:
For this week's lab we'll be web scraping the Brown academic calendar for information about event names and dates!
Part 1: Getting Started
To start this lab, go to the academic calendar at this link . In your broswer, pull up the web inspector. To do this, right click the page and select inspect element (we recommend that you use Firefox or Google Chrome). Take a second to go through the tag information and notice where information is located.
Also, please run pip3 install bs4
to install the library needed to scrape the web.
Part 2: Scraping the Site
Create a file called lab6.py
and complete the following tasks:
- Copy down the following code:
from bs4 import BeautifulSoup
import requests
calendar_url = "https://www.brown.edu/about/administration/registrar/academic-calendar"
calendar_page = BeautifulSoup(
requests.get(calendar_url).content, features="html.parser"
)
def scrape_events(page: BeautifulSoup) -> dict:
# TODO: Write function that creates a dict (key is date and value is event name)
- Write scrape_events (We're expecting this function to return a dictionary containing the information from the specified tags)
- Run scrape_events in your terminal with the argument for the function as the BeautifulSoup object containing the academic calendar page.
- Verify that some of the values in the dictionary line up with the keys that we had in the inspector
- Notify your TA that you have completed the Lab
A couple of tips as you are scraping:
- BeautifulSoup's
find
andfind_all
methods can take multiple arguments. If you give it two arguments, it will not only look for html nodes with the specified tags, but also with the specified id or class. For example,find("h1", class="mac")
will find a node with tag 'h1' and class 'mac', whilefind_all("div", id="cheese")
will find all nodes with tag 'div' and class 'cheese' - BeatifulSoup's
find
andfind_all
will search deep levels of your page, i.e. if you saynode.find("div", class="pretzel")
it will not only searchnode
's children, it will also searchnode
's grandchildren, and their children, and their children, and so on through all levels of the HTMLTree, until it finds a node with tag "div" and class "pretzel" - BeautifulSoup objects have a
.text
attribute which you can use to get their text rather than all of their html.
Part 3: You're all set!
That's all for lab this week! Be sure to ask any questions as this will be the same Web Scraping format that we follow for Project 3 and may be helpful in the final project as well.