Activity 3-6

Webscraping

Beautiful Soup is a Python library for pulling data out of HTML and XML files. In this activity, we’ll extract the text of articles from the New York Times.

Task 1:

The Beautiful Soup library is already included in your Anaconda Python installation. We’ll begin by trying out Beautiful Soup in the interactive Python interpreter. To access the library, open the interpeter and call from bs4 import BeautifulSoup

Also call import requests. This library allows you to open web hyperlinks.

Task 2:

For this activity, we’ll scrape articles from the New York Times. In Google Chrome, open any NYT article. (Navigation is different for other web browsers, so if you don’t have Chrome, you may want to ask the TAs for help).

Right click anywhere on the page and navigate to “View Page Source.” This is the webpage’s HTML format, which structures the layout of text, images, links, etc. Scroll through the HTML to get a general overview.

We want to extract the title and story of the article. They are denoted by the respective tags title and p class="story-body-text story-content". Find these tags in the HTML.

Task 3:

Let’s try extracting the article’s title in the Python interpreter. Follow the steps below:

  1. Specify your article’s webpage with url = "". For example, url="https://www.nytimes.com/2017/04/16/us/politics/north-korea-missile-crisis-slow-motion.html"
  2. Create a web request to access the url: r = requests.get(url)
  3. Check that the URL request was successful by calling r.status_code. The code 200 indicates that the request was successful. If you obtain another code, then there may be an error in your URL or program.
  4. Parse the url’s content into Beautiful Soup HTML format: soup = BeautifulSoup(r.content,"html.parser") Then print soup to see the results. Note: if you had an XML document, you could parse it using Beautiful Soup’s “xml.parser.”
  5. Locate the title tag: title = soup.find('title'). Print title to see the results. Then, to print the title's text without its tags, use title.get_text()
  6. Once you have this working in the Python interpreter, transfer the code over to a new Python program, adding it to a function named get_article_content() that takes in url as a parameter.

Task 4:

Now we’ll extract the article’s story content. You can do this just in your Python program (not the interpreter).

Call paragraphs = soup.find_all('p', {'class':'story-body-text story-content'}). This locates all text encapsulated within the tag p class="story-body-text story-content">.

When you print out paragraphs, you’ll see that both the text and tags are returned. paragraphs is also a list, for which each index represents a paragraph.

To extract just the text (no tags), iterate through each paragraph in paragraphs and call paragraph.get_text(). Join together all the paragraphs into a single string, separated by the newline character '\n'.

Task 5:

Now that you’ve extracted the title and story content, write them to a text file. We want the filename to match the webpage name found in the URL. For this, we can use a regex to get every character of the URL after the last backslash. This part of the URL may contain the extension “.html”, which we’ll want to replace with the extension “.txt”.


	match = re.search("[^\s\/]+$", url).group(0)
	match = match.replace(".html", "")
	filename = match + ".txt"

Task 6:

Because the HTML format of most NYT articles are consistent, you can use this code to webscrape other NYT articles. Try running your program on a couple more articles. All you need to do is change url !

Task 7:

In the next phase of the activity, you will create a web crawler that will find the contents of all of the articles on the front page of the New York Times.

To do this, you will need to create a function called get_links() that will return a list of URL links to the articles that are on the front page.

You will first need to create a new requests() object: r = requests.get("https://www.nytimes.com/"). Then a new BeautifulSoup() object: soup = BeautifulSoup(r.content,"html.parser")

Looking at "View Page Source," you'll see that the titles and links for the articles are in the h2 tag with a class "story-heading."

Use BeautifulSoup to find all of these instances: titles = soup.find_all('h2', {'class':'story-heading'}). This will return a list of all of the titles.

Task 8:

Iterate through the list titles and print out each element. One element should look something like this:


<h2 class="story-heading">
<a href="https://www.nytimes.com/2017/04/20/automobiles/autoreviews/jaguar-xe-review.html">
            Driven: Video Review: Jaguar Figures Out the Compact Sport Sedan        </a>
</h2>

We need to extract the link from each element. To do this, call .find('a')['href'] on each element.

Do this and try running the program. You will get an error: TypeError: 'NoneType' object is not subscriptable. This is because not every element in titles contains a link.

Task 9:

To handle this, as you iterate each title in titles, first check if the element includes a link, denoted by the tag <a>. If there is a link, then extract the url, denoted by the <href> tag. You can use the code below:


link = title.find('a')
if link != None:
	link_url = link['href']

Add all these urls to a list. Return the list at the end of the function.

Task 10:

In main(), call the function get_links() to return the list of links on the front page of the NYT. Iterate through this list and call get_article_content() to write the title and content of each article to its own text file. Because of the large number of files, you may want to make a new directory within your cs0030_workspace and write your files to that directory.

If you have extra time ...

  • Try scraping information from some of your favorite websites
  • Check out this tutorial on parsing XML data into a pandas data frame. You could try this on the Congressional data that we analyzed at the beginning of the semester.

Once you're done, please check off your lab with a TA or share your file with cs0030handin@gmail.com by midnight, 4/27.