Activity 3-4

Accessing the Twitter API via Tweepy

In this activity, you’ll learn how to obtain and clean tweets with Tweepy, a Python library for accessing the Twitter API.

Task 1

The first step is to install the Tweepy library. In the command-line, type pip install tweepy. The “pip” command is a tool for installing Python packages.


Task 2

In order to access the Twitter API, you need to set up a Twitter Application and obtain authentication credentials. Follow the steps below:

  1. Open https://apps.twitter.com
  2. Sign in with an existing Twitter account, or set up a new account.
  3. Click “Create New App”
  4. Provide a name and a description for your app (it can be anything you wish). For the website field, you can enter our course webpage: http://cs.brown.edu/courses/csci0030/. You can leave the “Callback URL” field blank.
  5. Agree to the Twitter Developer Agreement, and then click “Create your Twitter application”
  6. In the new page that opens, navigate to the tab “Keys and Access Tokens.”
  7. Click on “Create my access token” at the bottom of the page.
  8. The page should now list a Consumer Key, Consumer Secret, Access Token, and Access Token Secret. Open the Python stencil program here, and then copy these strings into their variable assignments.
  9. In the Python program, look over the initialize() function. This function uses your specific credentials to provide you access to the API.

Task 3

Now we can use the API to return the most recent tweets for a certain topic.

In main(), you’ll see the line results = tweepy.Cursor(api.search, q='', lang='en').items(number_tweets_to_get)

This stencil code uses the API’s search function to find the most recent tweets that include the query term “q” and are written in English. The number of tweets is specified by the constant number_tweets_to_get. We’ve set number_tweets_to_get to have a default value of 50. You can change this value later (while being mindful of Twitter’s API access limits, which we’ll discuss at the end of the activity).

For example, to return the most recent tweets that include the term “basketball” and are in English, call: results = tweepy.Cursor(api.search, q='basketball', lang='en').items(number_tweets_to_get)

Also, in the Python stencil, notice the “for” loop that iterates through results and prints out each tweet. (Note: while our stencil code opens two files, don’t write to them just yet). Run your code now to see how your tweets get printed out.

In your printed results, you’ll see that each tweet includes not just its text, but also multiple additional fields. For example, you can see whether the tweet was a retweet (“retweeted=False” or “retweeted=True”) and the “screen_name” of the user who wrote the tweet.

To see just the text of each tweet, change print(tweet) to print(tweet.text)

Another note: Sometimes a single tweet is separated by linebreaks. To remove newline characters ('\n') to merge a tweet into just one line, use tweet.text = tweet.text.replace('\n', ' ')

If you would not like to print out retweets, use the following “if” statement


if (not tweet.retweeted) and ('RT @' not in tweet.text):
	print(tweet.text)


Task 4

You can also return the most recent tweets from a specific user with the API’s user_timeline function.

In your stencil code, should see the commented code results = tweepy.Cursor(api.user_timeline, screen_name='').items(number_tweets_to_get) The "screen_name" parameter is used to specify which user.

Let's return the most recent tweets by Ellen Degeneres, written on her Twitter account “TheEllenShow.” Replace this line with results = tweepy.Cursor(api.user_timeline, screen_name='TheEllenShow').items(number_tweets_to_get)and then run your code to print out the results.

If you would not like to print out replies, add the following “if” statement


if tweet.in_reply_to_status_id==None and tweet.in_reply_to_user_id==None:
	print(tweet.text)


Task 5

The format of our tweets are somewhat messy--we can use regex to clean them up!

Let’s go over some regexes and see their matches for a few example tweets. Open regexr.com and copy the following three tweets from Ellen:


@RamsNFL I love representing the Rams on my show. Usually I do it with a baby goat.

I hope your Tuesday is as wonderful as a montage of puppies. Happy #NationalPetDay! #LaughDancePartner https://t.co/KsmABi8fLm

Happy birthday, @Eltonofficial! I ❤ a Saturday bday. Saturday night's the night I like. Saturday night's alright alright alright. 

Try out the following regexes:

  • Urls: http[^\s]+
  • Special characters (e.g. punctuation, emojis): [^\w\s]
    • If you would like to retain hashtags and @ mentions:[^#@\w\s]

Then return to your Python program. Fill in the function clean_tweet() to remove the regex matches above. You can use re.sub().

One other issue: Twitter represents the ampersand (&) with the string “amp.” Thus, you may want to replace “amp” with “and."

Also, if you would like, you can make the text lowercase.

Now, as you iterate through results in main(), call clean_tweet() on each tweet’s text, and then print out the output.


Task 6

The last task is to write the tweets (both the uncleaned and cleaned versions) to text files.

In your “for” loop that iterates through results, add the following code. The newline character ‘\n’ ensures that each tweet is written on its own line: unclean_file.write(tweet.text + '\n')

Use a similar syntax for writing your cleaned tweet with clean_file.

Note: we predefined the filenames "uncleanedtweets.txt" and "tweets.txt" for you. You can replace these with filenames of your choice, or have them specifed with sys.argv inputs.


Some notes about the API

Our access to tweets are somewhat limited:

  • api.search can only return tweets from the past 7 days.
  • There is no time restriction for api.user_timeline.
  • There is an API rate-limit: we can obtain a maximum of ~2500 tweets in a 15 minute window. Thus, when testing your code, we suggest that you use small values for number_tweets_to_get to ensure you don’t exceed the limit.
    • If you see a Twitter error response status code = 429 when running your code, you’ve exceeded the limit. You’ll need to wait 15 minutes before your program will work again.

If you have extra time

This page has a section “Query operators." It provides the various options for setting up a query, which you can use with the “q” parameter for the API’s search function

  • For example, if you wanted to return tweets mentioning the Twitter account for Brown University, you would use q='@BrownUniversity'
  • Try out some of the other query operators for topics or accounts of your choice!

Also, in the section “Additional parameters,” there’s a description of the parameter geocode, which you can use to return tweets from a latitude/longitude radius. You can try visualizing your results on a Plotly map.


Once you're done, please check off your lab with a TA or share your file with cs0030handin@gmail.com by midnight, 4/20.