Accessing the Twitter API via TweepyIn this activity, you’ll learn how to obtain and clean tweets with Tweepy, a Python library for accessing the Twitter API.
The first step is to install the Tweepy library. In the command-line, type
pip install tweepy. The “pip” command is a tool for installing Python packages.
In order to access the Twitter API, you need to set up a Twitter Application and obtain authentication credentials. Follow the steps below:
- Open https://apps.twitter.com
- Sign in with an existing Twitter account, or set up a new account.
- Click “Create New App”
- Provide a name and a description for your app (it can be anything you wish). For the website field, you can enter our course webpage: http://cs.brown.edu/courses/csci0030/. You can leave the “Callback URL” field blank.
- Agree to the Twitter Developer Agreement, and then click “Create your Twitter application”
- In the new page that opens, navigate to the tab “Keys and Access Tokens.”
- Click on “Create my access token” at the bottom of the page.
- The page should now list a Consumer Key, Consumer Secret, Access Token, and Access Token Secret. Open the Python stencil program here, and then copy these strings into their variable assignments.
- In the Python program, look over the
initialize()function. This function uses your specific credentials to provide you access to the API.
Now we can use the API to return the most recent tweets for a certain topic.
main(), you’ll see the line
results = tweepy.Cursor(api.search, q='', lang='en').items(number_tweets_to_get)
This stencil code uses the API’s
search function to find the most recent tweets that include the query term “q” and are written in English. The number of tweets is specified by the constant
number_tweets_to_get. We’ve set
number_tweets_to_get to have a default value of 50. You can change this value later (while being mindful of Twitter’s API access limits, which we’ll discuss at the end of the activity).
For example, to return the most recent tweets that include the term “basketball” and are in English, call:
results = tweepy.Cursor(api.search, q='basketball', lang='en').items(number_tweets_to_get)
Also, in the Python stencil, notice the “for” loop that iterates through
results and prints out each tweet. (Note: while our stencil code opens two files, don’t write to them just yet). Run your code now to see how your tweets get printed out.
In your printed results, you’ll see that each tweet includes not just its text, but also multiple additional fields. For example, you can see whether the tweet was a retweet (“retweeted=False” or “retweeted=True”) and the “screen_name” of the user who wrote the tweet.
To see just the text of each tweet, change
Another note: Sometimes a single tweet is separated by linebreaks. To remove newline characters ('\n') to merge a tweet into just one line, use
tweet.text = tweet.text.replace('\n', ' ')
If you would not like to print out retweets, use the following “if” statement
if (not tweet.retweeted) and ('RT @' not in tweet.text): print(tweet.text)
You can also return the most recent tweets from a specific user with the API’s
In your stencil code, should see the commented code
results = tweepy.Cursor(api.user_timeline, screen_name='').items(number_tweets_to_get) The "screen_name" parameter is used to specify which user.
Let's return the most recent tweets by Ellen Degeneres, written on her Twitter account “TheEllenShow.” Replace this line with
results = tweepy.Cursor(api.user_timeline, screen_name='TheEllenShow').items(number_tweets_to_get)and then run your code to print out the results.
If you would not like to print out replies, add the following “if” statement
if tweet.in_reply_to_status_id==None and tweet.in_reply_to_user_id==None: print(tweet.text)
The format of our tweets are somewhat messy--we can use regex to clean them up!
Let’s go over some regexes and see their matches for a few example tweets. Open regexr.com and copy the following three tweets from Ellen:
@RamsNFL I love representing the Rams on my show. Usually I do it with a baby goat. I hope your Tuesday is as wonderful as a montage of puppies. Happy #NationalPetDay! #LaughDancePartner https://t.co/KsmABi8fLm Happy birthday, @Eltonofficial! I ❤ a Saturday bday. Saturday night's the night I like. Saturday night's alright alright alright.
Try out the following regexes:
- Special characters (e.g. punctuation, emojis):
- If you would like to retain hashtags and @ mentions:
Then return to your Python program. Fill in the function
clean_tweet() to remove the regex matches above. You can use
One other issue: Twitter represents the ampersand (&) with the string “amp.” Thus, you may want to replace “amp” with “and."
Also, if you would like, you can make the text lowercase.
Now, as you iterate through
clean_tweet() on each tweet’s text, and then print out the output.
The last task is to write the tweets (both the uncleaned and cleaned versions) to text files.
In your “for” loop that iterates through
results, add the following code. The newline character ‘\n’ ensures that each tweet is written on its own line:
unclean_file.write(tweet.text + '\n')
Use a similar syntax for writing your cleaned tweet with
Note: we predefined the filenames "uncleanedtweets.txt" and "tweets.txt" for you. You can replace these with filenames of your choice, or have them specifed with
Some notes about the API
Our access to tweets are somewhat limited:
api.searchcan only return tweets from the past 7 days.
- There is no time restriction for
- There is an API rate-limit: we can obtain a maximum of ~2500 tweets in a 15 minute window. Thus, when testing your code, we suggest that you use small values for
number_tweets_to_getto ensure you don’t exceed the limit.
- If you see a Twitter error response
status code = 429when running your code, you’ve exceeded the limit. You’ll need to wait 15 minutes before your program will work again.
If you have extra time
This page has a section “Query operators." It provides the various options for setting up a query, which you can use with the “q” parameter for the API’s
- For example, if you wanted to return tweets mentioning the Twitter account for Brown University, you would use
- Try out some of the other query operators for topics or accounts of your choice!
Also, in the section “Additional parameters,” there’s a description of the parameter
geocode, which you can use to return tweets from a latitude/longitude radius. You can try visualizing your results on a Plotly map.
Once you're done, please check off your lab with a TA or share your file with email@example.com by midnight, 4/20.