Homework 3-4

Analyzing Tweets

For this homework, you’ll gather Tweets on a topic or from a user of your choice. You’ll then analyze the tweets’ word frequencies and their average sentiment (postive vs. negative).

Task 1

Create a Python program to gather tweets from the API, clean the tweets, and write the tweets (both the unclean and clean versions) to text files. This program should be similar to the activity, except now you should use command-line arguments to specify

  • a program "mode" that determines whether to use api.search or api.user_timeline
  • the query/username
  • the number of tweets to return
  • the filenames for your text files

Using command-line arguments makes it easy to set up different queries and save the results in different text files. While we encourage you to run your code for multiples queries, you only need to submit the Tweets from one query.

Because of rate-limits, while you are creating and debugging your program, you may want to return just a few dozen tweets at a time. For the final text files that you submit and analyze, gather at least 500 tweets.


Task 2

Create a second Python program to read clean tweets from a text file and analyze them. Specify the text file in the command-line.

Analysis Part 1: Word Frequencies

What are the words that are associated with our query? For example, if you search for “apple,” some of the associated words may be “iphone,” “laptop,” and "pie.”

For your program, use a Counter to print out the 20 most common words in your text file of tweets. In order to remove stopwords, use the text file provided here.

Analysis Part 2: Sentiment Analysis

Does a user tweet with more positive or more negative language? Are certain topics discussed with a certain sentiment? For example, we could expect that tweets about puppies will generally involve positive language, while tweets about the plague will involve negative language.

As a crude measure of the sentiment of a tweet, you can count its number of positive vs. negative words. We’re providing you with reference lists of positive and negative words.

Rate the sentiment for a single tweet as: (num_pos_words - num_neg_words) / (num_pos_words + num_neg_words)

If a tweet has no negative or positive words, then set its rating to 0. The ratings will be on a scale of [-1, 1], where -1 = most negative, 0 = neutral, and 1 = most positive. You may want to remove stop words.

Print out the average sentiment of your file of Tweets.

While your program should be flexible enough to handle any text file of Tweets, in your handin, you only need to report your results for one text file.


Extra Credit

We encourage you to extend your programs or analyses. We’ve provided you with a few suggestions below, but feel free to come up with your own ideas!

  • Compare the word frequencies or sentiments of 2+ groups of tweets (e.g. tweets about Uber vs. tweets about Lyft).
  • Gather and analyze geocoded tweets.
  • Find the most common hashtags or @ mentions from a specific user.

Handin

You should submit:

  • Both of your Python programs. These can be named FirstLast_Task1_HW3-4.py and FirstLast_Task2_HW3-4.py.
  • For your chosen topic or user:
    • The uncleaned and clean versions of your text files of Tweets.
    • The results of your analysis, written up in a Google doc.
  • In the Google doc, also include the the number of hours you worked on this assignment, any collaborators you worked with, and whether you went to TA hours for this assignment.

Please submit this to cs0030handin@gmail.com by midnight, 4/24.