(Independent) This project make take one of two forms, which we'll call the "hypothesis" and "computation" forms.
Pose a computational question based on textual data, or describe a computational activity you'll perform on textual data. You may use your own data source, or choose from the data sources we discussed in class:
- Project Gutenberg: http://www.gutenberg.org/
- Dictionary: http://www.mso.anu.edu.au/~ralph/OPTED/
- American Presidency Project Debates: http://www.presidency.ucsb.edu/debates.php
For your project you must either
- Present a testable hypothesis, carry out the required analyses, report your findings in a clear and understandable way, and discuss your results (just as in project 1) or
- Describe a computation you'll perform, including the form that the results will take, demonstrate that your computations do in fact work as claimed, and discuss the results on some data.
To present your results, you may report descriptive summary statistics (such as count, mean, median, standard deviation, etc.). You will also be able to import your results into Google spreadsheets or Excel to analyze basic trends (via color formatting and plotting). However, you will not be graded on any spreadsheet work beyond presenting your results in a clear manner. This applies to either the "hypothesis" or "computation" forms of the project.
Below are several themes for projects that you are free to build upon in your project. You do not need to limit yourself to these ideas.
- Hypothesize about historical outcomes based on a collection of documents written during those times. For example, analyze a collection of transcripts of political debates, define what it means to "win" the debate based on your analysis, and compare that to what the media reported, etc.
- Hypothesize about the influence of historical contexts on the writing. For example, see whether writers during the American Civil War were affected by their geographic region. Do northern writers have a writing style that is distinct from southern writers?
- Hypothesize about differences between genres of writing.
For example, compare whether early science fiction writers used frequencies or vocabulary different from other contemporaries.
- Hypothesize about authorship. For example, analyze The Federalist Papers available on Project Gutenberg. The authors are included in this version, but at one point in time it was unclear who wrote each paper. Ignoring the names given in the text, see whether you can identify who authored each paper, and compare your results with the known authors.
- Hypothesize about how a single author changed over time. For example, see whether an author's writing style changed over time and whether those changes correspond to events in the author's life.
- Compute the fraction of tweets involving some words or phrases that come from east or west of some longitude line. For example, see whether an author's writing style changed over time and whether those changes correspond to events in the author's life.
- Compare the average length of tweets with that of re-tweets, or find other interesting characteristics of tweets. Maybe short, punchy tweets are more often retweeted. Or maybe tweets that include images are more popular.
- Write a basic spell checker. Check that all words in a Supreme Court Decision are spelled correctly. Allow the user to include an optional dictionary of extra words, like the "certiorari", which may appear in SCOTUS decisions but not in your dictionary. And don't try to spell-check capitalized words, in case they're names.
- Google trends for tweets. Let the user pick a word (or two words) and make a plot in Excel showing how oft-used those words are over the course of a day. Maybe Miley Cyrus is most often discussed late at night, while Nabisco shows up more during working hours. We can provide you with twitter posts from a certain user over time (say 1000 tweets), or a certain search query (e.g. NATO). In the last case, we can give you a couple of different datasets, either during different parts of the day, or different geographical locations. All the information is given to you in CSV format, and you would parse that using Python.
Regardless of your hypothesis or computation, you must write new Python functions for your analysis. You can use code from the homeworks and in-class activities, but you cannot only use functions that were provided in class, like compareTwo()
, without adding new functions and/or modifying existing ones. What kind of new code could you add?
- Use your concordance code in your project. Maybe instead of counting word frequencies, you could look up the part of speech of each word using the online dictionary and compute the frequencies of parts of speech.
- Compute the frequencies of lengths of words. Maybe part of a writer's "signature" is how often they use big words.
- Use more than one "feature" (e.g., stop-word frequencies) in your
compareTwo()
function.
- Use the
raw_input()
function to prompt the user for some text input. Let them choose which texts to analyze or which features to use in the analysis. The user input should probably be an input argument into some other function(s).
- Use your imagination!
Project Description (Google document called FirstLast_Proj2_Proposal
)
Write a concise (one-page) description of the project you would like to execute. This description should include the following parts, but double-check the rubric for full details:
- Background: put your project idea in context.
- Claim: the specific hypothesis you plan to test (which is a statement, not a question), or the computation you plan to carry out.
- Data: a brief description of your data source.
- Programming Elements: a few sentences describing the problems you will need to write Python functions for.
- Potential Roadblocks: a list of potential obstacles.
- Backup Plan: Suppose your project is much harder than you anticipate. What parts of the project would you change to still get somewhat interesting results?
Skeleton Code (FirstLast_Proj2_Proposal.py
)
Write a Python file that contains an outline of the code you anticipate writing. This file should compile! It should include the following:
- Comments at the top of the file describing what the program does.
- Functions that you will write (of course, you might change this later).
- Function descriptions (in triple quotes) that describe (1) what the function does, (2) what the inputs are, and (3) what the outputs are.
- Some lines of code and comments that help describe what the functions will contain.
Don't get too wrapped up in the details here — the goal of the skeleton code is to provide you with an outline of what you have to program.
Handin
Create a Google folder called
FirstLast_Proj2_Proposal
. Replace FirstLast with your actual first and last name or we will take off points. Make sure the folder contains both your skeleton code and the proposal document.
Share the folder with cs0931handinfall2015@gmail.com
.
Carry out the project you proposed. It's OK if the project changes — that's why it was a proposal.
Python Program
After filling in your skeleton code, you are almost done. However, to make this code usable for others, you must do a few more things.
- Provide instructions on how to run your program (in comments).
- Provide at least one test function and/or test file that verifies that your code does what it should do. Include instructions in the comments.
- If you have a tricky function that has a regular expression, write a test file and show that the function returns the proper result.
- If you have a function that counts occurrences of words, write a test file and show that the function returns the proper result.
- Handle data and input errors and notify the user.
- One way to notify the user is to print a string (such as
"Error! Input should be an integer, not a float"
and return nothing.
- Remember the
type()
function. The following expressions all evaluate to True
:
type('a') is str
type([1,2,3]) is list
type(2.5) is float
- Suppose you are using data where you know that each line should be split into exactly 13 elements. To skip any lines that do not have 13 elements, you could write:
if len(myList) != 13:
print("Skipping line with != 13 elements", myList)
pass
else:
#continue with code...
Website
You will create a website that presents your analysis and results. It should contain the following things:
- Project description and hypothesis.
- Concise explanation of your methods.
- Your results, presented in a clear and informative manner.
- Discussion of the results of your analysis or computation. You should point out expected and unexpected results.
- Reflection of the project. What went well? What didn't?
- Python and data/test files available for download.
Refer to the Project 2 Rubric for more details on the code and website requirements.
DON'T FORGET to change the permissions on your website so that we are able to view it. You may make the site public or restrict it to only people with a Brown email address if you like, but we must be able to access it in some way.
Handin
Create a Google Folder named FirstLast_Proj2
. Please make sure to replace FirstLast with your first and last name or we will take off points. It should contain the following:
- All files you used in your project, including Python files, Excel files/Google Spreadsheets, data files, and test files.
- A Google document named 'README' that contains (1) the url of your web page and (2) a list of all files contained in the folder with a short description of what they are.
Share the folder with cs0931handinfall2015@gmail.com
.