Project 2


This is an independent assignment. You may discuss this assignment with course staff only.

Project Description

(Independent) This project may take one of two forms, which we'll call the "hypothesis" and "computation" forms.

Pose a computational question based on textual data, or describe a computational activity you'll perform on textual data. You are free to choose your own dataset. A couple of options are provided here:

  1. Project Gutenberg:
  2. Dictionary:
  3. American Presidency Documents Archive:

For your project you must either

  1. Present a testable hypothesis, carry out the required analyses, report your findings in a clear and understandable way, and discuss your results (just as in project 1) or
  2. Describe a computation you'll perform, including the form that the results will take, demonstrate that your computations do in fact work as claimed, and discuss the results on some data.
  3. To present your results, you may report descriptive summary statistics (such as count, mean, median, standard deviation, etc.). You will also be able to import your results into Google spreadsheets or Excel to analyze basic trends (via color formatting and plotting). However, you will not be graded on any spreadsheet work beyond presenting your results in a clear manner. This applies to either the "hypothesis" or "computation" forms of the project.

    Refer to the project rubric to ensure that your proposal and project include all the required elements.

Project Themes

Below are several themes for projects that you are free to build upon in your project. You do not need to limit yourself to these ideas.

  • Hypothesize about historical outcomes based on a collection of documents written during those times. For example, analyze a collection of transcripts of political debates, define what it means to "win" the debate based on your analysis, and compare that to what the media reported, etc.
  • Hypothesize about the influence of historical contexts on the writing. For example, see whether writers during the American Civil War were affected by their geographic region. Do northern writers have a writing style that is distinct from southern writers?
  • Hypothesize about differences between genres of writing. For example, compare whether early science fiction writers used frequencies or vocabulary different from other contemporaries.
  • Hypothesize about authorship. For example, analyze The Federalist Papers available on Project Gutenberg. The authors are included in this version, but at one point in time it was unclear who wrote each paper. Ignoring the names given in the text, see whether you can identify who authored each paper, and compare your results with the known authors.
  • Hypothesize about how a single author changed over time. For example, see whether an author's writing style changed over time and whether those changes correspond to events in the author's life.
  • Hypothesize how readability and style varies across authorship. For example, do supreme court justices write in notably different styles and readability levels?
  • Hypothesize how readability of a class of text has changed over time. Are patents getting harder to read? What about legislative bills? What has changed that makes them harder or easier to read now?
  • Write a basic spell checker. Check that all words in a Supreme Court Decision are spelled correctly. Allow the user to include an optional dictionary of extra words, like the "certiorari", which may appear in SCOTUS decisions but not in your dictionary. And don't try to spell-check capitalized words, in case they're names.

Regardless of your hypothesis or computation, you must write new Python functions for your analysis. You can use code from the homeworks and in-class activities, but you must include original code of your own. What kind of new functionality could you add?

  • Use your concordance code in your project. Maybe instead of counting word frequencies, you could look up the part of speech of each word using the online dictionary and compute the frequencies of parts of speech.
  • Compute the frequencies of lengths of words. Maybe part of a writer's "signature" is how often they use big words.
  • Compare the readabilities of a collection of documents. How does readability change with respect to time or authorship?
  • Use your imagination!

Project Proposal (Due Friday, March 24nd at 11:59 pm)

Proposal: Project Description (Google document called FirstLast_Proj2_Proposal)

Write a concise (one-page) description of the project you would like to execute. This description should include the following parts, but double-check the rubric for full details:

  1. Background: put your project idea in context.
  2. Claim: the specific hypothesis you plan to test (which is a statement, not a question), or the computation you plan to carry out.
  3. Data: a brief description of your data source.
  4. Programming Elements: a few sentences describing the problems you will need to write Python functions for. How will you need to pre-process and clean your text?
  5. Potential Roadblocks: a list of potential obstacles.
  6. Backup Plan: Suppose your project is much harder than you anticipate. What parts of the project would you change to still get somewhat interesting results?

Proposal: Skeleton Code (

Write a Python file that contains an outline of the code you anticipate writing. This file should run without errors! It should include the following:

  1. Comments at the top of the file describing what the program does.
  2. Function definitions, including a good function name, the input parameter variables and a return statement with a default version of the output value type.
  3. Function descriptions (docstrings) (in triple quotes) that describe (1) what the function does, (2) what the inputs are, and (3) what the outputs are.
  4. A main function with comments that describe how your functions will tie together to accomplish the goal of the program.

Don't get too wrapped up in the details here — the goal of the skeleton code is to provide you with an outline of what you have to program.

Proposal: Handin

Create a Google folder called FirstLast_Proj2_Proposal. Replace FirstLast with your actual first and last name or we will take off points. Make sure the folder contains both your skeleton code and the proposal document.

Share the folder with

Project (Due Wednesday, April 12 at 11:59pm)

Carry out the project you proposed. It's OK if the project changes — that's why it was a proposal.

Project: Python Program

After filling in your skeleton code, you are almost done. However, to make this code usable for others, you must do a few more things.

  1. Provide instructions on how to run your program (in comments).
  2. Provide at least three test cases that verify that your code does what it should do. Include instructions how to evaluate the test cases in the comments and the expected output result when running your program on each test case.
  3. Your program should pre-process and clean input data so if we were to get another example from your original data source, the program will run as expected. For example, if your program was designed to run on texts from Project Gutenberg, we should be able to download a text ourself and run it with your program.
  4. If the input data does not match what your program expects, the program should handle these errors and notify the user.
    • One way to notify the user is to print a string (such as "Error! Input should be an integer, not a float" and return nothing.
    • You can use type() function to check an input is the correct type
    • If each line of an input file should be split into a certain number of elements, and during iteration you encounter a line which is the incorrect length, you could print out a warning about which line you are skipping.

Project: Website

You will create a website that presents your analysis and results. It should contain the following things:

  1. Project description and hypothesis.
  2. Concise explanation of your methods.
  3. Your results, presented in a clear and informative manner.
  4. Discussion of the results of your analysis or computation. You should point out expected and unexpected results.
  5. Reflection of the project. What went well? What didn't?
  6. Python and data/test files available for download with an explanation of how to use your program.

Refer to the Project 2 Rubric for more details on the code and website requirements.

DON'T FORGET to change the permissions on your website so that the website is publicly viewable. For this project, we will create a website listing each students' projects

Project: Handin

Create a Google Folder named FirstLast_Proj2. Please make sure to replace FirstLast with your first and last name or we will take off points. It should contain the following:

  1. All files you used in your project, including Python files, Excel files/Google Spreadsheets, data files, and test files.
  2. A Google document named 'README' that contains (1) the url of your web page and (2) a list of all files contained in the folder with a short description of what they are.
  3. Permissions to view the website along with all the files associated with your project should be given.

Share the folder with