The American Presidency Project
http://www.presidency.ucsb.edu/ is an archive that contains hundreds of thousands of documents related to American politics. This site was used in the Politilines example we looked at in the beginning of the semester (
http://politilines.periscopic.com/).
- Go to the Debates site: http://www.presidency.ucsb.edu/debates.php. Click on
Republican Candidates Debate in Mesa, Arizona
.
- First look for patterns in the text. How do we know who speaks when? Very generally, how would you design a regular expression to get the speaker and the phrases for each line/set of lines?
- Right-click on the page and view the page source. We want to get the transcript from this file. Where is the text that we want to extract? Hint: search for a couple phrases you KNOW appear in the transcript.
- The function
getTranscript
in DataImport.py
pulls the source from a url, formats the text, and writes the result to an output file. Run getTranscript
on the URL for the Arizona Republican Debate. Open the file and verify that it worked.