Tech Report CS-98-06

Getting Useful Gender Statistics from English Text

John Hale and Eugene Charniak

May 1998


Gender, understood as a lexical feature, is important for anaphora because it narrows down the number of possible referents involved in a typical pronoun resolution situation. This work describes an automatic method for obtaining reliable guesses about the gender of entities in a corpus using free text. By using a simple but unreliable anaphora algorithm repeatedly over a large corpus, the probable genders of referenced entities can be compiled and given a salience ranking. These statistics are an inexpensive way to add on gender-feature information to a statistical anaphora resolution algorithm.

(complete text in pdf or gzipped postscript)