The DATA/ directory is an alternate data directory, trained from WSJ
and NANC data using self-training. WSJ is given a relative weight of
5 and approximately 1,750k sentences from NANC (1,765,736
sentences total). On section 23 of the Penn Treebank, it achieves
an f-score of 92.1% with the reranking parser. For more details, please see:
David McClosky, Eugene Charniak, and Mark Johnson.
Effective Self-Training for Parsing. Proceedings of the Conference
on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York.
[PS]
[PDF]
[slides]
Make sure you have a new enough release of the BLLIP reranking parser
from here
or it will not be able to handle the larger vocabulary.