PaperMine Description: PageMine is a web service to assist with research. Services such as Google Scholar and Entrez PubMed fall far short of providing a feature set that would expedite and enhance the research process. Systems model diagram: See attached Annotation: Spider - This component searches the internet for papers. It accepts requests from the paper database (ie if a citation isn't entered in the database, a request will be issued to the spider.) Paser - Extract citation information from papers, converts each paper into a standardized format, and as a strips out common words as a pre-processing step. Paper Database - Stores each paper in a standardized format for use by the graph handler, cluster'er and crumb generator. Graph Handler - Maintains the paper graph and performs graph operations on request. Cluster'er - This component attempts to group papers based on topic and semantic relationships. Given a user query, the cluster'er attempts to refine this grouping to provide maximally relevant information. Crumbs - Maintains a list of papers that a given user has searched. Automatically generates BibTeX entries upon request. User Database - Stores entries (preferences, crumb trails, etc.) for each user. Server - Accepts external requests and delegates them to the relevant components of the back end. UI - A sleek "web 2.0" interface for interacting with the service. Non-functional Requirements: Performance: The server needs to be able to cope with an arbitrarily large paper graph Portability: The server will run on a *nix-based operating system. The front end will be provided as a website, allowing for platform independence. Ease of Use: User interface is one of the major goals of this project. By offering a slick web interface PaperMine should be at once easy to use and extremely powerful. Risks: There is a tremendous external reliance on both the availability and formatting of academic papers. Copyright could potentially strangle this product. Inconsistencies in formatting could make parsing papers a great challenge. Brendan Hickey (bhickey) CS190 Project Requirements PageMine 1. Title: PageMine 2. Description: PageMine is a web service to assist with research. Services such as Google Scholar and Entrez PubMed fall far short of providing a feature set that would expedite and enhance the research process. 3. Features: * Enhanced search functionality * Citation based graph representation of papers * Semantic clustering * Search string clustering * Automated classification * Reference generation * Breadcrumb trail generation, and suggestions * Open access * Enhanced UI * Multi-discipline and interdisciplinary 4. Priorities: * Citation graphing -- if papers cannot be reliably identified and linked, all other functionality /will/ fail. * Automated classification -- Papers need to be correctly classified by topic. If they aren't, clustering algorithms might inadvertantly group papers on Unix with those about penguin habitats. Furthermore, topical classification will allow the system to treat the papers as a set of smaller graphs rather than a single connected graph. * Clustering & Suggestions -- In order to offer an experience superior to existing services, the system must be able to reliably suggest papers similar to those that the user is reading. * Reference generation -- Researcher time should not be wasted on this labor intensive and intellectually fruitless task. * Enhanced UI -- While providing much information, Entrez PubMed's UI is complex and cumbersome. On the other hand, Google Scholar's UI is spartan and hides useful functionality. 5. Usage: Shriram, while not enamored with the idea, said he would use it if it were available. Other researchers, including myself, would certainly find it to be valuable. 6. Requirements: Must be able to import and graph scientific papers and generate references for papers from any discipline. Automated classification and clustering must function for at least a single discipline. This way, PaperMine could be usefully deployed on a per subject basis. 7. Divisibility: The project has several discrete components: paper processing, graphing, clustering algorithms, database work, and UI. These tasks could be further subdivided. 8. Risks: Copyrights and inconsistent formatting might make the acquisition and processing of papers challenging. The system has the potential to generate very large graphs, which even with the best algorithms, could cause operations to be too slow for use as a web service. Finally, semantic analysis is hard, and clustering could potentially fail to offer useful results.