PaperMine

Description:
PageMine is a web service to assist with research. Services such as Google
Scholar and Entrez PubMed fall far short of providing a feature set that would
expedite and enhance the research process.

Systems model diagram:
    See attached

Annotation:

    Spider - This component searches the internet for papers. It accepts
    requests from the paper database (ie if a citation isn't entered in the
    database, a request will be issued to the spider.)
    Paser - Extract citation information from papers, converts each paper into
    a standardized format, and as a strips out common words as a pre-processing
    step.
    Paper Database - Stores each paper in a standardized format for use by the
    graph handler, cluster'er and crumb generator.
    Graph Handler - Maintains the paper graph and performs graph operations on
    request.
    Cluster'er - This component attempts to group papers based on topic and
    semantic relationships. Given a user query, the cluster'er attempts to
    refine this grouping to provide maximally relevant information.
    Crumbs - Maintains a list of papers that a given user has searched.
    Automatically generates BibTeX entries upon request.
    User Database - Stores entries (preferences, crumb trails, etc.) for each
    user.

    Server - Accepts external requests and delegates them to the relevant
    components of the back end.

    UI - A sleek "web 2.0" interface for interacting with the service.

Non-functional Requirements:
    
    Performance: The server needs to be able to cope with an arbitrarily large
    paper graph

    Portability: The server will run on a *nix-based operating system. The
    front end will be provided as a website, allowing for platform
    independence.

    Ease of Use: User interface is one of the major goals of this project. By
    offering a slick web interface PaperMine should be at once easy to use and
    extremely powerful.

    Risks: There is a tremendous external reliance on both the availability and
    formatting of academic papers. Copyright could potentially strangle this
    product. Inconsistencies in formatting could make parsing papers a great
    challenge.
Brendan Hickey (bhickey)
CS190 Project Requirements

PageMine

1. Title: PageMine
2. Description:

PageMine is a web service to assist with research. Services such as Google
Scholar and Entrez PubMed fall far short of providing a feature set that would
expedite and enhance the research process.

3. Features:

    * Enhanced search functionality
    * Citation based graph representation of papers
    * Semantic clustering
    * Search string clustering
    * Automated classification
    * Reference generation
    * Breadcrumb trail generation, and suggestions
    * Open access
    * Enhanced UI
    * Multi-discipline and interdisciplinary
    
4. Priorities:

    * Citation graphing -- if papers cannot be reliably identified and linked,
	all other functionality /will/ fail.
    * Automated classification -- Papers need to be correctly classified by
	topic. If they aren't, clustering algorithms might inadvertantly group
	papers on Unix with those about penguin habitats. Furthermore, topical
	classification will allow the system to treat the papers as a set of
	smaller graphs rather than a single connected graph.
    * Clustering & Suggestions -- In order to offer an experience superior to existing
	services, the system must be able to reliably suggest papers similar to
	those that the user is reading.
    * Reference generation -- Researcher time should not be wasted on this labor
	intensive and intellectually fruitless task.
    * Enhanced UI -- While providing much information, Entrez PubMed's UI is
	complex and cumbersome. On the other hand, Google Scholar's UI is
	spartan and hides useful functionality.

5. Usage:
    Shriram, while not enamored with the idea, said he would use it if it were
    available. Other researchers, including myself, would certainly find it to
    be valuable.

6. Requirements:
    Must be able to import and graph scientific papers and generate references
    for papers from any discipline.
    Automated classification and clustering must function for at least a single
    discipline. This way, PaperMine could be usefully deployed on a per subject
    basis.

7. Divisibility:
    The project has several discrete components: paper processing, graphing,
    clustering algorithms, database work, and UI. These tasks could be further
    subdivided.

8. Risks:
    Copyrights and inconsistent formatting might make the acquisition and
    processing of papers challenging. The system has the potential to generate
    very large graphs, which even with the best algorithms, could cause
    operations to be too slow for use as a web service. Finally, semantic
    analysis is hard, and clustering could potentially fail to offer useful
    results.