Syllabus

Home

Reading assignments

Programming Assignments

Resources

Syllabus

Slide Sets

Final Project Ideas

Updates

Course Details
:

CS196-9 Document Engineering

Course details

Instructor: David Durand

Meeting Time: G Hour (M.,W.,F. 2:00- 2:50 PM)

Prerequisites: CS 16 or 18, 31, 51.

Overview
:

Document Engineering is a new term, for an area of computing that is receiving increasing attention in Computer Science. Many estimates calculate that 80-90% of all information in stored and managed in computers is "loosely structured" or "unstructured" data. But these terms really refer to documents: words, written by people, for other people to read. Making computers really useful in handling documents is difficult because their utility is in how they are used and manipulated by people; where enforcing a regularity for a machine violates human expectations or time constraints, the machine must yield to the human.

This course will present a thorough survey of the theory and practice of document engineering. We will delve into semi-structured databases, markup languages (XML, XSLT, XML schema languages), hypertext systems, document data management, and the web. Readings will be drawn from the research literature, and current industrial standards, and programming assignments will build to a final publication project, using real data.

Like many other areas of computing, the problems of document processing are multiple. Unlike most others, except for databases, the specific problems of particular data sets are a critical factor. Even compared to databases, documents are more variable, much harder to simplify, and less well understood. For this reason, we will be examining and revising our fundamental data models throughout the course. Once we have reached the most ambitious models, we will have reached the state of the art, but not necessarily a perfectly satisfactory final model.

Throughout the course, we will consider content markup, primarily XML, first as a practical tool, then as a model for what documents are, and finally as a starting point for developing more powerful models.

This course is suitable for any undergraduate or graduate student who is interested in document management or text processing. Document engineering is an interesting area because it is a place where traditionally humanistic issues of the meaning, form and interpretation of documents intersect with and have real impact on the practice of computing in an intellectually and commercially important application domain. While there is significant technical content, this area requires a broader perspective than many purely technical areas of computer science. For undergraduates, it should be a fun way to get exposure to research and the research literature in a growing area, where many of the papers are relatively accessible without years of specialized background.

Goals
:

To develop an understanding of the unique problems of document management and document processing, and the engineering practices needed to cope with them.

To read enough of the literature on document engineering to have an overview of current problems, techniques and interfaces, as well as an understanding of the history of the field.

To learn some of the most interesting and important algorithms and technical results.

To learn how to analyze documents.

To learn some of the basic processing models for document transformation: functional, data-driven, template driven, and imperative.

To have a sense of the full range of problems involved, including indexing, information retrieval, and multilingual text processing.

Outcomes
:

At the end of the course, students should have developed a number of specific skills and areas of knowledge that further the aims of the course:

· Document analysis, and the creation of document schemas in DTDs, RelaxNG and XML Schemas.

· Processing models for documents, and the ability to implement them in (at least) a general programming language, XSLT, and JSP.

· Web publishing and server-side technologies; basic client-side technologies.

· Knowledge of basic algorithms in text processing: line breaking and justification, IR, text indexing, XPath pattern matching.

· Basic knowledge of web-related Internet protocols, and standards; extensive knowledge of how these protocols and standards are layered and integrated.

Course Mechanics
:

We will read 1-3 papers or web pages a week (depending on length and difficulty). Estimated reading time, 3-4 hours.

For the first 6 weeks, we will work on smaller assignments. There will be 4 small programming or data manipulation assignments; each allotted 1 week for completion. (they should be 3-4 hours total), and one longer assignment (8 hours).

Final projects will be completed in the remainder of the semester

Every 2 weeks, each student will be required to prepare a review (2-3 paragraphs, no more than one page) of one of the papers. These reviews should include a reaction to, and evaluation of the work presented; and should not summarize the work. (est. time 1-2 hours). These will be submitted and returned via email, and circulated to the class.

Credit will be apportioned as follows: 40% final project, 35% assignments, 25% class participation and reviews.

Assignments
:

1. (1 wk): tag a document

2. (1 wk): create a DTD for the previous assignment (+ validate the data)

3. (1 wk): Revise that DTD based on more complete information, use XSLT to publish it for the web

4. (1 wk): Simple XSL/FO publishing

5. (2 wk): Algorithm assignment: simple Xpath , line-breaking or text indexing (TBD)

6. (6 wks): final project

This will be broken into stages:

· proposal based on sample data (1.5 wks)

· Prototyping (2.5 wks)

· Final delivery from final data (2 wks)

Readings
:

Article readings are available online.

We will use XML in a Nutshell as introduction and backup to the practical content of the lectures. Strongly recommended, but not mandatory.

The XML Pocket Reference is handy for looking up details quickly. Handy, small, inexpensive.

Topical overview
: