Next: Filters
Up: Techniques for Extracting Data
Previous: Techniques for Extracting Data
- Issues
- refer to the section on Data Models and Schema Integration
- Mediator systems usually require more complex wrappers than do most
warehouse systems
- Ways of creating wrappers
- Manual
Why is it impractical for some sources?
In case of Web sources:
- big number of sources
- new sources are added frequently
- format of sources change
So, high maintenance costs.
- Semi-automatic (interactive)
Noted that only small part of the code deals with the specific
access details of the source. The rest is common among
wrappers or data transformation can be expressed in a
declarative fashion (high-level). Graphical interface,
programming by demonstration.
- Automatic
- site-specific or generic
- usually needs training often supervised learning
- Tools for semi-automatic/automatic wrapper construction for
structured/semistructured data
- Template-based wrappers
- Inductive learning techniques for automatically
learning a wrapper (using labeled data)
Inductive learning - task of computing
some generalization from a set of examples
Methods:
- zero-order (decision tree learners)
- first-order (inductive logic programming)
- bottom-up/top-down approaches
- Tools for data extraction from unstructured documents
- Using ontologies and conceptual models to extract
and structure information from data-rich, unstructured
documents.
- Using heuristic approaches to find record boundaries
in web documents.
Next: Filters
Up: Techniques for Extracting Data
Previous: Techniques for Extracting Data
Emine N. Tatbul
2001-03-19