OUTLINE

Click on a heading for a detailed explanation.

1. Introduction 1.1 Overview
  1.2 Problem statement
  1.3 Chapter outline
2. Data collection 2.1 Introduction
  2.2 Data Acquisition
  2.3 Data Representation
  2.4 Case study - Amazon.com
3. Data processing 3.1 Introduction
  3.2 KDD for personalized content delivery
  3.3 Content filtering - the post-KDD process
4. Data presentation 4.1 Means of delivery
  4.2 Who is looking?
  4.3 PrizeChoice Example
5. Research / Challenges 5.1 Scalability
  5.2 Data Integration
  5.3 System Targeting
  5.4 Data Collection Methods
  5.5 Security/Privacy Concerns
6. Summary  

1. Introduction
1.1 Overview
  • What is Customization? (brief subsection)
    • Delivery of content that is personalized in one way or another
    • Web as the medium - Outline of Cusomization systems on the Internet
      • Enterprize Portals - what are these creatures?
      • Are there better ways of delivering custom content?
  • Customization as newest trend on the web
      • Ultimate goal: increase customer retention, consequently customer utility
      • Ultimate tool: personalized content delivery.
      • Overview of examples
        • Simple solutions such as enterprise portals (e.g. MyYahoo)
        • Sophisticated Recommender and data mining systems (e.g. Double Click targeted ads, MyCDNow)
        • Prize Choice - intro to our running example

  • Goals of this chapter: We state that we'll do the following: define the customization problem, explain what makes it a data management problem, and propose a unified solution.
1.2 Problem statement
  • Present customization as an information filtering problem. Given a content database and search criteria, perform a query on the database to find content that satisfies the search criteria.

  • Introduce the notion of a User Profile
    • Profile - information filtering criteria. We state that criteria is user specific. We also outline what can qualify as filtering criteria (e.g. user age, implicitly inferred user personality type, user prefered news category)
    • Profile - compact representation of "knowledge" that is user specific
      • "Knowledge" is explicitly provided (e.g. MyYahoo)
      • "Knowledge" is implicitly inferred through the process of Knowledge Discovery in Databases (KDD)
        • Concise overview of KDD pipeline

  • Introduce Profile-Driven Customization (PDC). PDC as a generic solution to the problem of personalized content delivery on the web. We state that any web customization pipeline can be viewed as user profile-oriented.
    • Diagram of PDC
    • PDC Stages (Assembly of User Profile, Content filtering based on profile paramenters)
1.3 Chapter Outline
  • Data Collection
  • Data Processing (KDD)
  • Data Presentation
 
2. Data collection

2.1 Introduction

  • A short paragraph on how data gathered from the user forms the foundation of the customization process. The focus of this section is to explore the types of data that will allow for more productive information extraction that can then be applied to customization. 

  • The data that is gathered from the user allows us to build an information filter - the user profile. At the highest level, we can separate data into two categories: explicit and implicit.
    • Explicit: custom content is generated based on what the user has explicitly requested. The user knows that s/he has provided this information. 
      • Enterprise portals. excite.com, yahoo.com are two portals that offer customizable layout and content. Corporate intranets may also offer similar services (TODO: find an intranet example). Show screenshots.
    • Implicit: custom content is presented to the user based on knowledge that the customization engine has inferred. Often, the user does not know what data has been collected to determine the content.
      • PrizeChoice. Ads and next set of prizes are generated based on previous choices. The user is not aware of this.
  • Re-introduce the concept of a 'profile' (as defined in the introduction). Explain how profiles are constructed out of data gathered from the user and is used to verify and match knowledge gathered from the customization engine.
    • The data you want to gather really depends on what you want to mine, and what you want to filter. Emphasize that profiles will most likely be site-specific, which is why we rely so much on examples in this section.
    • Use our running example, PrizeChoice, to introduce a concrete example of what a profile looks like. Show screenshots of the DB. Introduce the types of data that are gathered to produce this profile.
    • (TODO: Can we get another example of a profile? Something from an enterprise portal would be nice. Check cookies on HD.)
    • (Can we use the Pub/Sub group for another notion of a profile?)
    • User profiles need not be categorized on a per user basis. They can also extend across genres and demographics (for example, weather will be filtered by geographic region rather than on an individual basis. In this case, we could use zip codes/IP addresses as simple profiles [views?] to filter content).

  • Give a brief section outline:
    • What data should you collect?
    • How do you store this data? (e.g. how does clickstream data get from a browser into the DB? What format is it stored in?)

2.2 Data Acquisition

  • Discussion on how you can collect information about anything that the user does on the screen - from mouse movements and keystrokes to browsing patterns and purchase history. There is an almost infinite amount of data, ranging across many different types and granularities.

  • Collecting everything is a bad idea.
    • Space/scalability and processing time problems. 
    • Usability problems. Trying to gather too much information may result in too complex an interface for the user, giving you inaccurate data.
      • PrizeChoice: initially had the user rate the prize they wanted. Supposedly this gave us more information, because it not only told us which prize the user preferred, but the degree to which they preferred it over the others. However, users did not pay any attention to the ratings - so they became meaningless. Eventually we removed the rating system and just had users choose the prize they wanted. Show screenshots of both prize screens.

  • Data flow. How does this information get from the user into the DB? 
    • PrizeChoice: choices are recorded by making an entry of the four prizes offered, with another entry storing the choice.
    • MyYahoo: user preferences are stored in a cookie (TODO: find this on my hard drive and use as example). (http://privacy.yahoo.com/privacy/us/my/)
    • DoubleClick: ads served by DoubleClick that are clicked on go through a central server that records some information before forwarding the user onto the appropriate page. (http://www.doubleclick.net:80/us/corporate/privacy/)

2.3 Data Representation

  • Questions to deal with in this section:
    • How is your representation of data affected by the type of knowledge discovery you want to do?
      • What types of data work well with what kinds of models? 
        • Choice data. This is where the user chooses one or more things over one or more other things (e.g. choosing apples over oranges or cats over dogs). Good for finding association rules.
        • Purchase history. Also good for clustering like users and finding association rules. This is a good source as it is concrete information. Also interesting to note is delivery address (if any), payment method (if any), and purchase type (e.g. gift, personal etc).
        • Demographic information.
        • Pageviews/clickstreams (browsing history). Good for tracking which links the user likes - perhaps to augment or update the user's profile. 
          • Suppose the user explicitly states that s/he does not like sports in a site's registration page. If her browse history indicated that she consistently visits espn.com or checks sports stats, then her behaviour is revealing hidden information.
          • Putting something in your shopping cart and then removing it, clicking on links by a certain author/artist, and keywords used in searches may give you clues as to what the user may or may not like.
    • Can you keep your representation as general as possible? If not, how can you restructure your data (if at all)?
    • How do you represent your profile? Is it necessary to have one profile per user, or can you do it by category, demographic etc.? Can you dynamically generate these?

2.4 Case study - Amazon.com

  • Profiles and the type of data you gather tends to differ from site to site. This section is here to give another concrete example.

  • Show the types of information that amazon.com collects, and the way they use it. This will mainly be based off their privacy policy, which is pretty superficial in terms of details. amazon.com makes a good example, however, because they collect both explicit and implicit user data. 

  • The main point to note is that they have identified the set of data that makes the best recommendations for users. (TODO: see if I can find examples of cookies on my HD to use)

  • The hunt for a patent or white paper is in progress.....
Related readings:
+ Learning about the user: A General Approach and its Applications
+ Analyzing Web Site Traffic
+ Integrating Web Usage and Content Mining for More Effective Personalization
+ Automatic Personalization Based on Web Usage Mining
+ Privacy Interfaces for Information Management
 
3. Data processing

3.1 Introduction

Reiterate problem statement. Here we look back at the pipeline of personalized content delivery (PDC) and look where data processing fits in.

  • Present the process (KDD) responsible for extracting hidden information patterns out of the collected data.
  • Outline history of KDD.
    • Overview of conventional applications of statistical analysis (e.g. credit card and insurance industries, quality control in manufacturing)
    • Statistical analysis applied to new information medium - the Web.

In addition to that we state that explanation of how the 'knowledge' is used (content filtering) is to follow.

 

3.2 KDD for personalized content delivery

3.2.1 Overview

Here we give a brief outline of the section, explaining exactly what each section will be doing and how we will be using PrizeChoice a running example. At each subsection, we will illustrate the concepts and issues discussed by using PrizeChoice as an example.

3.2.2 PrizeChoice

Basic Idea: Why take PrizeChoice as an example dataset? PrizeChoice is a system that collects relational item preference data. The beauty of such data is its generic nature. Pretty much everything on the web is a choice. We can categorized and classify series of choices (e.g. prize choices, web site choices, click stream can be viewed as a sequence of choices, etc.). The logic and methodology behind data analysis, that we subject PrizeChoice data to, can be easily reapplied and extended to analysis of very broad data.

3.2.3 Pre-Data Mining steps

  • Why bother? Here we outline wrong approaches to mining data.
    • Example: let's flip a coin and do some data mining!
    • Spurious patterns - what are they?

  • Data selection and Pre-processing
    • Issues: mapping data to a single naming convention, uniformly representing and handling missing data, and handling noise and errors when possible.
    • Table Normalization and Pivoting

  • Transformation
    • OLAP and Data Mining. OLAP provides basis for conceptual and descriptive modeling. Here we talk of facts and dimensions, aggregations.
    • Mining provides explanatory modeling
      • Mining can build on the OLAP cubes and visualization tools
      • "bundling" with OLAP considered to be ideal combination

3.2.4 Data Mining

  • The Core Problem.
    • Find statistical model that describes the data well.
    • Attempt to generate probability distribution function (find relationships among variables of interest)

  • Typical Data Mining Tasks. Here we define major goals that data mining might accomplish (e.g. classification determines what predefined class the data belongs to, where class can be a value of a variable (see diagram); clustering solves a problem of partitioning data into mutually exclusive subsets according to some metric)
    • Classification. We define classification and describe typical classification approaches.

      • Decision rule + density estimation
      • Assign to a class based on proximity (e.g. K-nearest neighbor that is widely used by recommender systems such as MovieLens)
        • Role of a good metric
      • Ways to define decision regions. Goal: project attribute space into mutually exclusive classes.
        • Decision trees
        • Neural nets
    • Clustering. Identify a finite set of categories to describe the data.
      • Metric-based
        • Numerical data (continuous). Easy case
        • Categorical data. Hard, since metric notions have to be defined (e.g. in Prize Choice categorizing prizes in a large DB is a difficult task)
      • Model-based
      • Partition based
    • Pattern discovery (e.g. finding a strong association rule). Example: assuming binary transaction data (consume or not consume), find clusters of consumers such that satisfy a pattern: 7 out of 10 products are bought and there are at least x consumers in the cluster.

  • Examples of scalable algorithms (high level pseudo code will be provided in the appendix)
    • Decision trees.
      • Construction: iterative partitioning.
      • Allows to assign a class to given input data
    • Association rules. Example: given items found in consumer basket offer complimentary items.
      • Definition: Itemset1 => Itemset2
      • What is confidence threshold ?
      • Major Issue: search for frequently occuring itemsets
    • Clustering - optimization problem (typically NP-hard). Given clastering criterion, find optimal data partition.
      • Criterion functions used (e.g. Min variance, Sum-of-Squared Error)
      • Clustering Methods: iterative, hierarchical, region growing, graph based.
    • Density estimation. Goal: given estimate joint Probability Distribution Function from data. Example: Bayesian Network.

3.2.5 Interpretation/Evaluaion

  • Scoring or fit function. Why do we need it.
    • After mining for patterns we need to evaluate how good of a pattern/model we constructed.

  • Human input into this stage
    • Human input is important in passing final judgment on whether the knowledge discovered is useful or not.

3.3 Profile generation and information filtering

  • Profile generation.
    • Inputs
      • General "knowledge" model found through data mining (e.g. decision tree)
      • User specific behavior (e.g. click stream, prize choices). This is the data collected.
    • Infer probable profile. Example: infer the sex (here it is one of User Profile parameters) of a person from Prize Choices given using prior constructed Bayesian Network.

  • Information Filtering
    • Given filtering criteria (User Profile) we extract content out of the content DB that satisfies the criteria. The content that is most fit for the user is served to the us. For example, based on the known (or probable) fact that user likes technology news we query DB for content with tech news category as our profiling criteria. In SQL: SELECT * FROM All_Content WHERE Content_Category = 'technology news'
 
4. Data presentation

This brief section describes techniques used to present custom content generated through Data Processing to the user. In summary, given a profile and given the custom content, how exactly do you present the data?

4.1 Means of delivery

This will very briefly touch upon the dynamic nature of our content and have a big pointer over to the dynamic content chapter.

  • Dynamic Web Page
  • Publish/Subscribe
  • Mobile Devices

4.2 Who is looking?

4.2.1 User - customer
This is the most common destination for the personalized content. The information should be tailored to user needs in a manner consistent with the type of information requested.

4.2.2 User - web site administrator
Corporations often wish to take a composite look at the data, for various purposes. This is KDD, but not strictly "web site customization". However, it is still and important issue in this area, so it should be mentioned here.

  • Sales metrics
  • Site usage statistics
  • Advertiser effectiveness

4.3 PrizeChoice Example

The mechanisms for display used in PrizeChoice will be described.

 
5. Research / Challenges

5.1 Scalability

These systems must be able to support large numbers of users in real-time. Techniques such as dimensionality reduction and parallelism can be used, but more are needed.

5.1.1 Issues

  • Need to support large numbers of users
  • Concurrent need for real-time systems
  • Sparsity of Information (see above)

5.1.2 Current Techniques

  • Dimensionality Reduction
  • Parallelism

5.2 Data Integration

A current area of concentration is the integration of several sources of data to solve problems with lack of information and use apathy about explicitly defining information.

5.2.1 Issues

  • As above, sparse data sets
  • User apathy or reluctance to provide information

5.2.2 Current Techniques

  • See the Data Integration chapter!

5.3 System Targeting

Web sites are typically used by an individual consumer, while traditional KDD and data mining systems were used by companies and large corporations. Algorithms are needed to make the data useful for the individual.

5.3.1 Issues

  • Hard to know which algorithm is appropriate
  • Algorithms have too many parameters
  • Hide implementation specifics

5.3.2 Current Techniques

  • Make them robust
5.4 Data Collection Methods

The current method of data collection (as discussed in Chapter 2) are
non-optimal. It is very difficult to get users to supply data (already
addressed by implicit techniques), but it is also difficult to know
whether the data that they have supplied is "good".

5.5 Security/Privacy Concerns
  • How do you keep your user's data private?
  • How do securely store filtered information?
 
6. Summary

The summary will reiterate the concepts presented in the introduction, but in slightly more detail and on a more technical level. In particular, the summary will cover:

  • The notion of a profile, and how it allows you to approach KDD from a general standpoint.

  • The KDD process in brief, how you apply it, and what types of information you can gather (this will be very short, with references to PrizeChoice).

  • The challenges that lie ahead and areas of research that we feel have the most potential (e.g. AI / machine learning).