Click on a heading for a detailed explanation.
||1.2 Problem statement
||1.3 Chapter outline
||2.2 Data Acquisition
||2.3 Data Representation
||2.4 Case study - Amazon.com
||3.2 KDD for personalized content delivery
||3.3 Content filtering - the post-KDD process
||4.1 Means of delivery
||4.2 Who is looking?
||4.3 PrizeChoice Example
Research / Challenges
Data Collection Methods
- What is Customization? (brief subsection)
- Delivery of content that is personalized in one way or another
- Web as the medium - Outline of Cusomization systems on the Internet
- Enterprize Portals - what are these creatures?
- Are there better ways of delivering custom content?
as newest trend on the web
goal: increase customer retention, consequently customer utility
tool: personalized content delivery.
solutions such as enterprise portals (e.g. MyYahoo)
Recommender and data mining systems (e.g. Double Click targeted
Choice - intro to our running example
of this chapter: We state that we'll do the following: define the customization
problem, explain what makes it a data management problem, and propose
a unified solution.
1.2 Problem statement
- Present customization as an information filtering problem. Given a
content database and search criteria, perform a query on the database
to find content that satisfies the search criteria.
- Introduce the notion of a User Profile
- Profile - information filtering criteria. We state that criteria
is user specific. We also outline what can qualify as filtering criteria
(e.g. user age, implicitly inferred user personality type, user prefered
- Profile - compact representation of "knowledge" that is user specific
- "Knowledge" is explicitly provided (e.g. MyYahoo)
- "Knowledge" is implicitly inferred through the process of Knowledge
Discovery in Databases (KDD)
- Concise overview of KDD pipeline
- Introduce Profile-Driven Customization (PDC). PDC as a generic solution
to the problem of personalized content delivery on the web. We state
that any web customization pipeline can be viewed as user profile-oriented.
- Diagram of PDC
- PDC Stages (Assembly of User Profile, Content filtering based
on profile paramenters)
- Data Collection
- Data Processing (KDD)
- Data Presentation
|2. Data collection
- A short paragraph on how data gathered from the user forms the foundation
of the customization process. The focus of this section is to explore
the types of data that will allow for more productive information extraction
that can then be applied to customization.
- The data that is gathered from the user allows us to build an information
filter - the user profile. At the highest level, we can separate data
into two categories: explicit and implicit.
- Explicit: custom content is generated based on what the user has
explicitly requested. The user knows that s/he has provided this
- Enterprise portals. excite.com,
yahoo.com are two portals
that offer customizable layout and content. Corporate intranets
may also offer similar services (TODO: find an intranet example).
- Implicit: custom content is presented to the user based on knowledge
that the customization engine has inferred. Often, the user does
not know what data has been collected to determine the content.
- PrizeChoice. Ads and next set of prizes are generated based
on previous choices. The user is not aware of this.
- Re-introduce the concept of a 'profile' (as defined in the introduction).
Explain how profiles are constructed out of data gathered from the user
and is used to verify and match knowledge gathered from the customization
- The data you want to gather really depends on what you want to
mine, and what you want to filter. Emphasize that profiles will
most likely be site-specific, which is why we rely so much on examples
in this section.
- Use our running example, PrizeChoice, to introduce a concrete
example of what a profile looks like. Show screenshots of the DB.
Introduce the types of data that are gathered to produce this profile.
- (TODO: Can we get another example of a profile? Something from
an enterprise portal would be nice. Check cookies on HD.)
- (Can we use the Pub/Sub group for another notion of a profile?)
- User profiles need not be categorized on a per user basis. They
can also extend across genres and demographics (for example, weather
will be filtered by geographic region rather than on an individual
basis. In this case, we could use zip codes/IP addresses as simple
profiles [views?] to filter content).
- Give a brief section outline:
- What data should you collect?
- How do you store this data? (e.g. how does clickstream data get
from a browser into the DB? What format is it stored in?)
2.2 Data Acquisition
- Discussion on how you can collect information about anything that
the user does on the screen - from mouse movements and keystrokes to
browsing patterns and purchase history. There is an almost infinite
amount of data, ranging across many different types and granularities.
- Collecting everything is a bad idea.
- Space/scalability and processing time problems.
- Usability problems. Trying to gather too much information may
result in too complex an interface for the user, giving you inaccurate
- PrizeChoice: initially had the user rate the prize they wanted.
Supposedly this gave us more information, because it not only
told us which prize the user preferred, but the degree
to which they preferred it over the others. However, users did
not pay any attention to the ratings - so they became meaningless.
Eventually we removed the rating system and just had users choose
the prize they wanted. Show screenshots of both prize screens.
- Data flow. How does this information get from the user into the DB?
- PrizeChoice: choices are recorded by making an entry of the four
prizes offered, with another entry storing the choice.
- MyYahoo: user preferences are stored in a cookie (TODO: find this
on my hard drive and use as example). (http://privacy.yahoo.com/privacy/us/my/)
- DoubleClick: ads served by DoubleClick that are clicked on go
through a central server that records some information before forwarding
the user onto the appropriate page. (http://www.doubleclick.net:80/us/corporate/privacy/)
2.3 Data Representation
- Questions to deal with in this section:
- How is your representation of data affected by the type of knowledge
discovery you want to do?
- What types of data work well with what kinds of models?
- Choice data. This is where the user chooses one or more
things over one or more other things (e.g. choosing apples
over oranges or cats over dogs). Good for finding association
- Purchase history. Also good for clustering like users
and finding association rules. This is a good source as
it is concrete information. Also interesting to note is
delivery address (if any), payment method (if any), and
purchase type (e.g. gift, personal etc).
- Demographic information.
- Pageviews/clickstreams (browsing history). Good for tracking
which links the user likes - perhaps to augment or update
the user's profile.
- Suppose the user explicitly states that s/he does
not like sports in a site's registration page. If her
browse history indicated that she consistently visits
espn.com or checks sports stats, then her behaviour
is revealing hidden information.
- Putting something in your shopping cart and then removing
it, clicking on links by a certain author/artist, and
keywords used in searches may give you clues as to what
the user may or may not like.
- Can you keep your representation as general as possible? If not,
how can you restructure your data (if at all)?
- How do you represent your profile? Is it necessary to have one
profile per user, or can you do it by category, demographic etc.?
Can you dynamically generate these?
2.4 Case study - Amazon.com
- Profiles and the type of data you gather tends to differ from site
to site. This section is here to give another concrete example.
- Show the types of information that amazon.com collects, and the way
they use it. This will mainly be based off their privacy
policy, which is pretty superficial in terms of details. amazon.com
makes a good example, however, because they collect both explicit and
implicit user data.
- The main point to note is that they have identified the set of data
that makes the best recommendations for users. (TODO: see if I can find
examples of cookies on my HD to use)
- The hunt for a patent or white paper is in progress.....
about the user: A General Approach and its Applications
Web Site Traffic
Web Usage and Content Mining for More Effective Personalization
Personalization Based on Web Usage Mining
Interfaces for Information Management
problem statement. Here we look back at the pipeline of personalized content
delivery (PDC) and look where data processing fits in.
the process (KDD) responsible for extracting hidden information patterns
out of the collected data.
history of KDD.
- Overview of conventional applications of statistical analysis
(e.g. credit card and insurance industries, quality control in manufacturing)
analysis applied to new information medium - the Web.
to that we state that explanation of how the 'knowledge' is used (content
filtering) is to follow.
KDD for personalized content delivery
give a brief outline of the section, explaining exactly what each section
will be doing and how we will be using PrizeChoice a running example.
At each subsection, we will illustrate the concepts and issues discussed
by using PrizeChoice as an example.
Why take PrizeChoice as an example dataset? PrizeChoice is a system that
collects relational item preference data. The beauty of such data is its
generic nature. Pretty much everything on the web is a choice. We can
categorized and classify series of choices (e.g. prize choices, web site
choices, click stream can be viewed as a sequence of choices, etc.). The
logic and methodology behind data analysis, that we subject PrizeChoice
data to, can be easily reapplied and extended to analysis of very broad
Pre-Data Mining steps
bother? Here we outline wrong approaches to mining data.
let's flip a coin and do some data mining!
patterns - what are they?
selection and Pre-processing
mapping data to a single naming convention, uniformly representing
and handling missing data, and handling noise and errors when possible.
Normalization and Pivoting
and Data Mining. OLAP provides basis for conceptual and descriptive
modeling. Here we talk of facts and dimensions, aggregations.
provides explanatory modeling
Mining can build on the OLAP cubes and visualization tools
with OLAP considered to be ideal combination
statistical model that describes the data well.
to generate probability distribution function (find relationships
among variables of interest)
Data Mining Tasks. Here we define major goals that data mining might
accomplish (e.g. classification determines what predefined class the
data belongs to, where class can be a value of a variable (see diagram);
clustering solves a problem of partitioning data into mutually exclusive
subsets according to some metric)
- Classification. We define classification and describe typical classification
- Decision rule + density estimation
- Assign to a class based on proximity (e.g. K-nearest neighbor
that is widely used by recommender systems such as MovieLens)
- Ways to define decision regions. Goal: project attribute space
into mutually exclusive classes.
- Decision trees
- Neural nets
Identify a finite set of categories to describe the data.
data (continuous). Easy case
data. Hard, since metric notions have to be defined (e.g.
in Prize Choice categorizing prizes in a large DB is a difficult
discovery (e.g. finding a strong association rule). Example:
assuming binary transaction data (consume or not consume), find clusters
of consumers such that satisfy a pattern: 7 out of 10 products are
bought and there are at least x consumers in the cluster.
of scalable algorithms (high level pseudo code will be provided in the
to assign a class to given input data
rules. Example: given items found in consumer basket offer
- Definition: Itemset1
is confidence threshold ?
Issue: search for frequently occuring itemsets
- optimization problem (typically NP-hard). Given clastering criterion,
find optimal data partition.
functions used (e.g. Min variance, Sum-of-Squared Error)
Methods: iterative, hierarchical, region growing, graph based.
estimation. Goal: given estimate joint Probability Distribution Function
from data. Example: Bayesian Network.
or fit function. Why do we need it.
mining for patterns we need to evaluate how good of a pattern/model
input into this stage
input is important in passing final judgment on whether the knowledge
discovered is useful or not.
Profile generation and information filtering
- Profile generation.
- General "knowledge" model found through data mining
(e.g. decision tree)
- User specific behavior (e.g. click stream, prize choices). This
is the data collected.
- Infer probable profile. Example: infer the sex (here it is one of
User Profile parameters) of a person from Prize Choices given using
prior constructed Bayesian Network.
- Information Filtering
- Given filtering criteria (User Profile) we extract content out of
the content DB that satisfies the criteria. The content that is most
fit for the user is served to the us. For example, based on the known
(or probable) fact that user likes technology news we query DB for
content with tech news category as our profiling criteria. In SQL:
SELECT * FROM All_Content WHERE Content_Category = 'technology news'
|4. Data presentation
section describes techniques used to present custom content generated
through Data Processing to the user. In summary, given a profile and given
the custom content, how exactly do you present the data?
4.1 Means of delivery
This will very briefly touch upon the dynamic nature of our content and
have a big pointer over to the dynamic content chapter.
4.2 Who is looking?
4.2.1 User - customer
This is the most common destination for the personalized content. The
information should be tailored to user needs in a manner consistent with
the type of information requested.
4.2.2 User - web site administrator
Corporations often wish to take a composite look at the data, for various
purposes. This is KDD, but not strictly "web site customization". However,
it is still and important issue in this area, so it should be mentioned
- Sales metrics
- Site usage statistics
- Advertiser effectiveness
4.3 PrizeChoice Example
The mechanisms for display used in PrizeChoice will be described.
|5. Research / Challenges
These systems must be able to support large numbers of users in real-time.
Techniques such as dimensionality reduction and parallelism can be used,
but more are needed.
- Need to support large numbers of users
- Concurrent need for real-time systems
- Sparsity of Information (see above)
5.1.2 Current Techniques
- Dimensionality Reduction
5.2 Data Integration
A current area of concentration is the integration of several sources
of data to solve problems with lack of information and use apathy about
explicitly defining information.
- As above, sparse data sets
- User apathy or reluctance to provide information
5.2.2 Current Techniques
- See the Data Integration chapter!
5.3 System Targeting
Web sites are typically used by an individual consumer, while traditional
KDD and data mining systems were used by companies and large corporations.
Algorithms are needed to make the data useful for the individual.
- Hard to know which algorithm is appropriate
- Algorithms have too many parameters
- Hide implementation specifics
5.3.2 Current Techniques
Data Collection Methods
The current method of data collection (as discussed in Chapter 2) are
non-optimal. It is very difficult to get users to supply data (already
addressed by implicit techniques), but it is also difficult to know
whether the data that they have supplied is "good".
- How do
you keep your user's data private?
- How do
securely store filtered information?
The summary will reiterate the concepts presented in the introduction,
but in slightly more detail and on a more technical level. In particular,
the summary will cover:
- The notion of a profile, and how it allows you to approach KDD from
a general standpoint.
- The KDD process in brief, how you apply it, and what types of information
you can gather (this will be very short, with references to PrizeChoice).
- The challenges that lie ahead and areas of research that we feel have
the most potential (e.g. AI / machine learning).