CSCI 1580: Information Retrieval and Web Search

Course Information

Syllabus

Introduction: Goals and history of IR. The impact of the web on IR. The role of artificial intelligence (AI) in IR.
Basic IR Models: Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity.
Basic Tokenizing Indexing, and Implementation of Vector-Space Retrieval: Simple tokenizing, stop-word removal, and stemming; inverted indices; efficient processing with sparse vectors; python implementation.
Experimental Evaluation of IR: Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.
Query Operations and Languages: Relevance feedback; Query expansion; Query languages.
Text Representation: Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages (SGML, HTML, XML).
Web Search:
Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank); shopping agents.
Text Categorization and Clustering: Categorization algorithms: naive Bayes; decision trees; and nearest neighbor. Clustering algorithms: agglomerative clustering; k-means; expectation maximization (EM). Applications to information filtering; organization; and relevance feedback.
Recommender Systems: Collaborative filtering and content-based recommendation of documents and products.
Information Extraction and Integration: Extracting data from text; XML; semantic web; collecting and integrating specialized information on the web.

WhoWhenWhere

Professor: Eli Upfal
Professor: Tim Kriska
HTA: Matt Mahoney
UTA: David Storch
GTA: Ahmad Mahmoody
Spring 2013

Course Information

Recommended textbook

Syllabus

WhoWhenWhere