An example of a typical bag of words classification pipeline. Figure by Chatfield et al.

Project 3: Scene recognition with bag of words
CS 143: Introduction to Computer Vision



Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based only on a histogram of the frequency of visual words. Visual words are identified by clustering a large corpus of local features. See Szeliski chapter 14.4.1 for more details on category recognition with quantized features. In addition, 14.3.2 discusses vocabulary creation and 14.1 covers classification techniques.

For this project you will be implementing a basic bag of words model with many opportunities for extra credit. You will classify scenes into one of 15 categories by training and testing on the 15 scene database (introduced in Lazebnik et al. 2006, although built on top of previously published datasets). Lazebnik et al. 2006 is a great paper to read, although we will be implementing the baseline method the paper discusses (equivalent to the zero level pyramid) and not the more sophisticated spatial pyramid (which is extra credit). For a more recent review of feature encoding methods for bag of words models see Chatfield et al, 2011.

Example scenes from of each category in the 15 scene dataset. Figure from Lazebnik et al. 2006.

The basic flow of this project is as follows:

Details and Starter Code

The top level script is proj3.m. It breaks the project pipeline in to 5 steps:

The first two steps will require you to decide on a local feature representation for each scene. We suggest starting with the VLFeat library's vl_dsift function. We suggest using vl_kmeans to build the vocabulary. You can also implement other local features or clustering methods.

A baseline version of the final three steps is written for you. We have included an SVM implementation, primal_svm.m. This code is fast, portable, and accepts arbitrary kernel matrices. The stencil code is configured to train linear SVMs, although you can use non-linear kernels for improved performance and extra credit.

All the images are under the data directory. data/training and data/test both have folders for each scene category; training has exactly 100 scenes per directory, and test has a variable number. The stencil code is configured to train and test on the same number of images per category.

Whichever local feature representation you decide to use, it is not necessary to use the entire training set to build a visual word vocabulary. Instead you can randomly sample tens or hundreds of thousands of local descriptors to cluster. Make sure to sample from all the training images, though. We recommend starting with a vocabulary of about 200 words.

You should normalize your bag of words histograms, so that image size does not influence histogram counts.

Useful MATLAB functions: histc

Write up

For this project, and all other projects, you must do a project report in HTML. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then you will show and discuss the results of your algorithm. Discuss any extra credit you did, and clearly show what contribution it had on the results (e.g. performance with and without each extra credit component).

It would be interesting (although not required) to see where the classifier is making mistakes in the spirit of the SUN database results page (Warning: large web page).

For this project you should also include a confusion matrix for your classifier. (You can include a graphic or MATLAB's text output in a <pre> tag.)

Extra Credit

For all extra credit, be sure to analyze on your web page cases whether your extra credit has improved classification accuracy. Each item is "up to" some amount of points because trivial implementations may not be worthy of full extra credit.

Some ideas: Finally, there will be extra credit and recognition for the student who achieves the highest recognition rate. To make sure the accuracy measurement is trustworthy you must use cross-validation across random test / train splits (first extra credit item). To make sure that comparisons are fair, if you use a validation set, your training + validation set can not be more than 100 images per category.

Graduate Credit

To get graduate credit on this project you must do 10 points worth of extra credit. Those 10 points will not be added to your grade, but additional extra credit will be.

Web-Publishing Results

All the results for each project will be put on the course website so that the students can see each other's results. In class we will highlight the best projects as determined by the professor and TAs. If you do not want your results published to the web, you can choose to opt out. If you want to opt out, email cs143tas[at] saying so.

Handing in

This is very important as you will lose points if you do not follow instructions. Every time after the first that you do not follow instructions, you will lose 5 points. The folder you hand in must contain the following:

Then run: cs143_handin proj3
If it is not in your path, you can run it directly: /course/cs143/bin/cs143_handin proj3


Final Advice


Project description and code by Sam Birch and James Hays. Figures in this handout from Chatfield et al. and Lana Lazebnik.

Good Luck!