Project 3 (Scene recognition with bag of words)
- Emanuel Zgraggen (ez), October 24th, 2011
Introduction
The task given by this assignment is to implement the baseline method for scene recognition presented in Lazebnik et al. 2006. The algorithm will try to classify scenes into one of 15 categories by training and testing on the 15 scene database introduced by Lazebnik et al. 2006.
Approach
			Baseline 
			The baseline algorithm collects many densely sampled SIFT features for all the training images. 
			By running k-means, these features are clustered into M visual words, the vocabulary. For each of the
			training images a histogram indicating how often each visual word is occurring gets computed. This gives us a M-dimensional 
			representation ("bag of words") of all training images. We then train 1-vs-all classifiers (linear SVM) for each scene category 
			based on the observed "bags of words" representation for all the training images. To classify a test image we calculate its "bag of words"
			representation and evaluate all of our trained classifiers. The test image is assigned to the scene category of the classifier 
			with the highest confidence.
			The general idea is that similar images will have similar distributions of visual words. 
		
The vocabulary size parameter M has a significant influence on the performance of this approach. Different values have been tested to find a good trade-off between speed and accuracy.
			GIST 
			This approach uses the GIST descriptor as a global representation of an image. The algorithm
			does not compute a vocabulary, but instead trains 1-vs-all classifiers (linear SVM) for each scene category 
			based on the GIST representation of all the training images. To classify a test image we calculate its GIST descriptor
			and evaluate all of our trained classifiers.  
		
			Spatial Pyramid 
			This approach works similar to the baseline method. The only difference is how the histograms of images
			are computed. Instead of computing one overall histogram, we subdivide the image into 
			different frames and compute histograms for each spatial bin. The algorithm weights all the
			histograms according to their level and concatenates them into one large vector, which will be used
			to train our 1-vs-all classifiers. The weight of each histogram depends on its level and is calculated with
			the following formula: 1 / 2^(L - l) where L is the number of levels and l is the current level.
			The dimension of the final concatenated histogram is: M * 1/3 * (4^(L + 1) -1).
		
 
		
			Combination of Approaches 
			Different combination of approaches have been tested. To combine two methods, the algorithm
			trains tow sets of 1-vs-all classifiers, one for each method. 
			A test image is assigned to the scene category where the mean of the classifier confidences is highest.
		
Results
All experiments were performed on 4 random splits of the 15 scene database (100 training images / 100 test images).












