Name: Chen Xu
login: chenx
The basic flow is as follows:
The result of the basic part is shown in Table 1. The accuracy is 0.6353 and it's acceptable as improvements haven't been made.
Accuracy  Confusion Matrix 

0.6353 
Table 1: Accuracy and Confusion Matrix of basic part.
Improvements have been made in the following aspects:
The single level uses the highest level of matching pyramid to make histograms, which are feed to SVM. The resulting histogram length is vocab_size * 4 ^ L, corresponding to 4 ^ L bins and vocab_size channels. The SVM is the kernelized nonlinear SVM, which uses the Histogram Intersection Kernel. Table 2 shows the best accuracy is achieved when L = 1, and the accuracy is 0.7593.
The histogram used by spatial pyramid is the concatenated histogram which appropriately concatenate weighted histograms at all channels and all levels. So the Histogram Intersection Kernel can still be used by SVM. The length of the histogram is vocab_size * (1 / 3) * (4 ^ (L + 1)  1). The weights of each histogram at different levels accords to equation(3) in Lazebnik et al. 2006. Table 2 shows that the best result is obtained when L = 2, and the accuracy is 0.7907. Table 3 shows the confusion matrix and kernel matrix of the pyramid matching methord at each level.
Strong features(M = 200)  

L  Single level  Pyramid 
0 (1 X 1)  0.712  
1 (2 X 2)  0.7593  0.7740 
2 (4 X 4)  0.7527  0.7907 
3 (8 X 8)  0.7327  0.778 
Table 2: Both the results of singlelevel and pyramid is much better than basic result(accuracy = 0.6353). And it is not surprising that the results of pyramid are better than the results of singlelevel at each L. And no improvements are observed when L > 2. All the training and testing image data of above experiments are fixed to 100 images every class. And dense SIFT features are extracted at binsize = 4 and step = 8.
L  Confusion matrix  Kernel matrix  Kernel size 

1 (2 X 2)  1500 X 1500  
2 (4 X 4)  1500 X 1500  
3 (8 X 8)  1500 X 1500 
Table 3: Confusion matrix and kernel matrix of pyramid matching methord at each level L. As indicated by confusion matrix, confusions are found at indoor classes(kitchen, bedroom, living room). And confusions are much stronger at the bottomright corner as indicated by kernel matrix. The confidence is much lower at the bottomright corner with birghter colors.
Crossvalidation measurement is done at three scales, corresponding to three different volume of training and testing data sets. From Table 2, we can see that best recognition accuracy is achieved when L = 2 in the pyramid matching methord. So I do crossvalidation for that. The three scales are: (1) randomly select 100 images for training and another 100 different images for testing for every class, and iterate for 5 times; (2) randomly select 30 images for training and another 30 images for testing, and iterate for 5 times; (3) randomly select 10 images for training and 10 for testing for every class, and iterate for 10 times. Table 4 shows the different crossvalidation results of L = 2 pyramid matching in different data scales. Table 5 shows the confusion matrix of the crossvalidation measured spatial pyramid matching performance(L = 2), as well as the confidence of every class along the matrix diagnol.
100 perclass & 5 iteration  30 perclass & 5 interation  10 perclass & 10 interation  

mean  std  mean  std  mean  std 
0.7851  0.0101  0.7236  0.0258  0.6067  0.0357 
L = 2  L = 2  L = 3 
Table 4: The crossvalidation measurement of pyramid matching methord at L = 2. And we also can observe from different training data scales that the more training data, the better recognition accuracy will be.

Table 5: It can be observed that the most confusion classes are the classes at the bottomright corner(Livingroom, Office, Industry), the lowest confidence comes from Industry, less than 0.5, the highest confidence comes from Suburb, nearly 1.
I tuned the SVM training parameter lambda at three values: lambda = 0.1, 0.5, 1, and L = 3. The results of recognition accuracy is shown in Table 6, and I decided to use lambda = 0.1.
lambda  accuracy 

0.1  0.778 
0.5  0.7733 
1  0.7667 
Table 6: Effects of different lambda.
The final results is:
Mean accuracy = 0.7851
Standard deviation = 0.0101
Methord: Spatial pyramid matching & Crossvalidation