Content based image retrieval tries to obtain similar images to the one used as query according to a feature. The objective is define a metric to compare this features and then obtain the k nearest neighbors for an image. For example, a basic approach would use the Euclidean distance from every pixel to the pixels in the original one. Clearly, this approach becomes extremely expensive as the set of images and the dimensions of each increase. To decrease the dimensionality of the feature different approaches are used (GIST, LSH and PCA). In addition, binary function tends to make the process even faster since the similarity function becomes the Manhattan distance.
[Lin, 2015] defined a framework to use a neural network as an autoencoder. It was proved that the AlexNet fine tuned for image classification produced the state of art results in binary hash codes for image. In this work, I aim to expand the concept for object detection. Therefore, using one of the main approaches for object detection, Faster-RCNN, I trained an auto-encoder with objective to define features for object based image retrieval.
The proposed framework is an application for the technique proposed by [Lin, 2015] in content-based image retrieval. In this work, this term refers to, given an image; images that have similar region proposals will be retrieved. Region proposals are defined as bounding boxes for an object of a specific class.
In this case, Faster-RCNN is used as the object detector. This detector proves to be suitable for the task because it is a single neural network with no need for extra input or output information. Therefore, the latent layer in this network will be directly compressing the data used for object detector. It is also the state of art for object detection and it is reaching real time detection in some cases.
By looking into the network structure and comparing to the original approach for ImageNet, the presence of the twin output layer (for class probability and bounding box location) does not allow a direct mapping of the framework into the network. However, in the last layers, Faster R-CNN presents a clear distinction between region localization and object classification. Following the framework, I decided to isolate the latent layer just for the classification output, as the feature will tend to ignore localization. The objective is to compress just information about the object classes.
After including the new layer, the new network is fine-tuned for the desired data set. The objective it to reach a similar performance to the original detector. This indirectly proves that the information is being compressed; otherwise, the output would not perform closely to the original structure. The output of the latent layer defines the binary code according to a threshold.
The image retrieval algorithm first defines a set of features (containing all the images that can be retrieved). These features are associated with the image. However, Faster R-CNN with the latent layer produces a feature for every proposal given the input image. In the configuration for the detector, the maximum number was 300 proposals. I did not increase this number because of processing limitations and it is out of the scope.
However, most of the proposed object does not strongly define a class in the output, or if it does, many proposals will be referring the same object. As a result, these features are filtered using the object detector output as the factor to be considered. First, to address multiple proposal to the same object we apply non-max suppression. This algorithm is already implemented in Faster-RCNN. For a given set of bounding box and scores, it suppresses those with areas overlapping above a threshold; the boxes with higher scores are given priority in the suppression. The remained features are then limited by another threshold, that defines the minimum confidence in a class for a proposal.
For the network a GeForce Titan was used. In terms of the neural network, there are three different configuration for Faster-RCNN. Due to the difference in terms of processing between the original (eights GPUs) and this work, the smallest network was adopted. In this case, it was Zeiler & Fergus Net, but problems to reach similar results to the original approach direct the approach towards VGG16 Net. A point that deserves mention is that even if it is deeper and more computationally expensive to train, VGG proved to be almost eight times faster. The reasons are still a topic of research.
The nms rate was chosen originally from the default values. The same was done was done with the confidence threshold. For the auto encoder features the confidence was lowered to 0.6 to create more images in the set. The objective was to restrict to 1 to 2 regions per image, to keep it computational feasible and to avoid false positives. According to observations, the threshold can be selected to control the region proposals set size. In the experiments 0.3 and 0.8 for nms and confidence threshold returned 8521 features for 5000 images. Coincidentally, for the network with the autoencoder the number of features was the same.
In terms of the learning rate, initially it was lowered because the network was not converging. However, by the time, the VGG16 Net proved to be reliable in terms of mAP, as the learning rate was increased the accuracy improved too. The learning rate was set in 0.01 the same used to train the original detector from ImageNet.
For the experiments, first Faster-RCNN was trained in the configuration end to end that generates a single neural network responsible for the task. I followed the pre-set values in this case, 70000 interactions. The resulting network yielded 70.8% in mAP, a result close to 69.9% described in the paper. The next step was to include the latent layer and fine tune for the chosen data set. As the source code used was an implementation over Caffe, the original source code presented challenges to be extended. Consequently, I kept the same dataset to fine tune the autoencoder. It is clear that it can lead to an over fitted network.
In this experiment, image proposals from 4 thousand images from PASCAL VOC 2007 were used to constitute the image set. To restrict the number of features the level of confidence and nms were used, as explained before. The testing set is composed by all the features of one thousand images from the testing dataset. The mean accuracy is calculated doing the average of the accuracy in each class.
The accuracy in each class is calculated as following. Given a image proposal the k nearest neighbors are retrieved. A set of the classes present in these images are built. If the class that is in the queried proposal is in this set a hit is counted, otherwise it is a miss. Finally, these numbers are divided by the total number of classes of proposals, normalizing it.
The auto encoder corresponds to 256 bits from the latent layer(whose activation function is sigmoid), the output was converted to binary with a 0.5 threshold. The fc7 is the 4096 feature from the network and pca is a 48 feature extracted from the fc7 feature through Principal Component Analysis. The last two features were not discretized.
This chart presents a test with the same autoencoder but the images were taken from PASCAL VOC 2012, that includes images from the previous years. To create the pool of features, region proposals from 5000 images of the training set. To test, 2000 images from the validation set. The performance is similar to the 2007 analysis and the better performance of the auto encoder indicates that it is not as over fitted as expected.
It is clear that as k increases fc7 will surpass the autoencoder performance. The worse performance of the pca in this experiment is surprising because the number of dimension was increased from 48 to 128.
Analysing the accuracy in a finer grain, it is clear that the three approaches have the same problem. This can indicate that the problem is not related to the features, but maybe to limitations from the network. In other words, using a neural network and knn would not give a better result. However, the autoencoder outperforms other methods in with a small k. In terms of application this is important because as k increases the algorithm become less responsible for the retrieved proposals.
|PCA, k =1||PCA, k =5||PCA, k =10||PCA, k =20||FC7, k =1||FC7, k =5||FC7, k =10||FC7, k =20||AUTO, k =1||AUTO, k =5||AUTO, k =10||AUTO, k =20|
For the experiments the brute force was used as technique to get the k nearest neighbors. Comparing the average time to retrieve k images from a 5000 images set, the experiment shows that this set of different values of k practically does not affect the time. Comparing the time, the number of features affected considerably more the final time than data type. However, it is important to consider that PCA achieved the worst results and it also demands an extra step to define the model.
This approach is clearly not the most suitable for object based image retrieval as it completly ignores the location of the feature. It also demands that every proposal of the query image is considered valid. So, when the accuracy is evaluated, all three hundred proposals were used. The same strategy applied to build the set of images could be used(applying non maximum suppression and a threshold in the score). This could be done as a next experiment. Faster R-CNN also imposed some problem, specially when training the autoencoder. Changing the data set is also another. I could not use a non PASCAL VOC because of time constraints to implement the code for a new dataset. On the other hand, there are tutorial with ImageNet. For the next steps, I would highlight a technique to select the proposals to be considered in the query image. Additionally, a more complete set of experiments with different data sets. Overall, however the autoencoder outperformed fc7 and pca in the tests. It shows that, in fact, the approach improves object retrieval for object detectors. In terms of over fitting, the tests with a different data set PASCAL VOC 2012 attest that it did not damage the accuracy.