This document provides an overview of how filters -- a concept drawn from signal processing -- and receptive fields -- a concept drawn from the study of biological vision -- relate to what we're doing in developing a computational model of the visual cortex. This task is made difficult by the fact that these concepts have become muddied and overused and the relevant fields of study have grown enormously in recent years; hopefully this document will give you some guidance navigating in the increasingly intertwined fields of biological and machine vision.
Filters are typically introduced to engineers in the context of signal processing. Students are expected to have been exposed to systems theory and in particular linear systems and basic concepts from control theory such as transfer functions and frequency domain analysis. In signal processing, continuous signals are typically sampled and digital signal processing is the norm. Matlab has extensive digital signal processing capabilities in its various toolboxes -- one of its strengths and a reason why Matlab is adopted in so many engineering curricula.
Engineers are taught to move easily back and forth between the time domain and the frequency domain; the Laplace transform and inverse Laplace transform allow this for continuous signals and the z-transform and its inverse serve for sampled signals. Some calculations that are difficult to perform in the time domain turn out to be simple in the frequency domain, e.g., products turn into sums. I've studied control theory but I don't pretend to be generally conversant in the use of the z-transform.
Another important tool in signal processing is the Fourier transform, its inverse and their discrete analogs. The Fourier transform allows us to express an arbitrary function (or signal) in terms of a linear combination of a general set of sinusoidal basis functions. In some cases, Fourier analysis allows us to express a complex signal in terms of a small number of coefficients, thereby enabling a compact representation or compression of the signal. The Fourier transform works well for stationary signals -- signals that don't change over time -- but is less useful for time-varying signals of the sort that we're most interested in.
Wavelets and in particular Haar and Daubechies wavelets are particularly useful in dealing with time-varying signals. Wavelets and the Fourier transform are often used to decompose a signal into a simpler form and then reconstruct the signal at a later time, an obvious application being when decomposition corresponds to compression or coding and reconstruction to decompression or decoding -- an application in which the following issues are important:
how compact is the representation resulting from decomposition
how much time does it require to perform the decomposition
how accurately does the decomposition represent the signal
how much time does it require to perform the reconstruction
Our interest in filters stems from our wanting to capture key features of images and image sequences. Hence our interests are more in line with those of an engineer trying to eliminate noise from a signal than with those of an engineer trying to design a more efficient codec (for compressor-decompressor or coder-decoder).
The notion of applying a filter to a signal is as general as applying a function or procedure to a signal, e.g., you might apply a differencing operator to a time series x1, x2, ..., xn to obtain the filtered series, (x1 - x2), (x2 - x3), ..., (xn-1 - xn), or apply an averaging operator to obtain the series, (x1 + x2)/2, (x2 + x3)/2, ..., (xn-1 + xn)/2. We're interested in capturing particular features of the input -- features that we hope will simplify extracting useful patterns in the data. The danger in specifying a particular set of features in advance is that you might overlook key features and thus undermine your ability to recognize important patterns. It is for this reason that engineers try to ensure that they have a complete basis and preferably one in which the basis functions are independent of one another or orthonormal.
There are many such complete and orthonormal sets of basis functions including the Daubechies family of wavelets (which includes the Haar wavelet) and Gabor functions. Unlike the sinusoidal basis used in Fourier analysis, Gabor filters and wavelets in the Daubechies family are local -- essentially zero everywhere except a small interval about zero. Imagine a sinusoidal function combined with a Gaussian -- a wave function in a Gaussian envelope is how you might see it characterized. Gabor filters have real and imaginary parts that can be used to capture symmetric image features. In many Wavelet compression schemes, all of the component filters are characterized in terms of translations, rotations and scaling of a single basis - or mother - wavelet. If you want to learn more about wavelets, use Matlab Help; in particular, if you want to understand the reference to wavelet de-noising in the Roth and Black [7] paper, type ``De-Noising Images'' to the search window in Matlab Help Navigator.
In vision research, the terms receptive field and filter are often used interchangeably reflecting their origins in signal/image processing and psychology/neuroscience respectively. A receptive field refers to a portion of an organism's sensory input space that serves to determine the activity of a particular cell or neuron. Suppose we position an experimental subject in front of a computer screen displaying various patterns. The displayed patterns will be projected onto the retina, which is the relevant sensory surface in this case. The receptive field of a particular cell in the visual cortex corresponds to that portion of the sensory surface that determines the cell's response, e.g., its firing rate (see John Krantz's tutorial on receptive fields for more information).
In psychophysics1, experimenters use micro-electrodes inserted in the brain to measure the response of cells having stimulated their receptive fields typically using simple images. In the machine and biological vision literature, you'll often see references to the work of Hubel and Weisel mapping cells in the primary visual cortex -- known as the striate cortex or V1 -- to their corresponding receptive fields in the retina. Some mathematical psychophysicists posit that particular cells in V1 implement a basis consisting of Gabor functions or at least behave as if they did (check out Christoph von der Malsburg's web pages on Gabor Wavelets if you want to learn more about Gabor filters and image processing in the visual cortex).
Hubel and Weisel suggested that such cells serve as edge detectors. Taking a hint from research in psychophysics, machine vision researchers have used Gabor filters to detect edges, but Gabor filters are not the only filters used for such purposes. For more information on how various filters are used for edge detection, see a quick introduction to the mathematics of edge detection2 and Robert Fisher's web pages on the Sobel edge detector and the related Prewitt gradient edge detector. Kwabena Boahen [1] provides an account of cortical maps from the point of view of an engineer attempting to build an implantable retina chip. He provides an interesting description of how cells in the retina are wired during development to produce the retinoptic map that Hubel and Weiss identified.3 Edge detection is but one step in extracting useful information from images, and, while edges are important, there are many other image features whose presence we might want to detect for purposes of inference, e.g., features concerning texture, color, etc.
It's interesting to note that George and Hawkins [4] dispense with the need for filters/features altogether by specifying the most primitive level in their graphical model as corresponding to four-by-four images regions, tiling each 32-by-32-pixel image to get sixty-four non-overlapping regions. This approach is fraught with potential problems not the least being that the size of the state space for each node at the lowest level is 216 (16 binary-valued pixels). As a practical expedient, they identify a subset of this sample space corresponding to the four-by-four regions that occur most often in the training data. Since noisy images will contain regions that don't correspond to elements of this subset, they have to use various ad hoc methods to compensate, e.g., finding the element nearest in terms of Hamming distance. Even with a sample space much smaller than 216, they run into problems in approximating the parameters of the conditional probability tables, typically ending up with zeros due to the relatively small samples acquired during training.
The George and Hawkins learning algorithm is attractive for its being particularly simple and unsupervised.4 The trick will be to devise an alternative learning algorithm that builds a more robust graphical model but retains the best features of the George and Hawkins algorithm, e.g., it is on-line, incremental and operates in stages mimicking biological development.
In discussing the problem of learning priors for the Lee and Mumford model [6] in [2], I suggested that Stefan Roth and Michael Black's Field of Experts model [7] might be of some help. Stefan now has a Field of Experts web page that includes image data and Matlab code. It's worth thinking about how we might use the filters they learn as the basis for the lowest level in the Lee and Mumford model. Stu Geman's invited talk at CVPR 2004 [3] and Gimelfarb's paper [5] on texture modeling may prove relevant to creating a basis for the next-to-the-lowest level.
1 Psychophysics is a branch of psychology dealing with the relationship between physical stimuli and their perception. Historically, psychophysics has tended to emphasize biological vision, but its methods are applicable to any sort of biological perception.
2 The figures on applying the Canny filter are missing from this introduction but see Robert Fisher's web pages on the Canny edge detector or search for ``Canny'' in Matlab Help and you'll find examples in the documentation for the Image Processing Toolbox.
3 The primary visual cortex has a retinoptic map such that each region in the retinal visual field maps directly to a region on the primary visual cortex.
4 Supervised learning characterizes algorithms for learning functions from training data consisting of input/output pairs from the target function. The training data is said to be labeled in this case and the objective is to infer the target function -- or a good approximation -- from the data. In unsupervised learning the training data is not labeled (no outputs are provided) and the task is to infer some structure in the unlabeled data (inputs). Clustering algorithms are typically unsupervised. Reinforcement learning algorithms are characterized as unsupervised even though some hints as to the desired outputs are provided in the reinforcement signal.