Example of a cat
    Look at the image to the right. Hopefully, you see a cat. Why do you see a cat? Eliding nearly all of the object recognition process, we can imagine that your visual pipeline is picking up on features it detects in the image, and with those features your brain decides that it is most likely that you are looking at a cat. Which features of the image are most salient? This is a very complex topic, but generally speaking, your vision system puts more stock in processing and interpreting the fine details of the image - fine detail in image scenes contains most of the information required for perceiving textures and physical characteristics of objects in the world, so it makes sense that we developed such a system.
    Fine detail in images is easy to conceptualize - an edge, a corner, a furry texture - but in interpreting images we make use of the entire spectrum of detail, more technically referred to as 'spatial frequency'. You can think of spatial frequency in an image as a measure of how rapidly the intensity of its pixels changes in a certain direction. For example, the transition between the different hues of orange on the cat's head can be thought of as low spatial frequency elements, while the sharp detail of his whiskers is a high spatial frequency element.
    With the premise that our vision system primarily uses high spatial frequency information to make object recognition judgements about images, might it be possible to embed information from one image into the low frequencies of another such that when viewing the image under normal conditions, we can still identify the object in the latter with minimal perception of the impostering low spatial frequencies?
    The answer is 'yes'! This is most effective when we replace those low spatial frequency elements with those from another image with similar structure in the very low frequency range, such that the broad, blobby information that those ranges encode still correlates reasonably well with the high frequency elements of the original image. Images created using this technique are called hybrid images, and they were first proposed by Oliva, Torralba, and Schyns in 2006. Below is a reproduction of the quintessential hybrid image, nearly always shown in expositions of the topic. The image shows a cat when viewed normally, but squinting your eyes will blur the image you receive, removing the high spatial frequencies of the image and revealing the stealthy pup they had been concealing.
Alternately, if you don't want to hurt your eyes squinting, you can look at the sequence of images below - recursively downsampling the image also results in the loss of high spatial frequency information, and the dog becomes visible in the images toward the right of the series.
    In this case, "well-aligned" refers to what was written earlier about having enough structural similarity for the illusion to be plausable - the difference has to be in the details! Below are the two images used to make the hybrid image shown above.
Dog | Cat |
![]() |
![]() |
    We know that images possess information at all spatial frequencies, but how do we extract just that information at the highs or lows? The key here is filtering. Filtering takes one image, called a kernel or filter, and another image, which we'll call the "input", and returns an image typically the same size as the input, which we'll call the "output". The output is 'typically' the same size as the input because this is arguably the most useful dimension of the output for the purposes of comparison to the input, but in the intermediate stages of filtering, the input is often padded with zeros or reflected across its boundaries to allow for the superposition of the kernel on edge pixels. In implementing the general filtering algorithm used in this project, I opted to reflect the image across its boundaries to minimize artifacts in the output resulting from padding and other approaches.
    The output is constructed in the following way: the value of any pixel (x,y) in the output is generated by centering the kernel at pixel (x,y) in the input, multiplying the corresponding pixels in the kernel and image that have been overlayed, and taking the sum of all of those products. In particular, this has the nice property that filtering with a very peaked kernel will yield an output with high intensity at the location of high frequency elements in the input, and filtering with a smooth Gaussian kernel blurs the input, letting only low spatial frequency information pass to the output. Below is a table of filter/output pairs for the cat image you see at the top of the page; this should give some intuition about how the shape of the filter determines the look of the output.
Identity | Box Blur | Large Gaussian Blur | Sobel Horizontal Gradient | Laplacian High-Pass Filter | |
Kernel |
![]() |
![]() |
![]() |
![]() |
![]() |
Output |
![]() |
![]() |
![]() |
![]() |
![]() |
Low Spatial Frequencies | High Spatial Frequencies | |
Kernel |
![]() |
![]() |
Output |
![]() |
![]() |
    Now that we have extracted the high and low frequencies from the two images, all that remains is to incorporate both output images into one. It turns out that a simple pixelwise addition operation is all that we need to perform the final step in constructing our hybrid image.
Low Spatial Frequencies | High Spatial Frequencies | |
Output |
![]() |
![]() |
Downsamples |
![]() |
Low Spatial Frequencies | High Spatial Frequencies | |
Output |
![]() |
![]() |
Downsamples |
![]() |
Low Spatial Frequencies | High Spatial Frequencies | |
Output |
![]() |
![]() |
Downsamples |
![]() |
Low Spatial Frequencies | High Spatial Frequencies | |
Output |
![]() |
![]() |
Downsamples |
![]() |