Introduction

The objective of this video texture extraction algorithm is to construct a representation of the patterns and transitions in a video sequence to use in interesting applications. One application that will be presented is synthesis, where a . Another application that will be shown is using the video texture information to generate a dynamic visualization when combined with analysis of music. The implementation described below is mainly based off the algorithm presented in Schodl, 2000.


Video Texture Extraction Algorithm

The following steps describe generating the video texture for use in synthesis and music visualization.
  1. Calculate L2 distance between all frames subsequences

    For each video frame, a local subsequence centered around the very next video frame, i, is compared to all other possible subsequences, jn. The metric used for comparison is the sum of squared differences. Using a subsequence instead of comparing single frames helps preserve motion dynamics. The result of these distance calculations is an n x n transitions cost matrix, where n is the total number of frames.

  2. Propagate cost of future transitions

    The transitions cost matrix is then modified by adding the cost of future transitions as shown below.

    Ci→j = Ci→j + αmin(Cj→k)

    Ci→j is the cost of transitioning from frame i to frame j. min(Cj→k) is the future cost of transitioning from frame j to frame k. It is assumed that the future transition will taken in a greedy manner with minimal L2 distance cost. A discount factor, α, of 0.999 is used over a preset number of iterations to modify the transitions matrix. An example of the modified transitions cost matrix can be seen below.

    Original Transition Costs (left) vs. Modified Transition Costs (right)

  3. Convert costs to transition probabilities

    The probability of a transition, Pi→j, is inversely related to its cost as shown below.

    Pi→j = e^(-Ci→j / σ2)

    σ is a paramterized value that causes the probability for the best transitions to increase as it approaches 0. The probabilties are then normalized, and also all probabilities below a threshold of 0.1 are removed to eliminate the possibility of making a costly bad transition. The result is a fairly sparse matrix of transition probabilities for each frame.


Synthesis

The transition probabilities can be directly used to generate an infinitely long video sequence in a Monte Carlo fashion. After starting with an initial video frame, the next frame is simply chosen based on Pi→j. Some examples are shown below.

Candle (Input)

Candle (Synthesis)

In the candle example, it can be seen that the generated video sequence tends to fall into roughly stable stochastic patterns of high probability transitions. However, occasionally a lower probability event will occur such as at 1:20-1:25.

Pendulum (Input)

Pendulum (Synthesis)

In the pendulum example, it can be seen that the motion dynamics have been preserved. However, the generated video sequence is not perfect as it occasionally moves at an unrealistic speed or stutters slightly, probably due to lower probability transitions that are not as ideal.


Music Visualization

The following steps describe generating the video texture for use in synthesis and music visualization.
  1. Find notes based on audio signal

    A derivative of gaussian filter is convolved with both channels of the input audio signal to determine the edges of the notes and their corresponding amplitude and time. The notes detected from the first 32 seconds of "Galvanize" by The Chemical Brothers are shown below.

  2. Select a video frame to correspond to zero audio signal

    A video frame is selected to correspond with having zero audio signal. In the "Galvanize" example shown later, the first frame of the input pendulum video is used as this zero signal video frame.

  3. Use L2 distance from the zero signal video frame when making random transitions

    The unmodified transitions matrix from the video texture extraction . An example of this matrix is shown below.

    Transitions are then made using the probability matrix from the video texture extraction while taking into account the current audio signal and distance from the zero signal video frame. When a note plays, video frame transitions are made towards a target frame that corresponds with the current audio signal. A higher amplitude audio signal corresponds with a target frame that is further away from the zero signal video frame in L2 distance. If a new note occurs before the target frame for the current note has been reached, the transitions will then head towards the new target frame for the next note.

Pendulum Music Visualization

In the example shown below, the input pendulum video has had its video texture combined with the track "Galvanize" by The Chemical Brothers.


References

[1] A. Schodl, R. Szeliski, D. H. Salesin, and I. Essa. Video textures. In Computer Graphics (SIGGRAPH'2000 Proceedings), pages 489-498, New Orleans, July 2000. ACM SIGGRAPH.

[2] Hays, J. Data-driven methods: Video Textures [PowerPoint slides]. Retrieved from http://cs.brown.edu/courses/csci1290/lectures/12.ppt.

[3] Gostin, A. Piano Note Detection. Retrieved from http://cnx.org/content/m14196/latest/.