Content-Based Temporal Processing of Video

Robert A. Joyce

Ph.D. thesis
Department of Electrical Engineering
Princeton University
November, 2002

Multimedia information is most often stored, browsed, and transmitted as simply ``raw'' data, a set of opaque files. Digital video and audio in particular benefit tremendously from ``content-aware'' processing; as the salient content information is often temporal in nature, we study both the extraction and applications of the temporal structure of media streams.

We begin by examining some of the fundamental issues behind and goals of automated temporal processing. From there, the problem of gradual transition detection in video is explored, and we present methods to detect both dissolve and wipe-based transitions, even in the presence of special graphical effects. Combining video transition detection with neural network-based predictors, we apply the principles of content-aware processing to improve the channel multiplexing efficiency of variable bit rate video streams.

The integration of video, audio, and other data is essential to any temporal analysis of media streams. Segmentation in these modalities, as well as distance metrics between segments of the same stream, are developed. We examine issues in comparing distance metrics of different modalities, and develop a normalization scheme that takes into account both the distance metrics' statistics and prior probabilities on perceptual segment distances.

Using this distance information, we construct a matrix-based representation that allows quick identification of ``idiomatic'' sequences, such as dialog or character introductions, in both audio and video. This representation also has a graphical interpretation, which allows the use of shortest-path and similar algorithms, and can associate related but visually dissimilar segments by crossing the boundary between audio and video. Such a graph is itself a useful visualization tool, as it can show transitive connections between segments that would not otherwise be clear. Using detected idiomatic sequences and other criteria, we generate a hierarchy of such graphs, which allows a user to zoom in on sections of interest without being presented with hundreds of segments at once.  > main  > research
rob from redglow org