Behavior Recognition via Sparse Spatio-Temporal Features
In this work we develop a general framework for detecting and characterizing behavior from video sequences, making few underlying assumptions about the domain and subjects under observation. Consider some of the well known difficulties faced in behavior recognition. Subjects under observation can vary in posture, appearance and size. Occlusions and complex backgrounds can impede observation, and variations in the environment, such as in illumination, can further make observations difficult. Moreover, there are variations in the behaviors themselves.
The inspiration for our approach comes from approaches to object recognition that rely on sparsely detected features in a particular arrangement to characterize an object, e.g. [6, 1, 18]. Such approaches tend to be robust to pose, image clutter, occlusion, object variation, and the imprecise nature of the feature detectors. In short they can provide a robust descriptor for objects without relying on too many assumptions
We propose to characterize behavior through the use of spatio-temporal feature points (see figure 1). A spatiotemporal feature is a short, local video sequence such as an eye opening or a knee bending, or for a mouse a paw rapidly moving back and forth. A behavior is then fully described in terms of the types and locations of feature points present.
本文通过时空特征点(spatio-temporal feature points)来描述一个行为。所谓时空特征点就是一些特别短的局部视频序列,比如眨眼睛、曲膝等。而对一个行为的描述,就是对一些时空特征点的类型和位置的描述。
The motivation is that an eye opening can be characterized as such regardless of global appearance, posture,
nearby motion or occlusion and so forth, for example, see figure 2. The complexity of discerning whether two behaviors are similar is shifted to the detection and description of a rich set of features.
而梯度向量又是通过计算已经平滑过的图像的一阶微分得到的:L (x, y, σ ) =I (x, y ) ∗g (x, y, σ ),
The response strength at each point is then based on the rank of the covariance matrix of the gradient calculated
in a local window.
The general idea of interest point detection in the spatiotemporal case is similar to the spatial case. Instead of an image I(x, y), interest point detection must operate on a stack of images denoted by I(x, y, t). Localization must proceed not only along the spatial dimensions x and y but also the temporal dimension t. Likewise, detected features
also have temporal extent.
在时空域的感兴趣点提取和空间域上的提取方式相似,不过并不是仅仅是空间域上的I(x, y)描述,而是一个图像序列的描述:I(x, y, t)。对时空域上的特征提取除了空间域上的x和y以外,还需要时间t作为其描述的一个维度。
现今已知的时空域的感兴趣点的描述子是3D harris特征提取(an extension of the Harris corner detect to the 3D case)。其参考论文为:
但是在实际的测试中,3D harris特征提取所能够提取的特征点太少了,并不能很好的完成时空域的特征提取。
Like much of the work on interest point detectors, our response function is calculated by application of separable linear filters. We assume a stationary camera or a process that can account for camera motion. The response function has the form R = ( I ∗g ∗h ev ) 2 + ( I ∗g ∗h od) 2 where g( x, y; σ) is the 2D Gaussian smoothing kernel, applied only along the spatial dimensions, and h ev and h od are a quadrature pair [10] of 1D Gabor filters applied temporally. These are defined as h ev(t; τ, ω) = − cos(2πtω)e − t 2 /τ 2 and h od(t; τ, ω) = − sin(2πtω)e − t 2 /τ 2 . In all cases we use ω = 4/τ, effectively giving the response function R two parameters σ and τ, corresponding roughly to the spatial and temporal scale of the detector.
响应函数R = ( I ∗g ∗h ev ) ^2+ ( I ∗g ∗h od) ^2,其中,g( x, y; σ)是2D的高斯平滑核,它仅仅在空间域上使用;h ev和h od 是时空域上的一维Gabor滤波器(G. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, Dordrecht, The Netherlands,1995.)。这两个分别定义为:
hev(t; τ, ω) = − cos(2πtω)e − t 2 /τ 2
h od(t; τ, ω) = − sin(2πtω)e − t 2 /τ 2 .
ω = 4/τ
σ 和 τ 分别是空间域和时间域上的尺度参数。
At each interest point (local maxima of the response function defined above), a cuboid is extracted which contains the spatio-temporally windowed pixel values. The size of the cuboid is set to contain most of the volume of data that contributed to the response function at that interest point; specifically, cuboids have a side length of approximately six times the scale at which they were detected.
To compare two cuboids, a notion of similarity needs to be defined. Given the large number of cuboids we deal with in some of the datasets (on the order of10 5), we opted to use a descriptor that could be computed once for each cuboid and compare using Euclidean distance.
The brightness gradient is calculated at each spatio-temporal location (x, y, t), giving rise to three channels
(G x, G y, G t) each the same size as the cuboid.
To extract motion information we calculate LucasKanade optical flow [20] between each pair of consecutive frames, creating two channels (Vx, Vy). Each channel is the same size as the cuboid, minus one frame.
为了提取运动信息,这里计算了每一对连续的视频帧之间的LucasKanade optical flow(B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI, pages 674–679, 1981.)。
We use one of three methods to create a feature vector given the transformed cuboid (or multiple resulting cuboids when using the gradient or optical flow). The simplest method involves flattening the cuboid into a vector, although the resulting vector is potentially sensitive to small cuboid perturbations. The second method involves histogramming the values in the cuboid. Such a representation is robust to perturbations but also discards all positional information (spatial and temporal). Local histograms, used as part of Lowe‘s 2D SIFT descriptor [19], provide a compromise solution. The cuboid is divided into a number of regions and a local histogram is created for each region. The goal is to introduce robustness to small perturbations while retaining some positional information. For all the methods, to reduce the dimensionality of the final descriptors we use
PCA [12]
这个方法使用了SIFT描述算子(D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, Nov 2004.)。相对于前两个算法而言,它提供了一种折衷的解决方案。
对于所有的方法,得到的最终的描述子而言,采用PCA方法来降维(T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer Verlag, Basel, 2001.)。
上面所提到的各种方法都是基于2D特征提取的研究,关于2D特征提取的研究的详细论述,可以参考论文: K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In CVPR, pages II: 257–263, 2003.
In all experiments reported later in the paper we used the flattened gradient as the descriptor, which is essentially a generalization of the PCA-SIFT descriptor [15].
在本文下面提到的所有的试验中,我们采用flattened gradient作为描述子,它是PCA-SIFT描述子的衍生物,其主要参考文献为:
Recall that the descriptors we use involve first transforming the cuboid into: (1) normalized brightness, (2) gradient, or (3) windowed optical flow, followed by a conversion into a vector by (1) flattening, (2) global histogramming, or (3) local histogramming, for a total of nine methods, along with multi-dimensional histograms when they apply. Using the gradient in any form gave very reliable results, as did using the flattened vector of normalized brightness values.
重申一下描述子的算法:首先,是对cuboid一个转换,它包括三种方法:(1) normalized brightness, (2) gradient (3) windowed optical flow;然后,将转换后的cuboid通过下面三种方法 (1) flattening, (2) global histogramming (3) local histogramming,映射成为特征向量。在上面的方法中,最终选择的是先使用gradienth或者normalized brightness,然后使用flattening。
Our approach is based on the idea that although two instances of the same behavior may vary significantly in terms of their overall appearance and motion, many of the interest points they give rise to are similar. Under this assumption, even though the number of possible cuboids is virtually unlimited, the number of different types of cuboids is relatively small. In terms of recognition the exact form of a cuboid becomes unimportant, only its type matters.
这里通过K-means聚类的方法对cuboids进行分类。将每个cuboid都分到对应的一个cuboid prototypes中。
After extraction of the cuboids the original clip is discarded. The rationale for this is that once the interest points have been detected, together their local neighborhoods contain all the information necessary to characterize a behavior. Each cuboid is assigned a type by mapping it to the closest prototype vector, at which point the cuboids themselves are discarded and only their type is kept.
这里使用了cuboid类型直方图作为behavior Descriptor,而behavior Descriptor之间的距离是用欧氏距离或者x^2表示的。
