Embedding local properties of an image, for instance its color intensities, magnitude or orientation, to create a representative feature is a critical component in computer vision tasks, e.g., detection, classification, and tracking. A feature that is representative yet invariant to nuisance factors will scaffold the following modules and lead to a better performance. Statistical moments have been utilized to build such descriptors by providing a quantitative measure for the shape of the underlying data distribution. Examples of these include the covariance matrix feature, bilinear pooling encoding and Gaussian descriptors.
However, these features are currently limited to using up to second order moments and hence can be poor descriptors to represent non-Gaussian distributions. This dissertation aims towards examining this problem in-depth and identifying possible solutions. In particular, we propose to use feature descriptors based on the empirical moment matrix, which gathers high order moments and embeds them into the manifold of symmetric positive definite (SPD) matrices. The effectiveness of the proposed approach is illustrated in the context of two computer vision problems: person re-Identification (re-ID) and fine-grained classification.
Person re-ID is the problem of matching images of a pedestrian across cameras with no overlapping fields of view. Due to the extremely large inter-class variances across different cameras (e.g., poses, illumination, viewpoints), the performance of the state-of-the-art person re-id algorithms is still far from ideal. This dissertation proposes a novel descriptor, based on the on-manifold mean of a moment matrix (moM), which can be used to approximate complex, non-Gaussian, distributions of the pixel features within a mid-sized local patch. Extensive experiments on five widely used public re-ID datasets and a systematic benchmark on a new large-scale dataset demonstrate improved re-ID performance using moM.
Different from general objection recognition, fine-grained classification usually tries to distinguish objects at the sub-category level, such as different makes of cars or different species of birds. The main challenge of this task is the relatively large inter-class yet small intra-class variations. The most successful approaches to this problem use deep CNN, where the convolutional layers perform a local representation extraction step and the fully connected layers perform an encoding step. Bilinear pooling and Gaussian embedding have been shown as the best encoding options but at the price of an enormous feature dimensionality. Previous research have explored approximate compact pooling methods and matrix normalization as two separate approaches to address this weakness, resulting in significant performance gains. However, the combination of both has not been explored. In this thesis, we unify the bilinear pooling layer and the global Gaussian embedding layer through the empirical moment matrix in a novel deep architecture, moment embedding network (MoNet). We propose a novel sub-matrix square-root layer to normalize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pooling methods. Our experiments on three widely used fine-grained classification datasets illustrate that MoNet achieve similar or better performance than the state-of-art architectures. Furthermore, when combined with compact pooling techniques, it obtains comparable performance with only 4% of the dimensions.
- Professor Octavia Camps (Advisor)
- Professor Jennifer Dy
- Professor Richard J. Radke
- Professor Mario Sznaier