Extracting informative representations from data is a critical task in many visual learning applications, which mitigates the gap between low-level observed data and high-level semantic knowledge. Many traditional visual learning algorithms pose strong assumptions on the underlying distribution of data. In practice, however, the data might be corrupted, contaminated with severe noise, or captured by different types of sensors, which violates these assumptions. As a result, it is of great importance to learn robust data representations that could effectively and efficiently handle the noisy visual data.
Recent advances on low-rank and sparse modeling have shown promising performance on recovering clean data from noisy observations, which motivate us to develop new models for robust visual learning. This dissertation focuses on extracting mid-level feature representations from visual data such as images and videos. The research goals of this dissertation are twofold: (1) learning robust data representations from visual data, by exploiting the low-dimensional subspace structures; (2) evaluating the performance of the learned data representations on various analytics tasks of images and videos.
Three types of data representations are studied in this dissertation, including graph, subspace, and dictionary. First, two novel graph construction schemes are proposed, by integrating the low-rank modeling with graph sparsification strategies. Each sample is represented in the low-rank coding space. And it is revealed that the similarity measurement in the low-rank coding space is more robust than that in the original sample space. The proposed graphs could greatly enhance the performance of graph based clustering and semi-supervised classification. Second, low-dimensional discriminative subspaces are learned in single-view and multi-view scenarios, respectively. The single-view robust subspace discovery model is motivated from low-rank modeling and Fisher criterion, and it is able to accurately classify the noisy images. The multi-view subspace learning model is designed for extracting compact features from multimodal time series data, which leverages a shared latent space and fuses information from multiple data views. Third, dictionary serves as expressive bases for characterizing visual data. A non-negative dictionary with Laplacian regularization is learned to extract robust features from human motion videos, which leads to promising motion segmentation results. In addition, a robust dictionary learning method is designed to transfer knowledge from source domain to a target domain with limited training samples.
In summary, this dissertation aims to address the challenges in processing noisy visual data captured in real world. The proposed robust data representations have shown promising performance in a wide range of visual learning tasks, such as image clustering, face recognition, human motion segmentation, and multimodal classification.
- Professor Yun Raymond Fu (Advisor)
- Professor Jennifer G. Dy
- Professor Lu Wang