Feature Subset Selection and Order Identification for Unsupervised Learning

This paper explores the problem of feature subset selection for unsupervised learning within the wrapper framework. In particular, we examine feature subset selection wrapped around expectation-maximization (EM) clustering with order identification (identifying the number of clusters in the data). We investigate two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. When the ``true'' number of clusters k is unknown, our experiments on simulated Gaussian data and real data sets show that incorporating the search for k within the feature selection procedure obtains better ``class'' accuracy than fixing k to be the number of classes. There are two reasons: 1) the ``true'' number of Gaussian components is not necessarily equal to the number of classes and 2) clustering with different feature subsets can result in different numbers of ``true'' clusters. Our empirical evaluation shows that feature selection reduces the number of features and improves clustering performance with respect to the chosen performance criteria.

Back