Clustering is a task that divides objects into groups based on the similarity between objects. It is usually used as a tool for exploratory knowledge discovery, i.e., it is used to extract potentially useful and
previously unknown knowledge from data, before experts have any insight.
Because of the exploratory nature of clustering tasks, it is usually not adequate to simply provide clustering results that separate samples into groups. The domain scientists or data analysts in general also want to gain insight into the data. Therefore, it is desired to develop interpretable clustering models, which help the experts to attain deeper knowledge, by understanding what characterizes a cluster and how a cluster is distinguished from others.
This dissertation focuses on improving interpretability of clustering algorithms by targeting the following three aspects:
1) Clustering with interpretable rules. One possible strategy to improve interpretability is to describe the clusters using interpretable rules.
This dissertation first introduces a model that defines each cluster using rectangular decision rules with all features. Based on this model, a generative model and a discriminative model are developed to incorporate feature selection, which use a subset of features to define each cluster.
2) Interpretable clustering with similarity matrices.
Similarity-matrix-based clustering methods are usually less interpretable.
This dissertation introduces a clustering model that improves interpretability of similarity-matrix-based clustering methods. This model generates a set of interpretable rules for each cluster, using a subset of selected features in a feature matrix; and at the same time, it forces the clustering results to be consistent with the observed similarity matrix.
3) Interpretable crowdclustering with partition labels. Most existing crowdclustering methods analyze pairwise similarity labels provided by different experts, without explaining how different expert solutions are related. This dissertation presents a crowdclustering method that analyzes the partition labels from experts. This model explicitly learns the relationship between the latent consensus cluster solution and each expert solution, revealing the agreements and disagreements across different experts.
The methods introduced in this dissertation are applied to discover subtypes of a heterogeneous lung disease, called Chronic Obstructive Pulmonary Disease (COPD).
- Professor Jennifer Dy (Advisor)
- Professor Dana Brooks
- Professor Stratis Ioannidis