You are here

ECE PhD Defense: "Clustering with Flexible Constraints and Application to Disease Subtyping," Yale Chang


TF 378

November 29, 2017 10:00 am
November 29, 2017 10:00 am

Clustering algorithms are widely used to extract knowledge from large amount of unlabeled data (such as, discovering subtypes of complex diseases to enable personalized treatments of patients). Clustering is a challenging problem because given the same data, samples can be grouped in multiple different perspectives (views). Which of these alternative groupings is useful depends on the application. Thus, incorporating domain expert input often improves clustering performance. In this dissertation, we explore various ways to incorporate expert input to guide clustering. First, domain experts often have an idea regarding properties that clustering solutions should have in order to be useful based on domain relevant scores. We propose a framework to jointly optimize the usefulness and quality of a clustering solution. Second, besides instance-level constraints, feature-level structures can also be utilized to improve clustering. We consider two types of feature-level structures: 1) decision rules on a small set of features to provide interpretable clusterings; and 2) a feature similarity matrix used to guide the embeddings for clustering. Third, instead of supervision from one expert, it is becoming more common for supervision to be available from multiple experts as data can be shared and processed by increasingly larger audiences. To address this new clustering paradigm, we make the following contributions: 1) Because experts are not oracles, their inputs are prone to errors as well. We build a probabilistic model to learn the shared latent clustering structure in the data by explicitly modeling the accuracy of each expert. 2) Since different experts might provide supervision with varying views in mind, we build a Bayesian probabilistic model for learning multiple latent clustering views from multiple experts. Besides demonstrating the superior performance of our proposed approaches on synthetic and benchmark data sets, we also applied them to discover subtypes of a complex lung disease, called chronic obstructive pulmonary disease (COPD), and obtained clinically meaningful results.

  • Professor Jennifer Dy (Advisor)
  • Professor Adam Ding
  • Professor Stratis Ioannidis
  • Dr. Peter J. Castaldi
  • Dr. Michael H. Cho
  • Dr. Edwin K. Silverman