Tutorial 4: On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled
Abstract:
Ensemble methods have emerged as a powerful method for improving the robustness as well
as the accuracy of both supervised and unsupervised solutions. Moreover, as enormous amounts
of data are continuously generated from different views, it is important to consolidate different
concepts for intelligent decision making. In the past decade, there have been numerous studies
on the problem of combining competing models into a committee, and the success of ensemble
techniques has been observed in multiple disciplines, including recommendation systems, anomaly
detection, stream mining, and web applications.
The ensemble techniques have been mostly studied in supervised and unsupervised learning
communities separately. However, they share the same basic principles, i.e., combination of diversified base models strengthens weak models. Also, when both supervised and unsupervised models
are available for a single task, merging all of the results leads to better performances. Therefore,
there is a need of a systematic introduction and comparison of the ensemble techniques, combining
the views of both supervised and unsupervised learning ensembles.
In this tutorial, we will present an organized picture on ensemble methods with a focus on
the mechanism to merge the results. We start with the description and applications of ensemble
methods. Through reviews of well-known and state-of-the-art ensemble methods, we show that
supervised learning ensembles usually "learn" this mechanism based on the available labels in the
training data, whereas unsupervised ensembles simply combine multiple clustering solutions based
on "consensus". We end the tutorial with a systematic approach to combine both supervised and
unsupervised models.
Tutors' Biographies:
Jing Gao, received the BEng and MEng degrees, both in Computer
Science from Harbin Institute of Technology, China, in 2002 and 2004, respectively. She is
currently working toward the Ph.D. degree in the Department of Computer Science, University
of Illinois at Urbana Champaign. She is broadly interested in data and information analysis
with a focus on data mining and machine learning. In particular, her research interests include
ensemble methods, transfer learning, mining data streams and anomaly detection. She has
published more than 20 papers in refereed journals and conferences, including KDD, NIPS,
ICDCS, ICDM and SDM conferences.
Wei Fan, received his PhD in Computer Science from Columbia University
in 2001 and has been working in IBM T.J.Watson Research since 2000. He published more
than 60 papers in top data mining, machine learning and database conferences, such as KDD,
SDM, ICDM, ECML/PKDD, SIGMOD, VLDB, ICDE, AAAI, ICML etc. Dr. Fan has served
as Area Chair, Senior PC of SIGKDD'06, SDM'08 and ICDM'08/09, sponsorship co-chair of
SDM'09, award committee member of ICDM'09, as well as PC of several prestigious conferences in the area including KDD'09/08/07/05, ICDM'07/06/05/04/03, SDM'09/07/06/05/04,
CIKM'09/08/07/06, ECML/PKDD'07/06, ICDE'04, AAAI'07, PAKDD'09/08/07, EDBT'04,
WWW'09/08/07, etc. He is on the advisory board of KD2U. Dr. Fan was invited to speak at
ICMLA'06. He served as US NSF panelist in 2007/08. His main research interests and experiences are in various areas of data mining and database systems, such as, risk analysis, high
performance computing, extremely skewed distribution, cost-sensitive learning, data streams,
ensemble methods, easy-to-use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, novel applications and commercial
data mining systems. He is particularly interested in simple, unconventional, but effective methods to solve difficult problems. His thesis work on intrusion detection has been licensed by a
start-up company since 2001. His co-teamed submission that uses Random Decision Tree has
won the ICDM'08 Contest Crown Awards. His co-authored paper in ICDM'06 that uses "Randomized Decision Tree" to predict skewed ozone days won the best application paper award. His
co-authored paper in KDD'97 on distributed learning system "JAM" won the runner-up best
application paper award.
Jiawei Han (Ph.D., Univ. of Wisconsin at Madison), is a professor
in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has
been working on research into data mining, data warehousing, stream data mining, spatial and
multimedia data mining, and bio-medical data mining, with over 300 conference and journal
publications. He has chaired or served in over 100 program committees of international conferences and workshops, including ACM SIGKDD Conferences (2001 best paper award chair, 2002
student award chair, 1996 PC co-chair), SIAM-Data Mining Conferences (2001 and 2002 PC
co-chair), ACM SIGMOD Conferences (2000 exhibit program chair), International Conferences
on Data Engineering (2004 and 2002 PC vice-chair), International Conferences on Data Mining
(2005 PC co-chair) and International Conference on Very Large Data Bases (2006 VLDB Americas Chair). He also served or is serving as EIC of ACM Transactions on Knowledge Discovery
from Data and on the editorial boards for Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, Journal of Intelligent Information Systems, and
Journal of Computer Science and Technology. Jiawei has received the Outstanding Contribution
Award at the 2002 International Conference on Data Mining, ACM Service Award (1999) and
ACM SIGKDD Innovations Award (2004), and IEEE CS Technical Achievement Award (2005).
He is an ACM and IEEE Fellow. He is the first author of the textbook "Data Mining: Concepts
and Techniques" 2nd ed., (Morgan Kaufmann, 2006).