Clustering_ND_2012 - Department of Computer Science and

advertisement
Large Scale Data Clustering
Anil K. Jain
Department of Computer Science and Engineering
Michigan State University
The goal of data clustering is to organize a set of n objects into k clusters
such that objects in the same cluster are more similar to each other than
objects in different clusters. Clustering is one of the most popular tools for
data exploration and data organization that has been widely used in almost
every scientific discipline that collects data. Given the exponential growth in
data generation (estimated to be over 35 trillion gigabytes by the year 2020),
clustering is receiving renewed interest and use in applications such as social
networks, image retrieval, web search and gene expression analysis. In this
talk I will introduce the data clustering problem and discuss the challenges
and opportunities in the research on large-scale clustering, with the focus on
two main issues: (i) how to define pairwise similarity between objects? and
(ii) how to efficiently cluster hundreds of millions of objects? I will present
our recent work in approximation of the well known kernel k-means
clustering algorithm. I show both analytically and empirically that the
performance of approximate kernel k-means is similar to that of the kernel kmeans algorithm, but with significantly lower run-time complexity and
memory requirements.
Anil K. Jain is a university distinguished professor in the Department of
Computer Science and Engineering at Michigan State University. His
research interests include pattern recognition, computer vision and biometric
authentication. He served as the editor-in-chief of the IEEE Transactions on
Pattern Analysis and Machine Intelligence (1991-1994). The holder of six
patents in the area of fingerprints, he is the author of a number of books,
including Introduction to Biometrics (2011), Handbook of Face Recognition
(2011), Handbook of Fingerprint Recognition (2009), Handbook of
Biometrics (2007), Handbook of Multibiometrics (2006), BIOMETRICS:
Personal Identification in Networked Society (1999), and Algorithms for
Clustering Data (1988). He served as a member of the Defense Science
Board and The National Academies committees on Whither Biometrics and
Improvised Explosive Devices. Dr. Jain received the 1996 IEEE
Transactions on Neural Networks Outstanding Paper Award and the Pattern
Recognition Society best paper awards in 1987, 1991, and 2005. He is a
fellow of the AAAS, ACM, IAPR, IEEE, and SPIE. He has received
Fulbright, Guggenheim, Alexander von Humboldt, IEEE Computer Society
Technical Achievement, IEEE Wallace McDowell, ICDM Research
Contributions, and IAPR King-Sun Fu awards.
Download