398 Form Pages _ - Data Mining Laboratory

advertisement
Curriculum Vitae
HWANJO YU
Associate Professor
Department of Computer Science and Engineering
POSTECH, Pohang, Korea
Positions
2011-
Office: PIRL 335
Phone: +82-54-279-2388 (Mobile: +82-10-4118-7006)
Fax: +82-54-279-2299
Email: hwanjoyu@postech.ac.kr
Homepage: http://hwanjoyu.org
1995-1998
Associate Professor, Department of Computer Science and Engineering, POSTECH, Pohang,
Korea
Assistant Professor, Department of Computer Science and Engineering, POSTECH, Pohang,
Korea
Assistant Professor, Computer Science Department, University of Iowa, Iowa City, USA
Research Assistant, National Center for Supercomputing Applications (NCSA), University of
Illinois, Urbana-Champaign (UIUC), USA.
Database System and Application Developer, Sunwave Co. Seoul, Korea & Los Angeles, USA
Education
1998-2004
1993-1997
Ph.D., Computer Science, University of Illinois, Urbana-Champaign (Advisor: Jiawei Han)
B.S., Computer Science and Engineering, Chung-Ang University, Seoul
2008-2010
2004-2008
1999-2004
Awards
2013
2010
2004
2003
2003
2003
2003
2002
1996
1994-96
Teaching
POSTECH
2008~
U. of Iowa
2004~2008
Best Poster Award at IEEE Int. Conf. Data Engineering (ICDE) (out of 150 full and short papers)
2nd and 3rd places at the UCSD Data Mining Contest (Graduate advisees)
Nominated from U. Iowa for the Microsoft junior faculty fellowship
The 2003 UIUC Data Mining Research Gold Award
IBM Research Student Scholarship Award from ACM SIGKDD’03 (International Conference on
Knowledge Discovery and Data Mining)
Student Scholarship Award from IJCAI’03 (International Joint Conference on Artificial
Intelligence)
Student Scholarship Award from CIKM’03 (International Conference on Information and
Knowledge Management)
IBM Research Student Scholarship Award from ACM SIGKDD’02
Samsung Electronics Scholarship for outstanding undergrads
Chung-Ang Scholarships for outstanding undergrads (graduated with the highest major GPA in
the department)
Mining Big Data
Data Structures and Algorithms
Database Systems
Advanced Topics in Data Mining
Introduction to Data Mining
Introduction to Computing
Database Systems
Knowledge Discovery and Data Mining
Data Mining and Machine Learning
Advising
Hwanjo Yu graduated 5 MS students, and currently advises 1 PhD and 8 MS/PhD students. His graduate
advisees has published papers in ACM SIGMOD, ACM SIGKDD, IEEE ICDE, IEEE ICDM, and ACM CIKM,
and received prestigious awards including IEEE ICDE best poster award (2013), and 2nd and 3rd places at
UCSD data mining contest (2010). His graduate advisees have been also selected for internship to Microsoft
Research and Microsoft Research Asia. His also advised undergraduate students at POSTECH and one of his
undergraduate advisees published a paper in IEEE ICDM and received student travel award (2012). His
undergraduate advisee also received Samsung Human-Tech paper award (2012). His advisee was the only
undergraduate student among the recipients, and it is very unusual for an undergraduate student to receive the
awards.
Services
I have served an associate editor of Neurocomputing journal since the year of 2005, and served a NSF
proposal panel at 2006. I was also an organization committee of international conferences including the
Proceeding Chair of APWeb 2010 and the PC-chair of EDB 2013. I was also an invited speaker of an
international conference, PRIB 2012. I have been actively serving a program committee in ACM KDD, ACM
SIGMOD, VLDB, IEEE ICDE, IEEE ICDM, and ACM CIKM.
Projects
Hwanjo Yu has been the PI and Co-PI of 7 national and 8 industrial projects since he joined POSTECH at
2008, and the amount of funds he has managed is about 3.5 million USD in total. The followings are
representative national projects (ongoing or finished).
 Development of Enabling Software Technology for Big Data Mining (360k USD / year, 5 years):
This project is supported by Next-Generation Information Computing Development Program through the
National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and
Technology (No. 2012M3C4A7033344) The goal of this project is the development of enabling software
technologies for big data mining. Through this project, we research data mining techniques for big data
in natural sciences and social networks. We will also develop personalized service technologies based
on unstructured big data analysis and customer behavior models. Furthermore, we will produce welltrained software engineers who are experts in big data mining.
 Developing Search and Mining Technologies for Mobile Devices (100k USD / year, 3 years): This
project is supported by Mid-career Researcher Program through NRF grant funded by the MEST (No.
KRF-2011-0016029). Combining the highly profitable information search industry and the mobile
computing paradigm, mobile information search industry has been growing rapidly. This project aims at
advancing the technologies in the areas of mobile search and mining, low-power consumption utility
mining, and mining for mobile online advertising.
 User-Friendly Search Engine for MEDLINE (100k USD / year, 3 years): This project is supported by
Mid-career Researcher Program through NRF grant funded by the MEST (No. KRF-2009-0080667).
PubMed MEDLINE, a database of biomedical and life science journal articles, is one of the most
important information source for medical doctors and bio-researchers. Finding the right information from
the MEDLINE is nontrivial because it is not easy to express the intended relevance using the current
PubMed query interface, and its query processor focuses on fast matching rather than accurate
relevance ranking. This project develop techniques for building a user-friendly MEDLINE search engine.
 Enabling Relevance Search in Bio Databases (50k USD / year, 3 years): This project is supported
by the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-331-D00528).
Most online data retrieval systems, built based on relational database management systems(RDBMS),
support fast processing of Boolean queries but offer little support for relevance or preference ranking. A
unified support of Boolean and ranking constraints in a query is essential for user-friendly data retrieval.
This project develops foundational techniques that enable such data retrieval systems in which users
intuitively express ranking constraints and the system efficiently process the queries.
The followings are representative industrial projects (ongoing or finished)
 Developing Big Video Search and Analysis Technology (Korea Telecom) (50k USD): This
proposal aims to develop a distributed video processing (DiViP) engine and techniques supporting
distributed storage, information retrieval, object recognition for large-scale video dataset. More
specifically, 1) distributed storage system for large-scale video dataset, 2) distributed processing engine
which support traditional video processing techniques for large-scale video, 3) location based video
retrieval, 4) efficient recognized-object-oriented video retrieval technique.
 Developing Distributed Machine Learning Algorithms for Classification and Recommendation
(Samsung Elec.) (60k USD): Existing recommendation systems (e.g., the Netflix competition) focus on
an accurate prediction of purchase, as the systems are evaluated based on the prediction accuracy.
However, such systems tend to recommend popular items. Recommending popular items, however,
might not be effective or affective on users' purchase decisions, as users likely already know the items
and likely have pre-made decisions on the purchase of items, e.g., recommend to watch Star Wars or
Titanic. Effective recommendation must recommend unexpected or novel items that could surprise
users and affect users' purchase decision. This project is to develop an effective recommendation for
digital TV customers.
 Mining for High-Utility Advertising in Search Engines (Microsoft) (60k USD): Ads are dominating
revenue sources for search engine companies and the success of search engine ads relies on finding
relevant ads that highly likely triggers users to the ads and eventually make conversions (i.e., actually
buy something). Eliciting ads clicks and conversions from Web search requires investigations on
several research problems such as (1) finding users’ commercial intent from queries, (2) mining the
factors, for each query, that impact the ads clicks, (3) mining click patterns or periods (time and
location) for each type of ads, (4) designing ads styles that maximize the clicks and conversions. We
define the ads, high utility ads, that are properly selected and designed well for a specific query, specific
time, specific location, and specific user, and thus that highly likely induces the users’ clicks and
conversions. This project proposes to take on these four research problems to select and design high
utility ads. The four problems will be researched based on the log data we collected via our proxy
server during the 8-month project. After the project, during the intern in MSRA, we plan to investigate
the problems using the Microsoft log data and expand the research using Microsoft ads auction data.
Finally, we plan to integrate the results of four tasks to build tools that automatically select and design
high utility ads given user’s query.
 Developing Multi-Variables Optimization Method based on Data Analysis (POSCO) (100k USD):
This project is to generate a prediction model for detecting problems of the iron rolling process from the
history data, and to estimate the optimal values of the parameters of each rolling procedure.
Selected Publications
Conferences
 W Han, S Lee, K Park, J Lee, M Kim, J Kim, H Yu, “TurboGraph: A Fast Parallel Graph Engine
Handling Billion-scale Graphs in a Single PC”, ACM SIGKDD 2013
 S Jeon, S Kim, H Yu, “Don't be Spoiled by Your Friends: Spoiler Detection in TV Program Tweets”,
AAAI ICWSM 2013
 J Kim, S Kim, H Yu, “Scalable and Parallelizable Processing of Influence Maximization for Large-Scale
Social Network”, IEEE ICDE 2013 (best poster award)
 W Lee, J Kim, H Yu, “CT-IC: Continuously activated and Time-restricted Independent Cascade Model
for Viral Marketing”, IEEE ICDM 2012 (student travel award)
 J Oh, H Yu, “iSampling: Framework for Developing Sampling Methods Considering User’s Interest”,
ACM CIKM 2012
 S Kim, K Toutanova, H Yu, “Multilingual Named Entity Recognition using Parallel Data and Metadata
from Wikipedia”, ACL 2012
 Y Kim, J Kim, H Yu, “GeoSearch: Georeferenced Video Retrieval System”, ACM SIGKDD 2012
 J Oh, T Kim, S Park, H Yu, “PubMed Search and Exploration with Real-Time Semantic Network
Construction”, ACM SIGKDD 2012
 J Oh, S Park, H Yu, M Song, S Park, “Novel Recommendation based on Personal Popularity
Tendency”, IEEE ICDM 2011
 S Kim, T Qin, TY Liu, H Yu, “Advertiser-Centric Approach to Understand User Click Behavior in
Sponsored Search”, ACM CIKM 2011
 B Lee, J Oh, H Yu, J Kim, “Protecting Location Privacy using Location Semantics”, ACM SIGKDD 2011
 H Yu, I Ko, Y Kim, S Hwang, WS Han, “Exact Indexing for Support Vector Machines”, ACM SIGMOD
2011
 WS Han, J Lee, YS Moon, S Hwang, H Yu, “A New Approach for Processing Ranked Subsequence
Matching Based on Ranked Union”, ACM SIGMOD 2011
 H Yu, S Kim, “Passive Sampling for Regression”, IEEE ICDM 2010
 H Yu, S Kim, S Na, “RankSVR: Can Preference Data Help Regression?”, ACM CIKM 2010
 WS Han, WS Kwak, H Yu, “On Supporting Effective Web Extraction”, IEEE ICDE 2010
 H Yu, J Oh, WS Han, “Efficient Feature Weighting Method for Ranking”, ACM CIKM 2009
 H Yu, T Kim, J Oh, I Ko, S Kim, “RefMed: Relevance Feedback Retrieval System for PubMed”, ACM
CIKM 2009
 NA Vien, VH Viet, T Chung, H Yu, S Kim, B Cho, “VRIFA: A Nonlinear SVM Visualization Tool using
Nomogram and LRBF Kernels”, ACM CIKM 2009
 H Yu, Y Kim, S Hwang, “An Efficient Method for Learning Ranking SVM”, PAKDD 2009
 H Yu, J Vaidya, X Jiang, “Privacy-Preserving SVM Classification on Vertically Partitioned Data”,
PAKDD 2006
 H Yu, “SVM Selective Sampling for Ranking with Application to Data Retrieval”, ACM SIGKDD 2005
 H Yu, S Hwang, KCC Chang, “RankFP: A Framework for Supporting Rank Formulation and
Processing”, IEEE ICDE 2005
 H Yu, D Searsmith, X Li, J Han, “Scalable Construction of Topic Directory with Nonparametric Closed
Termset Mining”, IEEE ICDM 2004
 H Yu, J Yang, J Han, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, ACM
SIGKDD 2003
 H Yu, “SVMC: Single-Class Classification With Support Vector Machines”, IJCAI 2003
 H Yu, C Zhai, J Han, “Text Classification from Positive and Unlabeled Documents”, ACM CIKM 2003
 H Yu, “General MC: Estimating Boundary of Positive Class from Small Positive Data”, IEEE ICDM 2003
 H Yu, J Han, KCC Chang, “PEBL: Positive Example Based Learning for Web Page Classification Using
SVM”, ACM SIGKDD 2002
 H Yu, KCC Chang, J Han, “Heterogeneous Learner for Web Page Classification”, IEEE ICDM 2002
Journals
 S Kim, L Sael, H Yu, “Efficient Protein Structure Search using Indexing Methods”, BMC Medical
Informatics and Decision Making, Springer 2013 (IF: 1.6)
 J Oh, T Kim, S Park, H Yu, Y Lee, “Efficient Semantic Network Construction with Application to
PubMed Search”, Knowledge-Based Systems, Elsevier 2013 (IF: 4.104)
 H Yu, J Kim, Y Kim, S Hwang, YH Lee, “An Efficient Method for Learning Nonlinear Ranking SVM
Functions”, Information Sciences 2012 (IF: 3.643)
 M Song, H Yu, WS Han, “Combining Active Learning and Semi-Supervised Learning Techniques to
Extract Protein Interaction Sentences”, BMC Bioinformatics 2011 (IF: 3.02)
 J Lee, MD Pham, J Lee, WS Han, H Cho, H Yu, JH Lee, “Processing SPARQL queries with regular
expressions in RDF databases”, BMC Bioinformatics 2011 (IF: 3.02)
 H Yu, “Selective Sampling Techniques for Feedback-based Data Retrieval”, Data Mining and
Knowledge Discovery 2011 (IF: 2.877)
 NA Vien, H Yu, TC Chung, “Hessian Matrix Distribution for Bayesian Policy Gradient Reinforcement
Learning”, Information Sciences 2011 (IF: 3.643)
 H Yu, T Kim, J Oh, I Ko, S Kim, WS Han, “Enabling Multi-Level Relevance Feedback on PubMed by
Integrating Rank Learning into DBMS”, BMC Bioinformatics 2010 (IF: 3.02)
 G Yu, S Hwang, H Yu, “Supporting Personalized Ranking over Categorical Attributes”, Information
Sciences 2008 (IF: 3.643)
 B Cho, H Yu, J Lee, Y Chee, I Kim, S Kim, “Nonlinear Support Vector Machine Visualization for Risk
Factor Analysis using Nomograms and Localized Radial Basis Function Kernels”, IEEE T. Information
Technology in Biomedicine 2008 (IF: 1.978)
 J Vaidya, H Yu, X Jiang, “Privacy-Preserving SVM Classification”, Knowledge and Information Systems
2008 (IF: 2.225)
 B Cho, H Yu, K Kim, T Kim, I Kim, S Kim, “Application of irregular and unbalanced data to predict
diabetic nephropathy using visualization and feature selection methods”, Artificial Intelligence in
Medicine 2008 (IF: 1.355)
 H Yu, S Hwang, KCC Chang, “Enabling Soft Queries for Data Retrieval”, Information Systems 2007 (IF:
1.768)
 H Yu, “Single-Class Classification with Mapping Convergence”, Machine Learning 2005 (IF: 1.467)
 H Yu, J Yang, J Han, X Li, “Making SVMs Scalable to Large Data Sets using Hierarchical Cluster
Indexing”, Data Mining and Knowledge Discovery 2005 (IF: 2.877)
 H Yu, J Han, KCC Chang, “PEBL: Web Page Classification without Negative Examples”, IEEE TKDE
2004 (IF: 1.892)
Research Statement
Hwanjo Yu is one of the pioneers in classification without negative examples and privacy-preserving SVM. He
also developed influential algorithms and systems in the area of data mining, database, and machine learning,
including (1) SVM-JAVA: a widely-used java open source for SVM, (2) RefMed: the world-first relevance
feedback search engine for PubMed, (3) iKernel: the first exact indexing method for SVM, (4) IPA: a scalable
and parallelizable influence maximization algorithm for large-scale social networks, and (5) TurboGraph: a fast
parallel graph engine handling billion-scale graphs in a single PC. His methods and algorithms were published
in prestigious journals and conferences including ACM SIGMOD, ACM SIGKDD, IEEE ICDE, IEEE ICDM,
ACM CIKM, etc., where he is also serving as a program committee.
Research Achievements
Research achievements can be categorized into the following four (nearly exclusive) categories.
1. Search Engine for “Complex” Queries: Typical search engines find relevant results from keyword
queries. Keyword queries are not sufficient in “complex” databases where the user’s search intention is
often too complex to express in a few keywords. For example, with the same keyword query “breast
cancer”
in
PubMed,
i.e.,
a
popularly
used
life
science
journal
database
(http://www.ncbi.nlm.nih.gov/pubmed), the user may want to search for articles about recent treatments or
search for articles about related genes. He developed technologies for enabling real-time relevance
feedback search and developed a RefMed (http://hwanjoyu.org/refmed), i.e., a relevance feedback search
engine for PubMed. RefMed is distinct from existing relevance feedback search engine in that, in RefMed,
the user can specify her notion of relevance by relative ordering or multi-degree relevance (e.g., highly
relevant, somewhat relevant, not relevant), whereas existing system requires the user to specify it by bydegree relevance (relevant or not). Thereby, RefMed learns an accurate relevance function from relatively a
small amount of feedback. The enabling technologies we developed are three-fold:
(1) How to accurately learn the user’s hidden preference or relevance function from a small amount of
feedback? He developed methods for SVM active learning or selective sampling, nonlinear SVM
ranking, feature weighting for ranking, etc. In particular, he developed theorems for optimal selective
sampling for ranking SVM (KDD 2005, DAMI 2011), and successfully applied the theorems to develop
a real-time search engine RefMed in order to minimize the amount of feedback to achieve a certain
accuracy of Ranking SVM. Related papers were published in KDD 2005, CIKM 2009, CIKM 2010,
ICDM 2010, KDD 2011, DAMI 2011, and Information Sciences 2012.
(2) How to efficiently retrieve top-k results according to the learned relevance function? He developed an
SVM indexing method which is the first exact indexing method for SVM ranking queries. Such an index
is specifically critical to develop a real-time relevance feedback search engine like RefMed, because, in
RefMed, once a ranking function is learned, top-k results must be returned in real-time. Without the
indexing method, it would take very long to find top-k results by scanning all the candidates. Related
papers were published in SIGMODa 2011, SIGMODb 2011, and Information Sciences 2013.
(3) How to seamlessly integrate the rank-learning and rank-processing? He developed methods for
seamlessly integrating rank learning and rank processing. Related papers were published in ICDE
2005, CIKM 2009, KDD 2012, and Knowledge-Based System 2013.
RefMed is currently widely used by bio-scientists, and it is ranked 1st at Google with query “PubMed
relevance search”, which is ranked even higher than PubMed itself.
2. Support Vector Machines (Partially-Supervised, Scalable, Privacy-Preserving, and Ranking): SVM
(Support Vector Machine) has been popularly used for learning classification, regression, and ranking
functions. He made significant contributions to the advancement of SVM and the followings are the
representative work.
(1) Classification without negative examples, also called partially-supervised learning, is to learn a
classification function from positive and unlabeled data. He published pioneering papers on this area
which were published in KDD 2002, ICDM 2003, CIKM2003, IJCAI 2003, TKDE 2004, and Machine
Learning 2005.
(2) He developed a disk-based SVM classification method called CB-SVM which adopts a hierarchical
clustering iteratively with SVM learning. Related papers were published in KDD 2003 and DAMI 2005.
(3) Privacy-preserving SVM is to learn SVM models in a distributed environment without disclosing private
data or information to other parties. He published pioneering papers on this area which were
published in PAKDD 2006, ACM SAC 2006, and KAIS 2008.
(4) RankSVM (or Ranking SVM) is to learn a ranking function (relevance or preference function) from
relative ordering or multi-degree labeled data. He developed an efficient active learning or selective
sampling for RankSVM (KDD 2005, CIKM 2010, DAMI 2011), an effective method for learning nonlinear
RankSVM (PAKDD 2009, Information Sciences 2012), and feature weighting method for RankSVM
(CIKM 2009).
He also developed a java implementation of SVM (http://hwanjoyu.org/svm-java) for educational and
research purpose, and it is ranked 2nd and 4th at Google after LIBSVM with query “svm java”.
3. Social Network and Graph Analysis and Processing: As social network sites such as Facebook,
Tweeter, Linked-In are rapidly growing, analysis on social network data has gained much attention and
many applications have been introduced. He has recently developed a social network analysis algorithm for
viral marketing and a parallel processing engine for graphs, which are detailed in the following.
(1) Influence maximization problem was introduced at KDD 2003, which is to find k people such that their
union of influence is maximized in the social network. Since the problem is proven to be NP-hard at
2003, many approximate algorithms have been introduced. We developed a scalable and parallelizable
algorithm for influence maximization called IPA (http://dm.postech.ac.kr/ipa_demo), which runs about
10s times faster than previous method using 5 times smaller memory. IPA is also applicable to all ICbased influence diffusion model whereas previous method is limited to a specific model. IPA is also
easily parallelized. IPA is published and won the best poster award in ICDE 2013 (out of 150 full and
poster).
(2) He developed TurboGraph, i.e., a fast parallel graph engine handling billion-scale in a single PC
(http://wshan.net/turbograph). TurboGraph outperforms the state-of-the-art graph engine, GraphChi, by
up to four orders of magnitude for a wide range of queries such as BFS, targeted queries (e.g., kNN), and global queries (e.g., PageRank, connected component). TurboGraph is published in KDD
2013, and our KDD paper is ranked the 2nd most downloaded paper in all KDD papers and the most
downloaded paper in KDD 2013 papers in ACM DIGITAL LIBRARY for the last six weeks (as of
September 9, 2013).
4. Mining Biological and Medical Data: He has collaborated with medical doctors for analyzing medical
data to improve the medical cares or processes such as improving the lung disease classification using 3D
images (Academic Radiology 2006), improving the prediction of diabetics (AI in Medicine 2008), and
analyzing the risk factors of diabetics (IEEE T. Information Technology in Biomedicine 2008), He also
collaborated with people in biology and bioinformatics for analyzing biological data to discover new
knowledge such as discovering protein interactions by analyzing bio-articles (BMC Bioinformatics 2011)
and improving related protein search by indexing protein structure (BMC Medical Informatics and Decision
Making 2013).
Research Vision
My main research thrusts will place in the following themes.
1. Integration of Machine Learning and Database for Enabling Big Data Analytics: Big data world
introduced a new challenge that is enabling existing statistical and machine learning-based analysis on the
data whose volume, variety, and velocity are unprecedentedly high. On the other hand, “complex” data
analysis in machine learning community and “efficient” data processing in database community have been
evolved independently without intimate connection between them. Such connection or integration is
now not an option to enable a production of consolidated solutions for big data analytics. My
previous research effort has also been pushed under this research theme. As a result, my research
outputs have also had practical impacts, e.g., (1) integrating the rank-learning and rank-processing to
enable a real-time relevance feedback search engine (i.e., RefMed), (2) integrating disk-based clustering
with SVM to make SVM scalable (i.e., CB-SVM), and (3) implementing graph analysis operators using the
DBMS engine technologies (i.e., TurboGraph). While I keep pushing my efforts to stand at the top of the
two fields, I will continuously put my main research thrusts on the integration of the two, which becomes
nowadays more critical to enable the big data analytics technology.
2. Graph Analysis and Processing: As a particular effort in the near future, I will put my effort into
developing graph analysis and processing engines, which is a part of my main research theme presented
above. Specifically, TurboGraph, i.e., a fast parallel graph engine handling billion-scale graphs in a single
machine (KDD 2013), achieved the ideal speed of graph processing by adopting the DBMS engine
technology and the full parallelism, i.e., CPU parallelism using multi-cores and IO parallelism using SSD.
TurboGraph currently supports a basic set of graph operators including BFS, k-NN, community detection,
PageRank, matrix-vector multiplication, etc. My first effort will put into extending the graph operators to
cover more complex operations such as matrix-matrix multiplication, matrix factorization, clustering, and
community detection. My second effort will be put into developing a distributed version of TurboGraph by
extending the full parallelism to include the network parallelism on top of the CPU and IO
parallelisms.
Download