Curriculum Vitae HWANJO YU Associate Professor Department of Computer Science and Engineering POSTECH, Pohang, Korea Positions 2011- Office: PIRL 335 Phone: +82-54-279-2388 (Mobile: +82-10-4118-7006) Fax: +82-54-279-2299 Email: hwanjoyu@postech.ac.kr Homepage: http://hwanjoyu.org 1995-1998 Associate Professor, Department of Computer Science and Engineering, POSTECH, Pohang, Korea Assistant Professor, Department of Computer Science and Engineering, POSTECH, Pohang, Korea Assistant Professor, Computer Science Department, University of Iowa, Iowa City, USA Research Assistant, National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana-Champaign (UIUC), USA. Database System and Application Developer, Sunwave Co. Seoul, Korea & Los Angeles, USA Education 1998-2004 1993-1997 Ph.D., Computer Science, University of Illinois, Urbana-Champaign (Advisor: Jiawei Han) B.S., Computer Science and Engineering, Chung-Ang University, Seoul 2008-2010 2004-2008 1999-2004 Awards 2013 2010 2004 2003 2003 2003 2003 2002 1996 1994-96 Teaching POSTECH 2008~ U. of Iowa 2004~2008 Best Poster Award at IEEE Int. Conf. Data Engineering (ICDE) (out of 150 full and short papers) 2nd and 3rd places at the UCSD Data Mining Contest (Graduate advisees) Nominated from U. Iowa for the Microsoft junior faculty fellowship The 2003 UIUC Data Mining Research Gold Award IBM Research Student Scholarship Award from ACM SIGKDD’03 (International Conference on Knowledge Discovery and Data Mining) Student Scholarship Award from IJCAI’03 (International Joint Conference on Artificial Intelligence) Student Scholarship Award from CIKM’03 (International Conference on Information and Knowledge Management) IBM Research Student Scholarship Award from ACM SIGKDD’02 Samsung Electronics Scholarship for outstanding undergrads Chung-Ang Scholarships for outstanding undergrads (graduated with the highest major GPA in the department) Mining Big Data Data Structures and Algorithms Database Systems Advanced Topics in Data Mining Introduction to Data Mining Introduction to Computing Database Systems Knowledge Discovery and Data Mining Data Mining and Machine Learning Advising Hwanjo Yu graduated 5 MS students, and currently advises 1 PhD and 8 MS/PhD students. His graduate advisees has published papers in ACM SIGMOD, ACM SIGKDD, IEEE ICDE, IEEE ICDM, and ACM CIKM, and received prestigious awards including IEEE ICDE best poster award (2013), and 2nd and 3rd places at UCSD data mining contest (2010). His graduate advisees have been also selected for internship to Microsoft Research and Microsoft Research Asia. His also advised undergraduate students at POSTECH and one of his undergraduate advisees published a paper in IEEE ICDM and received student travel award (2012). His undergraduate advisee also received Samsung Human-Tech paper award (2012). His advisee was the only undergraduate student among the recipients, and it is very unusual for an undergraduate student to receive the awards. Services I have served an associate editor of Neurocomputing journal since the year of 2005, and served a NSF proposal panel at 2006. I was also an organization committee of international conferences including the Proceeding Chair of APWeb 2010 and the PC-chair of EDB 2013. I was also an invited speaker of an international conference, PRIB 2012. I have been actively serving a program committee in ACM KDD, ACM SIGMOD, VLDB, IEEE ICDE, IEEE ICDM, and ACM CIKM. Projects Hwanjo Yu has been the PI and Co-PI of 7 national and 8 industrial projects since he joined POSTECH at 2008, and the amount of funds he has managed is about 3.5 million USD in total. The followings are representative national projects (ongoing or finished). Development of Enabling Software Technology for Big Data Mining (360k USD / year, 5 years): This project is supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology (No. 2012M3C4A7033344) The goal of this project is the development of enabling software technologies for big data mining. Through this project, we research data mining techniques for big data in natural sciences and social networks. We will also develop personalized service technologies based on unstructured big data analysis and customer behavior models. Furthermore, we will produce welltrained software engineers who are experts in big data mining. Developing Search and Mining Technologies for Mobile Devices (100k USD / year, 3 years): This project is supported by Mid-career Researcher Program through NRF grant funded by the MEST (No. KRF-2011-0016029). Combining the highly profitable information search industry and the mobile computing paradigm, mobile information search industry has been growing rapidly. This project aims at advancing the technologies in the areas of mobile search and mining, low-power consumption utility mining, and mining for mobile online advertising. User-Friendly Search Engine for MEDLINE (100k USD / year, 3 years): This project is supported by Mid-career Researcher Program through NRF grant funded by the MEST (No. KRF-2009-0080667). PubMed MEDLINE, a database of biomedical and life science journal articles, is one of the most important information source for medical doctors and bio-researchers. Finding the right information from the MEDLINE is nontrivial because it is not easy to express the intended relevance using the current PubMed query interface, and its query processor focuses on fast matching rather than accurate relevance ranking. This project develop techniques for building a user-friendly MEDLINE search engine. Enabling Relevance Search in Bio Databases (50k USD / year, 3 years): This project is supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-331-D00528). Most online data retrieval systems, built based on relational database management systems(RDBMS), support fast processing of Boolean queries but offer little support for relevance or preference ranking. A unified support of Boolean and ranking constraints in a query is essential for user-friendly data retrieval. This project develops foundational techniques that enable such data retrieval systems in which users intuitively express ranking constraints and the system efficiently process the queries. The followings are representative industrial projects (ongoing or finished) Developing Big Video Search and Analysis Technology (Korea Telecom) (50k USD): This proposal aims to develop a distributed video processing (DiViP) engine and techniques supporting distributed storage, information retrieval, object recognition for large-scale video dataset. More specifically, 1) distributed storage system for large-scale video dataset, 2) distributed processing engine which support traditional video processing techniques for large-scale video, 3) location based video retrieval, 4) efficient recognized-object-oriented video retrieval technique. Developing Distributed Machine Learning Algorithms for Classification and Recommendation (Samsung Elec.) (60k USD): Existing recommendation systems (e.g., the Netflix competition) focus on an accurate prediction of purchase, as the systems are evaluated based on the prediction accuracy. However, such systems tend to recommend popular items. Recommending popular items, however, might not be effective or affective on users' purchase decisions, as users likely already know the items and likely have pre-made decisions on the purchase of items, e.g., recommend to watch Star Wars or Titanic. Effective recommendation must recommend unexpected or novel items that could surprise users and affect users' purchase decision. This project is to develop an effective recommendation for digital TV customers. Mining for High-Utility Advertising in Search Engines (Microsoft) (60k USD): Ads are dominating revenue sources for search engine companies and the success of search engine ads relies on finding relevant ads that highly likely triggers users to the ads and eventually make conversions (i.e., actually buy something). Eliciting ads clicks and conversions from Web search requires investigations on several research problems such as (1) finding users’ commercial intent from queries, (2) mining the factors, for each query, that impact the ads clicks, (3) mining click patterns or periods (time and location) for each type of ads, (4) designing ads styles that maximize the clicks and conversions. We define the ads, high utility ads, that are properly selected and designed well for a specific query, specific time, specific location, and specific user, and thus that highly likely induces the users’ clicks and conversions. This project proposes to take on these four research problems to select and design high utility ads. The four problems will be researched based on the log data we collected via our proxy server during the 8-month project. After the project, during the intern in MSRA, we plan to investigate the problems using the Microsoft log data and expand the research using Microsoft ads auction data. Finally, we plan to integrate the results of four tasks to build tools that automatically select and design high utility ads given user’s query. Developing Multi-Variables Optimization Method based on Data Analysis (POSCO) (100k USD): This project is to generate a prediction model for detecting problems of the iron rolling process from the history data, and to estimate the optimal values of the parameters of each rolling procedure. Selected Publications Conferences W Han, S Lee, K Park, J Lee, M Kim, J Kim, H Yu, “TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC”, ACM SIGKDD 2013 S Jeon, S Kim, H Yu, “Don't be Spoiled by Your Friends: Spoiler Detection in TV Program Tweets”, AAAI ICWSM 2013 J Kim, S Kim, H Yu, “Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Network”, IEEE ICDE 2013 (best poster award) W Lee, J Kim, H Yu, “CT-IC: Continuously activated and Time-restricted Independent Cascade Model for Viral Marketing”, IEEE ICDM 2012 (student travel award) J Oh, H Yu, “iSampling: Framework for Developing Sampling Methods Considering User’s Interest”, ACM CIKM 2012 S Kim, K Toutanova, H Yu, “Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia”, ACL 2012 Y Kim, J Kim, H Yu, “GeoSearch: Georeferenced Video Retrieval System”, ACM SIGKDD 2012 J Oh, T Kim, S Park, H Yu, “PubMed Search and Exploration with Real-Time Semantic Network Construction”, ACM SIGKDD 2012 J Oh, S Park, H Yu, M Song, S Park, “Novel Recommendation based on Personal Popularity Tendency”, IEEE ICDM 2011 S Kim, T Qin, TY Liu, H Yu, “Advertiser-Centric Approach to Understand User Click Behavior in Sponsored Search”, ACM CIKM 2011 B Lee, J Oh, H Yu, J Kim, “Protecting Location Privacy using Location Semantics”, ACM SIGKDD 2011 H Yu, I Ko, Y Kim, S Hwang, WS Han, “Exact Indexing for Support Vector Machines”, ACM SIGMOD 2011 WS Han, J Lee, YS Moon, S Hwang, H Yu, “A New Approach for Processing Ranked Subsequence Matching Based on Ranked Union”, ACM SIGMOD 2011 H Yu, S Kim, “Passive Sampling for Regression”, IEEE ICDM 2010 H Yu, S Kim, S Na, “RankSVR: Can Preference Data Help Regression?”, ACM CIKM 2010 WS Han, WS Kwak, H Yu, “On Supporting Effective Web Extraction”, IEEE ICDE 2010 H Yu, J Oh, WS Han, “Efficient Feature Weighting Method for Ranking”, ACM CIKM 2009 H Yu, T Kim, J Oh, I Ko, S Kim, “RefMed: Relevance Feedback Retrieval System for PubMed”, ACM CIKM 2009 NA Vien, VH Viet, T Chung, H Yu, S Kim, B Cho, “VRIFA: A Nonlinear SVM Visualization Tool using Nomogram and LRBF Kernels”, ACM CIKM 2009 H Yu, Y Kim, S Hwang, “An Efficient Method for Learning Ranking SVM”, PAKDD 2009 H Yu, J Vaidya, X Jiang, “Privacy-Preserving SVM Classification on Vertically Partitioned Data”, PAKDD 2006 H Yu, “SVM Selective Sampling for Ranking with Application to Data Retrieval”, ACM SIGKDD 2005 H Yu, S Hwang, KCC Chang, “RankFP: A Framework for Supporting Rank Formulation and Processing”, IEEE ICDE 2005 H Yu, D Searsmith, X Li, J Han, “Scalable Construction of Topic Directory with Nonparametric Closed Termset Mining”, IEEE ICDM 2004 H Yu, J Yang, J Han, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, ACM SIGKDD 2003 H Yu, “SVMC: Single-Class Classification With Support Vector Machines”, IJCAI 2003 H Yu, C Zhai, J Han, “Text Classification from Positive and Unlabeled Documents”, ACM CIKM 2003 H Yu, “General MC: Estimating Boundary of Positive Class from Small Positive Data”, IEEE ICDM 2003 H Yu, J Han, KCC Chang, “PEBL: Positive Example Based Learning for Web Page Classification Using SVM”, ACM SIGKDD 2002 H Yu, KCC Chang, J Han, “Heterogeneous Learner for Web Page Classification”, IEEE ICDM 2002 Journals S Kim, L Sael, H Yu, “Efficient Protein Structure Search using Indexing Methods”, BMC Medical Informatics and Decision Making, Springer 2013 (IF: 1.6) J Oh, T Kim, S Park, H Yu, Y Lee, “Efficient Semantic Network Construction with Application to PubMed Search”, Knowledge-Based Systems, Elsevier 2013 (IF: 4.104) H Yu, J Kim, Y Kim, S Hwang, YH Lee, “An Efficient Method for Learning Nonlinear Ranking SVM Functions”, Information Sciences 2012 (IF: 3.643) M Song, H Yu, WS Han, “Combining Active Learning and Semi-Supervised Learning Techniques to Extract Protein Interaction Sentences”, BMC Bioinformatics 2011 (IF: 3.02) J Lee, MD Pham, J Lee, WS Han, H Cho, H Yu, JH Lee, “Processing SPARQL queries with regular expressions in RDF databases”, BMC Bioinformatics 2011 (IF: 3.02) H Yu, “Selective Sampling Techniques for Feedback-based Data Retrieval”, Data Mining and Knowledge Discovery 2011 (IF: 2.877) NA Vien, H Yu, TC Chung, “Hessian Matrix Distribution for Bayesian Policy Gradient Reinforcement Learning”, Information Sciences 2011 (IF: 3.643) H Yu, T Kim, J Oh, I Ko, S Kim, WS Han, “Enabling Multi-Level Relevance Feedback on PubMed by Integrating Rank Learning into DBMS”, BMC Bioinformatics 2010 (IF: 3.02) G Yu, S Hwang, H Yu, “Supporting Personalized Ranking over Categorical Attributes”, Information Sciences 2008 (IF: 3.643) B Cho, H Yu, J Lee, Y Chee, I Kim, S Kim, “Nonlinear Support Vector Machine Visualization for Risk Factor Analysis using Nomograms and Localized Radial Basis Function Kernels”, IEEE T. Information Technology in Biomedicine 2008 (IF: 1.978) J Vaidya, H Yu, X Jiang, “Privacy-Preserving SVM Classification”, Knowledge and Information Systems 2008 (IF: 2.225) B Cho, H Yu, K Kim, T Kim, I Kim, S Kim, “Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods”, Artificial Intelligence in Medicine 2008 (IF: 1.355) H Yu, S Hwang, KCC Chang, “Enabling Soft Queries for Data Retrieval”, Information Systems 2007 (IF: 1.768) H Yu, “Single-Class Classification with Mapping Convergence”, Machine Learning 2005 (IF: 1.467) H Yu, J Yang, J Han, X Li, “Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing”, Data Mining and Knowledge Discovery 2005 (IF: 2.877) H Yu, J Han, KCC Chang, “PEBL: Web Page Classification without Negative Examples”, IEEE TKDE 2004 (IF: 1.892) Research Statement Hwanjo Yu is one of the pioneers in classification without negative examples and privacy-preserving SVM. He also developed influential algorithms and systems in the area of data mining, database, and machine learning, including (1) SVM-JAVA: a widely-used java open source for SVM, (2) RefMed: the world-first relevance feedback search engine for PubMed, (3) iKernel: the first exact indexing method for SVM, (4) IPA: a scalable and parallelizable influence maximization algorithm for large-scale social networks, and (5) TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. His methods and algorithms were published in prestigious journals and conferences including ACM SIGMOD, ACM SIGKDD, IEEE ICDE, IEEE ICDM, ACM CIKM, etc., where he is also serving as a program committee. Research Achievements Research achievements can be categorized into the following four (nearly exclusive) categories. 1. Search Engine for “Complex” Queries: Typical search engines find relevant results from keyword queries. Keyword queries are not sufficient in “complex” databases where the user’s search intention is often too complex to express in a few keywords. For example, with the same keyword query “breast cancer” in PubMed, i.e., a popularly used life science journal database (http://www.ncbi.nlm.nih.gov/pubmed), the user may want to search for articles about recent treatments or search for articles about related genes. He developed technologies for enabling real-time relevance feedback search and developed a RefMed (http://hwanjoyu.org/refmed), i.e., a relevance feedback search engine for PubMed. RefMed is distinct from existing relevance feedback search engine in that, in RefMed, the user can specify her notion of relevance by relative ordering or multi-degree relevance (e.g., highly relevant, somewhat relevant, not relevant), whereas existing system requires the user to specify it by bydegree relevance (relevant or not). Thereby, RefMed learns an accurate relevance function from relatively a small amount of feedback. The enabling technologies we developed are three-fold: (1) How to accurately learn the user’s hidden preference or relevance function from a small amount of feedback? He developed methods for SVM active learning or selective sampling, nonlinear SVM ranking, feature weighting for ranking, etc. In particular, he developed theorems for optimal selective sampling for ranking SVM (KDD 2005, DAMI 2011), and successfully applied the theorems to develop a real-time search engine RefMed in order to minimize the amount of feedback to achieve a certain accuracy of Ranking SVM. Related papers were published in KDD 2005, CIKM 2009, CIKM 2010, ICDM 2010, KDD 2011, DAMI 2011, and Information Sciences 2012. (2) How to efficiently retrieve top-k results according to the learned relevance function? He developed an SVM indexing method which is the first exact indexing method for SVM ranking queries. Such an index is specifically critical to develop a real-time relevance feedback search engine like RefMed, because, in RefMed, once a ranking function is learned, top-k results must be returned in real-time. Without the indexing method, it would take very long to find top-k results by scanning all the candidates. Related papers were published in SIGMODa 2011, SIGMODb 2011, and Information Sciences 2013. (3) How to seamlessly integrate the rank-learning and rank-processing? He developed methods for seamlessly integrating rank learning and rank processing. Related papers were published in ICDE 2005, CIKM 2009, KDD 2012, and Knowledge-Based System 2013. RefMed is currently widely used by bio-scientists, and it is ranked 1st at Google with query “PubMed relevance search”, which is ranked even higher than PubMed itself. 2. Support Vector Machines (Partially-Supervised, Scalable, Privacy-Preserving, and Ranking): SVM (Support Vector Machine) has been popularly used for learning classification, regression, and ranking functions. He made significant contributions to the advancement of SVM and the followings are the representative work. (1) Classification without negative examples, also called partially-supervised learning, is to learn a classification function from positive and unlabeled data. He published pioneering papers on this area which were published in KDD 2002, ICDM 2003, CIKM2003, IJCAI 2003, TKDE 2004, and Machine Learning 2005. (2) He developed a disk-based SVM classification method called CB-SVM which adopts a hierarchical clustering iteratively with SVM learning. Related papers were published in KDD 2003 and DAMI 2005. (3) Privacy-preserving SVM is to learn SVM models in a distributed environment without disclosing private data or information to other parties. He published pioneering papers on this area which were published in PAKDD 2006, ACM SAC 2006, and KAIS 2008. (4) RankSVM (or Ranking SVM) is to learn a ranking function (relevance or preference function) from relative ordering or multi-degree labeled data. He developed an efficient active learning or selective sampling for RankSVM (KDD 2005, CIKM 2010, DAMI 2011), an effective method for learning nonlinear RankSVM (PAKDD 2009, Information Sciences 2012), and feature weighting method for RankSVM (CIKM 2009). He also developed a java implementation of SVM (http://hwanjoyu.org/svm-java) for educational and research purpose, and it is ranked 2nd and 4th at Google after LIBSVM with query “svm java”. 3. Social Network and Graph Analysis and Processing: As social network sites such as Facebook, Tweeter, Linked-In are rapidly growing, analysis on social network data has gained much attention and many applications have been introduced. He has recently developed a social network analysis algorithm for viral marketing and a parallel processing engine for graphs, which are detailed in the following. (1) Influence maximization problem was introduced at KDD 2003, which is to find k people such that their union of influence is maximized in the social network. Since the problem is proven to be NP-hard at 2003, many approximate algorithms have been introduced. We developed a scalable and parallelizable algorithm for influence maximization called IPA (http://dm.postech.ac.kr/ipa_demo), which runs about 10s times faster than previous method using 5 times smaller memory. IPA is also applicable to all ICbased influence diffusion model whereas previous method is limited to a specific model. IPA is also easily parallelized. IPA is published and won the best poster award in ICDE 2013 (out of 150 full and poster). (2) He developed TurboGraph, i.e., a fast parallel graph engine handling billion-scale in a single PC (http://wshan.net/turbograph). TurboGraph outperforms the state-of-the-art graph engine, GraphChi, by up to four orders of magnitude for a wide range of queries such as BFS, targeted queries (e.g., kNN), and global queries (e.g., PageRank, connected component). TurboGraph is published in KDD 2013, and our KDD paper is ranked the 2nd most downloaded paper in all KDD papers and the most downloaded paper in KDD 2013 papers in ACM DIGITAL LIBRARY for the last six weeks (as of September 9, 2013). 4. Mining Biological and Medical Data: He has collaborated with medical doctors for analyzing medical data to improve the medical cares or processes such as improving the lung disease classification using 3D images (Academic Radiology 2006), improving the prediction of diabetics (AI in Medicine 2008), and analyzing the risk factors of diabetics (IEEE T. Information Technology in Biomedicine 2008), He also collaborated with people in biology and bioinformatics for analyzing biological data to discover new knowledge such as discovering protein interactions by analyzing bio-articles (BMC Bioinformatics 2011) and improving related protein search by indexing protein structure (BMC Medical Informatics and Decision Making 2013). Research Vision My main research thrusts will place in the following themes. 1. Integration of Machine Learning and Database for Enabling Big Data Analytics: Big data world introduced a new challenge that is enabling existing statistical and machine learning-based analysis on the data whose volume, variety, and velocity are unprecedentedly high. On the other hand, “complex” data analysis in machine learning community and “efficient” data processing in database community have been evolved independently without intimate connection between them. Such connection or integration is now not an option to enable a production of consolidated solutions for big data analytics. My previous research effort has also been pushed under this research theme. As a result, my research outputs have also had practical impacts, e.g., (1) integrating the rank-learning and rank-processing to enable a real-time relevance feedback search engine (i.e., RefMed), (2) integrating disk-based clustering with SVM to make SVM scalable (i.e., CB-SVM), and (3) implementing graph analysis operators using the DBMS engine technologies (i.e., TurboGraph). While I keep pushing my efforts to stand at the top of the two fields, I will continuously put my main research thrusts on the integration of the two, which becomes nowadays more critical to enable the big data analytics technology. 2. Graph Analysis and Processing: As a particular effort in the near future, I will put my effort into developing graph analysis and processing engines, which is a part of my main research theme presented above. Specifically, TurboGraph, i.e., a fast parallel graph engine handling billion-scale graphs in a single machine (KDD 2013), achieved the ideal speed of graph processing by adopting the DBMS engine technology and the full parallelism, i.e., CPU parallelism using multi-cores and IO parallelism using SSD. TurboGraph currently supports a basic set of graph operators including BFS, k-NN, community detection, PageRank, matrix-vector multiplication, etc. My first effort will put into extending the graph operators to cover more complex operations such as matrix-matrix multiplication, matrix factorization, clustering, and community detection. My second effort will be put into developing a distributed version of TurboGraph by extending the full parallelism to include the network parallelism on top of the CPU and IO parallelisms.