Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013 What is data mining? Mining patterns from data Is it statistics? Functional form? Computation speed concern? Data size Variable size Is it machine learning? 2 Big data issue New methods: network mining Examples of data mining Frequently bought together 3 Movie recommendation More examples of data mining Keyword suggestions 4 Genome & disease mining Heart monitoring Overview of data mining Frequent pattern mining Machine Learning Supervised Unsupervised Stream mining Recommender system Graph mining Unstructured data Text, Audio Image and Video Big data technology 5 Frequent Pattern Mining Diaper and Beer ? Product assortment Click behavior Machine breakdown 6 The case of Amazon User 1 2 3 4 5 Items {Princess dress, crown, gloves, t-shirt} {Princess dress, crown, gloves, pink dress, t-shirt } {Princess dress, crown, gloves, pink dress, jeans} { Princess dress, crown, gloves, pink dress} {crown, gloves } Count frequency of co-occurrence Efficient algorithm 7 Machine Learning Process 8 Machine Learning Supervised Unsupervised (clustering) 9 Binary classification Input features Checking Data point 10 Yes Yes No Yes Yes Yes Yes Duration Savings Current (years) Loans ($k) 1 10 Yes 2 4 No 5 75 No 10 66 No 5 83 Yes 1 11 No 4 99 Yes Output class Loan Purpose Risky? TV TV Car Repair Car TV Car 0 1 0 1 0 0 0 Classification (1) Decision tree 11 Classification (2): Neural network Perceptron Multi-layer neural netowrk 12 Head pose detection 13 Support Vector Machine (SVM) Search for a separating hyperplane Maximize margin 14 Perceived advantage of SVM Transform data into higher dimension 15 Applications of SVM: Spam Filter Input Features: Transmission Email header From --“admin@one-spam.cpm” To --“undisclosed” cc Email Body IP address --167.12.24.555 Sender URL -- one-spam.com # of paragraphs # words Email structure 16 # of attachments # of links Logistic regression Advantage: Simple functional form Can be parallelized Large scale 17 Applications of logistic regression Click prediction Search ranking (web pages, products) Online advertising Recommendation The model Output: Click/no click Input features: page content, search keyword, User information 18 Regression Linear regression Non-linear regression 19 Application: • Stock price prediction • Credit scoring • employment forecast History of Supervised learning 20 Semi-supervised learning Application: 21 Speech dialog system Unsupervised learning: Clustering No labeled data Methods 22 K-means Categories of machine learning 23 Applications of Clustering Malware detection Document clustering: Topic detection 24 Graphs in our life Social network Friend recommendation 25 Molecular compound Drug discovery Graph and its matrix representation Adjacency matrix 1 2 1 4 6 3 2 3 4 5 5 26 6 1 2 3 4 5 6 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0 The web graph Page 1 Anchor text Page 2 Hyperlink Anchor text Anchor text Page 3 Anchor text 27 PageRank as a steady state Transition matrix P= 1 2 3 4 5 6 1 0 0.5 0.25 0 0 0.5 2 0.33 0 0.25 1 0 0 3 0.33 0.5 0 0 0.33 0 4 0 0 0.25 0 0.33 0 PageRank is a probability vector P 28 5 0 0 0.25 0 0 0.5 6 0.33 0 0 0 0.33 0 such that Discover influencers on Twitter The Twitter graph Node Link A PageRank approach: TwitterRank 2 Following 1 4 5 29 3 Facebook graph search Entity graph Natural language search 30 “Restaurants liked by my friends” Recommending a game 31 Recommendation in Travel site 32 Prediction Problems Rating Prediction Given how an user rated other items, predict the user’s rating for a given item **** Top-N Recommendation 33 ? Given the list of items liked by an user, recommend new items that the user might like Explicit vs. Implicit Feedback Data Explicit feedback Ratings and reviews Implicit feedback (user behavior) Purchase behavior: Recency, frequency, … Browsing behavior: # of visits, time of visit, time of staying, clicks 34 Collaborative Filtering Hypotheses User/Item Similarities Matching characteristics 35 Similar users purchase similar items Similar items are purchased by similar users Match exists between user’s and item’s characteristics User-User similarity User’s movie rating 36 John Out of Africa 4 Star Wars 4 Air Force One 5 Liar, Liar 1 Adam 1 1 2 5 Laura ? 4 5 2 Item-item similarity John Adam Out of Africa 4 1 Star Wars 4 1 Air Force One 5 2 Liar, Liar 1 5 Laura ? 4 5 2 37 Application of item-item similarity Amazon 38 SVD (Singular Value Decomposition) 39 Latent factors 40 Application of Latent Factor Model GetJar 41 Ranking-based recommendation 42 Application in LinkedIn Ranking-based model 43 Thanks and Contact Co-author: Patricia Hoffman Contact: junlinghu@gmail.com Twitter: @junling_tech 44