• Introduction
• Functionalities
• Hot topics
• Research Groups
• Useful Resources
• Introduction
– What is data mining?
– General Process
– Related Fields
– Different Views
• Functionalities
• Hot topics
• Research Groups
• Useful Resources
• ( From Prof. Jiawei Han’s Slides ): Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
• ( From Prof. Sunita Sarawagi’s slides ): Process of semi-automatically analyzing large databases to find patterns that are
– valid: hold on new data with some certainty
– novel: non-obvious to the system
– useful: should be possible to act on the item
– understandable: humans should be able to interpret the pattern
• ( From Prof. Vipin Kumar’ Slides ): Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
– What is not Data Mining?
• Look up phone number in phone directory
• Query a Web search engine for information about “Amazon”
– What is Data Mining?
• Certain names are more prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in Boston area)
• Group together similar documents returned by search engine according to their context
- Tan, Steinbach, Kumar, Introduction to Data Mining
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Selection
Databases - Han & Kamber, Data Mining: Concepts and Techniques
Machine
Learning
Database
Technology
Algorithm
Data Mining
Statistics
Visualization
• Confluence of
Multiple Disciplines
- Han & Kamber, Data Mining:
Concepts and Techniques
Other
Disciplines
• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
• But different…
- Tan, Steinbach, Kumar, Introduction to Data Mining
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data Mining
Database systems
• Traditional Techniques may be unsuitable due to
– Enormity of data
– High dimensionality of data
From Prof. Vipin Kumar’s slides
– Heterogeneous, distributed nature of data
• Overlaps with machine learning, statistics, artificial intelligence, databases, visualization, but more stress on
– scalability of number of features and instances
– stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning.
– automation for handling large, heterogeneous data
-From Prof. Sunita Sarawagi’s slides
• Categorize a data mining task from different views
• By general functionality and operations:
– Descriptive data mining
• Find human-interpretable patterns that describe the data.
• Clustering / similarity matching
• Association rules and variants
• Deviation detection
– Predictive data mining
• Use some variables to predict unknown or future values of other variables.
• Regression
• Classification
• Collaborative Filtering
• By data to be mined
– Relational, data warehouse, transactional, stream, objectoriented, sequence, graph, social network, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• By knowledge to be discovered
– Characterization, discrimination, frequent patterns, association, classification, clustering, trend/deviation, outlier analysis, etc
• By techniques utilized
– Database-oriented, data warehouse (OLAP), combinational algorithms, machine learning, statistics, visualization, etc.
• By application adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
- Han & Kamber, Data Mining: Concepts and Techniques
• Introduction
• Functionalities
– Data Warehousing and OLAP
– Frequent patterns, association, correlation and causality
– Classification and prediction
– Clustering
– Outlier analysis, Trend and evolution analysis
• Hot topics
• Research Groups
• Useful Resources
• Data Warehousing:
– “
A data warehouse is a subject-oriented , integrated , time-variant , and nonvolatile collection of data in support of management
’ s decisionmaking process.
”—
W. H. Inmon
• OLAP: on-line analytical processing
– Major task of data warehouse system
– Data analysis and decision making
– Drill-down, roll-up, exception/discovery driven
• Methodology product
– Data Cubing
– Iceberg cube
– Multi-way, BUC, Star, MM, product,date all date product,country country date, country shell, close-cube , etc.
- Han & Kamber, Data Mining: Concepts and Techniques product, date, country
• Frequent pattern : a pattern (itemsets, subsequences, substructures, etc.) that occurs frequently in a data set
– Comparing to n-grams, phrases, etc.
• Motivation : Finding inherent regularities in data
• Applications : Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
• Association rule mining:
– Given a set of records each of which contain some number of items from a given collection;
– Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
– Frequent pattern association rules correlations
• Types of data:
– Itemsets, sequences, graphs.
• Scalable mining methods: Three major approaches
– Apriori (Agrawal & Srikant@VLDB’94)
– FPgrowth (Han, Pei & Yin @SIGMOD’00)
• Prefixspan, clospan, gSpan, closegraph, etc.
– Vertical data format approach (Charm, Zaki & Hsiao @SDM’02)
• Apriori:
– Candidate pattern generation and pruning
– Breadth-first search over pattern space
• FPgrowth:
– Pattern growth through FP-tree, no candidate generation
– Depth-first search, doing pruning smartly
• Supervised Learning, already discussed in Machine Learning.
– Classification: classifies data (constructs a model) based on the training set and the values ( categorical class labels ) in a classifying attribute and uses it in classifying new data
– Prediction: models continuous-valued functions, i.e., predicts unknown or missing values
• Algorithms:
– Decision Tree based: C4.5, ID3, Rainforest, etc.
– Bayesian Method: Naïve Bayesian, Bayesian network , a lot of others covered in Machine Learning..
– Discriminative: Perceptron/Winnow, NN, SVM, CB-SVM , etc.
– Rule-based, Associative, k-NN, etc.
– Prediction: Regression,
• Bagging, Boosting, Model Selection, Cross-Validation
• Unsupervised Learning, as discussed in Machine Learning
– Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
– Similarities/distances: many!
• Algorithms:
– Partition based: K-means, K-Medoids, CLARA, etc
– Hierarchical: Bottom-up (single/complete/average link), top-down,
Birch
– Density-based/Grid-based: DBSCAN, DENCLUE, CLIQUE, etc.
– Model-based: EM, COBWEB, SOM, etc.
– High-Dimensional, Constraint based
• outliers: The set of objects that are considerably dissimilar from the remainder of the data
– Statistical: hypothesis testing, bug mining
– Density based
– Clustering based, etc
• Deviation/Anomaly Detection
• Fraud Detection
• Trend and Evolution:
– Usually coupled with outlier analysis
– Basic functionalities in temporal data mining
– Trend, cycle, seasonal, irregular patterns
• Introduction
• Functionalities
• Hot topics
– Mining data stream, Mining time series, Spatiotemporal data mining, mining Social Networks, Sequential data mining, Graph
Mining, Biology data mining, Privacy Preserving Data Mining
– Text and Web mining
• Research Groups
• Useful Resources
• Data: Data streams
— continuous, ordered, changing fast, huge amount
• Characteristics and Challenges :
– Huge volumes
– Fast changing, requires fast and real-time response
– Random access is expensive — need single scan algorithms
– Difficult to keep the universe — need approximations
• Basic problems:
– Multi-dimensional on-line analysis of streams
– Mining outliers and unusual patterns in stream data
– Clustering data streams
– Classification of stream data
• Methods:
– Basic: Sliding windows, Tilted time frames
– Counting (FP mining, etc):
• Random sampling
• Approximated counting
– OLAP:
• Keep Critical layers in stream cube computation
• Partial materialization
• outlier: exception-based exploration
– Clustering:
• Offline microclustering and online macroclustering
• Text Related Applications:
– Web logs and Web page click streams
• Data: Time-series database
– Consists of sequences of values or events changing with time
– Data is recorded at regular intervals
• Characteristics and Challenges :
– Characteristic time-series components: Trend, cycle, seasonal, irregular patterns
• Basic Problems:
– Trends discovery, Similarity Search, outlier detection, prediction and clustering
• Methods:
– Statistical modeling (Regression, Spline, Mixture Model, etc)
– Data transformation (DFT, DWT)
– Sliding windows, Atomic matching, window stitching,
Subsequence Ordering
– Clustering
-Han & Kamber, Data Mining:
• Text Related Applications:
Concepts and Techniques
– Transliteration mining, Temporal text mining, word bursting, etc.
• Data: object data sets, spatial/spatiotemporal databases and data warehouses
• Characteristics and Challenges:
– Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage
– handling objects in space that have identity and well-defined extents, locations, and relationships.
– Require the merge of a set of geographic areas by spatial operations
• Basic Problems:
– Querying objects; distribution/cluster/correlation/evolution/trend analysis
• Methods
– GIS (Geographic Information System): Analysis and visualization of geographic data
• Search, Location analysis, Terrain analysis, Distribution,
Spatial analysis/statistics, Measurement
– Indexing Spatial data (R-tree, etc. )
– Modeling single objects with points, lines and regions
– Modeling spatially related collection of objects: plane partitions and networks.
– Spatiotemporal patterns, correlations, trend analysis, clustering …
• Text Related Applications:
– Spatiotemporal text mining; community evolution in weblogs;
– Information diffusing; web evolution
• Association rule mining and frequent itemset mining are pretty old topics
• However, some special topics of frequent pattern mining are still hot
– Sequential pattern mining
– Graph mining
– Pattern post-processing
• Data: sequential data base
• Basic problems:
– Discovery of frequent subsequences (allow gap, comparing to n-grams); close subsequences
– Sequence Similarity Search, Sequence Alignment
• Methods:
– Apriori: GSP
– FP-Growth: PrefixSpan, Clospan
– BLAST, Hidden Markov models,
CRF, etc.
• Text Related Applications:
– Most text patterns are sequential patterns
– Phrase extraction, entity/relation extraction, opinion mining, etc
– Biology sequence modeling
-Han & Kamber, Data Mining:
Concepts and Techniques
• Data: graph databases (like social network, but multiple graphs, more general), examples include
– Chemical component, protein structure, program flow, XML/Web,
– Directed, undirected, labeled/unlabeled, weighted, 2-D/3-D, etc.
• Characteristics and Challenges:
– Theoretically, most are of high complexity, but practically, the graphs are solvable.
– Too many substructures to index
– …
• Basic problems
– Frequent subgraph mining
– Close subgraph mining
– Graph indexing by substructures
– Similarity search
-Han & Kamber, Data Mining: Concepts and Techniques
• Methods:
– Subgraph mining: Apriori (e.g. FSG), Pattern Growth (e.g. gSpan)
– gSpan : pattern growth, depth first search, active elimination of duplicated subgraphs; Flatten a graph into a sequence using depth first search; enumerate graph using right-most extension.
– CloseGraph: mining close subgraph patterns
– gIndex : identify frequent structures, prune redundancy to maintain discriminative structures , create index on such structures.
– Similarity search: indexing; feature based similarities; estimate feature missing
• Text Related Applications:
– Multi-resolution topic map, entity-relation network, pathway extraction, etc.
• More on Graph Indexing and Similarity Search
• Comparing to Text Retrieval:
Objects
Basic Units
Text Retrieval
Documents
Words
Pruning stopwords stemming Redundancy?
Representation Term vectors
Dimensions Terms
Relevance Vector similarity
Approximation No
Graph Indexing & Search
Graphs
Frequent structures
Need to mine frequent subgraph
Need discriminative structs.
Feature vectors
Substructures
Vector similarity
Yes, need to estimate feature missing (relax substructures)
• What if we want to index on phrases instead of words?
– Need to extract phrases first
– N-grams/sequential patterns, have to remove redundancy
• E.g. “natural language processing” v.s. “language processing”
– Substructures are like phrases…
• Can IR help?
– Representation and Similarity measures? (Vector Space Models,
Probabilistic models…)
– How to weight features? (TF-IDF, …)
– Generative models?
– Query expansion? Feedback?
• Data: frequent patterns extracted by mining algorithms
• Challenge:
– Mining algorithms output explosively large number of patterns
– How to interpret the frequent patterns extracted
• Basic Problems:
– Pattern summarization
– Mining compressed patterns
– Top-K patterns
– Pattern annotation
– User-oriented ranking
• Methods:
– Modeling Pattern profiles, coverage and contexts
– Using Clustering to summarize and compress patterns
– Bridging IR/NLP and frequent pattern mining: profile, context, ranking, feedback, filtering, summarization, MMR, etc.
• Data: Graphs/networks with nodes and links
– Example: communication networks, webpages, citations, biological pathways, etc.
• Characteristics and Challenges:
– Connected Components: few
– Network diameter: small
– Clustering: high degree
– Degree distribution: heavy-tailed
– Modeling Logical/statistical dependencies
• Basic Problems:
– Model the generation of graphs/networks
– Link based object ranking, classification,
Identification, Clustering, entity resolution
– Link Prediction, querying, community discovery
H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai,
Nature 411, 41-42 (2001)
• Methods:
– Graph Generation Models: trying to derive generative models which explains the characteristics and evolutions of social networks/graphs.
– Vertex Ranking: PageRank, HITS, etc.
– Community Detection: Hierarchical Clustering, Spectral clustering, Stochastic modeling, etc.
– Link based classification: semi-supervised learning, propagation
– Entity resolution: duplicate prediction, collective resolution, probabilistic models
– Link Prediction: binary classification problem, local conditional probabilistic models
– Substructure mining: graph pattern mining, indexing
• Generative Models of social network/graph generation and evolution
• Random graphs (Erd ö s-R
é nyi models)
– Fix vertices, generate each edge independently with probability p
– N(N-1)/2 trials of a biased coin flip, p ~ 1/N
– Degree distribution is Poisson, E[d] = p(N-1); E[# of e] = pN(N-1)/2
– Parameter: p
• Graph process model:
– starting with no edges, just keep adding one edge at a time
– always choose next edge randomly from among all missing edges
• α-model (Watts-Strogatz models, Small-world)
– For vertices u, v, define m(u,v) to be the number of common neighbors (so far)
– Define the propensity R(u,v) of u to connect to v
• if m(u,v) >= k, R(u,v) = 1 (share too many friends, must connect)
• if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect)
• else, R(u,v) = p + (m(u,v)/k) a
(1-p) biased to connect
– Generate network incrementally, with R(u,v) as the edge probability;
– α ∞ , is similar to Erdos-Renyi models
– Need to tune parameter α, p, k
• Scale free models: not fix N (# of vertices)
– Start with (say) two vertices connected by an edge
– let Z = Σ d(j) where d(j) = degree of vertex j so far
– add new vertex i with k edges back to {1, …
, i-1}: i is connected back to j with probability d(j)/Z
– Richer get richer …
• Evaluation of generative models
– Can they explain all the characteristics of social networks?
– Parameter tuning?
• Other models for Social network analysis
– Copying model: leads to communities
– Forest Fire Model
– Electricity network (not generative model, but interesting)
• Text Related Applications: quite a lot!
– Ranking webpages
– Multi-resolution Concept/Topic Map
– Citation Impact of scientific literature
– Entity-relation extraction
– Bioinformatics: Pathway extraction
– Reference Reconciliation
– Web structure evolution
– Community discovery in Weblogs..
• Data: text, unstructured/semi-structured; webpages with linkages, user logs;
– E.g. webpage, news, email, weblogs, scientific literature, citations, customer reviews, forums, search logs, chatting logs, legal documents, etc.
• Challenges:
– Modeling unstructured/semi-structured data
– Coupling with Natural Language Processing
– Handling high dimensionality
– Handling data sparseness and ambiguity
– The Web is too complicated!
• Selected Problems:
– Text categorization/clustering ( Already covered in NLP and ML )
– Word sense disambiguation ( Covered in NLP )
– Information Extraction ( Covered in NLP )
– Dimension Reduction ( Overlapping with ML and IR )
– Collaborative Filtering, User-interest modeling
– Topic Detection and Tracking
– Comparative Text Mining, Theme based text mining
– Transliteration mining
– Email clustering / spam detection
– Opinion mining ( Overlapping with NLP )
– Social Networks Related (Already covered)
– Temporal Text Mining
– Vision based page segmentation / Block based search
• Methods: Confluence of Multiple Disciplines
– Database: data integration, schema matching, XML
– Data mining: sequential pattern mining, association rule mining, …
– IR: Search, language models, feedback, …
– Machine Learning: SVD, Supervised/unsupervised learning, semi-supervised learning, Topicmodels, …
– NLP: POS tagging, parsing, context modeling, sentiment extraction, entity extraction, …
– Statistical Learning: Bayesian methods, word bursting, timeseries analysis, hypothesis testing, other statistical models, …
• Resolution:
– Word level: Word sense disambiguation, word bursting, transliteration mining
– Entity level: information extraction, entity-relation network
– Pattern level: opinion mining, relation extraction
– Document level: document classification/clustering
– Theme level: PLSI, LDA, comparative text mining, temporal text mining/spatiotemporal text mining
– Topic level: topic detection and tracking, email threading
– Web level: social network, weblog mining, block based search
• Selected topics will be discussed in next meeting..
• Introduction
• Functionalities
• Hot topics
• Research Groups
– Stanford, CMU, UIUC, Wisc, Helsinki, UMN
– IBM, Microsoft, MSRA, Yahoo!
– Others
• Useful Resources
• Rakesh Agrawal
– One of the Leaders in Data Mining
– Frequent patterns, Privacy Preserved Data Mining
• Stanford: Jerome H. Friedman
– http://www-stat.stanford.edu/~jhf/
– Strong Statistical flavor, machine learning, boosting
• CMU: Christos Faloutsos
– http://www.cs.cmu.edu/~christos/
– Graph mining, Social Networks, Stream data mining, Image/Multimedia mining, time-series mining
• UIUC: Jiawei Han
– http://www-sal.cs.uiuc.edu/~hanj/
– Many! Frequent pattern mining, graph mining, OLAP/Cubing, Stream data mining, Classification, Clustering, …
• University of Helsinki: Heikki Mannila
– http://www.cs.helsinki.fi/research/fdk/
– http://www.cs.helsinki.fi/u/mannila/
– Frequent itemset mining, computational biology
• Wisconsin: Raghu Ramakrishnan
– http://www.cs.wisc.edu/dmi/
– http://www.cs.wisc.edu/~raghu/
– Data warehousing, cubing, classification/clustering,
• Minnesota: Vipin Kumar
– http://www-users.cs.umn.edu/~kumar/
– Spatiotemporal data mining
• IBM T.J Watson: Philip S. Yu
– http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html
– http://www.research.ibm.com/people/p/psyu/index.html
– Frequent pattern mining, graph mining, data streams
• Microsoft Research Redmond: Surajit Chaudhuri
– http://research.microsoft.com/dmx/
– Data base related, Data cleaning, etc.
• Microsoft Research Redmond: Eric Brill
– http://research.microsoft.com/tmsn/
– http://research.microsoft.com/~brill/
– Text Mining, Search and Navigation Research, NLP
• Microsoft Research Asia:
– http://research.microsoft.com/wsm/
– Web search, web/text mining
• Yahoo! Research: Prabhakar Raghavan
– http://research.yahoo.com/researcher.shtml
– http://theory.stanford.edu/~pragh/
– Web/Text Mining, Social Networks
• IBM Webfountain
– http://www.almaden.ibm.com/webfountain/
• UIC: Bing Liu
– http://www.cs.uic.edu/~liub/
– Association rule mining, web/text mining
• UNC: Wei Wang
– http://www.cs.unc.edu/~weiwang/
– Biology data mining, frequent pattern mining
• Simon Fraser: Jian Pei
– http://www.cs.sfu.ca/~jpei/
– Sequential pattern mining, OLAP
• National University of Singapore: Anthony K.H. Tung
– http://www.comp.nus.edu.sg/~atung/
– Spatial data mining, Biology data mining
• …
• Introduction
• Functionalities
• Hot topics
• Research Groups
• Useful Resources
– Text Books
– Toolkits
– Conferences
– Others
•
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured
Data. Morgan Kaufmann, 2002
•
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience,
2000
•
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley &
Sons, 2003
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
• J. Han and M. Kamber. Data Mining: Concepts and Techniques . Morgan Kaufmann, 2nd ed., 2006
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
•
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data
Mining, Inference, and Prediction , Springer-Verlag, 2001
•
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT
Press, 1991
• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining , Wiley, 2005
• S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
•
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations , Morgan Kaufmann, 2nd ed. 2005
From Prof. Jiawei Han’s slides
• Weka: Data mining software in Java
– http://www.cs.waikato.ac.nz/%7Eml/weka/
• IlliniMine (Illinois Data Mining System)
– http://illimine.cs.uiuc.edu/
– Data Cubing
– Frequent Pattern Mining
– Sequential pattern mining
– Graph pattern Mining
– Classification
• Collected by Vipin Kumar:
– http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm
• KDD Conferences • Other related conferences
– ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
( KDD )
– ACM SIGMOD
– VLDB
– (IEEE) ICDE
– SIAM Data Mining Conf. ( SDM )
– (IEEE) Int. Conf. on Data
Mining ( ICDM )
– WWW, SIGIR
– ICML, CVPR, NIPS
• Journals
– Conf. on Principles and practices of Knowledge
Discovery and Data Mining
( PKDD )
– Data Mining and Knowledge
Discovery (DAMI or DMKD)
– IEEE Trans. On Knowledge and Data Eng. (TKDE)
– Pacific-Asia Conf. on
Knowledge Discovery and
Data Mining ( PAKDD )
– KDD Explorations
– ACM Trans. on KDD
From Prof. Jiawei Han’s slides
• KDnuggets
– http://www.kdnuggets.com/
• Tutorial: Machine Learning Techniques for Data Mining
(WEKA) Slides- Eibe Frank, University of Waikato http://books.elsevier.com/companions/1558605525?country=United
+States
• Ideas for course projects in data mining
– Collected by Vipin Kumar
– http://www-users.cs.umn.edu/~kumar/dmbook/projects.htm