What is Data Mining?

• ( From Prof. Jiawei Han’s Slides ): Data mining (knowledge discovery from data)

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

• ( From Prof. Sunita Sarawagi’s slides ): Process of semi-automatically analyzing large databases to find patterns that are

– valid: hold on new data with some certainty

– novel: non-obvious to the system

– useful: should be possible to act on the item

– understandable: humans should be able to interpret the pattern

• ( From Prof. Vipin Kumar’ Slides ): Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

What is Data Mining? (cont.)

• Under these definitions:

– What is not Data Mining?

• Look up phone number in phone directory

• Query a Web search engine for information about “Amazon”

– What is Data Mining?

• Certain names are more prevalent in certain US locations

(O’Brien, O’Rurke, O’Reilly… in Boston area)

• Group together similar documents returned by search engine according to their context

General Process of KDD

– Data mining—core of

Pattern Evaluation

knowledge discovery process

Data Mining

Task-relevant Data

Data Warehouse

Data Cleaning

Data Integration


Related Fields






Data Mining



• Confluence of

Multiple Disciplines

• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

• But different…

Machine Learning/



Data Mining

Database systems

Differences to Related Fields

• Traditional Techniques may be unsuitable due to

– Enormity of data

– High dimensionality of data

– Heterogeneous, distributed nature of data

• Overlaps with machine learning, statistics, artificial intelligence, databases, visualization, but more stress on

– scalability of number of features and instances

– stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning.

– automation for handling large, heterogeneous data

Different Views of Data Mining

• Categorize a data mining task from different views

• By general functionality and operations:

– Descriptive data mining

• Find human-interpretable patterns that describe the data.

• Clustering / similarity matching

• Association rules and variants

• Deviation detection

– Predictive data mining

• Use some variables to predict unknown or future values of other variables.

• Regression

• Classification

• Collaborative Filtering

Different Views of Data Mining (II)

• By data to be mined

– Relational, data warehouse, transactional, stream, objectoriented, sequence, graph, social network, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

• By knowledge to be discovered

– Characterization, discrimination, frequent patterns, association, classification, clustering, trend/deviation, outlier analysis, etc

• By techniques utilized

– Database-oriented, data warehouse (OLAP), combinational algorithms, machine learning, statistics, visualization, etc.

• By application adapted

– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Data Warehousing and OLAP

• Data Warehousing:

– “

A data warehouse is a subject-oriented , integrated , time-variant , and nonvolatile collection of data in support of management

’ s decisionmaking process.


W. H. Inmon

• OLAP: on-line analytical processing

– Major task of data warehouse system

– Data analysis and decision making

– Drill-down, roll-up, exception/discovery driven

• Methodology product

– Data Cubing

– Iceberg cube

– Multi-way, BUC, Star, MM, product,date all date product,country country date, country shell, close-cube , etc.

Frequent Patterns and Associations

• Frequent pattern : a pattern (itemsets, subsequences, substructures, etc.) that occurs frequently in a data set

– Comparing to n-grams, phrases, etc.

• Motivation : Finding inherent regularities in data

• Applications : Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

• Association rule mining:

– Given a set of records each of which contain some number of items from a given collection;

– Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

– Frequent pattern  association rules  correlations

Mining Frequent Patterns

• Types of data:

– Itemsets, sequences, graphs.

• Scalable mining methods: Three major approaches

– Apriori (Agrawal & Srikant@VLDB’94)

– FPgrowth (Han, Pei & Yin @SIGMOD’00)

• Prefixspan, clospan, gSpan, closegraph, etc.

– Vertical data format approach (Charm, Zaki & Hsiao @SDM’02)

• Apriori:

– Candidate pattern generation and pruning

– Breadth-first search over pattern space

• FPgrowth:

– Pattern growth through FP-tree, no candidate generation

– Depth-first search, doing pruning smartly

Classification and Prediction

• Supervised Learning, already discussed in Machine Learning.

– Classification: classifies data (constructs a model) based on the training set and the values ( categorical class labels ) in a classifying attribute and uses it in classifying new data

– Prediction: models continuous-valued functions, i.e., predicts unknown or missing values

• Algorithms:

– Decision Tree based: C4.5, ID3, Rainforest, etc.

– Bayesian Method: Naïve Bayesian, Bayesian network , a lot of others covered in Machine Learning..

– Discriminative: Perceptron/Winnow, NN, SVM, CB-SVM , etc.

– Rule-based, Associative, k-NN, etc.

– Prediction: Regression,

• Bagging, Boosting, Model Selection, Cross-Validation


• Unsupervised Learning, as discussed in Machine Learning

– Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that

– Data points in one cluster are more similar to one another.

– Data points in separate clusters are less similar to one another.

– Similarities/distances: many!

• Algorithms:

– Partition based: K-means, K-Medoids, CLARA, etc

– Hierarchical: Bottom-up (single/complete/average link), top-down,


– Density-based/Grid-based: DBSCAN, DENCLUE, CLIQUE, etc.

– Model-based: EM, COBWEB, SOM, etc.

– High-Dimensional, Constraint based

Outlier, Trend and Evolution

• outliers: The set of objects that are considerably dissimilar from the remainder of the data

– Statistical: hypothesis testing, bug mining

– Density based

– Clustering based, etc

• Deviation/Anomaly Detection

• Fraud Detection

• Trend and Evolution:

– Usually coupled with outlier analysis

– Basic functionalities in temporal data mining

– Trend, cycle, seasonal, irregular patterns

Mining Data Streams

• Data: Data streams

— continuous, ordered, changing fast, huge amount

• Characteristics and Challenges :

– Huge volumes

– Fast changing, requires fast and real-time response

– Random access is expensive — need single scan algorithms

– Difficult to keep the universe — need approximations

• Basic problems:

– Multi-dimensional on-line analysis of streams

– Mining outliers and unusual patterns in stream data

– Clustering data streams

– Classification of stream data

Mining Data Streams (II)

• Methods:

– Basic: Sliding windows, Tilted time frames

– Counting (FP mining, etc):

• Random sampling

• Approximated counting


• Keep Critical layers in stream cube computation

• Partial materialization

• outlier: exception-based exploration

– Clustering:

• Offline microclustering and online macroclustering

• Text Related Applications:

– Web logs and Web page click streams

Mining Time series

• Data: Time-series database

– Consists of sequences of values or events changing with time

– Data is recorded at regular intervals

• Characteristics and Challenges :

– Characteristic time-series components: Trend, cycle, seasonal, irregular patterns

• Basic Problems:

– Trends discovery, Similarity Search, outlier detection, prediction and clustering

Mining Time series (II)

• Methods:

– Statistical modeling (Regression, Spline, Mixture Model, etc)

– Data transformation (DFT, DWT)

– Sliding windows, Atomic matching, window stitching,

Subsequence Ordering

– Clustering

– Transliteration mining, Temporal text mining, word bursting, etc.

Spatiotemporal data mining

• Data: object data sets, spatial/spatiotemporal databases and data warehouses

• Characteristics and Challenges:

– Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage

– handling objects in space that have identity and well-defined extents, locations, and relationships.

– Require the merge of a set of geographic areas by spatial operations

• Basic Problems:

– Querying objects; distribution/cluster/correlation/evolution/trend analysis

Spatiotemporal data mining (II)

• Methods

– GIS (Geographic Information System): Analysis and visualization of geographic data

• Search, Location analysis, Terrain analysis, Distribution,

Spatial analysis/statistics, Measurement

– Indexing Spatial data (R-tree, etc. )

– Modeling single objects with points, lines and regions

– Modeling spatially related collection of objects: plane partitions and networks.

– Spatiotemporal patterns, correlations, trend analysis, clustering …

• Text Related Applications:

– Spatiotemporal text mining; community evolution in weblogs;

– Information diffusing; web evolution

Special topics in Frequent Pattern


• Association rule mining and frequent itemset mining are pretty old topics

• However, some special topics of frequent pattern mining are still hot

– Sequential pattern mining

– Graph mining

– Pattern post-processing

Sequential pattern mining

• Data: sequential data base

• Basic problems:

– Discovery of frequent subsequences (allow gap, comparing to n-grams); close subsequences

– Sequence Similarity Search, Sequence Alignment

• Methods:

– Apriori: GSP

– FP-Growth: PrefixSpan, Clospan

– BLAST, Hidden Markov models,

CRF, etc.

• Text Related Applications:

– Most text patterns are sequential patterns

– Phrase extraction, entity/relation extraction, opinion mining, etc

– Biology sequence modeling

Graph Mining

• Data: graph databases (like social network, but multiple graphs, more general), examples include

– Chemical component, protein structure, program flow, XML/Web,

– Directed, undirected, labeled/unlabeled, weighted, 2-D/3-D, etc.

• Characteristics and Challenges:

– Theoretically, most are of high complexity, but practically, the graphs are solvable.

– Too many substructures to index

– …

• Basic problems

– Frequent subgraph mining

– Close subgraph mining

– Graph indexing by substructures

– Similarity search

Graph Mining (II)

• Methods:

– Subgraph mining: Apriori (e.g. FSG), Pattern Growth (e.g. gSpan)

– gSpan : pattern growth, depth first search, active elimination of duplicated subgraphs; Flatten a graph into a sequence using depth first search; enumerate graph using right-most extension.

– CloseGraph: mining close subgraph patterns

– gIndex : identify frequent structures, prune redundancy to maintain discriminative structures , create index on such structures.

– Similarity search: indexing; feature based similarities; estimate feature missing

• Text Related Applications:

– Multi-resolution topic map, entity-relation network, pathway extraction, etc.

Graph Mining (III): Graph Indexing &


• More on Graph Indexing and Similarity Search

• Comparing to Text Retrieval:


Basic Units

Text Retrieval



Pruning stopwords stemming Redundancy?

Representation Term vectors

Dimensions Terms

Relevance Vector similarity

Approximation No

Graph Indexing & Search


Frequent structures

Need to mine frequent subgraph

Need discriminative structs.

Feature vectors


Vector similarity

Yes, need to estimate feature missing (relax substructures)

Graph Mining (IV): Graph Indexing &


• What if we want to index on phrases instead of words?

– Need to extract phrases first

– N-grams/sequential patterns, have to remove redundancy

• E.g. “natural language processing” v.s. “language processing”

– Substructures are like phrases…

• Can IR help?

– Representation and Similarity measures? (Vector Space Models,

Probabilistic models…)

– How to weight features? (TF-IDF, …)

– Generative models?

– Query expansion? Feedback?

Pattern Post-processing

• Data: frequent patterns extracted by mining algorithms

• Challenge:

– Mining algorithms output explosively large number of patterns

– How to interpret the frequent patterns extracted

• Basic Problems:

– Pattern summarization

– Mining compressed patterns

– Top-K patterns

– Pattern annotation

– User-oriented ranking

• Methods:

– Modeling Pattern profiles, coverage and contexts

– Using Clustering to summarize and compress patterns

– Bridging IR/NLP and frequent pattern mining: profile, context, ranking, feedback, filtering, summarization, MMR, etc.

Mining Social Networks

• Data: Graphs/networks with nodes and links

– Example: communication networks, webpages, citations, biological pathways, etc.

• Characteristics and Challenges:

– Connected Components: few

– Network diameter: small

– Clustering: high degree

– Degree distribution: heavy-tailed

– Modeling Logical/statistical dependencies

• Basic Problems:

– Model the generation of graphs/networks

– Link based object ranking, classification,

Identification, Clustering, entity resolution

– Link Prediction, querying, community discovery

H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai,

Nature 411, 41-42 (2001)

Mining Social Networks (II)

• Methods:

– Graph Generation Models: trying to derive generative models which explains the characteristics and evolutions of social networks/graphs.

– Vertex Ranking: PageRank, HITS, etc.

– Community Detection: Hierarchical Clustering, Spectral clustering, Stochastic modeling, etc.

– Link based classification: semi-supervised learning, propagation

– Entity resolution: duplicate prediction, collective resolution, probabilistic models

– Link Prediction: binary classification problem, local conditional probabilistic models

– Substructure mining: graph pattern mining, indexing

Mining Social Networks (III)

• Generative Models of social network/graph generation and evolution

• Random graphs (Erd ö s-R

é nyi models)

– Fix vertices, generate each edge independently with probability p

– N(N-1)/2 trials of a biased coin flip, p ~ 1/N

– Degree distribution is Poisson, E[d] = p(N-1); E[# of e] = pN(N-1)/2

– Parameter: p

• Graph process model:

– starting with no edges, just keep adding one edge at a time

– always choose next edge randomly from among all missing edges

Mining Social Networks (IV)

• α-model (Watts-Strogatz models, Small-world)

– For vertices u, v, define m(u,v) to be the number of common neighbors (so far)

– Define the propensity R(u,v) of u to connect to v

• if m(u,v) >= k, R(u,v) = 1 (share too many friends, must connect)

• if m(u,v) = 0, R(u,v) = p (no mutual friends  no bias to connect)

• else, R(u,v) = p + (m(u,v)/k) a

(1-p)  biased to connect

– Generate network incrementally, with R(u,v) as the edge probability;

– α  ∞ , is similar to Erdos-Renyi models

– Need to tune parameter α, p, k

Mining Social Networks (V)

• Scale free models: not fix N (# of vertices)

– Start with (say) two vertices connected by an edge

– let Z = Σ d(j) where d(j) = degree of vertex j so far

– add new vertex i with k edges back to {1, …

, i-1}: i is connected back to j with probability d(j)/Z

– Richer get richer …

• Evaluation of generative models

– Can they explain all the characteristics of social networks?

– Parameter tuning?

• Other models for Social network analysis

– Copying model: leads to communities

– Forest Fire Model

– Electricity network (not generative model, but interesting)

Mining Social Networks (VI)

• Text Related Applications: quite a lot!

– Ranking webpages

– Multi-resolution Concept/Topic Map

– Citation Impact of scientific literature

– Entity-relation extraction

– Bioinformatics: Pathway extraction

– Reference Reconciliation

– Web structure evolution

– Community discovery in Weblogs..

Text and Web mining

• Data: text, unstructured/semi-structured; webpages with linkages, user logs;

– E.g. webpage, news, email, weblogs, scientific literature, citations, customer reviews, forums, search logs, chatting logs, legal documents, etc.

• Challenges:

– Modeling unstructured/semi-structured data

– Coupling with Natural Language Processing

– Handling high dimensionality

– Handling data sparseness and ambiguity

– The Web is too complicated!

Text and Web mining (II)

• Selected Problems:

– Text categorization/clustering ( Already covered in NLP and ML )

– Word sense disambiguation ( Covered in NLP )

– Information Extraction ( Covered in NLP )

– Dimension Reduction ( Overlapping with ML and IR )

– Collaborative Filtering, User-interest modeling

– Topic Detection and Tracking

– Comparative Text Mining, Theme based text mining

– Transliteration mining

– Email clustering / spam detection

– Opinion mining ( Overlapping with NLP )

– Social Networks Related (Already covered)

– Temporal Text Mining

– Vision based page segmentation / Block based search

Text and Web mining (III)

• Methods: Confluence of Multiple Disciplines

– Database: data integration, schema matching, XML

– Data mining: sequential pattern mining, association rule mining, …

– IR: Search, language models, feedback, …

– Machine Learning: SVD, Supervised/unsupervised learning, semi-supervised learning, Topicmodels, …

– NLP: POS tagging, parsing, context modeling, sentiment extraction, entity extraction, …

– Statistical Learning: Bayesian methods, word bursting, timeseries analysis, hypothesis testing, other statistical models, …

Text and Web mining (IV)

• Resolution:

– Word level: Word sense disambiguation, word bursting, transliteration mining

– Entity level: information extraction, entity-relation network

– Pattern level: opinion mining, relation extraction

– Document level: document classification/clustering

– Theme level: PLSI, LDA, comparative text mining, temporal text mining/spatiotemporal text mining

– Topic level: topic detection and tracking, email threading

– Web level: social network, weblog mining, block based search

• Selected topics will be discussed in next meeting..

Research Groups

• Rakesh Agrawal

– One of the Leaders in Data Mining

– Frequent patterns, Privacy Preserved Data Mining

• Stanford: Jerome H. Friedman

– http://www-stat.stanford.edu/~jhf/

– Strong Statistical flavor, machine learning, boosting

• CMU: Christos Faloutsos

– http://www.cs.cmu.edu/~christos/

– Graph mining, Social Networks, Stream data mining, Image/Multimedia mining, time-series mining

• UIUC: Jiawei Han

– http://www-sal.cs.uiuc.edu/~hanj/

– Many! Frequent pattern mining, graph mining, OLAP/Cubing, Stream data mining, Classification, Clustering, …

Research Groups (II)

• University of Helsinki: Heikki Mannila

– http://www.cs.helsinki.fi/research/fdk/

– http://www.cs.helsinki.fi/u/mannila/

– Frequent itemset mining, computational biology

• Wisconsin: Raghu Ramakrishnan

– http://www.cs.wisc.edu/dmi/

– http://www.cs.wisc.edu/~raghu/

– Data warehousing, cubing, classification/clustering,

• Minnesota: Vipin Kumar

– http://www-users.cs.umn.edu/~kumar/

– Spatiotemporal data mining

• IBM T.J Watson: Philip S. Yu

– http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html

– http://www.research.ibm.com/people/p/psyu/index.html

– Frequent pattern mining, graph mining, data streams

Research Groups (III)

• Microsoft Research Redmond: Surajit Chaudhuri

– http://research.microsoft.com/dmx/

– Data base related, Data cleaning, etc.

• Microsoft Research Redmond: Eric Brill

– http://research.microsoft.com/tmsn/

– http://research.microsoft.com/~brill/

– Text Mining, Search and Navigation Research, NLP

• Microsoft Research Asia:

– http://research.microsoft.com/wsm/

– Web search, web/text mining

• Yahoo! Research: Prabhakar Raghavan

– http://research.yahoo.com/researcher.shtml

– http://theory.stanford.edu/~pragh/

– Web/Text Mining, Social Networks

Research Groups (IV)

• IBM Webfountain

– http://www.almaden.ibm.com/webfountain/

• UIC: Bing Liu

– http://www.cs.uic.edu/~liub/

– Association rule mining, web/text mining

• UNC: Wei Wang

– http://www.cs.unc.edu/~weiwang/

– Biology data mining, frequent pattern mining

• Simon Fraser: Jian Pei

– http://www.cs.sfu.ca/~jpei/

– Sequential pattern mining, OLAP

• National University of Singapore: Anthony K.H. Tung

– http://www.comp.nus.edu.sg/~atung/

– Spatial data mining, Biology data mining

• …

Text Books

S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured

Data. Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience,


T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley &

Sons, 2003

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in

Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996

• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and

Knowledge Discovery, Morgan Kaufmann, 2001

• J. Han and M. Kamber. Data Mining: Concepts and Techniques . Morgan Kaufmann, 2nd ed., 2006

• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data

Mining, Inference, and Prediction , Springer-Verlag, 2001

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT

Press, 1991

• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining , Wiley, 2005

• S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and

Techniques with Java Implementations , Morgan Kaufmann, 2nd ed. 2005

• Weka: Data mining software in Java

– http://www.cs.waikato.ac.nz/%7Eml/weka/

• IlliniMine (Illinois Data Mining System)

– http://illimine.cs.uiuc.edu/

– Data Cubing

– Frequent Pattern Mining

– Sequential pattern mining

– Graph pattern Mining

– Classification

• Collected by Vipin Kumar:

– http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm


• KDD Conferences • Other related conferences

– ACM SIGKDD Int. Conf. on

Knowledge Discovery in

Databases and Data Mining

( KDD )




– SIAM Data Mining Conf. ( SDM )

– (IEEE) Int. Conf. on Data

Mining ( ICDM )



• Journals

– Conf. on Principles and practices of Knowledge

Discovery and Data Mining

( PKDD )

– Data Mining and Knowledge

Discovery (DAMI or DMKD)

– IEEE Trans. On Knowledge and Data Eng. (TKDE)

– Pacific-Asia Conf. on

Knowledge Discovery and

Data Mining ( PAKDD )

– KDD Explorations

– ACM Trans. on KDD

• KDnuggets

– http://www.kdnuggets.com/

• Tutorial: Machine Learning Techniques for Data Mining

(WEKA) Slides- Eibe Frank, University of Waikato http://books.elsevier.com/companions/1558605525?country=United


• Ideas for course projects in data mining

– Collected by Vipin Kumar

– http://www-users.cs.umn.edu/~kumar/dmbook/projects.htm

