Data Mining Stat557/IST557 - Penn State Department of Statistics

advertisement
Data Mining Stat557/IST557
Data Mining
Stat557/IST557
Jia Li
Department of Statistics
The Pennsylvania State University
Email: jiali@stat.psu.edu
http://www.stat.psu.edu/∼jiali
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
General Information
I
I
Course homepage:
http://www.stat.psu.edu/˜jiali/stat557
Prerequisite:
I
I
I
Jia Li
Elementary probability theory
Conditional distribution, expectation
C, Matlab, or S-plus programming
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
I
Text books:
I
I
Required: The Elements of Statistical Learning, by T. Hastie,
R. Tibshirani, and J. Friedman
(ElemStatLearn).
Optional:
1. Classification and Regression Trees by L. Breiman, J. H.
Friedman, R. A. Olshen, and C. J. Stone
2. Pattern Recognition and Neural Networks by B. Ripley
3. Principles of Data Mining by H. Mannila, P. Smyth and D. J.
Hand
4. Data Mining: Concepts and Techniques by J. Han and M.
Kamber
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
What Is Data Mining?
Data mining: tools, methodologies, and theories for revealing
patterns in data—a critical step in knowledge discovery.
Driving forces:
I Explosive growth of data in a great variety of fields
I
I
I
Jia Li
Cheaper storage devices with higher capacity
Faster communication
Better database manage systems
I
Rapidly increasing computing power
I
Make data to work for us
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Research fields
Jia Li
I
Statistics
I
Machine learning
I
Pattern recognition
I
Signal processing
I
Database
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Applications
I
Business
I
I
I
Genomics
I
I
I
I
I
Jia Li
Terrabytes of data on the internet
Multimedia information
Communication systems
I
I
Human genome project: DNA sequences
Microarray data
Information retrieval
I
I
Wal-Mart data warehouse
Credit card companies
Speech recognition
Image analysis
Many other scientific fields
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Problems Focused: Prediction
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Terminology
Notation
I
Input X : X is often multidimensional. Each dimension of X is
denoted by Xj and is referred to as a feature, predictor, or
independent variable/variable.
I
Output Y : response, dependent variable.
Categorization
I Supervised learning vs. unsupervised learning
I
I
Is Y available in the training data?
Regression vs. Classification
I
I
Is Y quantitative or qualitative?
For qualitative Y , it is also denoted by
G ∈ G = {1, 2, ..., K }.
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Examples
Email spam: (ElemStatLearn)
Jia Li
I
Goal: predict whether an email is a junk email, i.e., “spam”.
I
Raw data: text email messages.
I
Input X : relative frequencies of 57 of the most commonly
occurring words and punctuation marks in the email message.
I
Training data set: 4601 email messages with email type
known (supervised learning).
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Examples
Handwritten digit recognition:(ElemStatLearn)
I
I
Goal: identify single digits 0 ∼ 9 based on images.
Raw data: images that are scaled segments from five digit
ZIP codes.
I
I
I
Jia Li
16 × 16 eight-bit grayscale maps
Pixel intensities range from 0 (black) to 255 (white).
Input data: a 256 dimension vector, or feature vectors with
lower dimensions.
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Examples
Image segmentation:
Jia Li
I
Goal: segment images into regions of different types, e.g.,
man-made vs. natural in aerial images, graph and picture vs.
text in document images.
I
Raw data: grayscale images represented by matrices of size
m × n, or color images represented by 3 such matrices.
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Aerial images. Left: Original image of size 512 × 512 with pixel intensity
ranging from 0 to 255, Right: Hand-labeled classified images. White:
man-made, Gray: natural.
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
I
Input data:
I
I
I
I
Methodologies:
I
I
Jia Li
Divide images into blocks of pixels or form a neighborhood
around each pixel.
Compute statistics using pixel intensities in each block.
An image is converted to an array of input vectors.
Assume the feature vectors are independent.
Employ spatial models to capture dependence among the
vectors.
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Examples
Speech recognition:
I
Goal: identify words spoken according to speech signals
I
I
I
Jia Li
Automatic voice recognition systems used by airline companies
Automatic stock price reporting
Raw data: voice amplitude sampled at discrete time spots (a
time sequence).
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
I
I
Input data: speech feature vectors computed at the sampling
time.
Methodology:
I
I
I
Jia Li
Estimate an Hidden Markov Model (HMM) for each word,
e.g., State College, San Francisco,
Pittsburgh.
For a new word, find the HMM that yields the maximum
likelihood.
Identify the word as the one associated with the HMM.
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Examples
DNA Expression Microarray:
I
I
I
Goal: identify disease or tissue types
Raw data: for each sample taken from a tissue of a particular
disease type, the expression levels of a large collection of
genes are measured.
Input data: cleaned-up gene expression data
I
I
I
I
I
Example data set: 4026 genes, 96 samples taken from 9
classes of tissues.
Challenges:
I
I
Jia Li
Normalization
Denoising.
Ample literature on the topic of cleaning microarray data
very high dimensional data
very limited number of samples
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Examples
DNA sequence classification:
Jia Li
I
Goal: distinguish “junk” segments from coding segments.
I
Raw data: sequences of letters, e.g., A,C,G,T for DNA
sequences.
I
Input data: likelihood ratio statistics computed from
stochastic models.
I
Supervised learning: estimate stochastic models, select
models.
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Supervised Learning
Two types of learning:
I
Regression: the response Y is quantitative.
I
Classification: the response Y is qualitative, or categorical.
Two aspects in learning:
I
Fit the data well.
I
Robust
Equivalent concepts:
Jia Li
I
Training error vs. testing error
I
Bias vs. variance
I
Fitting vs. overfitting
I
Empirical risk vs. model complexity (capacity)
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Learning Spectrum
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Regression
Overview:
I Linear models:
I
I
I
Generalized linear models
Expand basis:
I
I
I
I
Jia Li
The mean response is a linear function of the independent
variables.
Splines (polynomials)
Reproducing Kernel Hilbert Spaces
Wavelet smoothing
Kernel methods
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Classification: A graphic View
Jia Li
http://www.stat.psu.edu/∼jiali
Data Mining Stat557/IST557
Outlines
Jia Li
I
Linear regression
I
Linear methods for classification
I
Prototype methods
I
Classification and regression tree (CART)
I
Mixture discriminant analysis
I
Hidden Markov models and its applications
http://www.stat.psu.edu/∼jiali
Download