Data Mining Stat557/IST557 Data Mining Stat557/IST557 Jia Li Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 General Information I I Course homepage: http://www.stat.psu.edu/˜jiali/stat557 Prerequisite: I I I Jia Li Elementary probability theory Conditional distribution, expectation C, Matlab, or S-plus programming http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 I Text books: I I Required: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, and J. Friedman (ElemStatLearn). Optional: 1. Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone 2. Pattern Recognition and Neural Networks by B. Ripley 3. Principles of Data Mining by H. Mannila, P. Smyth and D. J. Hand 4. Data Mining: Concepts and Techniques by J. Han and M. Kamber Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 What Is Data Mining? Data mining: tools, methodologies, and theories for revealing patterns in data—a critical step in knowledge discovery. Driving forces: I Explosive growth of data in a great variety of fields I I I Jia Li Cheaper storage devices with higher capacity Faster communication Better database manage systems I Rapidly increasing computing power I Make data to work for us http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Research fields Jia Li I Statistics I Machine learning I Pattern recognition I Signal processing I Database http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Applications I Business I I I Genomics I I I I I Jia Li Terrabytes of data on the internet Multimedia information Communication systems I I Human genome project: DNA sequences Microarray data Information retrieval I I Wal-Mart data warehouse Credit card companies Speech recognition Image analysis Many other scientific fields http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Problems Focused: Prediction Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Terminology Notation I Input X : X is often multidimensional. Each dimension of X is denoted by Xj and is referred to as a feature, predictor, or independent variable/variable. I Output Y : response, dependent variable. Categorization I Supervised learning vs. unsupervised learning I I Is Y available in the training data? Regression vs. Classification I I Is Y quantitative or qualitative? For qualitative Y , it is also denoted by G ∈ G = {1, 2, ..., K }. Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Examples Email spam: (ElemStatLearn) Jia Li I Goal: predict whether an email is a junk email, i.e., “spam”. I Raw data: text email messages. I Input X : relative frequencies of 57 of the most commonly occurring words and punctuation marks in the email message. I Training data set: 4601 email messages with email type known (supervised learning). http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Examples Handwritten digit recognition:(ElemStatLearn) I I Goal: identify single digits 0 ∼ 9 based on images. Raw data: images that are scaled segments from five digit ZIP codes. I I I Jia Li 16 × 16 eight-bit grayscale maps Pixel intensities range from 0 (black) to 255 (white). Input data: a 256 dimension vector, or feature vectors with lower dimensions. http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Examples Image segmentation: Jia Li I Goal: segment images into regions of different types, e.g., man-made vs. natural in aerial images, graph and picture vs. text in document images. I Raw data: grayscale images represented by matrices of size m × n, or color images represented by 3 such matrices. http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Aerial images. Left: Original image of size 512 × 512 with pixel intensity ranging from 0 to 255, Right: Hand-labeled classified images. White: man-made, Gray: natural. Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 I Input data: I I I I Methodologies: I I Jia Li Divide images into blocks of pixels or form a neighborhood around each pixel. Compute statistics using pixel intensities in each block. An image is converted to an array of input vectors. Assume the feature vectors are independent. Employ spatial models to capture dependence among the vectors. http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Examples Speech recognition: I Goal: identify words spoken according to speech signals I I I Jia Li Automatic voice recognition systems used by airline companies Automatic stock price reporting Raw data: voice amplitude sampled at discrete time spots (a time sequence). http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 I I Input data: speech feature vectors computed at the sampling time. Methodology: I I I Jia Li Estimate an Hidden Markov Model (HMM) for each word, e.g., State College, San Francisco, Pittsburgh. For a new word, find the HMM that yields the maximum likelihood. Identify the word as the one associated with the HMM. http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Examples DNA Expression Microarray: I I I Goal: identify disease or tissue types Raw data: for each sample taken from a tissue of a particular disease type, the expression levels of a large collection of genes are measured. Input data: cleaned-up gene expression data I I I I I Example data set: 4026 genes, 96 samples taken from 9 classes of tissues. Challenges: I I Jia Li Normalization Denoising. Ample literature on the topic of cleaning microarray data very high dimensional data very limited number of samples http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Examples DNA sequence classification: Jia Li I Goal: distinguish “junk” segments from coding segments. I Raw data: sequences of letters, e.g., A,C,G,T for DNA sequences. I Input data: likelihood ratio statistics computed from stochastic models. I Supervised learning: estimate stochastic models, select models. http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Supervised Learning Two types of learning: I Regression: the response Y is quantitative. I Classification: the response Y is qualitative, or categorical. Two aspects in learning: I Fit the data well. I Robust Equivalent concepts: Jia Li I Training error vs. testing error I Bias vs. variance I Fitting vs. overfitting I Empirical risk vs. model complexity (capacity) http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Learning Spectrum Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Regression Overview: I Linear models: I I I Generalized linear models Expand basis: I I I I Jia Li The mean response is a linear function of the independent variables. Splines (polynomials) Reproducing Kernel Hilbert Spaces Wavelet smoothing Kernel methods http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Classification: A graphic View Jia Li http://www.stat.psu.edu/∼jiali Data Mining Stat557/IST557 Outlines Jia Li I Linear regression I Linear methods for classification I Prototype methods I Classification and regression tree (CART) I Mixture discriminant analysis I Hidden Markov models and its applications http://www.stat.psu.edu/∼jiali