CS 584 Theory and Applications of Data Mining George Mason University, Fall 2023 Instructor: Mahendra Panagoda, Ph.D., ASA 22 Aug 2023 Useful Course Information • Check out the course outline posted on BB • Important to note homework, mid-term and final • This is an outline only – if items change, please be on the look out for an e-mail or a posting. • Use a variety of sources for learning • Text that is commonly followed is Tan, Steinbach, Karpatne and Kumar (2e), Introduction to Data Mining • We will follow the general flow and outline of the text 01/23/2023 CS 584 Fall 2022 – George Mason University 2 Introduction • Data is ever present, and the volumes grow • Why do we have lots of data ? • Where do we get “Large-data” sets? • • • • • E-commerce Traffic flow in major cities Cybersecurity Simulations Weather 01/23/2023 CS 584 Fall 2022 – George Mason University 3 Quick Quiz – no grade ! • You receive a file from IT named Data for E-commerce analysis. What steps would you take in your approach to mine the data therein? • A. Immediately open the file and run K-Means Algorithm • B. Verify that this is indeed what you were expecting and it is from a trusted source • C. After proper precautions, have a look at the data • D. If possible, spot any outliers or questionable input • E. Delete any empty rows or columns and run the Neural Network Algorithm that you recently developed • F. Review your notes from prior large data project 01/23/2023 CS 584 Fall 2022 – George Mason University 4 Large data-sets • Large data-sets ī Voluminous and need some automation to analyze • Major use of this data is in predictive analytics also known as data mining • Sometimes, these data sets are used in Machine Learning and Artificial Intelligence efforts • Think of Test/Train data • Insights from the data can be used to; • Improve productivity • Protect development • Test hypothesis 01/23/2023 CS 584 Fall 2022 – George Mason University 5 Formal Outline of Data Mining • Nontrivial evidence • Automation is required • Isolated actionable insights Data 01/23/2023 Processing Step Data Mining CS 584 Fall 2022 – George Mason University Post Processing Actionable Information 6 Formal Outline of Data Mining • Data Mining has its origins in Statistical Analysis and Probability • Modern applications involve ML and AI • The Large scale-ness of data requires • • • • Pre-processing Mathematical Descriptions and techniques Automation Interpretation of results 01/23/2023 CS 584 Fall 2022 – George Mason University 7 Some examples of Data Mining • Classification; • • • • • • Creditworthiness of a customer Graduate school admissions Employment Finding Investments Medical applications Fraud detection and many more… 01/23/2023 CS 584 Fall 2022 – George Mason University 8 Regression • A verify useful statistical technique • Old but still in use • Easy to understand • Basic idea is to predict a value of interest based on other variables • Two models - can be linear or non-linear • Various statistical methods are incorporated to enhance the model predictability 01/23/2023 CS 584 Fall 2022 – George Mason University 9 Clustering Analysis • Market studies based on geographical and other lifestyle attributes • Text / Document searches specially in Legal and investment contexts • Medical Records detection of anomalies 01/23/2023 CS 584 Fall 2022 – George Mason University 10 What is large data? • Very basically, we can say it is information on something we are interested in • Technically, we can define data as objects with certain attributes • Attributes are broadly identified as certain data characteristics • Attributes can be discrete, for example zip codes OR continuous, for example height of a customer 01/23/2023 CS 584 Fall 2022 – George Mason University 11 Types of Data (by attributes) • Nominal (numerical values). Example: SSN, G #, etc. • Ordinal (scaled values / preferences). Example: Reviews, grades • Internal (range of values). Example: measurements, scales • Ratio (dimensionless usually). Example: Average time to complete in đ1 and đ2 01/23/2023 CS 584 Fall 2022 – George Mason University 12 Types of Data (by attributes) • The above data types have their own limitations and restrictions. • Examples: • For Nominal data, one can use statistical tests such as đ 2 and correlation • For ordinal data, order preserving transformation has to be used • For ratio types, geometric and harmonic mean are more meaningful 01/23/2023 CS 584 Fall 2022 – George Mason University 13 How to we approach data ? • First and foremost, the data need to be looked at – this is called EDA or Exploratory Data Analysis • What is it – think of curious look at data • No need for heavy algorithms – just look • What is the benefit ? • • • • Data Issues What type of algorithms will work Data reproducibility What type of insights - granularity 01/23/2023 CS 584 Fall 2022 – George Mason University 14 Some Characteristics of Data • Dimensionality: • Think if this as how many attributes the data point has • High dimensional data is difficult to manage • Dimensionality reduction must be considered • Sparsity: • Lack of information • Usually matrix or algebraic manipulations do NOT work • Need to look into why the data is sparse • Resolution: • Patterns depend on the scale 01/23/2023 CS 584 Fall 2022 – George Mason University 15 Data Types • Record • Data Matrix • Document Data • Transaction Data • Graph • WWW • Chemical structure (molecules) • Ordered • Spatial • Temporal • Sequential 01/23/2023 CS 584 Fall 2022 – George Mason University 16 Record Data • Record Data is a collection of records with a fixed set of attributes • Example: G# Major GPA Graduated (Y/N) : : : : • Data Matrix is the representation of data by a m x n matrix • Example: The above data record can be thought of as a “matrix” 01/23/2023 CS 584 Fall 2022 – George Mason University 17 Document Data • Document Data is a frequency tabulation of certain words of interest • This has applications in Investments and in trademark laws, for example • Transaction data is a representation of commercial activity by customer or some unique identifiers • For example, think of purchases at a mall • Customer 1 – food, tea, chocolates • Customer 2 – holiday gifts, coffee •… 01/23/2023 CS 584 Fall 2022 – George Mason University 18 Graph Data • Data that shows a connection to other relevant data points • Examples of this can be found in chemistry and drug/vaccination development • Can be used to analyze a social network • Active research area for ML/AI • Graph Theoretic approaches can be used 01/23/2023 CS 584 Fall 2022 – George Mason University 19 Quality of Data • Important in Data driven learning • Need to address the quality as well as the process • Robust method of detecting issues with data • Cleaning the data and making it usable • Sparse data makes predictions less accurate 01/23/2023 CS 584 Fall 2022 – George Mason University 20 Quality of Data • “Bad” data can arise from • • • • • • Noise and possible outliers Wrong measurements at point of origin Fake data Data with missing components Duplicate entries Process short comings resulting in bad outcomes 01/23/2023 CS 584 Fall 2022 – George Mason University 21 Quality of Data • Handling missing data is an important area of research • Why do we “miss” values? • • • • Data not collected, but available Corrupt data Data storage/transform issues Data not possible to capture • Several methods exist to treat the missing data in a systematic way 01/23/2023 CS 584 Fall 2022 – George Mason University 22 Measurements of Data • We need to establish a process and a standard to compare data sets • One obvious way is to compare each point and check how similar or different it is from a reference point • This gives the opportunity to use the “distance” between points • General case of the Euclidean Distance 01/23/2023 CS 584 Fall 2022 – George Mason University 23 What is distance in large data sets ? • Can we define the distance between two data points in a large data set in a consistent way ? • Yes – we can • How – think basics • Look at the distance measure we have in 3-D • This forms a basis of idea of distance between two data points in a large data set • This definition is consistent ! 01/23/2023 CS 584 Fall 2022 – George Mason University 24 General Distance Formula đ đĨ, đĻ = ( đ đ=1 |đĨđ − đĻđ |đ ) 1 đ Where đĨ đđđ đĻ are vectors such that; đĨ = (đĨ1 , đĨ2 , … , đĨđ ) and đĻ = (đĻ1 , đĻ2 , … , đĻđ ) When ρ=2, we get the usual Euclidean Distance 01/23/2023 CS 584 Fall 2022 – George Mason University 25 Distance computation • The above reduces to the Pythagorean Theorem in special case of 3dimensional case • Think of this scenario to get a good visual idea of the distance in data sets • Are there any other distance formulae ? • Example with Euclidean distance in class 01/23/2023 CS 584 Fall 2022 – George Mason University 26 Similarity and Dissimilarity • Similarity: • • • • A measure of how “similar” or “alike” two data points Numerical value Higher value for similar objects Range of [0,1] • Dissimilarity: • Measures how different two data points are • Lower value when objects/points are alike or similar • No upper limit, lower limit of 0 01/23/2023 CS 584 Fall 2022 – George Mason University 27 Similarity and Dissimilarity • For a normal data attribute; 1 đđ đĨ = đĻ • Similarity = 0 đđ đĨ ≠ đĻ 0 đđ đĨ = đĻ • Dissimilarity = 1 đđ đĨ ≠ đĻ • Proximity is defined as how similar OR dissimilar data objects are 01/23/2023 CS 584 Fall 2022 – George Mason University 28 How to we measure distance ? • When computing the Euclidean distance we used what is called “squared distance” • This is very common and is also intuitive • But, we can use similar measures • When do we use these other measures ? • Situational • Algorithm • Different Error Estimates 01/23/2023 CS 584 Fall 2022 – George Mason University 29 Mahalanobis Distance • A measure involving distribution and variance co variance matrix. • Complicated to compute by hand • Automation is needed • Detects any outliers present • Strong connection to PCA 01/23/2023 CS 584 Fall 2022 – George Mason University 30 What is the connection to sequences ? • Data Sets can be closely approximated by a sequence • Properties of sequences can be used in extracting information • Data points can be compared to a sequence • Does it need to be infinite ? • What is the measure ? 01/23/2023 CS 584 Fall 2022 – George Mason University 31 Vector Norms • When the value of p=1, we get the taxicab or the Manhattan distance • When pī ∞ we get the supremum or the maximum distance between two data objects. • It is measured as the maximum difference between any components of the data points. 01/23/2023 CS 584 Fall 2022 – George Mason University 32 Vector Norms • The above distance formulae are a general cases of what we call a “vector norm” • The case p=1 is the 1-Norm • When p=2, we have the 2-Norm • For pī ∞, we have the Sup or Max Norm • Generally, it is denoted by ||. ||đ 01/23/2023 CS 584 Fall 2022 – George Mason University 33