Uploaded by xegir24554

CS584-Fall 2023-8-22-23

advertisement
CS 584
Theory and Applications of Data Mining
George Mason University, Fall 2023
Instructor: Mahendra Panagoda, Ph.D., ASA
22 Aug 2023
Useful Course Information
• Check out the course outline posted on BB
• Important to note homework, mid-term and final
• This is an outline only – if items change, please be on the look
out for an e-mail or a posting.
• Use a variety of sources for learning
• Text that is commonly followed is Tan, Steinbach, Karpatne and
Kumar (2e), Introduction to Data Mining
• We will follow the general flow and outline of the text
01/23/2023
CS 584 Fall 2022 – George Mason University
2
Introduction
• Data is ever present, and the volumes grow
• Why do we have lots of data ?
• Where do we get “Large-data” sets?
•
•
•
•
•
E-commerce
Traffic flow in major cities
Cybersecurity
Simulations
Weather
01/23/2023
CS 584 Fall 2022 – George Mason University
3
Quick Quiz – no grade !
• You receive a file from IT named Data for E-commerce analysis. What
steps would you take in your approach to mine the data therein?
• A. Immediately open the file and run K-Means Algorithm
• B. Verify that this is indeed what you were expecting and it is from a trusted
source
• C. After proper precautions, have a look at the data
• D. If possible, spot any outliers or questionable input
• E. Delete any empty rows or columns and run the Neural Network Algorithm
that you recently developed
• F. Review your notes from prior large data project
01/23/2023
CS 584 Fall 2022 – George Mason University
4
Large data-sets
• Large data-sets īƒ  Voluminous and need some automation to
analyze
• Major use of this data is in predictive analytics also known as
data mining
• Sometimes, these data sets are used in Machine Learning and
Artificial Intelligence efforts
• Think of Test/Train data
• Insights from the data can be used to;
• Improve productivity
• Protect development
• Test hypothesis
01/23/2023
CS 584 Fall 2022 – George Mason University
5
Formal Outline of Data Mining
• Nontrivial evidence
• Automation is required
• Isolated actionable insights
Data
01/23/2023
Processing
Step
Data
Mining
CS 584 Fall 2022 – George Mason University
Post
Processing
Actionable
Information
6
Formal Outline of Data Mining
• Data Mining has its origins in Statistical Analysis and Probability
• Modern applications involve ML and AI
• The Large scale-ness of data requires
•
•
•
•
Pre-processing
Mathematical Descriptions and techniques
Automation
Interpretation of results
01/23/2023
CS 584 Fall 2022 – George Mason University
7
Some examples of Data Mining
• Classification;
•
•
•
•
•
•
Creditworthiness of a customer
Graduate school admissions
Employment
Finding Investments
Medical applications
Fraud detection and many more…
01/23/2023
CS 584 Fall 2022 – George Mason University
8
Regression
• A verify useful statistical technique
• Old but still in use
• Easy to understand
• Basic idea is to predict a value of interest based on other
variables
• Two models - can be linear or non-linear
• Various statistical methods are incorporated to enhance the
model predictability
01/23/2023
CS 584 Fall 2022 – George Mason University
9
Clustering Analysis
• Market studies based on geographical and other lifestyle
attributes
• Text / Document searches specially in Legal and investment
contexts
• Medical Records detection of anomalies
01/23/2023
CS 584 Fall 2022 – George Mason University
10
What is large data?
• Very basically, we can say it is information on something we
are interested in
• Technically, we can define data as objects with certain
attributes
• Attributes are broadly identified as certain data characteristics
• Attributes can be discrete, for example zip codes OR
continuous, for example height of a customer
01/23/2023
CS 584 Fall 2022 – George Mason University
11
Types of Data (by attributes)
• Nominal (numerical values). Example: SSN, G #, etc.
• Ordinal (scaled values / preferences). Example: Reviews, grades
• Internal (range of values). Example: measurements, scales
• Ratio (dimensionless usually). Example: Average time to
complete in 𝑌1 and 𝑌2
01/23/2023
CS 584 Fall 2022 – George Mason University
12
Types of Data (by attributes)
• The above data types have their own limitations and
restrictions.
• Examples:
• For Nominal data, one can use statistical tests such as 𝑋 2 and
correlation
• For ordinal data, order preserving transformation has to be used
• For ratio types, geometric and harmonic mean are more meaningful
01/23/2023
CS 584 Fall 2022 – George Mason University
13
How to we approach data ?
• First and foremost, the data need to be looked at – this is called EDA
or Exploratory Data Analysis
• What is it – think of curious look at data
• No need for heavy algorithms – just look
• What is the benefit ?
•
•
•
•
Data Issues
What type of algorithms will work
Data reproducibility
What type of insights - granularity
01/23/2023
CS 584 Fall 2022 – George Mason University
14
Some Characteristics of Data
• Dimensionality:
• Think if this as how many attributes the data point has
• High dimensional data is difficult to manage
• Dimensionality reduction must be considered
• Sparsity:
• Lack of information
• Usually matrix or algebraic manipulations do NOT work
• Need to look into why the data is sparse
• Resolution:
• Patterns depend on the scale
01/23/2023
CS 584 Fall 2022 – George Mason University
15
Data Types
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• WWW
• Chemical structure (molecules)
• Ordered
• Spatial
• Temporal
• Sequential
01/23/2023
CS 584 Fall 2022 – George Mason University
16
Record Data
• Record Data is a collection of records with a fixed set of
attributes
• Example:
G#
Major
GPA
Graduated (Y/N)
:
:
:
:
• Data Matrix is the representation of data by a m x n matrix
• Example: The above data record can be thought of as a
“matrix”
01/23/2023
CS 584 Fall 2022 – George Mason University
17
Document Data
• Document Data is a frequency tabulation of certain words of
interest
• This has applications in Investments and in trademark laws, for
example
• Transaction data is a representation of commercial activity by
customer or some unique identifiers
• For example, think of purchases at a mall
• Customer 1 – food, tea, chocolates
• Customer 2 – holiday gifts, coffee
•…
01/23/2023
CS 584 Fall 2022 – George Mason University
18
Graph Data
• Data that shows a connection to other relevant data points
• Examples of this can be found in chemistry and
drug/vaccination development
• Can be used to analyze a social network
• Active research area for ML/AI
• Graph Theoretic approaches can be used
01/23/2023
CS 584 Fall 2022 – George Mason University
19
Quality of Data
• Important in Data driven learning
• Need to address the quality as well as the process
• Robust method of detecting issues with data
• Cleaning the data and making it usable
• Sparse data makes predictions less accurate
01/23/2023
CS 584 Fall 2022 – George Mason University
20
Quality of Data
• “Bad” data can arise from
•
•
•
•
•
•
Noise and possible outliers
Wrong measurements at point of origin
Fake data
Data with missing components
Duplicate entries
Process short comings resulting in bad outcomes
01/23/2023
CS 584 Fall 2022 – George Mason University
21
Quality of Data
• Handling missing data is an important area of research
• Why do we “miss” values?
•
•
•
•
Data not collected, but available
Corrupt data
Data storage/transform issues
Data not possible to capture
• Several methods exist to treat the missing data in a systematic
way
01/23/2023
CS 584 Fall 2022 – George Mason University
22
Measurements of Data
• We need to establish a process and a standard to compare
data sets
• One obvious way is to compare each point and check how
similar or different it is from a reference point
• This gives the opportunity to use the “distance” between
points
• General case of the Euclidean Distance
01/23/2023
CS 584 Fall 2022 – George Mason University
23
What is distance in large data sets ?
• Can we define the distance between two data points in a large data
set in a consistent way ?
• Yes – we can
• How – think basics
• Look at the distance measure we have in 3-D
• This forms a basis of idea of distance between two data points in a
large data set
• This definition is consistent !
01/23/2023
CS 584 Fall 2022 – George Mason University
24
General Distance Formula
𝑑 đ‘Ĩ, đ‘Ļ = (
𝑛
𝑘=1 |đ‘Ĩ𝑘
− đ‘Ļ𝑘 |𝑝 )
1
𝑝
Where đ‘Ĩ 𝑎𝑛𝑑 đ‘Ļ are vectors such that;
đ‘Ĩ = (đ‘Ĩ1 , đ‘Ĩ2 , … , đ‘Ĩ𝑛 ) and đ‘Ļ = (đ‘Ļ1 , đ‘Ļ2 , … , đ‘Ļ𝑛 )
When ρ=2, we get the usual Euclidean Distance
01/23/2023
CS 584 Fall 2022 – George Mason University
25
Distance computation
• The above reduces to the Pythagorean Theorem in special case of 3dimensional case
• Think of this scenario to get a good visual idea of the distance in data
sets
• Are there any other distance formulae ?
• Example with Euclidean distance in class
01/23/2023
CS 584 Fall 2022 – George Mason University
26
Similarity and Dissimilarity
• Similarity:
•
•
•
•
A measure of how “similar” or “alike” two data points
Numerical value
Higher value for similar objects
Range of [0,1]
• Dissimilarity:
• Measures how different two data points are
• Lower value when objects/points are alike or similar
• No upper limit, lower limit of 0
01/23/2023
CS 584 Fall 2022 – George Mason University
27
Similarity and Dissimilarity
• For a normal data attribute;
1 𝑖𝑓 đ‘Ĩ = đ‘Ļ
• Similarity =
0 𝑖𝑓 đ‘Ĩ ≠ đ‘Ļ
0 𝑖𝑓 đ‘Ĩ = đ‘Ļ
• Dissimilarity =
1 𝑖𝑓 đ‘Ĩ ≠ đ‘Ļ
• Proximity is defined as how similar OR dissimilar data objects
are
01/23/2023
CS 584 Fall 2022 – George Mason University
28
How to we measure distance ?
• When computing the Euclidean distance we used what is called
“squared distance”
• This is very common and is also intuitive
• But, we can use similar measures
• When do we use these other measures ?
• Situational
• Algorithm
• Different Error Estimates
01/23/2023
CS 584 Fall 2022 – George Mason University
29
Mahalanobis Distance
• A measure involving distribution and variance co variance matrix.
• Complicated to compute by hand
• Automation is needed
• Detects any outliers present
• Strong connection to PCA
01/23/2023
CS 584 Fall 2022 – George Mason University
30
What is the connection to sequences ?
• Data Sets can be closely approximated by a sequence
• Properties of sequences can be used in extracting information
• Data points can be compared to a sequence
• Does it need to be infinite ?
• What is the measure ?
01/23/2023
CS 584 Fall 2022 – George Mason University
31
Vector Norms
• When the value of p=1, we get the taxicab or the Manhattan
distance
• When pīƒ  ∞ we get the supremum or the maximum distance
between two data objects.
• It is measured as the maximum difference between any
components of the data points.
01/23/2023
CS 584 Fall 2022 – George Mason University
32
Vector Norms
• The above distance formulae are a general cases of what we
call a “vector norm”
• The case p=1 is the 1-Norm
• When p=2, we have the 2-Norm
• For pīƒ  ∞, we have the Sup or Max Norm
• Generally, it is denoted by ||. ||𝑝
01/23/2023
CS 584 Fall 2022 – George Mason University
33
Download