Lecture 1 Introduction to Data Analytics and Descriptive Analytics Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: What is Data Analytics What is Descriptive Analytics? Exploratory Data Analysis Descriptive Statistical Measures: 1 Measures of Location Measures of Dispersion Measures of Shape Measures of Association CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics What is Data Analytics? Data analytics: Use data, computing technology, statistical analysis, quantitative methods, and mathematical or computer-based models. Gain improved insights about business operations and make better, fact-based decisions. Scope of data analytics: 2 Descriptive analytics – Uses data to understand past and present performance. Predictive analytics – Use data to predict future performance. Prescriptive analytics – Improve what we know from descriptive and predictive analytics. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics What is Data Analytics? (cont.) Hypothetical example of data analytics in a typical business scenario: 3 Retail markdown decisions – Most department stores clear seasonal inventory by reducing prices. The question is – When to reduce the price and by how much? Descriptive analytics – Examine historical data for similar products (prices, units sold, advertising, …) Predictive analytics – Predict sales based on price. Prescriptive analytics – Find the best sets of pricing and advertising to maximise sales revenue. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics What is Data Analytics? (cont.) Real-world actual example of data analytics: Harrah’s Entertainment – Harrah’s owns numerous hotels and casinos Uses predictive analytics to: Uses prescriptive analytics to: 4 Forecast demand for rooms. Segment customers by gaming activities. Set room rates. Allocate rooms. Offer perks and rewards to customers. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics What is Data Mining? Data mining: Refers to the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Most prevalent form of predictive analytics in use in modern business environment. Data in data mining: 5 A collection of facts usually obtained as the result of experiences, observations or experiments. May consist of numbers, words and images. Lowest level of abstraction (from which information and knowledge are derived). CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics What is Data Mining? Taxonomy of data: 6 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Categorical Data Categorical Data: Represent the labels of multiple classes used to divide a variable into specific groups. Examples: 7 Race, sex, age group and educational level. Certain variables such as age group and educational level may be represented in numerical format using their exact values. Often more informative to categorise them into small number of ordered classes. Age Group Age Range Primary School Students 7 – 12 Secondary School Students 13 – 16 Pre-tertiary Students 17 – 19 Tertiary Students >= 20 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Categorical Data (cont.) Discrete in nature since there are a finite number of values with no continuum between them. Nominal data: Categorical variables without natural ordering. E.g., Marital status can be categorized into (1) single, (2) married and (3) divorced. Nominal data can be represented as either: 8 Binomial values with two possible values, e.g., yes/no or true/false. Multinomial values with three or more possible values, e.g., marital status or ethnicity (white/black/Latino/Asian). CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Categorical Data (cont.) Ordinal data: Categorical variables that lend themselves to natural ordering. But it makes no sense to calculate the differences or ratios between values. Examples: 9 I.e., the codes assigned to objects represent the rank order among them. Credit score can be categorized as (1) low, (2) medium and (3) high. Education level can be categorized as (1) high school, (2) college and (3) graduate school. Age group. The additional rank-order information is useful in certain data mining algorithms for building a better model. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Numerical Data Numerical data: Represent the numerical values of specific variables. Examples: Continuous in nature since the variable contains continuous measures on a specific scale that allows insertion of interim value. 10 Age, number of children, total household income (in SGD), travel distance (in kilometers) and temperature (in Celsius). Numerical values can be integer (whole number only) or real (including the fractional number). Discrete variable represents finite, countable data. Continuous variable represents scalable measures and may contain infinite number of fractional values. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Numerical Data (cont.) Interval data: Numerical variables that are measured on interval scales. Example: Temperature on the Celsius scale is 1/100 of the difference between the melting temperature and boiling temperature of water. There is no notion of absolute zero value. Ratio data: Numerical variables that are measured on ratio scales. Examples: Common measurements in physical science and engineering such as mass, length and time. A ratio scale has a non-arbitrary zero value defined. 11 Kelvin temperature scale has a non-arbitrary zero value of absolute zero, which is equal to -273.15 degrees Celsius. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Other Data Types Data mining may involve many other data types: Unstructured text, image and audio. These data types need to be converted into some form of categorical or numerical representation before they can be processed by data mining algorithms. Data may also be classified as: 12 Static. Dynamic, i.e., temporal or time series. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Data Type and Data Mining Some data mining methods are particular about the data types that they can handle. Providing incompatible data types may lead to: Incorrect models. Halting of the model development process. Examples: Neural networks, support vector machines and logistic regression require numerical variables: 13 Categorical variables are converted into numerical representations using 1-of-N pseudo variables. Both nominal and ordinal variables with N unique values can be transformed into N pseudo variables with binary values – 1 or 0. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Data Type and Data Mining (cont.) ID3 (a class decision tree algorithm) and rough sets (a relatively new rule induction algorithm) require categorical variables: Numeric variables are discretized into categorical representations. Multiple regression with categorical predictor variables: Binomial categorical variable: Multinomial categorical variable: 14 Simply represent the variable with a binary value – 1 or 0. Need to perform the process of dummy coding. A multinomial variable with k possible values may be represented by k-1 binomial variables each coded with a binary value – 1 or 0. Similar concept to 1-of-N pseudo variables but need to exclude one value. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Data Type and Data Mining (cont.) The good news is: 15 Some, but not all, software tools accepts a mix of numeric and nominal variables. Internally make the necessary conversion before processing the data. For learning purpose, you need to know when and how to use the correct data types. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Tukey’s Exploratory Data Analysis Exploratory Data Analysis (EDA) was first proposed by John Tukey. EDA is more of a mindset for analysis rather than an explicit set of techniques and models: EDA emphasises data observation, visualisation and the careful application of technique to make sense of data. EDA is NOT about fitting data into analytical model BUT rather about fitting a model to the data. Two main EDA approaches: 16 Descriptive Analytics – Use of descriptive statistical measures. Data Visualisation. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Why Study Exploratory Data Analysis? Visually examining data to understand patterns and trends: Raw data should be examined to learn the trends and patterns over time, and between dimensions in the data. Visual examination can help to frame what analytical methods are possible to apply to the digital data. Using the best possible methods to gain insights into not just the data, but what the data is saying: 17 Going beyond the details of data to understand the data is saying in the context of answering our question. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Why Study Exploratory Data Analysis? (cont.) Identifying the best performing variables and model for the data: There are so many big data, how do we decide what is useful? EDA helps ascertain what variables are influential and important. Detecting anomalous and suspicious outlier data: 18 Outliers and anomalies in the data may be important, highly relevant, and meaningful to the business. Can also be just random noise that can be ignored and excluded from analytical consideration. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Why Study Exploratory Data Analysis? (cont.) Testing hypothesis and assumptions: Using insights derived from data to create hypotheses. Test hypothesis-driven changes. The gist is to use data to test hypotheses. Finding and applying the best possible model to fit the data: 19 Predictive modelling and analysis requires an EDA approach that is more focused on the data rather than the model. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics General Process of Exploratory Data Analysis Identify the key dimensions and metrics of the dataset. Use analytical software to visualise the data using the techniques that we will review in this lecture. Apply the appropriate statistical model to generate value. 1. 2. 3. The general idea is to: 20 Use tool and pattern recognition abilities to observe the data relationships and unusual data movements. By visually examining and exploring the data, it becomes possible to focus the approach to the analysis work. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Data Quality Report To ascertain how good a dataset is, we can evaluate the dataset with a data quality report. Use Panda’s DataFrame to calculate the required statistics. See src01 for a sample script. 21 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics More About Exploratory Data Analysis and Descriptive Analytics Three main phases of descriptive analytics: Univariate analysis – Investigate the properties of each single variable in the dataset. Bivariate analysis: 22 Measure the intensity of the relationship existing between pairs of variables. For supervised learning models, focus is on the relationships between the predictive variables and target variable. Multivariate analysis – Investigate the relationships holding within a subset of variables. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics More About Exploratory Data Analysis and Descriptive Analytics (cont.) Visual illustration of the differences among the three phases: Variable A Target Variable Variable B Target Variable Variable A Variable C Target Variable Variable B Variable A Variable B Variable C Variable A Variable C Target Variable Variable B Variable C Univariate 23 Bivariate Important for supervised learning models Variable A Variable B Variable C Target Variable Multivariate CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Univariate Analysis Study the behavior of each variable as an entity independent of other variables in the dataset. Three characteristics that are of interest: Location – Tendency of the values of a given variable to arrange themselves around a specific central value. Dispersion – Propensity of the variable to assume a more or less wide range of values. Underlying probability distribution. Objectives of univariate analysis: Verify validity of assumptions regarding variable distribution: 24 E.g.,Values of a variable are exactly or approximately normal. Regression assumes that variables have normal distributions. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Univariate Analysis (cont.) Verify the information content of each variable: E.g., A variable that assumes the same value for 95% of the available records may be considered fairly constant for practical purposes. The information it provides is of little value. Identifying outliers – Anomalies and non-standard values. For the remaining discussion, let’s suppose the dataset D contains m records and denote by a j the generic variable being analyzed: 25 The subscript j will be suppressed for clarity. The vector (x1 j , x xj ,..., xmj ) of m records corresponding to a j will simply be denoted by: a = (x1 , x2,..., xm ) CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location Mean: Sample arithmetic mean is defined as: x1 + x2 + ...xm 1 m µ= = ∑ xi m m i =1 The sum of the differences between each value and the sample mean, i.e., the deviation or spread, is equal to 0. ∑ (x − µ ) = 0 m i i =1 It is also the value c that minimises the sum of squared deviations: ∑ (x − µ ) m i =1 26 2 i m c = min ∑ (xi − c ) 2 i =1 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location (cont.) Weighted sample mean: At times, each value xi is found to be associated with a numerical coefficient wi , i.e., the weight of that value. w1 x1 + w2 x2 + ...wm xm = µ= w1 + w2 + ... + wm i =1 m wi xi i =1 wi Example: xi is the unit sale price of the ith shipping lot consisting of w product i 27 ∑ ∑ m units. The average sale price would be the weighted sample mean. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location (cont.) Median: Suppose x1 + x2 + ...xm are m observations arranged in a nondecreasing way. If m is an odd number, the median is the observation occupying position (m + 1) / 2 , i.e., xmed = x(m +1)/ 2 . If m is an even number, the median is the middle point in the interval between the observations of position m / 2 and (m + 2) / 2 . Example of m as odd number: Example of m as even number: 28 -2, 0, 5, 7, 10 Median is 5. 50, 85, 100, 102, 110, 200 Median is (100 + 102 ) / 2 = 101 , i.e., using the 3rd and 4th observations. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location (cont.) Mode: The value that corresponds to the peak of the empirical density curve for the variable a. The value that appears most often in the dataset for variable a. If the empirical density curve has been calculated by partitioning continuous values into intervals, each value of the interval that corresponds to the maximum empirical frequency is the mode. (Source: Science Buddies) 29 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location (cont.) Midrange: The midpoint in the interval between the minimum value and the maximum value. x x max + x min = , where x max = max xi and x min = min xi i i 2 Not robust with respect to extreme values, i.e., presence of outliers (just like the mean). Example: 50, 85, 100, 102, 110, 200 200 + 50 x midr = = 125 2 µ = 107.8333 30 midr CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location (cont.) Geometric mean: mth root of the product of the m observations of the variable a. µ geom m = m x1 x2 ...xm = m ∏ xi i =1 Example (Source: University of Toronto Mathematics Network): Suppose you have an investment which earns 10% the first year, 60% the second year, and 20% the third year. What is its average rate of return? 31 After one year, $1 investment becomes $1.10. After two years, this investment becomes (1.10 + 0.6×1.10) = $1.76. After three years, we have (1.76 + 0.2×1.76) = $2.122. By what constant factor would your investment need to be multiplied by each year in order to achieve the same effect as multiplying by 1.10 one year, 1.60 the next, and 1.20 the third? CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Location (cont.) 3 1.1×1.6 ×1.2 = 1.283 1.2833 ≈ 2.112 The average rate of return is about 28% (not 30% which is what the arithmetic mean of 10%, 60% and 20% would return). Arithmetic mean is calculated for observations which are independent of each other. Geometric mean is used to calculate mean for observations that are dependent on each other. 32 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion Location measures provide an indication of the central part of the observed values for a numerical variable. Need measures of dispersion to represent the variability expressed by the observations w.r.t. the central value. The two empirical density curves below have mean, median and mode equal to 50. Density 0.1 0.02 0.08 0.015 0.06 0.01 Density 0.04 Density 0.005 0.02 0 0 0 33 Density 0.025 20 40 60 80 -50 0 50 100 150 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) In most applications, it is desirable to have data with small dispersion (left). In some applications, a higher dispersion may be desired, for example, for the purpose of classification/discriminating between classes. 34 E.g., In manufacturing process, a wide variation in a critical measure might point to an undesirable level of defective items. E.g., In the case of test results intended to discriminate between the abilities of candidates, it is desirable to have a fairly wide spectrum of values. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) Range: The simplest measure of dispersion. x range = x max − x min , where x max = max xi and x min = min xi i Useful to identify the interval in which the values of a variable fall. Unable to catch the actual dispersion of the values. 5 35 i The densities below have the same range but the dispersion for the density on the right is greater than that on the left. 6 7 8 9 10 5 6 7 8 9 10 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) Mean absolute deviation (MAD): The deviation, or spread, of a value is defined as the signed difference from the sample arithmetic mean: si = xi − µ , i ∈ M and 36 ∑s i =1 i =0 MAD is a measure of dispersion of observations around their sample mean through the sum of the absolute values of the spreads: 1 m 1 m MAD = ∑ si = ∑ xi − µ m m i =1 m i =1 The lower the MAD, the more the values fall in proximity of their sample mean and the lower the dispersion is. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) Variance: More widely used than MAD. Sample variance is defined as: ( 1 m 2 1 m σ = si = xi − µ ∑ ∑ m − 1 i =1 m − 1 i =1 2 37 ) 2 A lower sample variance implies a lower dispersion of the values around the sample mean. As size of the sample increases: Sample mean µ approximates the population mean µ . 2 Sample variance σ approximates the population variance σ 2 . CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) To bring the measure of dispersion back to the original scale in which the observations are expressed, the sample standard deviation is defined as: σ= σ 38 2 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) Normal distribution: If the distribution of variable a is normal or approximately normal: The interval (µ ± σ ) contains approximately 68% of the observed 39 values. The interval µ ± 2σ contains approximately 95% of the observed values. The interval µ ± 3σ contains approximately 100% of the observed values. ( ( ) ) Values that falls outside (µ ± 3σ ) can be considered as suspicious outliers. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Dispersion (cont.) Coefficient of variation: 40 Defined as the ratio between the sample standard deviation and the sample mean, expressed in percentage terms as: σ CV = 100 µ Provides a relative measure of dispersion. Used to compare two or more groups of data usually obtained from different distributions. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Relative Measures of Dispersion Relative measures of dispersion are used to examine the localisation of a value w.r.t. to other values in the sample. Quantiles: 41 Suppose we arranged the m values {x1 , x2 ,..., xm } of a variable a in non-decreasing order. Given any value p, with 0 ≤ p ≤ 1, the p-order quantile is the value q p such that pm observations will fall on the left of q p and the remaining (1-p)m on its right. Sometimes, p quantiles are called 100 pth percentiles. 0.5-order quantile coincides with the median. qL = 0.25-order quantile, also called lower quartile. qU = 0.75-order quantile, also called upper quartile. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Relative Measures of Dispersion (cont.) Interquartile range (IQR) is defined as the difference between the upper and lower quartiles: Dq = qU − qL = q0.75 − q0.25 (Source: http://www.cdc.gov/osels/scientific_edu/ss1978/Lesson2/Section7.html) 42 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Identification of Outliers for Numerical Variables z-index: 𝑧𝑧𝑖𝑖 = 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝜎𝜎 Calculated as Identify outliers in most cases. Data values which are outside 3-standard deviations from the sample mean are considered to be suspicious: ziind > 3 – Suspicious ziind >> 3 – Highly suspicious Box plots or box-and-whisker plots: Median and the lower and upper quartiles are represented on the axis where the observations are placed: 43 Length of the box represents the interquartile range (the distance between the 25th and the 75th percentiles). CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Identification of Outliers for Numerical Variables (cont.) An observation is identified as an outlier if it falls outside four thresholds: 44 Dot in the box interior represents the mean. Horizontal line in the box interior represents the median. Vertical lines issuing from the box extend to the minimum and maximum values of the analysis variable. At the end of these lines are the whiskers. External lower edge = Internal lower edge = Internal upper edge = External upper edge = qL − 3Dq = −101.5 qL − 1.5 Dq = −34.75 qU + 1.5 Dq = 143.25 qU + 3Dq = 210 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Shape Skewness refers to the lack of symmetry in the distribution of the data values. Refers to the empirical density of empirical relative frequencies that is calculated by dividing the raw frequency by m, i.e., the number of records. Negatively (left) skewed Mean < Median < Mode 45 Positively (right) skewed Mode < Median < Mean CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Shape (cont.) An index of asymmetric, the sample skewness, based on the third sample moment may be used to measure shape: The third sample moment is defined as: ( 1 m µ 3 = ∑ xi − µ m i =1 The sample skewness is defined as: µ3 I skew = 3 σ Sample skewness is interpreted as follow: 46 ) 3 Density curve is symmetric then I skew = 0 Density curve is skewed to the right then I skew > 0 Density curve is skewed to the left then I skew < 0 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Shape (cont.) Kurtosis: The kurtosis index expresses the level of approximation of an empirical density to the normal curve using the fourth sample moment defined as: ( 1 m µ 4 = ∑ xi − µ m i =1 The kurtosis is defined as: 𝐼𝐼𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 = 47 ) 4 𝜇𝜇4 𝜎𝜎 4 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Analysis of the Empirical Density (cont.) Kurtosis is interpreted as follow: If the empirical density perfectly fits a normal density then I kurt = 0 Empirical density is hyponormal, it shows greater dispersion than the normal density, i.e., assigns lower frequencies to values close to mean I kurt < 0 Empirical density is hypernormal, it shows lower dispersion than the normal density, i.e., assigns higher frequencies to values close to mean I kurt > 0 (Source: Vercellis 2009, pp. 133) 48 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Descriptive Statistics DataFrame.describe() can generate the descriptive statistics for each Series (i.e., column) in the DataFrame: This works only for numeric columns. Summarise the central tendency, dispersion and shape of a dataset’s distribution, excluding missing values. DataFrame.skew() can generate the skewness index. DataFrame.kurtosis() can generate the kurtosis. We have discussed histogram earlier. The empirical density can be generated with DataFrame.plot.density(). 49 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Descriptive Statistics (cont.) src02 Empirical density curve for hours 50 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association Summary indicators may be used to express the nature and intensity of relationship between numerical variables. Covariance: Given a pair of variables a j and ak , let µ and µ be the j k corresponding sample means. Sample covariance is defined as: ( )( 1 m v jk = cov(a j , ak ) = xij − µ j xik − µ k ∑ m − 2 i =1 51 ) Concordance: Values of a j greater(lower) than mean µ j are associated with values of ak also greater(lower) than mean µ . k CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association (cont.) Discordance: Values of a j greater(lower) than mean µ j are associated with values of ak lower(greater) than mean µ k . Two elements of the product in the summation agree in sign and will provide a positive contribution to the sum. Two elements of the product in the summation will not agree in sign and will provide a negative contribution to the sum. Positive(negative) covariance values indicate that variables a j and ak are concordant(discordant). Limitation of covariance: 52 Covariance is usually a number ranging on a variable scale and thus it is inadequate to assess the intensity of the relationship. Correlation is more useful. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association (cont.) Correlation: Correlation refers to the strength and direction of the relationship between two numerical variables. Interpreting correlation value: 1 means strong correlation in the positive direction. -1 means a strong correlation in the negative direction. 0 means no correlation. Some machine learning algorithms do not work optimally due to multicollinearity: 53 Multicollinearity refers to the presence of correlations among the independent variables. Important to review all pair-wise correlations among the independent variables. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association (cont.) Pearson R Correlation: 54 Most common measure of the relationships between two numerical variables which are linearly related and seem to be normally distributed. Defined as: v rjk = corr (a j , ak ) = jk σ jσ k where σ j and σ k are sample standard deviations of a j and ak rjk always lies in the interval [-1,1]. rjk represents a relative index expressing the intensity of a possible linear relationship between variables a j and ak . CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association (cont.) The main properties of Pearson coefficient: 55 If r jk > 0 , the attributes are concordant and the pairs of observations will tend to lie on a line with positive slope. If r jk < 0 , the attributes are discordant and the pairs of observations will tend to lie on a line with negative slope. If r jk = 0 or r jk ≈ 0 , no linear relationship exists. CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association (cont.) Kendall rank correlation: A non-parametric test to determine the strength and direction of relationship between two numeric variables that follow a distribution other than the normal distribution. Spearman rank correlation: 56 Also a non-parametric test. Does not make any assumptions about the distribution of the data. Most suitable for ordinal data (i.e., categorical). CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Measures of Association (cont.) DataFrame.corr() can be used to compute the pairwise correlation of columns, excluding null values. The following three methods are supported: pearson – Standard correlation coefficient. kendall – Kendall correlation coefficient. spearman – Spearman rank correlation. Scatter plot of grade against hours src03 57 CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics Lecture 2 Data Visualization and Predictive Analytics Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Data visualisation. More predictive analytics and machine learning. Different types of data mining pattern. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Matplotlib Architecture Matplotlib provides a set of functions and tools that allow the representation and manipulation of a Figure (the main object) and its associated internal objects. Other than graphics, Matplotlib also handles the events and graphics animation. Thus, Matplotlib can produce interactive charts that can respond to events triggered by keyboard or mouse movement. The architecture of Matplotlib is structured into three layers with unidirectional communication: 2 Each layer can communicate with the underlying layer only. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Matplotlib Architecture (cont.) Backend layer – Matplotlib API and a set of classes to represent the graphic elements. Artist layer – An intermediate layer representing all the elements that make up a chart, e.g., title, axis, labels, markers, etc. Scripting layer – Consists of the pyplot interface for actual data calculation, analysis and visualisation. 3 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Matplotlib Architecture (cont.) The three main artist objects in the hierarchy of the Artist layer. 4 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Matplotlib Architecture (cont.) Each instance of a chart corresponds to an instance of Artist structured in a hierarchy 5 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics pyplot The pyplot module is a collection of command-style functions that allow the data analyst to operate or make changes to the Figure. E.g., Create a new Figure. A simple interactive chart can be created using the pyplot object’s plot() function. The plot() function uses a default configuration that does not have title, axis label, legend, etc. src01 6 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics pyplot (cont.) The default configuration can be changed to obtain the desired chart. src02 7 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Line Plot To create a simple line plot from data in a Pandas DataFrame, we can use DataFrame.plot(), which relies on matplotlib: By default, you will get a line plot. src03 8 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Line Plot (cont.) To customise the line plot, we need to use matplotlib directly: src04 9 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Bar Plot To create a bar plot, we need to set the kind attribute in DataFrame.plot() to “bar”. By default, the index of the DataFrame is a zero-based number and thus x-axis are the numbers 0 to 4. We can change the index of the DataFrame to the Name column and replot the bar plot. See the sample script in src05. 10 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Box Plot A box plot can be created using DataFrame.boxplot() and specifying the required column, in this case, “Grade”. src06 11 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Box Plot (cont.) A box plot is a method for graphically depicting groups of numerical data through their quartiles: The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. The position of the whiskers can represent several values: If there is no outlier, the whiskers show the minimum and maximum values of the data. If there are outliers, the whiskers show the: Lowest datum still within 1.5 IQR of the lower quartile. Highest datum still within 1.5 IQR of the upper quartile. IQR -> interquartile range IQR = Q3 – Q1. Outlier points are those past the end of the whiskers. 12 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Box Plot (cont.) src06 • IQR = 18 • Outliers will fall outside ±1.5 * IQR from Q1 and Q3, i.e., 50 and 122 • The min and max values are within this range. In this example, the whiskers shows the minimum and maximum values since there is no outlier. 13 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Box Plot (cont.) src07 Adding one extra observation with a Grade of 130 will lead to an outlier • IQR = 20.75 • Outliers will fall outside ±1.5 * IQR from Q1 and Q3, i.e., 46.125 and 129.125 • The max value is outside this range. In this example, the whiskers shows the lowest datum still within 1.5 IQR of the lower quartile (i.e., 76), and the highest datum still within 1.5 IQR of the upper quartile (i.e., 99). 14 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Box Plot (cont.) We can create a box plot with categorisation, in this case by Gender. We can also adjust the y-axis so that it runs from 0 to 100. src08 15 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Box Plot (cont.) src08 Box plots with categorisation showing. The one on the right has the y-axis limit set to 0-100. 16 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Histogram A histogram can be created using DataFrame.hist() By default, the histograms for the columns with numeric values will be generated. To generate a histogram for a particular column, you can specify the name of the required column in the column attribute, e.g., df.hist(column='hours'). src09 17 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Histogram (cont.) We can also generate histogram for a numerical column grouped by another column, typically a categorical column, using the by attribute. See src09 for a sample script grouped by “gender”. 18 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Pie Chart A box plot can be created using pyplot.pie(). Customisations can be made to the pie chart to make it more meaningful. src10 19 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Scatter Plot A scatter plot can be created using pyplot.scatter(). src11 20 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Scatter Plots of the Iris Flower Dataset This dataset set contains data from three different species of iris (Iris silky, virginica Iris and Iris versicolor). The variables include the length and width of the sepals, and the length and width of the petals. This dataset is widely used for classification problems. 150 observations with 4 independent attributes and one target attribute. src12 21 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Visualisation – Scatter Plots of the Iris Flower Dataset (cont.) Which variables are better for predicting iris species? 22 src13 – Scatterplot of sepal sizes. src14 – Scatterplot of petal sizes. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Cross Tabulation What is a crosstab? 23 A crosstab is a table showing the relationship between two or more categorical variables. Compares the results for one or more variables with the results of another variable. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Cross Tabulation (cont.) In Pandas: By default, a crosstab computes a frequency table of the variables. To override the default, we can pass in an array of values and provide a suitable Numpy aggregation function. Refers to src15 for some examples with the automobiles dataset. 24 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Pivot Table Pivot table is generally similar to a crosstab but there are important differences. Data type: Display format: Crosstab is used for categorical variables only whereas pivot table works with both categorical and numerical variables. Crosstab analyzes relationship between two variables whereas pivot table works with more than two variables. Pivot table is generally more flexible than crosstab. Aggregation: 25 Pivot table goes beyond raw frequency and can perform statistics calculation. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Pivot Table (cont.) In Pandas, pivot table and crosstab are very similar. Refers to src16 for some examples with the automobiles dataset. 26 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics More About Machine Learning Recall that machine learning employs statistical and mathematical techniques to build a model based on sample data The objective is to identify patterns among variables in the data, i.e., data mining: Data mining involves three main patterns: Prediction Data Classification Mining Regression Segmentation Association 27 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics More About Data Mining Prediction: Forecast the outcome of future event or unknown phenomenon. Intuitively, we can think of prediction as learning an AB mapping, where A is the input and B is the output. Classification – Predict weather outlook: Regression – Predict rainfall: 28 A are weather data such as temperature and humidity. B is a class label representing the weather outlook such as “Rain” or “No Rain”. A are weather data such as temperature and humidity. B is a real number representing the amount of rainfall in millimeter. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics More About Data Mining (cont.) Segmentation: Partition a collection of things (e.g., objects, events) in a dataset into natural groupings. Clustering: Create groups so that the members within each group have maximum similarity. Members across groups have minimum similarity. Examples: 29 Segment daily temperature data into hot day or cold day. Segment customers based on their demographics and past purchase behaviors. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics More About Data Mining (cont.) Association: Discover interesting relationships among variables in a large database. Market Basket Analysis: Discover regularities among products in largescale transactions recorded by point-of-sale systems in supermarkets: 30 I.e., each product purchased in a transaction being a variable. Identify products that are commonly purchased together, e.g., beer and diaper. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics More About Data Mining (cont.) Data mining tasks to extract the different types of patterns rely on learning algorithms. Learning algorithms can be classified according to the way patterns are extracted from historical data. Supervised learning method : Training data include both independent variables and dependent variable Unsupervised learning method: 31 Training data include only the independent variables. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics More About Data Mining (cont.) • Prediction involve A B mapping in which we know the B to help determine the pattern. • This is known as supervised learning. • Association and Segmentation involve just A without the B. • We determine the pattern without the help of the B. • This is known as unsupervised learning. Source: Sharda et. al. (2020) – Analytics, Data Science, & Artificial Intelligence: Systems for Decision Support, pp. 206, Figure 4.2 32 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Hands-on with Supervised Learning versus Unsupervised Learning (cont.) The Zoo animals dataset consists of 101 animals and 18 variables: Variable Variable Type Data Type Description animal Identifier Text • Name of the zoo animal type Dependent Variable or Class Label Categorical • • • Type of the zoo animal This is the B 7 types – amphibian, bird, fish, insect, invertebrate, mammal, and reptile hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, tail, domestic, catsize Independent Variables Boolean • Various characteristics of zoo animal These are the A 15 such attributes legs • • Numeric 33 • Number of legs • This is also part of the A CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Hands-on with Supervised Learning versus Unsupervised Learning (cont.) Identifier 34 Independent Variables Dependent Variable CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Classifying Zoo Animals Use decision tree classifier to learn the function or mapping between the characteristics of animals and their membership to each type. This is a supervised learning process. Produces milk Yes No Has feather mammal Yes bird 35 No ??? CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Classifying Zoo Animals (cont.) Use a common two-step methodology known as split validation: Model training and then follow by model testing. In this case, we are using a simple split validation. 70% of sample is used to train the model and derive a decision tree. 30% of sample is used to test the model. 101 animals Training 72 animals 36 Testing 29 animals CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Classifying Zoo Animals (cont.) Compute the predictive accuracy, i.e., the model’s ability to correctly predict the class label of new or previously unseen data: Actual Animal Type amphibian amphibian Predicted Animal Type bird bird ...... a b ...... ...... reptile 37 reptile g Testing accuracy = (a + b + ...... + g) / 101 In this example, the predictive accuracy is about 90%: CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Classifying Zoo Animals (cont.) Actual testing accuracy may vary as we are using random stratified sampling to perform the split validation: Example: 10% of all animals are invertebrate. 10% of animals in each of training dataset and testing dataset would be invertebrate. 101 animals Training 7 invertebrate in training 72 animals 38 Testing 3 invertebrate in testing 29 animals CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Classifying Zoo Animals (cont.) Refer to sample source file src17 for the example. Confusion matrix for training and testing Decision tree for classifying zoo animal 39 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Clustering Zoo Animals Uses k-means clustering algorithm to segment the zoo animals into natural groupings. This is an unsupervised learning process. The class attribute, i.e., animal type, is NOT used by the algorithm. The clustering process groups animals together using some statistical measure of distance: 40 The distance between animals are calculated using the descriptive variables, i.e., the A. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Clustering Zoo Animals (cont.) Animals who are closer to each other (i.e., shorter distance) are grouped together. Therefore, animals in different groups are further part (i.e., greater distance). Cluster 1 – Fish Cluster 5 – Bird bass carp flamingo catfish Cluster 4 – Amphibian duck crow dove frog newt 41 toad CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Clustering Zoo Animals (cont.) • • We cannot compute the predictive accuracy because we are NOT supposed to know the actual class attribute. The quality of a clustering model is evaluated using other statistical measures. Refer to sample code src18 and src19 for the example. Observe that zoo animals of the same type are generally segmented into the same cluster. We do NOT observe a case in which zoo animals of the same type are randomly segmented into different clusters. 42 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Making Sense of Zoo Animals Our general conclusion is that unsupervised learning is useful in the real-world. Without knowing the actual animal type, our clustering model is able to segment the zoo animals into natural groupings that are close to the actual types. In real-world ML problem solving, we can usefully apply unsupervised learning if: 43 We do not have a B. It is not possible to obtain a labelled training dataset that contains a B. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Process Systematic approach to carry out a data mining project that maximizes the chance of success. Various standardized processes based on best practices: CRISP-DM (Cross-Industry Standard Process for Data Mining) SEMMA (Sample, Explore, Modify, Model, and Assess) KDD (Knowledge Discovery in Databases) Source: KDNuggets.com, August 2007 44 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics CRISP-DM Proposed by a consortium of European companies in the mid-1990s as a nonproprietary standard. Consists of a sequence of six steps: The steps are sequential in nature but usually involve backtracking. Whole process is iterative and could be time consuming. The outcome from each step feeds into the next step and thus each step must be carefully conducted. Data mining is largely driven by experience and experimentation – Is it an art or science? 45 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics CRISP-DM (cont.) 46 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 1 – Business Understanding Know the purpose of the data mining study: Thorough understanding of managerial need for new knowledge. Explicit specification of the business objective. Examples of specific business questions: 47 What are the common characteristics of the customers we have lost to our competitors recently? What are typical profiles of our customers, and how much value does each of them provide to us? CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 2 – Data Understanding Identify the relevant data from many available data sources – internal databases and external sources. Key consideration factors in identifying and selecting data: Clear and concise description of data mining task. Develop an in-depth understanding of the data and variables: Use different techniques to help understand data: 48 Is there any synonymous and/or homonymous variables. Are the variables independent of each other? Statistical techniques: Statistical summaries and correlation analysis. Graphical techniques: Scatter plots, histograms and box plots. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 2 – Data Understanding (cont.) Common data sources: 49 Demographic data – E.g., income, education, number of households and age. Sociographic data – E.g., hobby, club membership and entertainment. Transactional data – Sales record, credit card spending and issued checks. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 3 – Data Preparation 50 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 3 – Data Preparation (cont.) Prepare the data identified in Step 2 for data mining analysis,, i.e., data preprocessing. Data consolidation: Consumes the most time and effort, roughly 80% of total. Four main steps to convert raw, real-world data into minable datasets. Relevant data are collected from identified sources. Required records and variables are selected and integrated. Data cleaning: 51 Missing data are imputed or ignored. Noisy data, i.e., outliers, are identified and smoothened out. Inconsistent data are handled with domain knowledge or expert opinion. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 3 – Data Preparation (cont.) Data transformation: Data are transformed for better processing. Normalization – Data may be normalized between certain minimum and maximum values to mitigate potential bias. Discretization – Numeric variables are converted to categorical values. Aggregation – A nominal variable’s unique value range may be reduced to a smaller set using concept hierarchies. Construct new variables: 52 Derive new and more informative variables from existing ones. E.g., use a single variable blood-type match instead of separate multinomial values for blood type of both donor and recipient. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 3 – Data Preparation (cont.) Data reduction: A dataset that is too large can also cause problems. Too many variables or dimensions: Too many records: 53 Reduce the number of variables into a more manageable and most relevant subset. Use findings from extant literature or consult domain experts. Run appropriate statistical tests such as principle component analysis. Processing large number of records may not be practical or feasible. Use sampling to obtain a subset of the data that reflects all relevant patterns of the complete dataset. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 3 – Data Preparation (cont.) Skewed data: 54 Potentially bias in the analysis output. Oversample the less represented or undersample the more represented. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 4 – Modeling Building Various modeling techniques are selected and applied to the dataset prepared in Step 3. Assessment and comparative analysis of various models: No universally known “best” method or algorithm for a data mining task. Use a variety of viable model types with a well-defined experimentation and assessment strategy to identify the “best” method for a given purpose. Different methods may have specific requirements for data format and it may be necessary to go back to Step 3. 55 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 5 – Testing and Evaluation The models developed in Step 4 need to be evaluated for their accuracy and generality. Two options in general: Assess the degree to which the selected model(s) meets the business objectives. Test the developed model(s) in a real-world scenario if time and budget constraints permit. Bottom line is business value should be obtained from discovered knowledge patterns. Close interaction between data analysts, business analysts and decision makers is required. 56 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Step 6 – Deployment End user needs to be able to understand the knowledge gained from the data mining study and benefit from it. Knowledge needs to be properly organized and presented: Simple approach involves generating a report. Complex approach involves implementing a repeatable data mining process across the organization. Deployment also includes maintenance activities: 57 Data reflecting business activities may change. Models built on old data may become obsolete, irrelevant or misleading. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Goes to Hollywood Decision situation – Predicting the box-office receipt (i.e., financial success) of a particular movie. Problem: Traditional approach: Sharda and Delen’s approach: 58 Frames it as a forecasting (or regression) problem. Attempts to predict the point estimate of a movie’s box-office receipt. Convert into a multinomial classification problem. Classify a movie based on its box-office receipts into one of nine categories, ranging from “flop” to “blockbuster”. Use variables representing different characteristics of a movie to train various classification models. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Goes to Hollywood (cont.) Proposed solution: Dataset consists of 2,632 movies released between 1998 and 2006. 59 Training set – 1998 to 2005 Test set – 2006 Movie classification based on box-office receipts: CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Goes to Hollywood (cont.) Independent variables: A variety of data mining methods were used – neural networks, decision trees, support vector machines (SVM) and three types of ensembles. 60 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Goes to Hollywood (cont.) Data mining process map in IBM SPSS Modeler 61 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Goes to Hollywood (cont.) Results: 62 CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Data Mining Goes to Hollywood (cont.) Performance measures: Among individual prediction models – SVM > ANN > CART Ensemble models performed better than the individual prediction models: 63 Bingo – Percent correct classification rate. 1-Away correct classification rate – Within one category. Fusion algorithm being the best. Ensemble models also have significantly low standard deviation compared to individual prediction models. CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics Lecture 3 Simple Linear Regression Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Structure of regression models. Simple linear regression. Validation of regression models. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Overview of Regression Analysis In data mining, we are interested to predict the value of a target variable from the value of one or more explanatory variables. Example – Predict a child’s weight based on his/her height. Weight is the target variable. Height is the explanatory variable. Regression analysis builds statistical models that characterize relationships among numerical variables. Two broad categories of regression models: Cross-sectional data – Focus of this lecture. Time-series data – Focus of subsequent lecture: 2 Independent variables are time or some function of time. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Structure of Regression Models Purpose of regression models is to identify functional relationship between the target variable and a subset of the remaining variables in the data. Goal of regression models is twofold: Highlight and interpret dependency of the target variable on other variables. Predict the future value of the target variable based upon the functional relationship identified and future value of the explanatory variables. Target variable is also known as dependent, response or output variable. Explanatory variable is also known as independent or predictory variables. 3 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Structure of Regression Models (cont.) Suppose dataset D composed of m observations, a target variable and n explanatory variables: 4 Explanatory variables of each observation may be represented by a vector x i , i ∈ M in the n-dimensional space n. Target variable is denoted by yi . The m vectors of observation is written as a matrix X having dimension m x n. Target variable is written as y = ( y1 , y2 ,..., ym ) Let Y be the random variable representing the target attribute and X j , j ∈ N , the random variables associated with the explanatory variables. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Structure of Regression Models (cont.) Regression models conjecture the existence of a function f: n → that expresses the relationship between the target variable Y and the n explanatory variables X j : Y = f ( X 1 , X 2 ,..., X n ) 5 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Linear Regression Models If we assume that the functional relationship f: n → is linear, we have linear regression models. This assumption may be restrictive but most nonlinear relationships may be reduced to a linear one by applying appropriate preliminary transformation: 6 A quadratic relationship of the form Y = b + wX + dX 2 can be linearized through the transformation Z = X 2 into a linear relationship with two explanatory variables: Y = b + wX + dZ CG DADL (June 2023) Lecture 3 – Simple Linear Regression Linear Regression Models (cont.) 7 An exponential relationship of the form Y = e b + wX can be linearized through a logarithmic transformation Z = log Y , which converts it into the linear relationship: Z = b + wX A simple linear regression model with one explanatory variable is of the form: Y = α + βX + ε CG DADL (June 2023) Lecture 3 – Simple Linear Regression Simple Linear Regression Bivariate linear regression models a random variable Y as linear function of another random variable X: Y = α + βX + ε ε is a random variable, referred to as error, which indicates the discrepancy between the response Y and the prediction f ( X ) = α + βX . When the regression coefficients are determined by minimizing the sum of squared errors SSE, ε must follow a normal distribution with 0 mean and standard deviation σ : Ε(ε i | Χ i ) = 0 var(ε i | Χ i ) = σ 2 Note: Standard deviation is the square root of variance and variance is the average of the squared differences from the mean. 8 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Simple Linear Regression (cont.) The preceding model is known as simple linear regression where there is only one explanatory variable. When there are multiple explanatory variables, the model would be a multiple linear regression model of the form: Y = α + β1 X 1 + β 2 X 2 + ... + β n X n + ε 9 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Simple Linear Regression (cont.) For a simple linear regression model of the form: Y = α + βX + ε given the data samples (x1 , y1 ), (x2 , y2 ),..., (xs , ys ) The error for the prediction is: 𝜀𝜀𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑓𝑓 𝑥𝑥𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝛼𝛼 − 𝛽𝛽𝑥𝑥𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 The regression coefficients α and β can be computed by the method of least squares which minimizes the sum of the squared errors SSE: 𝑠𝑠 𝑆𝑆𝑆𝑆𝑆𝑆 = � 𝜀𝜀𝑖𝑖 2 𝑖𝑖=1 10 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Simple Linear Regression (cont.) To find the regression coefficients α and β that minimize S SSE: (x − x )( y − y ) β= ∑ i i =1 i S 2 ) ( x x − ∑ i i =1 α = y − βx S x= ∑x i =1 i S S y= 11 ∑y i =1 i S CG DADL (June 2023) Lecture 3 – Simple Linear Regression Simple Linear Regression (cont.) Suppose we have a linear equation 𝑦𝑦 = 2 + 3𝑥𝑥 in which 𝑆𝑆𝑆𝑆𝑆𝑆 = 0: 12 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Example of Simple Linear Regression Predict a child’s weight based on height: The dataset contains 19 observations. There are four variables altogether – Name, Weight, Height and Age. Note that for linear regression, we are using both Scikit Learn and StatsModels. StatsModels provide more summary statistics as compared to Scikit Learn. We could also manually calculate the required statistics... 13 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Example of Simple Linear Regression (cont.) src01 14 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Example of Simple Linear Regression (cont.) Scikit Learn’s output 15 StatsModels’ output CG DADL (June 2023) Lecture 3 – Simple Linear Regression Example of Simple Linear Regression (cont.) The regression equation is 𝑦𝑦 = 42.5701 + 0.1976𝑥𝑥 + 𝜀𝜀 𝛽𝛽 = 0.1976: A one unit increase in height leads to an expected increase of 0.1976 unit in weight. 𝛼𝛼 = 42.5701: When 𝑥𝑥 = 0, the expected 𝑦𝑦 value is 42.5701 (danger of extrapolation) 𝑁𝑁 = 19, number of observations: Most of the dots, i.e., actual 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 values, are close to the fitted line. 16 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Coefficient of Determination R-Square = R 2 = 0.7705: 77.05% of the variation in yi is explained by the model: R2 = This value is the proportion of total variance explained by the predictive variable(s): Model Sum of Squares Error Sum of Squares = 1− Corrected Total Sum of Squares Corrected Total Sum of Squares 𝑆𝑆 2 Model Sum of Squares = � 𝑦𝑦�𝑖𝑖 − 𝑦𝑦̄ 𝑖𝑖=1 S 2 Error Sum of Squares = ∑ ( yi − yˆ i ) i =1 S Corrected Total Sum of Squares = ∑ ( yi − y ) 2 i =1 17 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Coefficient of Determination (cont.) R 2 near zero indicates very little of the variability in yi is explained by the linear relationship of Y with X. R 2 near 1 indicates almost all of the variability in yi is explained by the linear relationship of Y with X. R 2 is known as the coefficient of determination or multiple RSquared. Root Mean Squared Error (RMSE) = 18 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑀𝑀𝑀𝑀𝑀𝑀 = 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆 = 2 ∑𝑆𝑆 𝑖𝑖=1 𝑒𝑒𝑖𝑖 𝑠𝑠 = ∑𝑆𝑆 𝑦𝑦𝑖𝑖 2 𝑖𝑖=1 𝑦𝑦𝑖𝑖 −� 𝑠𝑠 Recall that in linear regression, the goal is to minimize 𝑆𝑆𝑆𝑆𝑆𝑆. So smaller value of 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅, i.e., close to 0.0, is better. Smaller 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 indicates a model with better fit. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Coefficient of Determination (cont.) • R-Square = Model MS/Corrected Total SS = 0.771 • 77.1% of the variance in weight can be explained by the simple linear model with height as the independent variable. • Adjusted R-Square = 0.757 = 1 – (1 – R2) (m – 1)/(m – n – 1) = 1 – (1 – 0.771)(19-1)/ (19 – 1 – 1) = 0.757 • R-Square always increases when a new term is added to a model, but adjusted R-Square increases only if the new term improves the model more than would be expected by chance. Root MSE = 𝑀𝑀𝑀𝑀𝑀𝑀𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = 2.3906 This is an estimate for the standard error of the residuals σ. 19 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Coefficient of Determination (cont.) • DF Model = 1, only one independent variable in this model • DF Corrected Total = S-1=18, because 19 ∑ (y i =1 i − y) = 0 • Knowing 18 of them, we will know the value of the 19th difference. Analysis of Variance • F-value = MSEModel/MSEError = 57.08 • F-value has n and m-n-1 DF • The corresponding p-value is < .0001, indicating that at least one of the independent variables is useful for predicting the dependent variable. • In this case, there is only 1 independent variable: the value of height is useful for predicting the value of weight. 20 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Significance of Coefficient of Determination • Is the value of β significantly different from zero? • Hypothesis test: H 0 : β = 0 versus Ha : β ≠ 0 • The t-value for the test is (0.1976/0.026) = 7.555 with corresponding p-value of < 0.0001. • Since the p-value is lower than 5%, we may conclude with 95% confidence to reject H 0 that β is 0. • Note for simple linear regression models: t-value of the β parameter is the square root of the F-value. In this example, 7.555× 7.555 ≈ 57.08 21 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Significance of Coefficient of Determination (cont.) • The area under the curve to the left of -7.555 and to the right of +7.555 is less than 0.0001. • We reject the null hypothesis and conclude that the slope β is not 0, i.e. the variable height is useful for predicting the dependent variable weight. 22 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Validation of Model – Coefficient of Linear Correlation • In a simple linear regression model, the coefficient of determination = the squared of the coefficient of linear correlation between X and Y. • In our example: X = Height; Y = Weight • r = 0.877785 • R2 = 0.7705 = 0.877785 x 0.877785 23 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Assumptions of Linear Regression Linear regression has five key assumptions. Linear relationship – Relationship between the independent and dependent variables is linear. Homoscedasticity – Residuals are equal across the regression line. No auto-correlation – Residuals must be independent from each other. Multivariate normality – All variables being normally distributed on a univariate level. No or little multicollinearity – Independent variables are not correlated with each other. 24 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Evaluating the Assumptions of Linear Regression Linearity: Relationship between the independent and dependent variables is linear. The linearity assumption can best be tested with scatter plots. Recall the scatter and line plot of the linear regression line that we have created earlier. The 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 values appear to be linear. 25 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) Homoscedasticity: Residuals are equal across the regression line. Scatter plots between residuals and predicted values are used to confirm this assumption. Any pattern would result in a violation of this assumption and point toward a poor fitting model. See the sample script in src02. In the child’s weight example, no regular pattern/trend is observed. 26 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) We can also check the normality of the residuals using a Q-Q plot. See the sample script in src03. • QQ plot on the left shows the residuals in the child’s weight example. • Data points must fall (approximately) on a straight line for normal distribution. 27 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) Auto-correlation: Residuals must be independent from each other. Residuals are randomly distributed with no pattern in the scatter plot from src02. We can also use the Durbin-Watson test to test the null hypothesis that the residuals are not linearly auto-correlated: 28 While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no autocorrelation in the data. StatsModels’ will report the Durbin-Watson’s d value, which is 2.643 in the child’s weight example. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) Multivariate normality: All variables in the linear regression model are normally distributed on a univariate level. We can perform visual/graphical test to check for normality of the data using Q-Q plot and also histogram. See the sample script in src04. • Top (left to right) – Q-Q plot and histogram of Height. • Bottom (left to right) – Q-Q plot and histogram of Weight. 29 CG DADL (June 2023) Lecture 3 – Simple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) No or little multicollinearity: Independent variables are not correlated with each other. For simple linear regression, this is obviously not a problem We will revisit the multicollinearity assumption in the multiple linear regression model. Child’s weight example: 30 We may conclude that the residuals are normal and independent. The linear regression model fits the data well. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Confidence in the Linear Regression Model Linear regression is considered as a low variance/high bias model: Under repeated sampling, the line will stay roughly in the same place (low variance). But the average of those models will not do a great job in capturing the true relationship (high bias). Note that low variance is a useful characteristic when you do not have a lot of training data. A closely related concept is confidence intervals: 31 StatsModels calculates 95% confidence intervals for our model coefficients. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Confidence in the Linear Regression Model (cont.) We can interpret the confidence intervals as follows: 32 If the population from which this sample was drawn was sampled 100 times. Approximately 95 of those confidence intervals would contain the “true” coefficient. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Confidence in the Linear Regression Model (cont.) We can compare the true relationship to the predictions by using StatsModels to calculate the confidence intervals of the predictions: 33 See the sample script in src05. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Can We Assess the Accuracy of a Linear Regression Model? A linear regression model is intended to perform point predictions: It is difficult to make an exact point prediction of the actual continuous numerical value. Thus, it is not viable to assess accuracy in a conventional way. Other than the various measures of goodness, a model with a tight 95% confidence interval is preferred. But we can perform split validation to assess model overfitting: 34 The model from the testing data should return comparable values for the measures of goodness. CG DADL (June 2023) Lecture 3 – Simple Linear Regression Lecture 4 Multiple Linear Regression Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Multiple linear regression. Standardization and normalization of data. Selection of predictive variables. Treatment of categorical variables. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Limitations of Simple Linear Regression Simple linear regression only allows us to examine the relationship between two variables. In many real-world scenarios, there are likely more than one independent variables that are correlated with the dependent variable. Multiple linear regression provides a tool that allows us to examine the relationship between two or more explanatory variable and a response variable. Multiple linear regression is especially useful when trying to account for potential confounding factors in observational studies. 2 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Multiple Linear Regression A multiple linear regression model is of the form: Y = α + β1 X 1 + β 2 X 2 + ... + β n X n + ε The regression coefficient β j expresses the marginal effect of the variable X j on the target, conditioned on the current value of the remaining predictive variables. Scale of the values influences the value of the corresponding regression coefficient and thus it might be useful to standardize the predictive variables. There are different techniques to standardize or normalize data: 3 E.g., decimal scaling, min-max normalization and z-index normalization. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Multiple Linear Regression (cont.) MLR scenarios: Product Sales and Advertisement Channels: Relationship between product sales and investments made in advertisement through several media communication channels. The regression coefficients indicate the relative advantage afforded by the different channels. Colleges and Universities Graduation Rate: Predict the percentage of students who eventually graduate (GraduationPercent) using a set of 4 explanatory variables: 4 MedianSAT – Median SAT score of students. AcceptanceRate – Acceptance rate of applicants. ExpendituresPerStudent – Education budget per student. Top10PercentHS – % of students in the top 10% of their high school. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Transformation Data transformation is a general approach to improving the accuracy of data analytics models such as multiple linear regression. Preventive standardization or normalization of data: 5 Expressing a variable in smaller units will lead to a larger value range for that variable. Tendency to give such a variable greater effect or “weight”. To help avoid dependence on the choice of measurement units, the data should be normalized or standardized. This involves transforming the data to fall within a smaller or common range such as [-1, 1] or [0.0, 1.0]. May be done in three ways. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Transformation (cont.) Decimal scaling: ' ij Based on the transformation: x = where h is a given parameter to shift the decimal point by h positions toward the left. h is fixed at a value that gives transformed values in the range [-1,1]. Example: 10 h Recorded values of variable A range from -986 to 917. Maximum absolute value of A is 986. Divide each value by 1000 (i.e., h=3): 6 xij -986 normalizes to -0.986 917 normalizes to 0.917. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Transformation (cont.) Min-max normalization: Based on the following transformation: xij − xmin, j ' ' ' ' ( xij = xmax, j − xmin, j ) + xmin, j xmax, j − xmin, j where xmin, j = min xij , xmax ,j = max xij i i are the minimum and maximum values of the attribute a j ' ' x x before transformation while min, j and max, j are the minimum and maximum values that we wish to obtain after transformation. 7 In general, we use either [-1,1] or [0,1]. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Transformation (cont.) Example: Suppose that the minimum and maximum values for the variable income are $12,000 and $98,000, respectively. We would like to map income to the range [0.0, 1.0]. A value of $73,600 for income is transformed to: 73,600 − 12,000 (1.0 − 0.0) + 0.0 = 0.716 98,000 − 12,000 8 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Transformation (cont.) z-index normalization: xij − µ j _ σj where µ j and σ j are the sample mean and sample standard deviation of attribute a j . If the distribution of the values is approximately normal, the transformed values will fall within the range (-3,3). Useful if actual minimum and maximum values are unknown, or there are outliers dominating the min-max normalization. Example: _ Suppose the mean and standard deviation of the values for the variable income are $54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed to: 73,600 − 54,000 16,000 9 ' ij Based on the transformation: x = _ _ = 1.225 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Transformation (cont.) Standardization techniques replace original values of a variable with transformed values: This will affect the way the multiple linear regression model is interpreted. In Python, the Scikit Learn library provides support for data transformation via the sklearn.preprocessing package: 10 For example, z-index normalization can be performed using the preprocessing.scale() function. See the sample script src01 for an example. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Multiple Linear Regression Predict colleges and universities graduation rate: 11 The source of the dataset is attributed to Evans (2013). Predicting the percentage of students accepted into a college program whom would eventually graduate: The dataset contains 49 observations. There are five possible independent variables altogether – Type, MedianSAT, AcceptanceRate, ExpendituresPerStudent, Top10PercentHS. For now, we will only use the latter four numerical variables, i.e., excluding Type. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Multiple Linear Regression (cont.) src02 12 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Multiple Linear Regression (cont.) • Scikit-Learn’s output • Top: Correlation matrix. • Bottom: Regression model 13 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Multiple Linear Regression (cont.) StatsModels’ output 14 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Multiple Linear Regression (cont.) The regression equation is: GraduationPercent = 17.9210 + 0.0720×MedianSAT – 24.8592×AcceptanceRate – 0.0001×ExpendituresPerStudent – 0.1628×Top10PercentHS Interpreting the regression coefficients: 15 Higher median SAT scores and lower acceptance rates suggest higher graduation rates. 1 unit increase in median SAT scores increase graduation rates by 0.0720 unit, all other things being equal. 1 unit increase in acceptance rates decrease graduation rates by 24.8592 unit, all other things being equal. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Multiple Linear Regression (cont.) Overall: 16 Do the regression coefficients make sense? R2 is only 0.53. R2 and Adjusted R2 are quite similar. RMSE is 5.03 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Evaluating the Assumptions of Linear Regression Residuals analysis: 17 The scatter plot of residuals against predicted values and the Q-Q plot of the residuals shows that the residuals are normal and independent. See the sample script src03. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) Multicollinearity: Occurs when significant linear correlation exists between two or more predictive variables. Potential problems: 18 Regression coefficients are inaccurate. Compromises overall significance of the model. Possible that the coefficient of determination is close to 1 while the regression coefficients are not significantly different from 0. Pairwise linear correlation coefficients may be calculated. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) 19 We can observe mild correlation among the four independent variables. Consequently, you would see that even though 𝑅𝑅2 is moderately high, two of the independent variables, i.e., ExpenditurePerStudent and Top10PercentHS, are just below the 0.05 (or 95%) significance threshold. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Evaluating the Assumptions of Linear Regression (cont.) To identify multiple linear relationships or multicollinearity among predictive variables: 20 Calculate the variance inflation factor (VIF) for each predictor X j as: 1 VIF j = 1 − R 2j 2 where R j is the coefficient of determination for the model that explains X j , treated as a response, through the remaining independent variables. VIF j > 5 indicates multicollinearity. VIF can be calculated in StatsModels – See the sample script src04. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Selection of Predictive Variables In multiple linear regression model: We typically select a subset of all predictive variables that are most effective in explaining the response variable. Relevance analysis identifies variables that do not contribute to the prediction process. This is known as feature selection. Rationales: 21 Model based on all predictive variables may not be significant due to multicollinearity. Model with too many variables tend to overfit the training data samples and they may not predict new/unseen data with good accuracy. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Selection of Predictive Variables (cont.) General methods for variables selection: Forward Selection – Starts with no variables in the model and adds variables using the F statistics. Backward Elimination – Starts with all variables in the model and deletes variables using the F statistics. Stepwise: 22 Similar to forward selection except the variables in the model do not necessarily stay there. A hybrid of forward selection and backward elimination. A variable that was initially added could be deleted if the addition of a new variable negatively affects its level of significance. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Selection of Predictive Variables (cont.) However, variables selection suffers from some potential problems: May introduce bias in the selection of variables. May lack theoretical grounding on the final set of selected variables. May require significant computing power, and thus time, if the dataset is large. Variables selection with Scikit Learn: 23 Does NOT support variables selection in its linear regression routine. Supports a standalone routine to calculate the F-score and pvalues corresponding to each of the regressors (sklearn.feature_selection.f_regression). CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Selection of Predictive Variables (cont.) Supports a standalone recursive feature elimination (sklearn.feature_selection.RFE). src05 24 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Selection of Predictive Variables (cont.) • Based on the F-scores, we would select MedianSAT and AcceptanceRate. • Based on the recursive elimination ranking, we would select AcceptanceRate and Top10PercentHS 25 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Variables Selection Predict the newsprint consumption in US cities: 39 observations altogether. Six independent variables: X1: Number of newspaper in the city. X2: Proportion of the city population under the age of 18. X3: Median school year completed (city resident). X4: Proportion of city population employed in white collar occupation. X5: Logarithm of the number of families in the city. X6: Logarithm of total retail sales. See the sample script src06 (similar to src05). 26 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Variables Selection (cont.) • Based on the F-scores, we would select X5 and X6. (Note that the F-scores of X1 to X4 are not significant at the 0.95 level.) • Based on the recursive elimination ranking, we would also select X5 and X6. 27 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Treatment of Categorical Predictive Variables Categorical variables may be included as predicators in a regression model using dummy variables. A nominal categorical variable X j with H distinct values denote by V = {v1 , v2 ,..., vH } may be represented in two ways: Using arbitrary numerical values: 28 Regression coefficients will be affected by the chosen scale. Compromises significance of the model. Using H − 1 binary variables D j1 , D j 2 ,..., D j , H −1 , called dummy variables Each binary variable D jh is associated with level vh of X j . D jh takes the value of 1 if xij = vh . CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Encoding Examples with Multiple Linear Regression Nominal categorical data: 29 Housing estate – Multinomial variable with 5 different possible values represented using 4 binomial variables I1 to I4: Housing Estate I1 l2 l3 l4 Ang Mo Kio 1 0 0 0 Bishan 0 1 0 0 Clementi 0 0 1 0 Dover 0 0 0 1 East Coast 0 0 0 0 Predict price of HDB flats (y) based on the size (x1), floor level (x2) and distance to MRT station (x3). Model: y = β 0 + β1 x1 + β 2 x2 + β 3 x3 + β 4 I1 + β 5 I 2 + β 6 I 3 + β 7 I 4 In East Coast, the model is actually: y = β 0 + β1 x1 + β 2 x2 + β 3 x3 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Encoding Examples with Multiple Linear Regression (cont.) 30 In Dover, the model is: y = β 0 + β1 x1 + β 2 x2 + β 3 x3 + β 7 Suppose β 7 = −12000 : The price of a HDB flat in Dover is expected to be $12000 less than another HDB flat in East Coast if both HDB flats have the same size, are on the same level and have the same distance to the nearest MRT station. Interpretation of the dummy variables always with respect to East Coast. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Encoding Examples with Multiple Linear Regression (cont.) Ordinal categorical data: 31 Education level – Multinomial variable with 4 different possible values represented using 3 binomial variables I1 to I3: Education Level I1 l2 l3 Elementary School 0 0 0 High School 0 0 1 College 0 1 1 Graduate School 1 1 1 Predict starting salary (y) based on education level. Model: y = β 0 + β1 I1 + β 2 I 2 + β 3 I 3 Suppose β1 = β 2 = β 3 = 0 : No difference in starting salary. CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Data Encoding Examples with Multiple Linear Regression (cont.) Suppose β1 = 0, β 2 > 0, β 3 = 0 : 32 Model: y = β 0 + β 2 I 2 When education level is high school or lower, y = β 0 When education level is college or higher, y = β 0 + β 2 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Categorical Predictive Variables Predict colleges and universities graduation rate: 33 We will now include Type as one of the independent variables. Type is a nominal categorical variable with two values. Values of the variable Type are Lib Arts, University , i.e., 𝐻𝐻 = 2 So we use one dummy variable University with a numeric 1 representing University and a numeric 0 representing Lib Arts. A University has a 1.43632 unit higher graduation rate compared to Lib Arts, all other things being equal. But the regression coefficient of University is not significant in this case (𝑝𝑝 = 0.524). CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Categorical Predictive Variables (cont.) Dummy coding of “University” from line 11 to 15 src07 34 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Example of Categorical Predictive Variables (cont.) 35 CG DADL (June 2023) Lecture 4 – Multiple Linear Regression Lecture 5 Introduction to Classification Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: Overview of Machine Learning Limitations of linear regression models. Definitions of classification problem and classification models. Evaluation of classification models. Usefulness of classification and clustering. CG DADL (June 2023) Lecture 5 – Introduction to Classification Overview of Classification CG DADL (June 2023) Lecture 5 – Introduction to Classification Limitation of Linear Regression Models Regression analysis is useful but suffers from an important limitation. In linear regression models, the numerical dependent variable must be continuous: The dependent variable can take on any value, or at least close to continuous. In some data analytics scenarios, the dependent variable may not be continuous. In other scenarios, it may be unnecessary to make a point prediction. It is possible to convert a regression problem into a classification problem. CG DADL (June 2023) Lecture 5 – Introduction to Classification Limitations of Linear Regression Models (cont.) Linear regression requires a linear relationships between the dependent and independent variables: The assumption that there is a straight-line relationship between them does not always hold. Linear regression models only look at the mean of the dependent variable: E.g., in the relationship between the birth weight of infants and maternal characteristics such as age: Linear regression will look at the average weight of babies born to mothers of different ages. But sometimes we need to look at the extremes of the dependent variable, e.g., babies are at risk when their weights are low. CG DADL (June 2023) Lecture 5 – Introduction to Classification Limitations of Linear Regression Models (cont.) Linear regression is sensitive to outliers: Outliers can have huge effects on the regression. Data must be independent: Linear regression assumes that the data are independent: I.e., the scores of one subject (such as a person) have nothing to do with those of another. This assumption does not always make sense: E.g., students in the same class tend to be similar in many ways such as coming from the same neighborhoods, taught by the same teachers, etc. In the above example, the students are not independent. CG DADL (June 2023) Lecture 5 – Introduction to Classification Parametric versus Non-parametric Linear regression is parametric: Parametric ML algorithms: Assumes that sample data comes from a population that can be adequately modelled by a probability distribution that has a fixed set of parameters. Assumptions can greatly simplify the learning process, but can also limit what can be learned. Algorithms that simplify the function to a known form. Non-parametric ML algorithms: Algorithms that do not make strong assumptions about the form of the mapping function. Free to learn any functional form from the training data. CG DADL (June 2023) Lecture 5 – Introduction to Classification Parametric versus Non-parametric (cont.) Non-parametric ML methods are good when: You have a lot of data and no prior knowledge. You do not want to worry too much about choosing just the right features. Classification algorithms include both parametric and non-parametric: Parametric – Logistic Regression, Linear Discriminant Analysis, Perceptron, Naive Bayes, Simple Neural Networks Non-parametric – k-Nearest Neighbors, Decision Trees, Support Vector Machines CG DADL (June 2023) Lecture 5 – Introduction to Classification Data Mining Goes to Hollywood Data mining scenario – Predicting the box-office receipt (i.e., financial success) of a particular movie. Problem: Traditional approach: Frames it as a forecasting (or regression) problem. Attempts to predict the point estimate of a movie’s box-office receipt. Sharda and Delen’s (2006) approach: Convert the regression problem into a multinomial classification problem. Classify a movie based on its box-office receipts into one of nine categories, ranging from “flop” to “blockbuster”. Use variables representing different characteristics of a movie to train various classification models. CG DADL (June 2023) Lecture 5 – Introduction to Classification Overview of Classification Classification models: Aim of classification models: Supervised learning methods for predicting value of a categorical target variable. In contrast, regression models deal with numerical (or continuous) target variable. Generate a set of rules from past observations with known target class. Rules are used to predict the target class of future observations. Classification holds a prominent position in learning theory. CG DADL (June 2023) Lecture 5 – Introduction to Classification Overview of Classification (cont.) From a theoretical viewpoint: Classification algorithm development represents a fundamental step in emulating inductive capabilities of the human brain. From a practical viewpoint: Classification is applicable in many different domains. Examples: Selection of target customers for a marketing campaign. Fraud detection. Image recognition. Early diagnosis of disease. Text cataloguing. Spam email recognition. CG DADL (June 2023) Lecture 5 – Introduction to Classification Classification Problems We have a dataset D containing m observations described in terms of n explanatory variables and a categorical target variable (a class or label). The observations are also termed as examples or instances. The target variable takes a finite number of values: Binary classification – The instances belong to two classes only. Multiclass or multicategory classification – There are more than two classes in the dataset. CG DADL (June 2023) Lecture 5 – Introduction to Classification Classification Problems (cont.) A classification problem consists of defining an appropriate hypothesis space F and an algorithm AF that identifies a function f * ∈ F that can optimally describe the relationship between the predictive variables and the target class. n F is a class of functions f ( x ) : R ⇒ H called hypotheses that represent hypothetical relationship of dependence between yi and xi . R n is the vector of values taken by the predictive variables for an instance. H could be {0,1} or {− 1,1} for a binary classification problem. CG DADL (June 2023) Lecture 5 – Introduction to Classification Components of a Classification Problem Generator – Extract random vectors Χ of data instances. Supervisor – For each vector Χ , return the value of the target class. Classification algorithm (or classifier) – Choose a function f * ∈ F in the hypothesis space so as to minimize a suitably defined loss function. Generator x y Supervisor Classification Algorithm f(x) CG DADL (June 2023) Lecture 5 – Introduction to Classification Development of a Classification Model Development of a classification model consists of three main phases. Training phase: The classification algorithm is applied to the instances belonging to a subset T of the dataset D . T is called the training data set. Classification rules are derived to allow users to predict a class to each observation Χ. Test phase: The rules generated in the training phase are used to classify observations in D but not in T . Accuracy is checked by comparing the actual target class with the predicted class for all instances in V = D − T . CG DADL (June 2023) Lecture 5 – Introduction to Classification Development of a Classification Model (cont.) Observations in V form the test set. The training and test sets are disjoint: V ∩T = ∅ . Prediction phase: The actual use of the classification model to assign target class to completely new observations. This is done by applying the rules generated during the training phase to the variables of the new instances. CG DADL (June 2023) Lecture 5 – Introduction to Classification Development of a Classification Model (cont.) Training Training Tuning Training Data Test Data New Data Test Prediction Rules Accuracy Assessment Knowledge CG DADL (June 2023) Lecture 5 – Introduction to Classification Taxonomy of Classification Models Heuristic models: Classification is achieved by applying simple and intuitive algorithms. Examples: Classification trees – Apply divide-and-conquer technique to obtain groups of observations that are as homogenous as possible with respect to the target variable. Also known as decision trees. Nearest neighbor methods – Based on the concept of distance between observations. Separation models: Divide the variable space into H distinct regions. All observations in a region are assigned the same class. CG DADL (June 2023) Lecture 5 – Introduction to Classification Taxonomy of Classification Models (cont.) How to determine these regions? Neither too complex or many, nor too simple or few. Define a loss function to take into account the misclassified observations and apply an optimization algorithm to derive a subdivision into regions that minimizes the total loss. Examples – Discriminant analysis, perceptron methods, neural networks (multi-layer perceptron) and support vector machines (SVM). Regression model: Logistic regression is an extension of linear regression suited to handling binary classification problems. Main idea – Convert binary classification problem via a proper transformation into a linear regression problem. CG DADL (June 2023) Lecture 5 – Introduction to Classification Taxonomy of Classification Models (cont.) Probabilistic models: A hypothesis is formulated regarding the functional form of the conditional probabilities PΧ| y (Χ | y ) of the observations given the target class. This is known as class-conditional probabilities. Based on an estimate of the prior probabilities Py ( y ) and using Bayes’ theorem, calculate the posterior probabilities Py|Χ ( y | Χ ) of the target class. Examples – Naive Bayes classifiers and Bayesian networks. CG DADL (June 2023) Lecture 5 – Introduction to Classification Evaluation of Classification Models In a classification analysis: It is advisable to develop alternative classification models. The model that affords the best prediction accuracy is then selected. To obtain alternative models: Different classification methods may be used. The values of the parameters may also be modified. Accuracy: The proportion of the observations that are correctly classified by the model. Usually, one is more interested in the accuracy of the model on the test data set V . CG DADL (June 2023) Lecture 5 – Introduction to Classification Evaluation of Classification Models (cont.) If yi denotes the class of the generic observation Χ i ∈ V and f (Χ i ) the class predicted through the function f ∈ F identified by the learning algorithm A = AF , the following loss function can be defined: 0, if yi = f (Χ i ) L( yi , f (Χ i )) = 1, if yi ≠ f (Χ i ) The accuracy of model A can be evaluated as: 1 v acc A (V ) = acc AF (V ) = 1 − ∑ L( yi , f (Χ i )) v i =1 where v is the number of observations. The proportion of errors made is defined as: 1 v errA (V ) = errAF (V ) = 1 − acc AF (V ) = ∑ L( yi , f (Χ i )) v i =1 CG DADL (June 2023) Lecture 5 – Introduction to Classification Evaluation of Classification Models (cont.) Speed: Robustness: The method is robust if the classification rules generated and the corresponding accuracy do not vary significantly as the choice of training data and test datasets varies. It must also be able to handle missing data and outliers well. Scalability: Long computation time on large datasets can be reduced by means of random sampling scheme. Able to learn from large datasets. Interpretability: Generated rules should be simple and easily understood by knowledge workers and domain experts. CG DADL (June 2023) Lecture 5 – Introduction to Classification Holdout Method Divide the available m observations in the dataset D into training dataset T and test dataset V . The t observations in T is usually obtained by random selection. The number of observations in T is suggested to be between one half and two thirds of the total number of observations in D . The accuracy of the classification algorithm via the holdout method depends on the test set V . In order to better estimate accuracy, different strategies have been recommended. CG DADL (June 2023) Lecture 5 – Introduction to Classification Repeated Random Sampling Simply replicate the holdout method r times. For each repetition k = 1,2,..., r : A random training dataset Tk having t observations is generated. Compute acc AF (Vk ) , the accuracy of the classifier on the corresponding test set Vk , where Vk = D − Tk . Compute the average accuracy as: acc A = acc AF 1 r = ∑ acc AF (Vk ) r k =1 Drawback – No control over the number of times each observation may appear, outliers may cause undesired effects on the rules generated and the accuracy. CG DADL (June 2023) Lecture 5 – Introduction to Classification Cross-validation Divide the data into r disjoint subsets, L1 , L2 ,..., Lr of (almost) equal size. For iterations k = 1,2,..., r : Let the test set be Vk = Lk And the training set be Tk = D − Lk Compute acc AF (Vk ) Compute the average accuracy: acc A = acc AF 1 r = ∑ acc AF (Vk ) r k =1 Usual value for r is r = 10 (i.e., ten-fold cross-validation). CG DADL (June 2023) Lecture 5 – Introduction to Classification Cross-validation (cont.) Also known as k-fold cross-validation or rotation estimation. L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 Illustration of ten-fold cross-validation CG DADL (June 2023) Lecture 5 – Introduction to Classification Leave-One-Out Cross-validation method with the number of iterations r being set to m. This means each of the m test sets consists only of 1 sample and the corresponding training data set consists of m − 1 samples. Intuitively, every observation is used for testing once on as many models developed as there are number of data points. Time consuming methodology but a viable option for small dataset. CG DADL (June 2023) Lecture 5 – Introduction to Classification Stratified Random Sampling Instead of random sampling to partition the dataset 𝐷𝐷 into training set 𝑇𝑇 and test set 𝑉𝑉, stratified random sampling could be used to ensure the same proportion of observations belonging to each target class is the same in both 𝑇𝑇 and test set 𝑉𝑉. In cross-validation, each subset 𝐿𝐿𝑘𝑘 should also contain the same proportion of L L L L L L L L L L observations belonging to L L L L L L L L L L each target class. In this example: • Purple/Blue – Class 0 • Red – Class 1 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1 0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L1 L1 L2 L3 L4 L5 L6 L7 L8 L9 L1 0 CG DADL (June 2023) Lecture 5 – Introduction to Classification 0 Confusion Matrices In many situations, just computing the accuracy of the classifier may not be sufficient: Example 1 – Medical Domain: The value of 1 means the patient has a given medical condition and -1 means the patient does not. If only 2% of all patients in the database have the condition, then we achieve an accuracy rate of 98% by having the trivial rule that “the patient does not have the condition”. Example 2 – Customer Retention: The value of 1 means the customer has cancelled the service, 0 means the customer is still active. If only 2% of the available data correspond to customers who have cancelled the service, the trivial rule “the customer is still active” has an accuracy rate of 98%. CG DADL (June 2023) Lecture 5 – Introduction to Classification Confusion Matrices (cont.) Confusion matrix for a binary target variable encoded with the class values {− 1,+1} : Accuracy – Among all instances, what is the proportion that are correctly predicted? p+v p+v acc = = m p+q+u +v Predictions Instances -1 (Negative) +1 (Positive) Total -1 (Negative) p q p+q +1 (Positive) u v u+v Total p+u q+v m CG DADL (June 2023) Lecture 5 – Introduction to Classification Confusion Matrices (cont.) True negative rate – Among all negative instances, p proportion of correct predictions: tn = p+q False negative rate – Among all positive instances, proportion of incorrect predictions: u fn = u+v False positive rate – Among all negative instances, q proportion of incorrect predictions: fp = p+q True positive rate – Among all positive instances, proportion of correct predictions (also known as recall): v tp = u+v CG DADL (June 2023) Lecture 5 – Introduction to Classification Confusion Matrices (cont.) Precision – Among all positive predictions, the proportion of v actual positive instances: prc = q+v Geometric mean is defined as: and sometimes also as: gm = tp × tn F-measure is defined as: F= gm = tp × prc (β 2 ) recall x precision + 1 tp × prc β 2 prc + tp where β ∈ [0, ∞ ] regulates the relative importance of the precision w.r.t. the true positive rate. The F-measure is also equal to 0 if all the predictions are incorrect. CG DADL (June 2023) Lecture 5 – Introduction to Classification ROC Curve Charts Receiver operating characteristic (ROC) curve charts: Allow the user to visually evaluate the accuracy of a classifier and to compare different classification models. Visually express the information content of a sequence of confusion matrices. Allow the ideal trade-off between: Number of correctly classified positive observations – True Positive Rate on the y-axis. Number of incorrectly classified negative observations to be assessed – False Positive Rate on the x-axis. CG DADL (June 2023) Lecture 5 – Introduction to Classification ROC Curve Charts (cont.) true positive false positive CG DADL (June 2023) Lecture 5 – Introduction to Classification ROC Curve Charts (cont.) ROC curve chart is a two dimensional plot: fp on the horizontal axis and tp on the vertical axis. The point (0,1) represents the ideal classifier. The point (0,0) corresponds to a classifier that predicts class {− 1} for all samples. The point (1,1) corresponds to a classifier that predicts class {1} for all samples. Parameters in a classifier may be adjusted so that tp can be increased, but at the same time increasing fp . A classifier with no parameters to be (further) tuned yields only 1 point on the chart. The area beneath the ROC provides means to compare the accuracy of various classifiers. The ROC curve with the greatest area is preferable. CG DADL (June 2023) Lecture 5 – Introduction to Classification Lecture 6 Decision Tree Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Overview of decision tree or classification tree. Components of a decision tree. Splitting rules and criteria. Using decision tree in Python with Scikit Learn CG DADL (June 2023) Lecture 6 – Decision Tree Decision Trees The best known and most widely used learning methods in data mining applications. Reasons for its popularity include: 2 Conceptual simplicity. Ease of usage. Computational speed. Robustness with respect to missing data and outliers. Interpretability of the generated rules. CG DADL (June 2023) Lecture 6 – Decision Tree Decision Trees (cont.) The development of a decision tree involves recursive, heuristic, top-down induction: 1. 2. 3. Initialization phase – All observations are placed in the root of the tree. The root is placed in the active node list L . If the list L is empty, stop the procedure. Otherwise, node J ∈ L is selected, removed from the list and used as the node for analysis. The optimal rule to split the observations in J is then determined, based on an appropriate preset criterion: 3 If J does not need to be split, node J becomes a leaf, target class is assigned according to majority class of observations. Otherwise, split node J , its children are added to the list. Go to Step 2. CG DADL (June 2023) Lecture 6 – Decision Tree Components of Decision Trees Components of the top-down induction of decision trees: Splitting rules – Optimal way to split a node (i.e., assigning observations to child nodes) and for creating child nodes. Stopping criteria – If the node should be split or not. If not, this node becomes a leaf of the tree. Pruning criteria – Avoid excessive growth of the tree (prepruning) during tree generation phase, and reduce the number of nodes after the tree has been generated (post-pruning). Exam ≥ 80? A simple decision tree: • Left branch for “No” • Right branch for “Yes” 4 Assignment ≥ 80%? C B Assignment ≥ 50%? B A CG DADL (June 2023) Lecture 6 – Decision Tree Splitting Rules Two splitting rules based on variable value selection: Binary split – Each node has at most two branches: Multi-split classification trees – Each node has an arbitrary number of branches: 5 Example – Customers who agree to a mailing campaign are placed on the right child node, those who do not agree on the left child node. Example – Customers residing in areas {1,2} are on the right child node, {3,4} on the left. Example – Customers who are 45 years old or younger are on the right, others on the left. It is easier to handle multi-valued categorical variables. For numerical variables, it is necessary to group together adjacent values. This can be achieved by discretization. CG DADL (June 2023) Lecture 6 – Decision Tree Splitting Rules (cont.) Empirical evidence suggests no significant difference in performance of classification trees with regards to the number of children nodes. Two splitting rules based on the number of variables selected: Univariate: Based on the value of a single explanatory variable X j . Examples: Authorized communication No Xj ≤ 45 Yes Univariate split for a binary variable 6 Area of residence Age Young Xj > 45 Old Univariate split for a numerical variable Xj = 1 North Xj = 2 Center Xj = 3 South Xj =4 Island Univariate split for a nominal variable CG DADL (June 2023) Lecture 6 – Decision Tree Splitting Rules (cont.) Univariate trees are also called axis-parallel trees. Example: X1 ≤ 5 X2 X2 ≤ 2 X2 > 2 X1 > 5 X2≤0 X2 X 2 >0 X2 X2≤ 4.5 7 X2 = 4.5 X2 = 2 X2 > 4.5 X2 = 0 X1 = 5 X1 CG DADL (June 2023) Lecture 6 – Decision Tree Splitting Rules (cont.) Multivariate trees – Also called oblique decision trees: Observations are separated based on the expression: n ∑w x j =1 j j ≤b where the threshold value b and the coefficients w1 , w2 ,..., wn of the linear combination have to be determined, for example by solving an optimization problem for each node. X2 X1 8 CG DADL (June 2023) Lecture 6 – Decision Tree Univariate Splitting Criteria Let ph be the proportion of instances of target class vh , h ∈ H , at a given node q and let Q be the total number of instances at q . We have H ∑p h =1 h =1 The heterogeneity index I (q ) is a function of the relative frequencies ph , h ∈ H of the target class values for the instances at the node. The index must satisfy the following 3 criteria: 9 It must be maximum when the instances in the node are distributed homogenously among all the classes. CG DADL (June 2023) Lecture 6 – Decision Tree Univariate Splitting Criteria (cont.) It must be minimum when all the instances at the node belong to only one class. It must be a symmetric function w.r.t. the relative frequencies ph , h ∈ H . The heterogeneity indices of a node q that satisfy the three impurity/inhomogeneity criteria: Misclassification index: Miscl (q ) = 1− max ph h Entropy index: Gini index: H Entropy (q ) = −∑ ph log 2 ph h =1 H Gini (q ) = 1 − ∑ ph2 h =1 10 CG DADL (June 2023) Lecture 6 – Decision Tree Univariate Splitting Criteria (cont.) In a binary classification problem, the three impurity measures defined preceding reach: 11 Their maximum value when p1 = p2 = 1 − p1 = 0.5 . Their minimum value, i.e., 0, when p1 = 0 or p1 = 1 . CG DADL (June 2023) Lecture 6 – Decision Tree Univariate Splitting Criteria (cont.) Selection of variable for splitting: Node q Node q X1=0 Child node q1 12 X1=1 Child node q2 X2=0 Child node q3 X2=1 Child node q4 Suppose node q needs to be split – Is it better to split according to the values of variable X 1 or variable X 2 ? Or any other variable? Choose a split that minimizes the impurity of the child nodes q1 , q2 ,..., qk : K Q I (q1 , q2 ,...qk ) = ∑ k I (qk ) k =1 Q CG DADL (June 2023) Lecture 6 – Decision Tree Univariate Splitting Criteria (cont.) where: k = the number of child nodes Q = the number of observations in node q Qk = the number of observations in child node qk : Q1 + Q2 + Q3 + ... + Qk = Q I (qk ) = the impurity of the observations in child node q . k The information gain is defined as the reduction in the impurity after splitting: ∆(q1 , q2 ,..., qk ) = I (q ) − I (q1 , q2 ,..., qk ) The best split is the one with the largest information gain. This is clearly equivalent to the split that results in the smallest impurity of the partition I (q1 , q2 ,...qk ) . 13 CG DADL (June 2023) Lecture 6 – Decision Tree Example of a Decision Tree Given the dataset: Observation # Income Credit Rating Loan Risk 0 23 High High 1 17 Low High 2 43 Low High 3 68 High Low 4 32 Moderate Low 5 20 High High The task is to predict Loan-Risk. 14 CG DADL (June 2023) Lecture 6 – Decision Tree Example of a Decision Tree (cont.) Given the data set D , we start building the tree by creating a root node. If this node is sufficiently “pure”, then we stop. If we do stop building the tree at this step, we use the majority class to classify/predict. In this example, we classify all patterns as having LoanRisk = “High”. Correctly classify 4 out of 6 input samples to achieve classification accuracy of: (4 / 6)×100% = 66.67% This node is split according to impurity measures: 15 Gini Index (used by CART) Entropy (used by ID3, C4.5, C5) Loan-Risk = High Acc = 66.67% CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index CART (Classification and Regression Trees) uses the Gini index to measure the impurity of a dataset: Gini index for the observations in node q is: H Gini (q ) = 1 − ∑ ph2 h =1 16 where q is the node that contains Q examples from H classes ph is a relative frequency of class h in node q In our dataset, there are 2 classes High and Low, H = 2 . 2 1 4 2 pLow = = pHigh = = 4+2 3 4+2 3 2 2 1 1 4 Gini (q ) = 1 − × − × = = 0.4444 3 3 3 3 9 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Should Income be used as the variable to split the root node? Income is a variable with continuous values. Sort the data according to Income values: Observation # Income Credit Rating Loan Risk 1 17 Low High 5 20 High High Split 1 0 23 High High Split 2 4 32 Moderate Low Split 3 2 43 Low High 3 68 High Low 17 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) We consider 3 possible splits when there are changes in the value of Loan-Risk. Case 1 – Split condition Income ≤ 23 versus Income > 23 Impurity after split: Loan Risk = High Acc = 66.67% Gini(q) = 4/9 Income ≤ 23 3 High Loan-Risks 0 Low Loan Risk Gini(q1) = 0 18 3 3 4 2 ( ) I G q1 , q2 = × 0 + × = = 0.2222 6 Income > 23 1 High-Loan Risk 2 Low Loan Risk Gini(q2) = 4/9 6 9 𝐼𝐼𝐺𝐺 𝑞𝑞1 9 𝐼𝐼𝐺𝐺 𝑞𝑞2 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Case 2 – Split condition Income ≤ 32 versus Income > 32: 4 3 2 1 5 = 0.41667 I G (q1 , q2 ) = × + × = 6 8 6 2 12 Case 3 – Split condition Income ≤ 43 versus Income > 43: 5 8 1 4 I G (q1 , q2 ) = × + × 0 = = 0.26667 6 25 6 15 Case 1 is the best. Loan Risk = High Acc = 66.67% Instead of splitting between Gini (q) = 4/9 Income ≤ 23 versus Income > 23, Income ≤ 27.5 Income > 27.5 the midpoint is selected as actual splitting point: (23 + 32)/2. 3 High Loan-Risks 0 Low Loan Risk Gini(q1) = 0 19 1 High-Loan Risk 2 Low Loan Risk Gini(q2) = 4/9 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Apply the tree generating method recursively to nodes that are still not “pure”. Loan Risk = High Acc = 66.67% Gini(q) = 4/9 Income ≤ 27.5 Loan Risk = High Income > 27.5 Loan Risk = ? Develop a subtree by examining the variable CreditRating. Credit-Rating is a discrete variable with ordinal values, i.e., they can be ordered in a meaningful sequence. 20 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Possible values are {Low, Moderate, High} . Check for best split: Case 1 – Low versus (Moderate or High) Case 2 – (Low or Moderate) versus High Compute the Gini index for splitting the node: Loan Risk = ? 21 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Case 1 – Split Credit-Rating = Low versus Credit-Rating = Moderate or High: Loan Risk = ? Credit-Rating=Low 0 Low Loan-Risk 1 High Loan Risk Gini(q1) = 0 Credit-Rating=Moderate or High 2 Low Loan-Risk 0 High Loan Risk Gini(q2) = 0 1 2 I G (q1 , q2 ) = × 0 + × 0 = 0 3 3 22 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Case 2 – Split Credit-Rating = Low or Moderate versus Credit-Rating = High: Loan Risk = ? Credit-Rating=Low or moderate 1 Low Loan-Risk 1 High Loan Risk Gini(q1) = 1/2 Credit-Rating=High 0 Low Loan-Risk 1 High Loan Risk Gini(q2) = 0 2 1 1 1 I G (q1 , q2 ) = × + × 0 = 3 2 3 3 23 Case 2 split is not as good as Case 1 split. CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) Complete tree: Root Income ≤ 27.5 Leaf Income > 27.5 Intermediate Loan Risk = High Loan Risk = ? Credit-Rating=Low Observation # Income Credit Rating Loan Risk Predicted Loan Risk 0 23 High High High 1 17 Low High High 2 43 Low High High 3 68 High Low Low 4 32 Moderate Low Low 5 20 High High High 24 Loan Risk = High Credit-Rating=Moderate or High Loan Risk = Low CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) The tree achieves 100% accuracy on the training data set. It may overfit the training data instances. Trees may be simplified by pruning: Removing nodes or branches to improve the accuracy on the test samples. Tree growing could be terminated when the number of instances in the node is less than a pre-specified number. Notice we have built a binary tree where every non-leaf nodes have 2 branches. For ordinal discrete variables with N values, check N − 1 possible splits. 25 CG DADL (June 2023) Lecture 6 – Decision Tree Using Gini Index (cont.) For nominal discrete variables with N values, check 2( N −1) − 1 possible splits. For example {Red, Green, Blue}, check 2(3−1) − 1 = 3 possible splits: 26 Red versus Green, Blue Green versus Red, Blue Blue versus Red, Green CG DADL (June 2023) Lecture 6 – Decision Tree Decision Tree in Scikit Lean We can perform decision tree classification using Scikit Learn’s tree.DecisionTreeClassifier. However, this class cannot process categorical independent variables and thus we need to recode CreditRating: Use one hot encoding or one-of-K scheme. LoanRisk has three levels – Low, Moderate and High. So we will create three binary variables – CreditRatingLow, CreditRatingModerate and CreditRatingHigh. For each observation, only exactly one of these three variables will be set to 1. This is a small dataset and so we won’t do any validation. 27 CG DADL (June 2023) Lecture 6 – Decision Tree Decision Tree in Scikit Lean (cont.) src01 28 CG DADL (June 2023) Lecture 6 – Decision Tree Decision Tree in Scikit Lean (cont.) 29 CG DADL (June 2023) Lecture 6 – Decision Tree Classification Rule Generation Trace each path from the root node to a leaf node to generate a rule: If Income ≤ 27.5, then Loan-Risk = High Else if Income > 27.5 and Credit-Rating=Low, then Loan-Risk = High Else if Income > 27.5 and Credit-Rating= Moderate or High, then Loan-Risk = Low 30 CG DADL (June 2023) Lecture 6 – Decision Tree Using Entropy Measure ID3 and its successors (C4.5 and C5.0) are the most widely used decision tree algorithms: Developed by Ross Quinlan, University of Sydney. ID3 uses a single best variable to test at each node of the tree, it selects the most useful variable for classifying the instances. The “goodness” or “usefulness” of a variable is measured according to its information gain computed as entropy. In the Loan Risk example: 4 observations have Loan-Risk = High, p1 = 4 / 6 2 observations have Loan-Risk = Low, p2 = 2 / 6 4 Entropy (q ) = − log 2 6 31 4 2 + log 2 6 6 2 2 1 ( ) (− 1.58496) = 0.91830 = − − 0 . 58496 − 6 3 3 CG DADL (June 2023) Lecture 6 – Decision Tree Using Entropy Measure (cont.) Let’s consider Split 1 again, Income ≤ 23 versus Income > 23 Subset q1 having patterns #1, #5, #0, all with Loan-Risk = High. Entropy (q1 ) = 0 Obs # Split 1 32 Income Credit Rating Loan Risk 1 17 Low High 5 20 High High 0 23 High High 4 32 Moderate Low 2 43 Low High 3 68 High Low CG DADL (June 2023) Lecture 6 – Decision Tree Using Entropy Measure (cont.) Subset q2 having patterns #4, #3 with Loan-Risk = Low and pattern #2 with Loan-Risk = High. 2 Entropy (q2 ) = − log 2 3 2 1 1 2 1 + log 2 = − (− 0.58496 ) − (− 1.58496 ) = 0.91829 3 3 3 3 3 Entropy after splitting: 3 3 I E (q1 , q2 ) = × 0 + × 0.91829 = 0.45915 6 6 Split 1 33 Obs # Income Credit Rating Loan Risk 1 17 Low High 5 20 High High 0 23 High High 4 32 Moderate Low 2 43 Low High 3 68 High Low CG DADL (June 2023) Lecture 6 – Decision Tree Using Entropy Measure (cont.) Suppose we split the dataset using Credit-Rating into 3 subsets: Subset q1 if Credit Rating = Low, Patterns #1 and #2 both with Loan-Risk = High Subset q2 if Credit-Rating = Moderate, Pattern #4 with LoanRisk = Low Subset q3 if Credit-Rating = High, Patterns #5 and #0 with Loan-Risk = High and Pattern #3 with Loan-Risk = Low Obs # Income Credit Rating Loan Risk Entropy after splitting: 2 1 3 I E (q1 , q2 , q3 ) = × 0 + × 0 + × 0.91829 6 6 6 = 0.45915 Same as before using Income. 34 1 17 Low High 5 20 High High 0 23 High High 4 32 Moderate Low 2 43 Low High 3 68 High Low CG DADL (June 2023) Lecture 6 – Decision Tree Using Entropy Measure (cont.) Let’s use Credit-Rating as the first variable to split the root node: Credit Rating = Low Loan Risk = High Credit Rating = High Credit Rating = Moderate Loan Risk = Low Loan Risk = ? Apply the algorithm again to reduce the entropy for the instances in q3 . Maximizing information gain – When selecting variable for splitting, pick the one that reduces the entropy the most. 35 CG DADL (June 2023) Lecture 6 – Decision Tree Using Entropy Measure (cont.) Use Income to split q3 and obtain: Credit Rating = Low Loan Risk = High Credit Rating = Moderate Credit Rating = High Loan Risk = Low Income ≤ 45.5 Income > 45.5 Loan Risk = High Loan Risk = Low Note that this is not a binary tree. 36 CG DADL (June 2023) Lecture 6 – Decision Tree The Iris Flower Dataset This dataset set contains data from three different species of iris (Iris silky, virginica Iris and Iris versicolor). The variables include the length and width of the sepals, and the length and width of the petals. This dataset is widely used for classification problems. 150 observations with 4 independent attributes and one target attribute. See the sample script src02. 37 CG DADL (June 2023) Lecture 6 – Decision Tree The Iris Flower Dataset (cont.) We can have a parsimonious tree with reasonable accuracy by setting the maximum depth of the tree to 2: src02 38 CG DADL (June 2023) Lecture 6 – Decision Tree Lecture 7 Bayesian Classifier and Logistic Regression Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: Dr. TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Various advanced classification models. Naïve Bayesian classifier. Logistic regression. Multinomial logistic regression. Using these advanced classification models in Python with Scikit Learn. CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Conditional Probability In probability theory, conditional probability is a measure of the probability of an event occurring given that another event has occurred. For an event of interest 𝐴𝐴 and another known event 𝐵𝐵 that is assumed to have occurred, the conditional probability of 𝐴𝐴 given 𝐵𝐵 is written as 𝑃𝑃 𝐴𝐴|𝐵𝐵 Example: 2 The probability that any given person has a cough on any given day may be only 5%, i.e., 𝑃𝑃 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0.05. But if we know or assume that the person has a cold, then they are much more likely to be coughing. The conditional probability of coughing given that you have a cold might be much higher at 75%, i.e., 𝑃𝑃 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶|𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0.75. CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Conditional Probability (cont.) Kolmogorov definition: 3 The conditional probability of 𝐴𝐴 given 𝐵𝐵 is defined as the quotient of the probability of the joint of events 𝐴𝐴 and 𝐵𝐵, and the probability of 𝐵𝐵: 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 𝑃𝑃 𝐴𝐴|𝐵𝐵 = 𝑃𝑃 𝐵𝐵 Where 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 is the probability that both events 𝐴𝐴 and 𝐵𝐵 occur. And assuming that the unconditional probability of 𝐵𝐵 is greater than zero, i.e., 𝑃𝑃 𝐵𝐵 > 0. Conditional probability may also be written as an axiom of probability: 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 = 𝑃𝑃 𝐴𝐴|𝐵𝐵 𝑃𝑃 𝐵𝐵 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Bayesian Methods Bayesian methods belong to the family of probabilistic classification models. Given the information about the values of explanatory variables Χ , what is the probability that the instance belongs to class y ? We need to calculate the posterior probability P( y | Χ ) by means of Bayes’ theorem. This could be done if we have the values of the prior probability P( y ) and the class conditional probability P(Χ | y ) . 4 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Bayesian Methods (cont.) Suppose there are H distinct values for the target variable y denoted as Η = {v1 , v2 ,..., vH } . According to Bayes’ theorem, the posterior probability P( y | Χ ) , that is, the probability of observing the target class y given the instance Χ : P( y | Χ ) = P(Χ | y )P( y ) ∑ H l =1 5 P(Χ | y )P( y ) = P(Χ | y )P( y ) P(Χ ) CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Bayesian Methods (cont.) An illustration: Suppose H = 2 . What is the probability that y = v1 given Χ ? Answer – P( y = v1 | Χ ) For simplicity, let A be the event that y = v1 and B be the event that y = v2 . P( A ∩ Χ ) P( A | Χ ) = P(Χ ) We know that: P( A ∩ Χ ) = P(Χ ∩ A) and so 6 P(Χ ∩ A) P(Χ | A) = P ( A) P(Χ | A)P( A) P( A | Χ ) = P(Χ ) P( y | Χ ) = P(Χ | y )P( y ) ∑ H l =1 P(Χ | y )P( y ) = P(Χ | y )P( y ) P(Χ ) CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Bayesian Methods (cont.) Also: P(Χ ) = P(Χ ∩ A) + P(Χ ∩ B ) = P(Χ | A)P( A) + P(Χ | B )P(B ) P( y | Χ ) = P(Χ | y )P( y ) ∑ H l =1 P(Χ | y )P( y ) = P(Χ | y )P( y ) P(Χ ) Χ y = v1 y = v1 ∩ Χ y = v2 ∩ Χ y = v2 7 Event 𝐴𝐴 is 𝑦𝑦 = 𝑣𝑣1 and event 𝐵𝐵 is 𝑦𝑦 = 𝑣𝑣2 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Bayesian Methods (cont.) Maximum A Posteriori Hypothesis (MAP): 8 Calculate the posterior probability P( A | Χ ) and P(B | Χ ) P(Χ | A)P( A) P(Χ | B )P(B ) P( A | Χ ) = P (B | Χ ) = P(Χ ) P(Χ ) then see which one is higher. Since the denominator is the same for both posterior probabilities, need only to compare: P(Χ | A)P( A) P(Χ | B )P(B ) We conclude that y = v1 if P(Χ | A)P( A) is the larger of the two, otherwise conclude y = v2 . CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Bayesian Methods (cont.) To obtain P(Χ | A)P( A) and P(Χ | B )P(B ) . y = v1 Χ y = v1 ∩ Χ y = v2 ∩ Χ y = v2 P(Χ ∩ A) P ( A) P( A ∩ Χ ) = P(Χ | A)P( A) = P( x1 | A)P(x2 | A)...P(xn | A)P( A) P(Χ | A) = 9 P(Χ ∩ B ) P (B ) P(B ∩ Χ ) = P(Χ | B )P(B ) = P( x1 | B )P( x2 | B )...P(xn | B )P(B ) P(Χ | B ) = CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Naïve Bayesian Classifier The classifier assumes that given the target class, the explanatory variables are conditionally independent: P(Χ | y ) = P( x1 | y )× P( x2 | y )× P( x3 | y )× ...... × P(xn | y ) The class conditional probability values are calculated from the available data: 10 For categorical or discrete numerical variable a j : s jhk P (x j | y ) = P (x j = rjk | y = vh ) = mh where s jhk is the number of instances class vh for which the variable 𝑥𝑥𝑗𝑗 takes value 𝑟𝑟𝑗𝑗𝑘𝑘 mh is the total number of instances of class vh in dataset D . CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Naïve Bayesian Classifier (cont.) For numerical attributes: 11 P (x j | y ) is estimated by making some assumption regarding its distribution. Often, this conditional probability is assumed to follow the Gaussian distribution (also known as normal distribution) and we compute the Gaussian density function. CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression An Example of Naïve Bayesian Classifier For tomorrow, the weather forecast is: DAY OUTLOOK SUNNY TEMPERATURE HOT HUMIDITY HIGH WIND WEAK PLAYTENNIS D2 SUNNY HOT HIGH STRONG NO D3 OVERCAST HOT HIGH WEAK YES D4 RAIN MILD HIGH WEAK YES D5 RAIN COOL NORMAL WEAK YES D6 RAIN COOL NORMAL STRONG NO D7 OVERCAST COOL NORMAL STRONG YES D8 SUNNY MILD HIGH WEAK NO D9 SUNNY COOL NORMAL WEAK YES D10 RAIN MILD NORMAL WEAK YES D11 SUNNY MILD NORMAL STRONG YES D12 OVERCAST MILD HIGH STRONG YES D13 OVERCAST HOT NORMAL WEAK YES D14 RAIN HIGH STRONG NO D1 12 MILD NO Outlook – Sunny Temperature – Cool Humidity – High Wind – Strong Do we play tennis? Prior probabilities: 9 14 5 P(PlayTennis = No ) = 14 P(PlayTennis = Yes ) = CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression An Example of Naïve Bayesian Classifier (cont.) Estimate class conditional probabilities: DAY OUTLOOK SUNNY TEMPERATURE HOT HUMIDITY HIGH WIND WEAK PLAYTENNIS D2 SUNNY HOT HIGH STRONG NO D3 OVERCAST HOT HIGH WEAK YES D4 RAIN MILD HIGH WEAK YES D5 RAIN COOL NORMAL WEAK YES D6 RAIN COOL NORMAL STRONG NO D7 OVERCAST COOL NORMAL STRONG YES D8 SUNNY MILD HIGH WEAK NO D9 SUNNY COOL NORMAL WEAK YES P ( Sunny | Yes) P (Cool | Yes) P ( High | Yes) P ( Strong | Yes) P (Yes) = D10 RAIN MILD NORMAL WEAK YES D11 SUNNY MILD NORMAL STRONG YES D12 OVERCAST MILD HIGH STRONG YES 2 3 3 3 9 = 0.00529 9 9 9 9 14 D13 OVERCAST HOT NORMAL WEAK YES P( Sunny | No) P(Cool | No) P( High | No) P ( Strong | No) P( No) = D14 RAIN HIGH STRONG NO 3 1 4 3 5 = 0.02057 5 5 5 5 14 D1 MILD Note on posterior probabilities: P(Sunny, Cool , High, Strong ) = 0.00529 + 0.02057 = 0.02586 P(Yes | Sunny, Cool , High, Strong ) = 0.00529 0.02586 = 0.2045 3 9 3 P(Wind = Strong | PlayTennis = No ) = 5 P(Wind = Strong | PlayTennis = Yes ) = NO Compute class conditional probabilities for the other variables. Then compute: Decision: No, we do not play tennis. P( No | Sunny, Cool , High, Strong ) = 0.02057 0.02586 = 0.7955 13 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Naïve Bayesian Classifier with Scikit Learn The sklearn.naive_bayes module implements various Naive Bayes algorithms: Supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions. MultinomialNB: BernoulliNB: 14 Naive Bayes classifier for multinomial models. Suitable for classification with discrete features having a distribution like word frequencies (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. Naive Bayes classifier for multivariate Bernoulli models. Suitable for discrete data such as binary/boolean features. CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Naïve Bayesian Classifier with Scikit Learn (cont.) GaussianNB: Gaussian Naive Bayes classifier. Features with distribution that is assumed to be normal and the values can be continuous. See the sample script file src01 for the tennis example with GaussianNB. 15 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Logistic regression is a technique for converting binary classification problems into linear regression. Values of response variables are assumed to be 0 or 1. Using logistic regression, the posterior probability P( y | Χ ) of the target variable y conditioned to the input Χ = ( X 1 , X 2 ,..., X n ) is modeled according to the logistic function (where e = 2.718281828... ): P (Y = 1 | X 1 , X 2 ,..., X n ) = 1 1 + e −( β 0 + β1 X 1 +...+ β n X n ) e β 0 + β1 X 1 +...+ β n X n = 1 + e β 0 + β1 X 1 +...+ β n X n P (Y = 0 | X 1 , X 2 ,..., X n ) = 1 − P(Y = 1 | X 1 , X 2 ,..., X n ) = 1 − 16 1 1 + e −( β 0 + β1 X 1 +...+ β n X n ) = 1 1 + e β 0 + β1 X 1 +...+ β n X n CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression (cont.) 1 Graph of the logistic function f ( x ) = : −x 1+ e Hence, 0 ≤ P(Y = 1 | X 1 , X 2 ,..., X n ) ≤ 1 and The above is known as the sigmoid function. 0 ≤ P(Y = 0 | X 1 , X 2 ,..., X n ) ≤ 1 17 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression (cont.) The ratio of the two conditional probabilities is: P(Y = 1|X 1,X 2 ,...,X n ) = e β 0 + β1 X 1 +...+ β n X n P(Y = 0|X 1,X 2 ,...,X n ) This is the odds in favor of y = 1 And its logarithm: P(Y = 1|X 1,X 2 ,...,X n ) = β 0 + β1 X 1 + ... + β n X n ln P(Y = 0|X 1,X 2 ,...,X n ) This is the logit function, or the logarithm of the odds. 18 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression (cont.) If X 1 is increased by 1: logit | X +1 = logit | X + β i i i odds | X +1 = odds | X e β i i i is the odds-ratio – The multiplicative increase in the odds when X 1 increases by one (other variables remaining constant): β i > 0 ⇒ e β > 1 ⇒ odds and probability increase with X 1 β i < 0 ⇒ e β < 1 ⇒ odds and probability decrease with X 1 e βi i i 19 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 1 A system analyst studied the effect of computer programming experience on ability to complete a complex programming task within a specified time. They had varying amount of experience (measured in months). All persons were given the same programming task and their success in the task was recorded: 20 Y = 1 if task was completed successfully within the allotted Person Months-Experience Success time. 1 14 0 Y = 0 otherwise. 2 29 0 … … … 24 22 1 25 8 1 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 1 (cont.) A standard logistic package was run on the data and the parameter values found are: β 0 = −3.0595 and β1 = 0.1615 . The estimated mean response for i = 1 , where X 1 = 14 is: a = β 0 + β1 X 1 = −3.0595 + 0.1615(14 ) = −0.7985 e a = e −0.7985 = 0.4500 0.4500 ea = = 0.3103 P(Y = 1 | X 1 = 14 ) = a 1+ e 1 + 0.4500 The estimated probability that a person with 14 months experience will successfully complete the programming task is 0.3103. The odds in favor of completing the task = 0.3103 (1 − 0.3103) = 0.4499 21 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 1 (cont.) Suppose there is another programmer with 15 months experience, i.e., X 1 = 15 . Recall the parameter values are β 0 = −3.0595 and β1 = 0.1615 b = β 0 + β1 X 1 = −3.0595 + 0.1615(15) = −0.637 eb = e −0.637 = 0.5289 eb 0.5289 P(Y = 1 | X 1 = 15) = = = 0.3459 b 1+ e 1 + 0.5289 The estimated probability that a person with 15 months experience will successfully complete the programming task is 0.3459. = 0.3459 (1 − 0.3459 ) The odds in favor of completing the task = 0.5288 22 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 1 (cont.) Comparing the two odds: 0.5288 = 1.1753 = e 0.1615 0.4499 The odds increase by 17.53% with each additional month of experience. 23 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 2 This example originates from UCLA. The data were collected from 200 high school students. The variables read, write, math, science, and socst are the results of standardized tests on reading, writing, math, science and social studies (respectively). The variable female is coded 1 if female, 0 if male. The response variable is honcomp with two possible values: 24 High writing test score if the writing score is greater than or equal to 60 (honcomp = 1), Low writing test score, otherwise (honcomp = 0). CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 2 (cont.) The predictor variables used are gender (female), reading test score (read) and science test score (science). See the sample script file src02. 25 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 2 (cont.) 26 CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Logistic Regression Example 2 (cont.) Selected output from StatsModels: Estimate for Intercept β 0 = −12.7772 Estimate for β1 = 1.4825 (corresponds to variable female) β Odds ratio point estimate corresponds to variable female = 4.404 = e 1 Estimate for β 2 = 0.1035 (corresponds to variable read) Estimate for β 3 = 0.0948 (corresponds to variable science): 27 The odds of a female student getting high writing test score is more than 4-fold higher than a male student (given the same reading and science test scores). The estimate logistic regression coefficient for a one unit change in science score, given the other variables in the model are held constant. If a student were to increase his/her science score by one point, the difference in log-odds (logit response values) for high writing score is expected to increase by 0.0948 units, all other variables held constant. CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression Lecture 8 Support Vector Machines Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Overview of Support vector machines (SVMs). Maximum-margin hyperplane. Non-linear SVMs. Using SVMs in Python with Scikit Learn. Hyperparameter tuning. CG DADL (June 2023) Lecture 8 – Support Vector Machines Overview of Support Vector Machines Support vector machines (SVMs): SVMs are a family of separation methods for classification and regression. SVMs are developed in the context of statistical learning theory. Among the best supervised learning algorithm: 2 Most theoretically motivated. Practically most effective classification algorithms in modern machine learning. SVMs are initially designed to fit a linear boundary between the samples of a binary problem, ensuring the maximum robustness in terms of tolerance to isotropic uncertainty. CG DADL (June 2023) Lecture 8 – Support Vector Machines Overview of Support Vector Machines (cont.) SVMs can achieve better performance in terms of accuracy with respect to other classifiers in several application domains: Text and hypertext categorization – SVMs can significantly reduce the need for labeled training instances. Classification of images. Recognition of hand-written characters. SVMs have also been widely applied in the biological and other sciences – E.g., classify proteins with up to 90% accuracy. Efficiently scalable for large problems: 3 If we have a big data set that needs a complicated model, the full Bayesian framework is very computationally expensive. CG DADL (June 2023) Lecture 8 – Support Vector Machines What is SVM? – A Video Introduction 4 CG DADL (June 2023) Lecture 8 – Support Vector Machines How Does SVM Work? Given a set of labelled training examples for two categories: 5 An SVM training algorithm builds a model that assigns new examples to one category or the other. This makes SVM a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. The gist of SVM is to establish an optimal hyperplane for linearly separable patterns. CG DADL (June 2023) Lecture 8 – Support Vector Machines How Does SVM Work? (cont.) The figure below depicts the SVM decision boundary and the support vectors: 6 The boundary displayed has the largest distance to the closest point of both classes. Any other separating boundary will have a point of a class closer to it than this one. The figure also shows the closest points of the classes to the boundary. These special points are called support vectors. CG DADL (June 2023) Lecture 8 – Support Vector Machines How Does SVM Work? (cont.) In fact, the boundary only depends on the support vectors. If we remove any other point from the dataset, the boundary remains intact. However, in general, if any of these support vectors is removed, the boundary will change. In addition to performing linear classification, SVMs can also efficiently perform a non-linear classification: 7 SVMs can be extended to patterns that are not linearly separable by transformations of original data to map into new space using Kernel functions. This approach is known as the kernel trick – transforming data into another dimension that has a clear dividing margin between classes of data. CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vectors Support vectors are the data points that lie closest to the decision surface (or hyperplane): They are the data points that are the most difficult to classify. They have direct bearing on the optimum location of the decision surface. Which separating hyperplane should we use? 8 In general, there are many possible solutions. SVM finds an optimal solution. CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vector Machine SVMs maximize the margin around the separating hyperplane: This is known as the “street”. This is known as the maximummargin hyperplane. The decision function is fully specified by a (usually very small) subset of training samples, i.e., the support vectors. This is essentially a quadratic programming problem that is easy to solve by standard methods. 9 CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vector Machine (cont.) Separation by hyperplanes: We will assume linear separability for now and will relax this assumption later. In two dimensions, we can separate by a line. In higher dimensions, we need hyperplanes. General input/output for SVMs: Similar to neural nets but there is one important addition. Input: 10 Set of (input, output) training pair samples. The input are the sample features 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 . The output is the result 𝑦𝑦. Typically, there can be lots of input features 𝑥𝑥𝑖𝑖 . CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vector Machine (cont.) Output: Important difference: 11 A set of weights 𝑤𝑤𝑖𝑖 , one for each feature. The linear combination of the weights predicts the value of 𝑦𝑦. Thus far, this is similar to neural nets. We use the optimization of maximizing the margin (“street width”) to reduce the number of weights that are nonzero to just a few that correspond to the important features that ‘matter’ in deciding the separating line (hyperplane). These nonzero weights correspond to the support vectors (because they ‘support’ the separating hyperplane). CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vector Machine (cont.) Two-dimensional case: 12 Find 𝑎𝑎, 𝑏𝑏 and 𝑐𝑐 such that 𝑎𝑎𝑥𝑥 + 𝑏𝑏𝑏𝑏 ≥ 𝑐𝑐 for the red points. 𝑎𝑎𝑎𝑎 + 𝑏𝑏𝑏𝑏 ≤ (or <)𝑐𝑐 for the green points. CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vector Machine (cont.) Which hyperplane to choose? There are lots of possible solutions for 𝑎𝑎, 𝑏𝑏 and 𝑐𝑐. Some methods find a separating hyperplane, but not the optimal one (e.g., neural net). But the important question is which points should influence optimality? All points? Or only “difficult points” close to decision boundary? 13 Linear regression Neural nets Support vector machines CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vectors for Linearly Separable Case Recall that support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed. Support vectors are the critical elements of the training set. The problem of finding the optimal hyper plane is an optimization problem: 14 Can be solved by optimization techniques. E.g., Lagrange multipliers can be used to get this problem into a form that can be solved analytically. CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vectors for Linearly Separable Case (cont.) Support vectors are input vectors that just touch the boundary of the margin (street): 15 There are three of them as circled in the figure below. Think of them as the ‘tips’ of the vectors. CG DADL (June 2023) Lecture 8 – Support Vector Machines Support Vectors for Linearly Separable Case (cont.) The figure below shows the actual support vectors, 𝑣𝑣1 , 𝑣𝑣2 , 𝑣𝑣3 , instead of just the 3 circled points at the tail ends of the support vectors. 16 𝑑𝑑 denotes 1/2 of the street “width”. CG DADL (June 2023) Lecture 8 – Support Vector Machines Definitions Define the hyperplanes 𝐻𝐻 such that: 𝐻𝐻1 and 𝐻𝐻2 are the planes: 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 ≥ +1 when 𝑦𝑦𝑖𝑖 = +1 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 ≤ −1 when 𝑦𝑦𝑖𝑖 = −1 𝐻𝐻1 : 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 = +1 𝐻𝐻2 : 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 = −1 The points on the planes 𝐻𝐻1 and 𝐻𝐻2 are the tips of the support vectors. 17 CG DADL (June 2023) Lecture 8 – Support Vector Machines Definitions (cont.) The plane 𝐻𝐻0 is the median in between, where 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 = 0: 𝑑𝑑 + = The shortest distance to the closest positive point. 𝑑𝑑 − = The shortest distance to the closest negative point. The margin (gutter) of a separating hyperplane is 𝑑𝑑 + + 𝑑𝑑 −. The optimization algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary. 18 CG DADL (June 2023) Lecture 8 – Support Vector Machines Definitions (cont.) 19 CG DADL (June 2023) Lecture 8 – Support Vector Machines Defining the Separating Hyperplane Form of equation defining the decision surface separating the classes is a hyperplane of the form: 𝑊𝑊 𝑇𝑇 𝑋𝑋 + 𝑏𝑏 = 0 𝑊𝑊 is a weight vector. 𝑋𝑋 is input vector. 𝑏𝑏 is bias. This allows us to write: 𝑊𝑊 𝑇𝑇 𝑋𝑋 + 𝑏𝑏 ≥ 0 for 𝑑𝑑1 = +1 𝑊𝑊 𝑇𝑇 𝑋𝑋 + 𝑏𝑏 < 0 for 𝑑𝑑1 = −1 20 CG DADL (June 2023) Lecture 8 – Support Vector Machines Defining the Separating Hyperplane (cont.) Margin of Separation (𝑑𝑑) – The separation between the hyperplane and the closest data point for a given weight vector 𝑊𝑊 and bias 𝑏𝑏. Optimal Hyperplane (maximal margin) – The particular hyperplane for which the margin of separation 𝑑𝑑 is maximized. 21 CG DADL (June 2023) Lecture 8 – Support Vector Machines Maximizing the Margin (a.k.a. Street Width) We want a classifier (linear separator) with as big a margin as possible. The distance from a point 𝑥𝑥0 , 𝑦𝑦0 to a line 𝐴𝐴𝐴𝐴 + 𝐵𝐵𝐵𝐵 + 𝑐𝑐 = 0 is: 𝐴𝐴𝑥𝑥0 + 𝐵𝐵𝑦𝑦0 + 𝑐𝑐 𝐴𝐴2 + 𝐵𝐵2 The distance between 𝐻𝐻0 and 𝐻𝐻1 is then: 𝑤𝑤 ⋅ 𝑥𝑥 + 𝑏𝑏 1 = 𝑤𝑤 𝑤𝑤 The total distance between 𝐻𝐻1 and 𝐻𝐻2 is thus: 2 𝑤𝑤 22 CG DADL (June 2023) Lecture 8 – Support Vector Machines Maximizing the Margin (a.k.a. Street Width) (cont.) In order to maximize the margin, we thus need to minimize 𝑤𝑤 and subject to the condition that there are no data points between 𝐻𝐻1 and 𝐻𝐻2 : 𝑥𝑥𝑖𝑖 ⋅ 𝑤𝑤 + 𝑏𝑏 ≥ +1 when 𝑦𝑦𝑖𝑖 = +1 𝑥𝑥𝑖𝑖 ⋅ 𝑤𝑤 + 𝑏𝑏 ≤ −1 when 𝑦𝑦𝑖𝑖 = −1 Can be combined into: 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖 ⋅ 𝑤𝑤 ≥ +1 23 CG DADL (June 2023) Lecture 8 – Support Vector Machines Maximizing the Margin (a.k.a. Street Width) (cont.) We need to solve a quadratic programming problem: Minimize 𝑤𝑤 , s.t. discrimination boundary is obeyed, i.e., min𝑓𝑓 𝑥𝑥 s.t. 𝑔𝑔 𝑥𝑥 = 0. We can rewrite this as min𝑓𝑓: 1⁄2 𝑤𝑤 2 s.t. 𝑔𝑔: 𝑦𝑦𝑖𝑖 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 − 𝑏𝑏 = 1 or 𝑦𝑦𝑖𝑖 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 − 𝑏𝑏 − 1 = 0 This is a constrained optimization problem: 24 It can be solved by the Lagrangian multiplier method. Because it is quadratic, the surface is a paraboloid, with just a single global minimum. CG DADL (June 2023) Lecture 8 – Support Vector Machines A Video Explanation 25 CG DADL (June 2023) Lecture 8 – Support Vector Machines How about Non-Linear Separable? We would be finding a line that penalizes points on “the wrong side”. 26 CG DADL (June 2023) Lecture 8 – Support Vector Machines Non-Linear SVM – Continue Video 27 CG DADL (June 2023) Lecture 8 – Support Vector Machines How about Non-Linear Separable? (cont.) We can transform the data points such that they are linearly separable: 28 CG DADL (June 2023) Lecture 8 – Support Vector Machines How about Non-Linear Separable? (cont.) Non-Linear SVMs: 29 The idea is to gain linearly separation by mapping the data to a higher dimensional space. The following set cannot be separated by a linear function, but can be separated by a quadratic one. So if we map 𝑥𝑥 ↦ 𝑥𝑥 2 , 𝑥𝑥 , we gain linear separation: CG DADL (June 2023) Lecture 8 – Support Vector Machines How about Non-Linear Separable? (cont.) But what happen if the decision function is not linear? What transformation would separate the data points in the figure below? 30 CG DADL (June 2023) Lecture 8 – Support Vector Machines How about Non-Linear Separable? (cont.) The kernel trick: 31 Imagine a function 𝜙𝜙 that maps the data into another space. A kernel function performs the transformation and defines the inner dot products 𝑥𝑥𝑖𝑖 ⋅ 𝑥𝑥𝑗𝑗 , i.e., it also defines similarity in the transformed space. CG DADL (June 2023) Lecture 8 – Support Vector Machines How about Non-Linear Separable? (cont.) Examples of non-linear SVMs: Example of a Gaussians kernel 32 CG DADL (June 2023) Lecture 8 – Support Vector Machines Performing SVM in Scikit Learn We can apply linear SVM to the binary classification problem of the tennis dataset. See the sample script src01 We can also apply linear SVM to the multiclass classification problem of the Iris flower dataset. See the sample script src02 33 CG DADL (June 2023) Lecture 8 – Support Vector Machines Hyperparameter Tuning In order to improve the model accuracy, there are several parameters that need to be tuned. The three major parameters include: Parameter Description Kernel Kernel takes low dimensional input space and transform it into a higherdimensional space. It is mostly useful in non-linear separation problem. C C is the penalty parameter, which represents misclassification or error (Regularisation) term. C tells the SVM optimisation how much error is bearable. Control the trade-off between decision boundary and misclassification term. When C is high it will classify all the data points correctly, also there is a chance to overfit. Gamma Defines how far influences the calculation of plausible line of separation. When gamma is higher, nearby points will have high influence; low gamma means far away points also be considered to get the decision boundary. 34 CG DADL (June 2023) Lecture 8 – Support Vector Machines Hyperparameter Tuning (cont.) C (Regularisation) 35 CG DADL (June 2023) Lecture 8 – Support Vector Machines Hyperparameter Tuning (cont.) Gamma 36 CG DADL (June 2023) Lecture 8 – Support Vector Machines Hyperparameter Tuning (cont.) Hyperparameters are parameters that are not directly learnt within estimators: In scikit-learn, they are passed as arguments to the constructor of the estimator classes. Grid search is commonly used as an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. GridSearchCV helps us combine an estimator with a grid search preamble to tune hyperparameters. See the sample script src03 37 CG DADL (June 2023) Lecture 8 – Support Vector Machines Lecture 9 Clustering Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 What is clustering? Clustering methods. Affinity measures. Partition methods. Hierarchical methods. Evaluation of clustering models. Using Clustering in Python with Scikit Learn. CG DADL (June 2023) Lecture 9 – Clustering Overview of Clustering Clusters are homogeneous groups of observations. To measure similarity between pairs of observations, a distance metric must be defined. Clustering is an unsupervised learning process. Focus of our discussions will be on: 2 Features of clustering models. Two partition methods: K-means and K-medoids. Two hierarchical methods: Agglomerative and divisive clustering methods. Quality indicators for clustering methods. CG DADL (June 2023) Lecture 9 – Clustering Clustering Methods Aim – To subdivide the records of a dataset into homogeneous groups of observations called clusters. Observations in a cluster are similar to one another and are dissimilar from observations in other clusters. Purpose of clustering: As a tool which could provide meaningful interpretation of the phenomenon of interest: 3 Example – Grouping consumers based on their purchase behavior may reveal the existence of a market niche. CG DADL (June 2023) Lecture 9 – Clustering Clustering Methods (cont.) As a preliminary phase of a data mining project that will be followed by other methodologies within each cluster: Example: 4 Clustering is done before classification. In retention analysis, distinct classification models may be developed for various clusters to improve the accuracy in spotting customers with high probability of churning. As a way to highlight outliers and identify an observation that might represent its own cluster. CG DADL (June 2023) Lecture 9 – Clustering General Requirements for Clustering Methods Flexibility: Some methods can be applied to data having numerical variables only. Other methods can be applied to datasets containing categorical variables as well. Robustness: 5 The stability of the clusters with respect to small changes in the values of variables of each observation. Noise should not affect the clusters. The clusters formed should not depend on the order of appearance of the observations in the dataset. CG DADL (June 2023) Lecture 9 – Clustering General Requirements for Clustering Methods (cont.) Efficiency: 6 The method must be able to generate clusters efficiently within reasonable computing time even for a large dataset with many observations and large number of variables. CG DADL (June 2023) Lecture 9 – Clustering Taxonomy of Clustering Methods Based on the logic used for deriving the clusters. Partition methods: Develop a subdivision of the given dataset into a predetermined number K of non-empty subsets. They are usually applied to small or medium sized data sets. Hierarchical methods: 7 Carry out multiple subdivisions into subsets. Based on a tree structure and characterized by different homogeneity thresholds within each cluster and inhomogeneity threshold between distinct clusters. No predetermined number of clusters is required. CG DADL (June 2023) Lecture 9 – Clustering Taxonomy of Clustering Methods (cont.) Density-based methods: Derive clusters such that for each observation in a specific cluster, a neighborhood with a specified diameter must contain at least a pre-specified number of observations. E.g., Defined distance (DBSCAN): If the Minimum Features per Cluster cannot be found within the Search Distance from a particular point, then that point will be marked as noise. In other words, if the core-distance (the distance required to reach the minimum number of features) for a feature is greater than the Search Distance, the point is marked as noise. The Search Distance, when using DBSCAN, is treated as a search cut-off. 8 CG DADL (June 2023) Lecture 9 – Clustering Taxonomy of Clustering Methods (cont.) Grid methods: 9 Derive a discretization of the space of observations, obtaining grid structure consisting of cells. Subsequent clustering operations are developed with respect to the grid structure to achieve reduced computing time, but lower accuracy. CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures Clustering models are typically based on a measure of similarity between observations. The measure can typically be obtained by defining an appropriate notion of distance between each pair of observations. There are many popular metrics depending on the type of variables being analyzed. 10 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures (cont.) Given a dataset D having m observations Χ1, Χ 2 , Χ 3 ,...Χ m each described by n-dimensional variables, we compute the distance matrix D: 0 d12 d1,m −1 0 d 2,m −1 D = [d ik ] = 0 d1m d 2m d m −1,m 0 where d ik is the distance between observations Χi and Χ k . d ik = dist (Χ i , Χ k ) = dist (Χ k , Χ i ) for i, k = 1,2,..., m D is a symmetric m × m matrix with zero diagonal. 11 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures (cont.) Similarity measure can be obtained by letting: s ik = 1 1 + d ik or s ik = d max − d ik d max where d max = max i ,k d ik is the max value of D . 12 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Numerical Variables If all n variables of the observations Χ1, Χ 2 , Χ 3 ,...Χ m are numerical, the distance between Χi and Χ k can be computed in four ways. Euclidean distance (or 2 norm): dist (Χ i , Χ k ) = n ∑ (x j =1 ij − xkj ) = 2 (xi1 − xk1 )2 + (xi 2 − xk 2 )2 + ... + (xin − xkn )2 Manhattan distance (or 1 norm): n dist (Χ i , Χ k ) = ∑ xij − xkj = xi1 − xk1 + xi 2 − xk 2 + ... + xin − xkn j =1 13 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Numerical Variables (cont.) Infinity norm: Minkowski distance, which generalizes both the Euclidean and Manhattan metrics: dist (Χ i , Χ k ) = max( xi1 − xk1 , xi 2 − xk 2 ,..., xin − xkn ) dist (Χ i , Χ k ) = 14 n q ∑ j =1 xij − xkj q q q = q xi1 − xk1 + xi 2 − xk 2 + ... + xin − xkn q The Minkowski distance reduces to the Manhattan distance when q = 1 , and to the Euclidean distance when q = 2 . CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Numerical Variables (cont.) Example: Χ1 = (5,0 ) and Χ2 = (1,−3) Euclidean distance (or 2 norm): dist (Χ1 , Χ 2 ) = (5 − 1)2 + (0 − (− 3))2 = 16 + 9 = 5 Manhattan distance (or 1 norm): dist (Χ1 , Χ 2 ) = 5 − 1 + 0 − (− 3) 0 1 Infinity norm: dist (Χ1 , Χ 2 ) = max( 5 − 1 , 0 − (− 3) ) -3 = max(4,3) = 4 4 5 -2 3 -1 = 4+3= 7 2 -4 15 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Numerical Variables (cont.) Distance can also be measured by computing the cosine of the angle formed by the two vectors representing the observations. The similarity coefficient is defined as: ∑ xx ∑ x∑ n Bcos (Χ i , Χ k ) = cos(Χ i , Χ k ) = j =1 ij kj n 2 j =1 ij n 2 x j =1 kj The value of Bcos (Χ i , Χ k ) lies in the interval [0,1] . The distance itself is the angle between the two vectors Χi and Χ k derived as follow: dist (Χ i , Χ k ) = arccos(Χ i , Χ k ) 16 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Numerical Variables (cont.) The coefficient Bcos (Χ i , Χ k ) is closer to 1 when Χi and Χ k are similar to each other (parallel) It is closer to 0 when they are dissimilar. Note that cos(90) = 0 and cos(0) = 1 . Cluster 1 xi Cluster 2 xk αi Origin αk xnew xnew is more similar to xk than to xi 17 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Binary Variables Suppose the variable a = (x , x ,..., x ) can only have value j ij 2j mj either 0 or 1. Even if the distance between observations can be computed, the quantity would not represent a meaningful measure. The values 0 and 1 are purely conventional and their meaning could be interchanged. 18 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Binary Variables (cont.) Assuming all n variables in the dataset D are binary, define a contingency table: Observation xk Observation xi 19 0 1 total 0 p q p+q 1 u v u+v total p+u q+v n p – The number of variables for which Χi and Χ k assume the value 0. v – The number of attributes for which Χi and Χ k assume the value 1. q – The number of attributes for which Χi assumes the value of 0 and Χ k assumes the value 1. u – the number of attributes for which Χi assumes the value of 1 and Χ k assumes the value 0. CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Binary Variables (cont.) Example: Χi Χk = (0,0,1,1,0,1,0,0) = (1,1,1,0,0,0,0,0 ) 1 2 3 4 5 6 7 n=8 Xi 0 0 1 1 0 1 0 0 Xj 1 1 1 0 0 0 0 0 Observation xk Observation xi 20 0 1 total 0 p=3 q=2 p+q=5 1 u=2 v=1 u+v=3 total p+u=5 q+v=3 n=8 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Binary Variables (cont.) Symmetric binary variables: The presence of the value 0 is as interesting as the presence of value 1. E.g., Customer authorization of promotional mailings assumes the value {yes, no} . Asymmetric binary variables: 21 We are mostly interested in the presence of the value 1, which can be found in a small number of observations. E.g., If the customer purchased item A in a supermarket which carries more than 100,000 items. CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Binary Variables (cont.) For symmetric binary variables, the distance is given by the coefficient of similarity: dist (Χ i , Χ j ) = q+u q+u = p+q+u +v n For asymmetric binary variables, it is more interesting to match positives (coded as 1’s) with respect to matching negatives (coded as 0’s). The Jaccard coefficient is defined as: dist (Χ i , Χ j ) = q+u q+u +v Observation xk Observation xi 22 0 1 total 0 p q p+q 1 u v u+v total p+u q+v n CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Nominal Categorical Variables A nominal categorical variable can be interpreted as a generalization of a symmetric binary variable for which the number of distinct values is more than 2. The distance measure can then be computed as: dist (Χ i , Χ j ) = n− f n where f is the number of variables for which the observations Χi and Χ k have the same nominal value. Example: = (male, Chinese, Clementi, Full − Time Employee) Χ k = (female, Caucasian, Clementi, Temporary Worker ) 4 −1 n=4 f =1 dist (Χ i , Χ j ) = = 0.75 4 Χi 23 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Ordinal Categorical Variables An ordinal categorical variable can be placed on a natural ordering scale, but the numerical values are, however, arbitrary. Standardization is required before affinity measures can be computed. For example, level of education have 4 possible values {elementary school, high school, bachelor's degree, post - graduate} , which is coded as follows: 24 1 = elementary school 2 = high school 3 = bachelor’s degree 4 = post-graduate CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Ordinal Categorical Variables (cont.) If the values associated with the ordinal variable are {1,2,3,..., H j } then standardize by the transformation: ' ij x = xij − 1 H j −1 where H j is the maximum assigned numerical value of the variable and xij' lies in the interval [0,1] . After transformation, the variable level of education have numerical values as follows: 25 0 = elementary school 1 3 = high school 2 3 = bachelor’s degree 1 = post-graduate We can then apply measures of distance for numerical variable accordingly. CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Mixed Composition Variables Suppose the n variables in the dataset D has mixed composition – numerical, symmetric/asymmetric binary and nominal/categorical variables. How do we define an affinity measure between two observations Χi and Χ k in D ? Let δ ikj be a binary indicator where δ ikj = 0 if and only if one of the two cases below is true: At least one of the two values xij or xkj is missing in the corresponding observations; The variable a j is binary asymmetric and xij = xkj = 0. For other cases, δ ikj = 1 . 26 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Mixed Composition Variables (cont.) Example: aj j = 2 xi 0 3.8 -1 xk 0 missing +1 δikj = 0 δikj = 1 δikj = 0 aj j = 5 aj j = 7 Define the contribution ∆ ikj of the variable a j to the similarity between Χi and Χ k as follows: If a j is binary or nominal, ∆ ikj = 0 if xij = xkj and ∆ ikj = 1 , otherwise. If the attribute a j is numerical, we set: ∆ ikj = xij − xkj maxι xιj where maxι xιj is the maximum value of variable a j . 27 CG DADL (June 2023) Lecture 9 – Clustering Affinity Measures for Mixed Composition Variables (cont.) If variable a j is ordinal, its standardized value is computed as before and we set ∆ ikj as for numerical variables. The similarity coefficient between the observations Χi and Χ k can be computed as: dist (Χ i , Χ j δ ∆ ∑ )= ∑ δ n j =1 ikj n j =1 28 ikj ikj CG DADL (June 2023) Lecture 9 – Clustering Partition Methods Given a dataset D , each represented by a vector in ndimensional space, construct a collection of subsets C = {C1 , C2 ,..., C K } where K ≤ m . K is the number of clusters and is generally predetermined. Clusters generated are usually exhaustive and mutually exclusive – Each observation belongs to only one cluster. Partition methods are iterative: 29 Assign m observations to the K clusters. Then iteratively reallocate to improve overall quality of clusters. CG DADL (June 2023) Lecture 9 – Clustering Partition Methods (cont.) Criteria for quality: Degree of homogeneity of observations in the same clusters. Degree of heterogeneity with respect to observations in other clusters. The methods terminate when during the same iteration no reallocation occurs, i.e., clusters are stable. 30 CG DADL (June 2023) Lecture 9 – Clustering K-means Algorithm 1. 2. 3. 4. 31 Initialize: choose K observations arbitrarily as the centroids of the clusters. Assign each observation to a cluster with the nearest centroid. If no observation is assigned to different cluster with respect to previous iteration, stop. For each cluster, the new centroid is computed as the mean of the values belonging to that cluster. Go to Step 2. CG DADL (June 2023) Lecture 9 – Clustering K-means Algorithm (cont.) Source: Vercellis (2009), pp. 304 32 CG DADL (June 2023) Lecture 9 – Clustering K-means Algorithm (cont.) Given a cluster Ch , h = 1,2,..., K , the centroid of the cluster is the point zh having coordinates equal to the mean value of each variable in the observations belonging to that cluster: z hj ∑ = Χ i ∈C h xij card{Ch } where card{Ch } is the number of observations in cluster Ch . 33 CG DADL (June 2023) Lecture 9 – Clustering K-means Algorithm (cont.) Example – Suppose we have 2-dimensional data with the variables {Weight, Height} : In Cluster 1, the observations are: {65,168}, {69,172} . In Cluster 2, the observations are: {50,165}, {58,158}, {54,157} . The centroids are: Cluster 1: 65 + 69 168 + 172 z1 = {z11 , z12 } = , = {67,170} 2 2 Cluster 2: 50 + 58 + 54 165 + 158 + 157 z 2 = {z 21 , z 22 } = , = {54,160} 3 3 34 CG DADL (June 2023) Lecture 9 – Clustering K-medoids Algorithm K-medoids algorithms, also known as partitioning around medoids, is a variant of the K-means method. Given a cluster Ch , a medoid U h is the most central observation among those assigned to this cluster. Instead of the means, the K-medoids algorithm assigns observations Χ i , i = 1,2,..., m according to the distance to the medoids, Essentially, we find cluster Ch such that dist (Χ i , U h ) is minimized, h = 1,2,..., K . 35 CG DADL (June 2023) Lecture 9 – Clustering K-medoids Algorithm (cont.) K-medoids requires a large number of iterations to assess the advantage afforded by an exchange of medoids: Check all (m − K )K pairs of (Χ i , U h ) composed of: An observation Χ i ∉ U that is not a medoid. And a medoid U h ∈ U . If it is beneficial to exchange between Χ i and U h , the observation Χ i becomes a medoid in place of U h . Thus, K-medoids is not suited for large datasets. 36 CG DADL (June 2023) Lecture 9 – Clustering Hierarchical Methods Hierarchical methods are based on a tree structure. They do not require the number of clusters to be predetermined. Distance between two clusters is computed before merging of clusters. Let Ch and C f be two clusters and Z h and Z f their corresponding centroids. To evaluate the distance between two clusters, there are five alternative measures that can be used. 37 CG DADL (June 2023) Lecture 9 – Clustering Hierarchical Methods (cont.) Minimum distance (single linkage): The dissimilarity between two clusters is given by the minimum distance among all pair of observations such that one belongs to the first cluster and the other to the second cluster: dist (Ch , C f ) = min dist (Χ i , Χ k ) Χ i ∈C h Χ k ∈C f Maximum distance (complete linkage): The dissimilarity between two clusters is given by the maximum distance among all pair of observations such that one belongs to the first cluster and the other to the second clusters: dist (C , C ) = max dist (Χ , Χ ) h 38 f Χ i ∈C h Χ k ∈C f i k CG DADL (June 2023) Lecture 9 – Clustering Hierarchical Methods (cont.) Mean distance: The dissimilarity between two clusters is computed as the mean of the distances between all pairs of observations belonging to the two clusters: dist (Ch , C f ) = ∑ Χ i ∈C h ∑ Χ k ∈C f dist (Χ i , Χ k ) card{Ch } card{C f } Distance between centroids: The dissimilarity between two clusters is determined by the distance between the two centroids of the clusters: dist (Ch , C f ) = dist (Z h , Z f ) 39 CG DADL (June 2023) Lecture 9 – Clustering Hierarchical Methods (cont.) Ward distance: Distance is computed based on the analysis of the variance of the Euclidean distances between the observations. Two types of hierarchical clustering methods: 40 Agglomerative methods. Divisive methods. CG DADL (June 2023) Lecture 9 – Clustering Agglomerative Hierarchical Methods Bottom-up techniques: 1. 2. 3. Initialization phase – Start with each observation representing a cluster. The minimum distance between the clusters is computed and the two clusters with the minimum distance are merged into a new cluster. If all the observations have been merged into one cluster, stop. Otherwise, go to step 2. Dendrogram: 41 A graphical representation of the merging process. On one axis is the value of the minimum distance corresponding to each merger. On the other axis are the observations. CG DADL (June 2023) Lecture 9 – Clustering Agglomerative Hierarchical Methods (cont.) A dendrogram provides a whole hierarchy of clusters corresponding to different threshold values for the minimum distance between clusters. Lower threshold gives more clusters. Example: 42 CG DADL (June 2023) Lecture 9 – Clustering Divisive Hierarchical Methods Top down approach: 1. 2. 3. Initialization – Start with all observations placed in a single cluster. Split clusters into smaller clusters so that the distances between the generated new clusters are minimized. Repeat Step 2 until every cluster contains only one observation, or until a similar stopping condition is met. In order to reduce the computation time, not all possible splits are considered: 43 At any given iteration, determine for each cluster the two observations that are furthest from each other. Subdivide the cluster by assigning the remaining records to one or the other based on their similarity. CG DADL (June 2023) Lecture 9 – Clustering Pros and Cons of Hierarchical Clustering Methods Hierarchical clustering is appealing because: It does not require specification of number of clusters. It’s ability to represent the clustering process and results through dendrograms makes it easier to understand and interpret. Hierarchical clustering has certain limitations: 44 Has greater computational complexity. Has lower robustness because reordering data or dropping a few observations can lead to very different solution. Is sensitive to outliers. CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models Unlike supervised learning, evaluation of unsupervised learning is more subjective. It is still useful to compare different algorithms/metrics and number of clusters. Various performance indicators can be used once the set of K clusters C = {C1 , C2 ,..., CK } has been obtained: 45 Cohesion. Separation. Silhouette coefficient. CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models (cont.) Cohesion: An indicator of homogeneity of observations within each cluster Ch is defined as: coh (Ch ) = ∑ dist (Χ , Χ ) Χ i ∈C h Χ k ∈C h ∑ coh(C ) C h ∈C 46 k The overall cohesion of the partition C is defined as: coh (C ) = i h One clustering is preferable over another, in term of homogeneity within each cluster, if it has a smaller overall cohesion. CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models (cont.) Separation: An indicator of inhomogeneity between a pair of clusters is defined as: sep(Ch , C f ) = ∑ dist (Χ , Χ ) ∑ sep(C , C ) C h ∈C C f ∈C 47 k The overall separation of the partition C is defined as: sep(C ) = i Χ i ∈C h Χ k ∈C f h f One clustering is preferable over another, in term of inhomogeneity among all clusters, if it has a greater overall separation. CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models (cont.) Silhouette coefficient: A combination of cohesion and separation. Calculation of silhouette coefficient for a single observation 1. 2. 3. Χi: The mean distance ui of Χ i from all the remaining observations in the same cluster is computed. For each cluster C f other than the cluster containing Χ i , the mean distance wif between Χ i and all the observations in C f is calculated. The minimum vi among the distances wif is determined by varying the cluster C f . The silhouette coefficient of Χ i is defined as: vi − ui silh (Χ i ) = max(ui , vi ) 48 CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models (cont.) The silhouette coefficient varies between -1 and 1. A negative value indicates that the mean distance ui of the observation Χ i from the points of its cluster is greater than the minimum value vi of the mean distances from the observations of the other clusters, A negative value is therefore undesirable since the membership of Χ i in its cluster is not well characterized. Ideally, silhouette coefficient should be positive and ui should be as close as possible to 0: 49 That is, the best value is 1. Overall silhouette coefficient of a clustering may be computed as the mean of the silhouette coefficients for all observations in the dataset D . CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models (cont.) Silhouette diagram: 50 Used to illustrate silhouette coefficients. The observations are placed on vertical axis, subdivided by clusters. The value of silhouette coefficient for each observation are shown on the horizontal axis. The mean value of silhouette coefficient for each cluster and the overall mean may also be shown in the diagram. CG DADL (June 2023) Lecture 9 – Clustering Evaluation of Clustering Models (cont.) Example of silhouette diagrams: Generated from iris dataset using K-means clustering. Source – MathWorks K=2 51 K=3 CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 1 – K-means Iris classification problem: 3 classes – Setosa,Versicolor and Virginica. 4 variables – Sepal length, sepal width, petal length and petal width. We use K-means clustering with K=3: 52 Silhouette Score = 0.5526 (positive and close to 1.0 is better) Homogeneity Score = 0.7515 (close to 1.0 is better) CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 1 – K-means (cont.) src01 53 CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 1 – K-means (cont.) We can generate the silhouette diagrams for K=2 and K=3 for comparison: 54 See the sample script src02. CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 1 – K-means (cont.) To identify the distinguishing characteristics of observations in each cluster: 55 We can compute the within-cluster means and standard deviations of the independent variables. Plot scatter plots of the observations using the required independent variables. See the sample script src03. CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 2 – Agglomerative Hierarchical We perform agglomerative hierarchical clustering on the Iris dataset. The number of clusters to find is set to 3: Silhouette Score = 0.5541 (positive and close to 1.0 is better) Homogeneity Score = 0.7608 (close to 1.0 is better) The statistics are slightly better than K-means See the sample script src04. A dendrogram is generated to visualize the merging process – See the sample script src05. 56 CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 2 – Agglomerative Hierarchical (cont.) 57 CG DADL (June 2023) Lecture 9 – Clustering Clustering Example 2 – Agglomerative Hierarchical (cont.) • Dendrogram diagram showing the merging process. • For 3 clusters: • The first cluster would have 50 observations. • The second cluster would have 50 observations. • The third cluster would have 50 observations. Sample generated with SAS 58 CG DADL (June 2023) Lecture 9 – Clustering Lecture 10 Association Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Motivation and structure of association rules. Single-dimension association rules. Apriori algorithm. General association rules. Using association in Python with Mlxtend. CG DADL (June 2023) Lecture 10 – Association Overview of Association Rules Association rules is a class of unsupervised learning models. Aim of association rules is to identify regular patterns and recurrences within a large set of transactions. Fairly simple and intuitive. Frequently used to investigate: 2 Sales transactions in market basket analysis. Navigation paths within websites. CG DADL (June 2023) Lecture 10 – Association Motivations of Association Rules Many real-world applications involve systematic collection of data that gives rise to massive lists of transactions. These transactions may be analyzed using association rules to identify possible recurrences in the data. Market basket analysis: 3 In typical retail purchase transactions, the list of purchased items is stored along with the price, time and date of purchase. Analyze the transactions to identify recurrent rules that relate the purchase of a product, or group of products, to the purchase of another product, or group of products. E.g., A customer buying breakfast cereals will also buy milk with a probability of 0.68. CG DADL (June 2023) Lecture 10 – Association Motivations of Association Rules (cont.) Web mining: 4 List of web pages visited during a session is recorded as a transaction together with a sequence number and time of visit. Understand the pattern of navigation paths and the frequency with which combinations of web pages are visited by a given individual during a single session or consecutive sessions. Identify the association of one or more web pages being viewed with visits to other pages. E.g., An individual visiting the website timesonline.co.uk will also visit the website economist.com within a week with a probability of 0.87. CG DADL (June 2023) Lecture 10 – Association Motivations of Association Rules (cont.) Credit card purchase: Each transaction consists of the purchases and payments made by card holders. Association rules are used to analyze the purchases in order to direct future promotions. Fraud detection: 5 Transactions consist of incident reports and applications for damage compensation. Existence of special combinations may reveal potentially fraudulent behaviors and triggers investigation. CG DADL (June 2023) Lecture 10 – Association Structure of Association Rules events that happens naturally Given two propositions Y and Z , which may be true or false, we can state in general terms that a rule is an implication of the type Y ⇒ Z with the following meaning: If Y is true then Z is also true. A rule is called probabilistic if the validity of Z is associated with a probability p . That is, if Y is true then Z is also true with probability p . The notation ⇒ read as “material implication”: 6 A ⇒ B means if A is true then B is also true; if A is false then nothing is said about B . CG DADL (June 2023) Lecture 10 – Association Structure of Association Rules (cont.) Rules represent a classical paradigm for knowledge representation and is popular for their simple and intuitive structure. Rules should be non-trivial and interpretable so that they can be translated into concrete action plans: A marketing analysis may generate a rule that only reflect the effects of historical promotional campaigns, or that merely state the obvious. Rules may sometime reverse the causal relationship of an implication: 7 E.g., Buyers of an insurance policy will also buy a car with a probability of 0.98. This rule confuses the cause with the effect and is useless. CG DADL (June 2023) Lecture 10 – Association Representation of Association Rules Let O = {o1 , o2 ,..., on } be a set of n objects. A generic subset L ⊆ O is called an itemset. An itemset that contains k objects is called a k-itemset. A transaction represents a generic itemset recorded in a database in conjunction with an activity or cycle of activities. The dataset D is composed of a list of m transactions Ti , each associated with a unique identifier denoted by ti. 8 Market basket analysis – The objects represent items from the retailer and each transaction corresponds to items listed in a sales receipt. CG DADL (June 2023) Lecture 10 – Association Representation of Association Rules (cont.) Web mining – The objects represent the web pages in a website and each transaction corresponds to the list of web pages visited by a user during one session. Example on market basket analysis: • This example is for market basket analysis. • In this example, t1 = 001 and T1 = {a, c} = {bread, cereals} . • Similarly, t3 = 003 and the corresponding T3 = {b, d } = {milk, coffee} . Source: Vercellis (2009), pp. 279 9 CG DADL (June 2023) Lecture 10 – Association Representation of Association Rules (cont.) A dataset of transactions can be represented by a twodimensional matrix X : The n objects of the set O correspond to the columns of the matrix. The m transactions Ti are the rows. The generic element of X is defined as: 1 if object o j belongs to transaction Ti , xij = 0 otherwise. 10 CG DADL (June 2023) Lecture 10 – Association Representation of Association Rules (cont.) Same example on market basket analysis: Source: Vercellis (2009), pp. 280 • Recall that T1 = {bread, cereals} = {a, c} • And T3 = {milk, coffee} = {b, d } 11 CG DADL (June 2023) Lecture 10 – Association Representation of Association Rules (cont.) The representation could be generalized: Assuming that each object o j appearing in a transaction Ti is associated with a number f ij . f ij represents the frequency in which o j appears in Ti . Possible to fully describe multiple sales of a given item in a single transaction. Let L ⊆ O be a given set of objects, then transaction T is said to contain the set L if L ⊆ T . In the market basket analysis example, the 2-itemset L = {a, c} is contained in the transaction with identifier ti = 005 . But it is not contained in ti = 006 . 12 CG DADL (June 2023) Lecture 10 – Association Representation of Association Rules (cont.) The empirical frequency f (L ) of an itemset L is defined as the number of transactions Ti existing in the dataset D that contain the set L : f (L ) = card{Ti : L ⊆ Ti , i = 1,2,..., m} For a large sample (i.e., as m increases), the ratio f (L ) / m approximate the probability Pr (L ) of occurrence of itemset L : That is, the probability that L is contained in a new transaction T recorded in the database. In the market basket analysis example: 13 The set of objects L = {a, c} has a frequency f (L ) = 4 . Probability of occurrence is estimated as Pr (L ) = 4 / 10 = 0.4 . CG DADL (June 2023) Lecture 10 – Association Single-dimension Association Rules Given two items L ⊂ O and H ⊂ O such that L ∩ H = φ and a transaction T , the association rule is a probabilistic implication denoted by L ⇒ H with the following meaning: If L is contained in T , then H is also contained in T with a given probability p . p is termed the confidence of the rule in D and defined as: f (L ∪ H ) p = conf {L ⇒ H } = f (L ) 14 The set L is called the antecedent or body of the rule. H is the consequent or head. CG DADL (June 2023) Lecture 10 – Association Single-dimension Association Rules (cont.) The confidence of the rule indicates the proportion of transactions containing the set H among those that include L . This refers to the inferential reliability of the rule. As the number of m transactions increases, the confidence approximates the conditional probability that H belongs to a transaction T given that L does belong to T : Pr{{H ⊆ T }∩ {L ⊆ T }} Pr{H ⊆ T | L ⊆ T } = Pr{L ⊆ T } Higher confidence thus corresponds to greater probability that itemset H exists in a transaction that also contains the itemset L . 15 CG DADL (June 2023) Lecture 10 – Association Single-dimension Association Rules (cont.) The rule L ⇒ H is said to have a support s in D if the proportion of transactions containing both L and H is equal to s : f (L ∪ H ) s = supp{L ⇒ H } = m 16 The support of the rule expresses the proportion of transactions containing both the body and head of the rule. Measures the frequency with which an antecedent-consequent pair appears together in the transactions of a dataset. A low support suggests that a rule may have occurred occasionally, of little interest to decision maker and is typically discarded. CG DADL (June 2023) Lecture 10 – Association Single-dimension Association Rules (cont.) As m increases, the support approximates the probability that both L and H are contained in some future transactions. In the market basket analysis example: Given the itemsets L = {a, c} and H = {b} for the rule L ⇒ H. We have: f (L ∪ H ) 2 1 p = conf {L ⇒ H } = = = = 0.5 4 2 f (L ) f (L ∪ H ) 2 s = supp{L ⇒ H } = = = 0.2 m 10 17 CG DADL (June 2023) Lecture 10 – Association Strong Association Rules Once a dataset D of m transactions has been assigned: Determine minimum threshold value smin for the support. Determine minimum threshold value pmin for the confidence. All strong association rules should be determined, characterized by: A support s ≥ smin ; and A confidence p ≥ pmin . For large dataset: 18 Extracting all association rules through a complete enumeration procedure requires excessive computation time. The number N T of possible association rules increases exponentially as n increases: N T = 3n − 2 n +1 + 1 . CG DADL (June 2023) Lecture 10 – Association Strong Association Rules (cont.) Moreover, most rules are not strong and a complete enumeration will likely end up discarding many unfeasible rules. Need to devise a method capable of deriving strong rules only, implicitly filtering out rules that do not meet the minimum threshold requirements: 19 Generation of strong association rules may be divided into two successive phases. First phase – Generation of frequent itemsets. Second phase – Generation of strong rules. CG DADL (June 2023) Lecture 10 – Association Generation of Frequent Itemsets Recall that support of a rule only depends on the union of the itemsets L and H and not the actual distribution of objects between the body and head. For example, all six rules that can be generated using the objects {a, b, c} have support s = 2 10 = 0.2 : {a, b} ⇒ {c} {b, c} ⇒ {a} {b} ⇒ {a, c} {a, c} ⇒ {b} {a} ⇒ {b, c} {c} ⇒ {a, b} If the threshold value for support is smin = 0.25 , all six rules can be eliminated based on the analysis of the union set {a, b, c} . 20 CG DADL (June 2023) Lecture 10 – Association Generation of Frequent Itemsets (cont.) The aim of generating frequent itemsets is to extract all sets of objects whose relative frequency is greater than the assigned minimum support smin . This phase is more computationally intensive than the subsequent phase of rule generation: 21 Several algorithms have been proposed for obtaining the frequent itemsets efficiently. The most popular one is known as the Apriori algorithm. CG DADL (June 2023) Lecture 10 – Association Generation of Strong Rules Need to separate the objects contained in each frequent itemsets according to all possible combinations of head and body. Verify if the confidence of each rule exceeds the minimum threshold pmin . In the previous example, if the itemset {a, b, c} is frequent, we can obtain six rules although only those with confidence higher than pmin can be considered strong. conf ({a, b} ⇒ {c}) = 2 4 = 0.50 conf ({b, c} ⇒ {a}) = 2 3 = 0.67 conf ({b} ⇒ {a, c}) = 2 7 = 0.29 conf ({a, c} ⇒ {b}) = 2 4 = 0.50 conf ({a} ⇒ {b, c}) = 2 7 = 0.29 conf ({c} ⇒ {a, b}) = 2 5 = 0.40 If we choose pmin = 0.50 , we will have three strong rules but zero if pmin = 0.70 . 22 CG DADL (June 2023) Lecture 10 – Association Lift of a Rule Strong rules are not always meaningful or of interest to decision makers. Example – Sales of Consumer Electronics: 23 Analyze a set of transactions to identify associations between sales of color printers and sales of digital camera. Assuming there are 1000 transactions, of which 600 include cameras, 750 include printers and 400 include both. If the threshold values smin = 0.3 and pmin = 0.6 are used. The rule {camera} ⇒ {printer} would be a strong rule since it has support s = 400 1000 = 0.4 and confidence p = 400 600 = 0.66 that exceed the threshold values. CG DADL (June 2023) Lecture 10 – Association Lift of a Rule (cont.) The rule seems to suggest that the purchase of a camera also induces the purchase of a printer. However, the probability of purchasing a printer is equal to 750 1000 = 0.75 and is greater than 0.66, the probability of purchasing a printer conditioned on the purchase of a camera. In fact, sales of printers and cameras show a negative correlation, since the purchase of a camera reduces the probability of buying a printer. This example highlights a shortcoming in the evaluation of the effectiveness of a rule based only on its support and confidence. 24 CG DADL (June 2023) Lecture 10 – Association Lift of a Rule (cont.) A third measure of the significance of an association rule is the lift, defined as: conf {L ⇒ H }× m f (L ∪ H )× m l = lift{L ⇒ H } = = f (H ) f (L ) f (H ) Lift values greater than 1: Lift is less than 1: 25 Indicate that the rule being considered is more effective than the relative frequency of the head in predicting the probability that the head is contained in some transaction of the dataset. Body and head of the rule are positively associated. The rule is less effective than the estimate obtained through the relative frequency of the head. Body and head of the rule are negatively associated. CG DADL (June 2023) Lecture 10 – Association Lift of a Rule (cont.) In the preceding example: l = lift{{camera} ⇒ {printer}} conf {{camera} ⇒ {printer}}× m 0.66 ×1000 = = = 0.88 f ({printer}) 750 If the lift is less than 1, the rule that negates the head, expressed as {L ⇒ (O − H )} , is more effective than the original rule: The negated rule has confidence 1 − conf {L ⇒ H } . 26 Therefore the lift of the negated rule is greater than 1. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm A dataset D of m transactions defined over a set O of n objects may contain up to 2 n − 1 frequent itemsets. In real-world applications, n is likely to be at least of the order of a few dozen objects. Complete enumeration of the itemsets is not practical. The Apriori algorithm is a more efficient method of extracting strong rules: 27 In the first phase, the algorithm generates the frequent itemsets in a systematic way, without exploring the space of all candidates. In the second phase, it extracts the strong rule. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm (cont.) Apriori principle – If an itemset is frequent, then all its subsets are also frequent. Example: Assuming that the itemset {a, b, c} is frequent. It is clear that each transaction containing {a, b, c} should also contain each of its six proper subsets: 28 2-itemsets {a, b} , {a, c} and {b, c} . 1-itemsets {a} , {b} and {c} . Therefore, these six itemsets are also frequent. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm (cont.) There is also a consequence of the Apriori principle that is fundamental in reducing the search space: If we suppose that the itemset {a, b, c} is not frequent, then 29 each of the itemsets containing it must in turn not be frequent. Thus, once a non-frequent itemset has been identified in the course of the algorithm, it is possible to eliminate all itemsets with a greater cardinality that contain it. This significantly increase overall efficiency. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Frequent Itemsets 1. 2. 30 The transactions in the dataset are scanned to compute the relative frequency of each object. Objects with a frequency lower than the support threshold smin are discarded. At the end of this step, the collection of all frequent 1itemsets has been generated. The iteration counter is set to k = 2 . The candidate k-itemsets are iteratively generated starting from the (k – 1)-itemsets, determined during the previous iteration. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Frequent Itemsets (cont.) 3. 4. 5. 31 The support of each candidate k-itemset is computed by scanning through all transactions included in the dataset. Candidates with a support lower than the threshold smin are discarded. The algorithm stops if no k-itemset has been generated. Otherwise, set k = k + 1 and return to step 2. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Frequent Itemsets (cont.) Example – Market basket analysis: Minimum threshold for support smin = 0.2 . Step 1 – Determine the relative frequency for each object in O = {a, b, c, d , e} : Source: Vercellis (2009), pp. 286 32 All 1-itemsets are frequent since their frequency is higher than the threshold smin CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Frequent Itemsets (cont.) The second iteration proceeds by generating all candidate 2itemsets, which are obtained from the 1-itemsets. Then determine the relative frequency for each 2-itemset and compare with the threshold value smin : Source: Vercellis (2009), pp. 286 33 CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Frequent Itemsets (cont.) The next iteration generates the candidate 3-itemsets that can be obtained from the frequent 2-itemsets. Then determine the relative frequency for each 3-itemset and compare with the threshold value smin : Source: Vercellis (2009), pp. 286 34 Since there are no candidate 4-itemsets, the procedure stops. Total frequent itemsets are 5 + 6 + 2 = 13 . CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Strong Rules 1. 2. 35 The list of frequent itemsets generated during the first phase is scanned. If the list is empty, the procedure stops. Otherwise, let B be the next itemset to be considered, which is then removed from the list. The set B of objects is subdivided into two non-empty disjoint subsets L and H = B − L , according to all possible combinations. There are altogether 2 k − 2 candidate association rules, excluding the rules having the body or head as the empty set. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Strong Rules (cont.) 3. 4. 36 For each candidate rule L ⇒ H , the confidence is computed as: f (B ) p = conf {L ⇒ H } = f (L ) Note that f (B ) and f (L ) are already known at the end of the first phase. Do not need to rescan the dataset to compute them. If p ≥ pmin the rule is included into the list of strong rules, otherwise, it is discarded. CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Strong Rules (cont.) Example – Market basket analysis: Source: Vercellis (2009), pp. 288 37 CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Strong Rules (cont.) The number of operations required by the Apriori algorithm grows exponentially as the number of objects increases, i.e., n , increases. To generate a frequent 100-itemset, need to examine at least: 100 100 30 = − ≈ 2 1 10 ∑ h h =1 100 38 CG DADL (June 2023) Lecture 10 – Association Apriori Algorithm – Generation of Strong Rules (cont.) Several tactics have been suggested to increase the efficiency: Adopt more advanced data structures for representing the transactions, such as dictionaries and binary trees. Partition the transactions into disjoint subsets: 39 Separately apply the Apriori algorithm to each partition to identify local frequent items. At a later stage, starting from the latter, the entire transactions is considered to obtain global frequent itemsets and the corresponding strong rules. Randomly extract a significant sample of transactions and obtain strong association rules for the extracted sample with a greater efficiency but at the expense of possible inaccuracies. CG DADL (June 2023) Lecture 10 – Association Binary, Asymmetric and Onedimensional Association Rules The association rules that we have discussed in the preceding section are binary, asymmetric and onedimensional. Binary: The rules refer to the presence or absence of each object in the transactions that made up the dataset. • 1 indicates an object’s presence in a transaction. • 0 indicates an object’s absence from a transaction. • Assuming that we have large number of objects and transactions, we should observe more 0s than 1s since each transaction is likely to involve a small subset of objects. 40 CG DADL (June 2023) Lecture 10 – Association Binary, Asymmetric and Onedimensional Association Rules (cont.) Asymmetric: One-dimensional: Implicit assumption that the presence of an object in a single transaction, corresponding to a purchase, is far more relevant than its absence, corresponding to a non-purchase. That is, we are mostly interested in the presence of the value 1, which can be found in a small number of observations. The rules involve only one logical dimension of data from multi-dimensional data warehouses and data marts. In this example, we only look at the product dimension. Possible to identify more general association rules that may be useful across a range of applications. 41 CG DADL (June 2023) Lecture 10 – Association General Association Rules Association rules for symmetric binary variables: Handle binary variables for which the 0 and 1 values are equally relevant. E.g., Online registration form seeking the activation of newsletter service or wishing to receive a loyalty card. Transform each symmetric binary variable to two asymmetric binary variables, which correspond to the presence of a given value and absence of a given value. E.g., Activation of newsletter service: 42 Two asymmetric binary variables consent and do not consent with the values 0,1 . { } CG DADL (June 2023) Lecture 10 – Association General Association Rules (cont.) Association rules for categorical variables: Transform each categorical variable by introducing a set of asymmetric binary variables, equal in number to the levels of the categorical variable. Each binary variable takes the value 1 if in the corresponding record the categorical variable assumes the level associated with it. Association rules for continuous variables: 43 E.g., Age or income of customers. Transform the continuous variables in two sequential steps. Transform each continuous variable into a categorical variable using a suitable discretization technique. Then transform each discretized variable into a set of asymmetric binary variables. CG DADL (June 2023) Lecture 10 – Association General Association Rules (cont.) Multidimensional association rules: Data are stored in data warehouses or data marts, organized according to several logical dimensions. Association rules may thus involve multiple dimensions. E.g., Market basket analysis: 44 If a customer buys a digital camera, is 30-40 years old, and has an annual average expenditure of $300-500 on electronic equipment, then she will also buy a color printer with probability 0.78. This is a three-dimensional rule, since the body consists of three dimensions – purchased item, age and expenditure amount. Head refers to a purchase. Notice that the continuous variables age and expenditure amount have already been discretized. CG DADL (June 2023) Lecture 10 – Association General Association Rules (cont.) Multi-level association rules: 45 In some applications, association rules do not allow strong associations to be extracted due to rarefaction. E.g., Each of the items for sale can be found in too small a proportion of transaction to be included in the frequent itemsets thus preventing search for association rules. However, objects making up the transactions usually belong to hierarchies of concepts. E.g., Items are usually grouped by type and in turn by sales department. To remedy rarefaction, transfer the analysis to a higher level in the hierarchy of concepts. CG DADL (June 2023) Lecture 10 – Association General Association Rules (cont.) Sequential association rules: Transactions are often recorded according to a specific temporal sequence. Examples: Transactions for a loyalty card holder correspond to the sequence of sale receipts. Transactions that gather the sequence of navigation paths followed by a user are associated with the temporal sequence of the sessions. Analysts are interested to extract association rules that take into account temporal dependencies. The algorithms used to extract general association rules usually consist of extensions of the Apriori algorithm and its variants. 46 CG DADL (June 2023) Lecture 10 – Association Example – Market Basket Analysis RapidMiner Studio: 47 The dataset consisting of five binary variables, one per object, is prepared in Excel. This dataset is imported into RPM. The “Numerical to Binary” operator is used to convert the integer variables to actual binary variables. The transformed dataset is then feed into the “FP-Growth” operator to generate the frequent itemsets with the threshold value 𝑠𝑠𝑚𝑚𝑚𝑚𝑚𝑚 = 0.2. The frequent itemsets is then feed into the “Create Association Rules” operator with the threshold value 𝑝𝑝𝑚𝑚𝑚𝑚𝑚𝑚 = 0.55. CG DADL (June 2023) Lecture 10 – Association Example – Market Basket Analysis (cont.) 48 CG DADL (June 2023) Lecture 10 – Association Example – Market Basket Analysis (cont.) Frequent itemsets with smin = 0.2 Association rules with pmin = 0.55 49 CG DADL (June 2023) Lecture 10 – Association Example – Market Basket Analysis (cont.) Scikit Learn does not support the Apriori algorithm. So we will use Mlxtend (machine learning extensions): src01 50 CG DADL (June 2023) Lecture 10 – Association Example – Market Basket Analysis (cont.) Original transactions dataset 51 Frequent itemsets CG DADL (June 2023) Lecture 10 – Association Example – Market Basket Analysis (cont.) 52 CG DADL (June 2023) Lecture 10 – Association Lecture 11 Text Mining Corporate Gurukul – Data Analytics using Deep Learning June 2023 Lecturer: A/P TAN Wee Kek Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35 Learning Objectives At the end of this lecture, you should understand: 1 Concepts and definitions of text mining. Text mining process. Extracting knowledge from text data. Using text mining in Python with NLTK and Scikit Learn. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Concepts 85-90% of all corporate data is in some kind of unstructured form (e.g., text). Unstructured corporate data is doubling in size every 18 months. Tapping into these information sources is not an option, but a need to stay competitive. Text mining is the answer: 2 A semi-automated process of extracting knowledge from large amount of unstructured data sources. Also known as text data mining or knowledge discovery in textual databases. CG DADL (June 2023) Lecture 11 – Text Mining Data Mining versus Text Mining Both seek novel and useful patterns. Both are semi-automated processes. Main difference lies in the nature of the data: Structured versus unstructured data. Structured data – databases. Unstructured data – Word documents, PDF files, text excerpts, XML files, etc. Text mining generally involves: 3 Imposing structure to the data. Then mining the structured data. CG DADL (June 2023) Lecture 11 – Text Mining Benefits of Text Mining Benefits of text mining are obvious especially in text-rich data environments: E.g., law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc. Electronic communication records (e.g., email): 4 Spam filtering. Email prioritization and categorization. Automatic response generation. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Application Area Information extraction: Identification of key phrases and relationships within text. Look for predefined sequences in text via pattern matching. Topic tracking: Based on a user profile and documents that a user views. Text mining can predict other documents of interest to the user. Summarization: 5 Summarizing a document to save time on the part of the reader. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Application Area (cont.) Categorization: Identifying main themes of a document. Place the document into a predefined set of categories based on those themes. Clustering: Group similar documents without having a predefined set of categories. Concept linking: 6 Connect related documents by identifying their shared concepts. Help users find information that they perhaps would not have found using traditional search methods. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Application Area (cont.) Question answering: Find the best answer to a given question through knowledgedriven pattern matching. Text mining initiatives by Elsevier, a leading publisher of research journals. (Source: http://www.elsevier.com/connect/elsevier-updates-text-mining-policy-to-improve-access-for-researchers) 7 CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies Unstructured data (versus structured data): Structured data: Unstructured data: 8 Have a predetermined format. Usually organized into records with simple data values (categorical, ordinal and continuous variables) and stored in databases. Do not have a predetermined format. Stored in the form of textual documents. Structured data are for computers to process while unstructured data are for humans to process and understand. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) Corpus: Corpus is a large and structured set of texts (usually stored and processed electronically). Corpus is prepared for the purpose of conducting knowledge discovery. Terms: 9 A term is a single word or multiword phrase extracted directly from the corpus of a specific domain. Uses natural language processing (NLP) methods. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) Concepts: Concepts are features generated from a collection of documents. Uses manual, statistical, rule-based or hybrid categorization methodology. Compared to terms, concepts are the result of higher level abstraction. Stemming: 10 Process of reducing inflected words to their stem (or base or root) form. For example, stemmer, stemming, stemmed are all based on the root stem. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) Stop words: Stop words (or noise words) are words that are filtered out prior to or after processing of natural language data, i.e., text. No universally accepted list of stop words. Most NLP tools use a list that includes articles (a, am, the, of, etc.), auxiliary verbs (is, are, was, were, etc.) and context-specific words that are deemed not to have differentiating value. Synonyms and polysemes: Synonyms are syntactically different words (i.e., spelled differently) with identical or at least similar meanings. 11 E.g., movie, film and motion picture. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) Polysemes (or homonyms) are syntactically identical words with different meanings. E.g., Bow can mean “to bend forward”, “the front of the ship”, “the weapon that shoots arrows,” or “a kind of tied ribbon”. Tokenizing: 12 A token is a categorized block of text in a sentence. The block of text corresponding to the token is categorized according to the function it performs. This assignment of meaning to blocks of text is known as tokenizing. A token can look like anything, it just need to be a useful part of the structured text. E.g., email address, URL or phone number. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) Term dictionary: Word frequency: A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus. The number of times a word is found in a specific document. Term frequency refers to the raw frequency of a term in a document. Term-by-document matrix (occurrence matrix or term-document matrix): 13 A common representation schema of the frequency-based relationship between the terms and documents in tabular format. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) 14 Terms are listed in rows and documents are listed in columns. The frequency between the terms and documents is listed in cells as integer values. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Terminologies (cont.) Singular-value decomposition (latent semantic indexing): 15 A dimensionality reduction method used to transform the term-by-document matrix to a manageable size. Generate an intermediate representation of the frequencies using a matrix manipulation method similar to principle component analysis. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Process Task 1 Establish the Corpus: Collect & Organize the Domain Specific Unstructured Data Task 2 Create the TermDocument Matrix: Introduce Structure to the Corpus Feedback The inputs to the process includes a variety of relevant unstructured (and semistructured) data sources such as text, XML, HTML, etc. The output of the Task 1 is a collection of documents in some digitized format for computer processing Task 3 Extract Knowledge: Discover Novel Patterns from the T-D Matrix Feedback The output of the Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies The output of Task 3 is a number of problem specific classification, association, clustering models and visualizations The three-step text mining process 16 CG DADL (June 2023) Lecture 11 – Text Mining Task 1 – Establish the Corpus Collect all the documents related to the context (domain of interest) being studied: May include textual documents, XML files, emails, web pages and short notes. Voice recordings may also be transcribed using speechrecognition algorithms. Text documents are transformed and organized into the same representational form (e.g., ASCII text file). Place the collection in a common place (e.g., in a flat file, or in a directory as separate files). 17 CG DADL (June 2023) Lecture 11 – Text Mining Task 2 – Create the Term-Document Matrix The digitized and organized documents (the corpus) are used to create the term-document matrix (TDM): Rows represent documents and columns represent terms. The relationships between the terms and documents are characterized by indices. Indices is a relational measure that can be as simple as the number of occurrences of the term in respective documents. The goal is to convert the corpus into a TDM where the cells are filled with the most appropriate indices. The assumption is that the essence of a document can be represented with a list of frequency of the terms used in that document. 18 CG DADL (June 2023) Lecture 11 – Text Mining Task 2 – Create the Term-Document Matrix (cont.) Terms Documents Document 1 in me t s ve sk ri nt p e ag n ma ng ri ee gin n ee ar w oft s 1 elo v de nt e pm P SA ... 1 1 Document 2 1 3 Document 3 1 Document 4 2 Document 5 Document 6 t ec roj t n me 1 1 1 ... 19 CG DADL (June 2023) Lecture 11 – Text Mining Task 2 – Create the Term-Document Matrix (cont.) On the one hand, not all the terms are important when characterizing the documents: Some terms such as articles, auxiliary verbs, and terms used in almost all of the documents in the corpus, have no differentiating power. This list of terms, known as stop terms or stop words, should be identified by domain experts. Stop terms should be excluded from the indexing process. On the other hand, the analyst might choose a set of predetermined terms, i.e., term dictionary, under which the documents are to be indexed. 20 CG DADL (June 2023) Lecture 11 – Text Mining Task 2 – Create the Term-Document Matrix (cont.) To create more accurate index entries: Synonyms and specific phrases (e.g., “Eiffel Tower”) can be provided. Stemming may be used such that different grammatical forms or declinations of verb are identified and indexed as the same word. The first generation of the TDM: 21 Include all unique terms identified in the corpus as its columns, excluding the stop terms. All documents as its rows. Occurrence count of each term for each document as its cell values. CG DADL (June 2023) Lecture 11 – Text Mining Task 2 – Create the Term-Document Matrix (cont.) For a large corpus, the TDM will contain a very large number of terms: Increases processing time. Leads to inaccurate patterns. Need to decide: 22 What is the best representation of the indices? How can we reduce the dimensionality of the TDM to a manageable size? CG DADL (June 2023) Lecture 11 – Text Mining Representing the Indices The raw term frequencies generally reflect on how salient or important a term is in each document: Terms that occur with greater frequency in a document are better descriptors of the contents of that document. However, term counts themselves might not be proportional to their importance as descriptors of the documents. E.g., a term that occurs one time in document A but three times in document B does not necessarily mean that this term is three times as important a descriptor of B compared to A. These raw indices need to be normalized to obtain a more consistent TDM for further analysis. 23 CG DADL (June 2023) Lecture 11 – Text Mining Representing the Indices (cont.) Other than actual frequency, the numerical representation between terms and documents can be normalized with several methods. Log frequencies: This transformation would “dampen” the raw frequencies and how they affect the results of subsequent analysis. Given wf is the raw word (or term) frequency and f (wf ) is the result of the log transformation: 1 + log(wf ) if wf > 0 f (wf ) = 0 if wf = 0 24 This transformation is applied to all raw frequencies in the TDM that are greater than 0. CG DADL (June 2023) Lecture 11 – Text Mining Representing the Indices (cont.) Binary frequencies: This is a simple transformation to enumerate whether a term is used in a document. 1 if wf > 0 f (wf ) = 0 if wf = 0 25 The resulting TDM will contain 1s and 0s to indicate the presence or absence of the respective term. This transformation will also dampen the effect of the raw frequency counts on subsequent computation and analyses. CG DADL (June 2023) Lecture 11 – Text Mining Representing the Indices (cont.) Inverse document frequencies: Sometime, it may be useful to consider the relative document frequencies (df) of different terms. Example: A term such as guess may occur frequently in all documents. Another term such as software may appear only a few times. 26 One might make guesses in various contexts regardless of the specific topic. Software is a more semantically focused term that is only likely to occur in a documents that deal with computer software. Inverse document frequency is a transformation that deals with specificity of words (document frequencies) as well as the overall frequencies of their occurrences (term frequencies). CG DADL (June 2023) Lecture 11 – Text Mining Representing the Indices (cont.) This transformation for the ith term and jth document is defined as: 0 if wf ij = 0 N idf (i, j ) = ( 1 + log(wf ij ))log if wf ij ≥ 1 df i where: N is the total number of documents. df i is the document frequency for the ith term, i.e., the number of documents that include this term. This transformation includes both the dampening of the simple-term frequencies via the log function and a weighting factor that: 27 Evaluates to 0 if the word occurs in all documents, i.e., log(N N ) = log 1 = 0. Evaluates to the maximum value when a word only occurs in a single document, i.e., log(N 1) = log N . CG DADL (June 2023) Lecture 11 – Text Mining Reducing the Dimensionality of the Matrix TDM is often very large and rather sparse (most of the cells are filled with 0s). There are three options to reduce the dimensionality of this matrix to a manageable size: Use a domain expert to go through the list of terms and eliminates those that do not make much sense for the study’s context. 28 This is a manual process that might be labor-intensive. Eliminate terms with very few occurrences in very few documents. Transform the matrix using singular value decomposition (SVD). CG DADL (June 2023) Lecture 11 – Text Mining Reducing the Dimensionality of the Matrix (cont.) Singular value decomposition (SVD): 29 Closely related to principal components analysis. Reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space. Each consecutive dimension represents the largest degree of variability (between words and documents) possible. Identify the two or three most salient dimensions that account for most of the variability (differences) between the words and documents. Once the salient dimensions are identified, the underlying “meaning” of what is contained in the documents is extracted. CG DADL (June 2023) Lecture 11 – Text Mining Task 3 – Extract the Knowledge Using the well-structured TDM: We can extract novel patterns within the context of the specific problem. The knowledge extraction process may be potentially augmented with other structured data elements. The main categories of knowledge extraction methods are: 30 Classification. Clustering. Association. CG DADL (June 2023) Lecture 11 – Text Mining Classification Classify a given data instance into a predetermined set of categories (or classes). In text mining, this is known as text categorization. For a given set of categories (subjects, topics or concepts) and a collection of text documents, the goal is to find the correct category for each document. Automated text classification may be applied to: 31 Indexing of text. Spam filtering. Web page categorization under hierarchical catalogs. Automatic generation of metadata. Detection of genre and many others. CG DADL (June 2023) Lecture 11 – Text Mining Classification (cont.) Two main approaches: 32 Knowledge engineering – Expert’s knowledge about the categories is encoded into the system either declaratively or in the form of procedural classification rules. Machine learning – A general inductive process builds a classifier by learning from a set of pre-classified examples. Increasing number of documents and decreasing knowledge experts result in a trend towards machine learning. CG DADL (June 2023) Lecture 11 – Text Mining Clustering Unsupervised process whereby objects are placed into “natural” groups called clusters. In classification, descriptive features of the classes in preclassified training objects are used to classify a new object. In clustering, unlabeled collection of objects are grouped into meaningful clusters without any prior knowledge. Clustering is useful in a wide range of applications: 33 Document retrieval. Enabling better web content searches. CG DADL (June 2023) Lecture 11 – Text Mining Clustering (cont.) Analysis and navigation of very large text collections such as web pages is an prominent application of clustering: 34 Underlying assumption is that relevant documents tend to be more similar to each other than to irrelevant ones. Clustering of documents based on content similarity thus improves search effectiveness. CG DADL (June 2023) Lecture 11 – Text Mining Association Generating association rules involves identifying the frequent sets that go together. In text mining, association refers to the direct relationships between concepts (terms) or sets of concepts. A concept association rule A ⇒ C [S , C ] relates two frequent concept sets A and C with support S and confidence C : 35 Support is the proportion of documents that include the concepts in A and C . Confidence is the proportion of documents that include all the concepts in C within the subset of documents that include A. CG DADL (June 2023) Lecture 11 – Text Mining Association (cont.) Example: 36 In a document collection, the concept “Software Implementation Failure” may appear most often in association with “Enterprise Resource Planning” and “Customer Relationship Management” with support 0.04 and confidence 0.55. This means that 4% of all documents had all three concepts represented in the same document. Of the documents that included “Software Implementation Failure”, 55% of them also included “Enterprise Resource Planning” and “Customer Relationship Management”. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Example – SMS Spam Filtering The SMS spam collection dataset is taken from the UCI Machine Learning repository: 37 https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection The SMS messages are extracted from various sources such as the NUS SMS Corpus (NSC). The authors have classified each SMS as either spam or ham (i.e., legitimate SMS message). The pre-classification by domain experts allows us to perform supervised learning. On the contrary, most SMS messages corpuses such as the NSC might not have been pre-classified. CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Example – SMS Spam Filtering (cont.) We can skip over Task 1 of establishing the corpus since it is already provided. Task 2 of creating the Term-Document Matrix (TDM) can be done in Python using NLTK (Natural Language Toolkit) and Scikit Learn: 38 Tokenize each SMS message. Lowercase the SMS message and remove special characters and leading/trailing spaces. Remove stop words using NLTK list of English stop words. Perform spelling correction and lemmatization using NLTK. Generate a list of word list from all the SMS messages and their respective raw term frequencies using Scikit Learn’s CountVectorizer CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Example – SMS Spam Filtering (cont.) Actual TDM exported to a TSV file viewed in Excel: 39 CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Example – SMS Spam Filtering (cont.) Task 3 of extracting knowledge is based on the TDM with binary frequencies: Classification – Performed with classification tree, Naïve Bayes classifier and neural network. Association rules – Binary, asymmetric and one-dimensional association rules were extracted. The classification models allow us to classify future SMS messages as spam or ham based on the presence/absence of key words/terms. 40 CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Example – SMS Spam Filtering (cont.) Using a decision tree classifier with a maximum depth of 5, we can obtain fairly good prediction: Training accuracy was 93% Test accuracy was 93% True positive rate (recall) for test data was 93% (48% for spam class) Precision was 93% If we allow unlimited depth: 41 Test accuracy would be 96% True positive rate would be 96% (80% for spam class) Precision would be 96% CG DADL (June 2023) Lecture 11 – Text Mining Text Mining Example – SMS Spam Filtering (cont.) The decision tree also provided some fairly intuitive rules for classifying whether a SMS message is a spam: • Full view of the decision tree with a maximum depth of 5 generated from 80% training dataset with a prediction accuracy of 93% (true positive rate of classifying spam is 48%). • E.g., a new SMS message containing the words “call” and “claim” would be classified as a spam. Another SMS message containing the words “call” but not “claim” would be classified as a ham • A decision tree with unlimited depth would be more accurate but the rules would be less interpretable. 42 CG DADL (June 2023) Lecture 11 – Text Mining