AMS1301 Foundations of Data Science Chapter 1 - Descriptive Statistics Bachelor of Science (Hons) in • Actuarial Studies and Insurance (BSC-AIN) • Data Science and Business Intelligence (BSC-DSBI) 1 What is Data Science ? A collection of analytical skills and techniques derived from mathematics, statistics and computer science for extracting information from data. Why Data Science ? 1. The world is data-driven (octopus card, Google, etc). 2. Increasing demand for data scientists in the world. 3. Reports from Ernst & Young and McKinsey Global Institute. -- analyzing business data sets will become a key basis of competition in the future. 2 1.1 Statistical Terminology Statistics is a branch of mathematics that transforms numbers into useful information for decision making. 1. Let you know the risk in a business. 2. Allow you to understand and reduce the variation so as to make an appropriate business decision. Statistics involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information. 3 Why Statistics ? 1. To summarize business data 2. To draw conclusions from the data 3. To make reliable forecasts about business activities 4. To improve business process 4 1.1.1 Basic Terminology 1. Variable : a characteristic of an item or individual 2. Data : values associated with a variable (Singular: Datum) 3. Population : a set of elements of interest for a given problem 4. Sample : a portion of a population selected for analysis 5. Sampling : a process of selecting samples from the population 5 6. Census : a study on every elements of the population 7. Survey : a study on part of the elements (samples) of the population 8. Parameter : a descriptive measure of the population e.g.: and 9. Statistic : a descriptive measure of the sample e.g.: x and s 6 Descriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present information in a convenient form. Inferential statistics draws conclusions about the characteristics of the population based on the sample data. 7 Parameter Sampling Population data Sample data Statistical inference Sample data Calculation / Analysis Statistic 8 1.1.2 Types of Data There are different ways to summarize different types of data. Before we move on to the ways to convert data into useful information, we need to be able to classify various types of data. 9 Ratio Quantitative / Numerical Interval Data Ordinal Qualitative / Categorical Nominal 10 Categorical data: (1) Also known as qualitative data. (2) Data are separated as categories e.g.: yes/no; true/false, male/female, etc Numerical data: (1) Also known as quantitative data. (2) Data are represented by numerical values, e.g.: height, weight, temperature, income, distance, etc 11 Numerical data can be further identified as being either discrete or continuous. Discrete variable: Produce numerical responses that arise from a counting process. e.g.: how many modules have you taken in this semester? [0, 7] Continuous variable: Produce numerical responses that arise from a measuring process. e.g.: how tall are you? [170cm, 190cm] 12 Example 1.1 Categorical vs Numerical Data Question Response Data Type 1. Do you currently have a profile on Facebook? 2. How many modules have you taken in this semester? 3. How long did you take to travel from home to University? Yes/No 5 Categorical Numerical 30 mins Numerical 13 Statisticians use the terms (1) nominal scale and ordinal scale to describe categorical variable. (2) interval scale and ratio scale to describe numerical variable. 14 Nominal data (Categorical) (1) No particular order (e.g. yes/no, true/false). (2) Information is obtained by doing counts of the number of occurrences. (3) Numbers, letters, symbols, colours, etc are used to represent nominal data. 15 Ordinal data (Categorical) (1) Ranked categorical items, e.g., fail/pass/credit/distinction (2) Information is lost if data structure is re-organized, e.g., fail/pass/credit/distinction → fail/satisfactory/distinction 16 Example 1.2 Nominal vs Ordinal Data Categorical Variable Categories 1. Gender Male / Female 2. Profile on Facebook? Yes / No 3. Types of investments Cash / stocks / bonds / None 4. Rating of the bus services Excellent to Poor 5. Student Grade A, A-, …, D, U 6. Standard & Poor’s bond credit ratings AAA, AA, A, BBB, … Data Type Nominal Nominal Nominal Ordinal Ordinal Ordinal Remark: We often record categorical data by arbitrarily assigning a number to each category. For instance, Male = 1, Female = 2. 17 Interval data (Numerical) (1) Ordered data (2) The difference between measurements is a meaningful quantity but does not involve a true zero point (arbitrary zero). e.g.: The difference between a temperature of 100 degrees and 90 degrees is the same as between 90 degrees and 80 degrees. 18 Ratio data (Numerical) (1) have all the properties of an interval variable (2) have a clear definition of 0.0. When the variable equals 0.0, there is none of that variable (involve a true zero point). e.g.: Temperature (in F or C) is interval, but temperature (in Kelvin) is ratio because 0.0 Kelvin means “no thermal energy”. e.g.: Height, weight, age, length are ratio. 19 Example 1.3 Interval vs Ratio Data Numerical Variable Data Type 1. Temperature (in C or F) 2. Temperature (in Kelvin) 3. Time 4. Weight 5. Age Interval Ratio Ratio Ratio Ratio Remark: The distinctions between interval and ratio data are subtle, but fortunately, this distinction is often not important. For statistical purposes, there is no difference between ratio and interval data. 20 1.2 Measures of Central Tendency and Variability 1.2.1 Measures of Central Location Three different measures -- Arithmetic mean -- Median -- Mode 21 1.2.1.1 Arithmetic Mean Let x1, x2, …, xn be observations in a sample (ungrouped data), where n is the sample size. The sample mean is denoted by x (X-bar). In a population, the number of observations is N and the population mean is 1 N Population Mean: = xi N i =1 n 1 Sample Mean: x = xi n i =1 22 Example 1.4 Number of TV watching hours per week of 5 students randomly selected from a class are: (Solution) 5 7 3 8 7 5 1 The sample mean is x = xi 5 i =1 5+ 7 +3+8+7 = =6 5 Remark: Note that the mean is very sensitive to the extreme value (or outlier). For instance, if 8 is changed to 38, the mean will change substantially from 6 to 12. 23 For Grouped data Suppose n observations are grouped into k classes c1 – d1, c2 – d2, …, ck – dk with frequencies f1, f2,…,fk and the class marks (mid-value for each class) are x1, x2,…,xk. Class Class Mark c1 – d1 x1 c2 – d2 x2 … … ck - dk xk Frequency f1 f2 … fk k fx f x + f 2 x2 + ... + f k xk Sample Mean: x = 1 1 = i =1k f1 + f 2 + ... + f k i f i =1 i i Remark: we use the mid-value of a class to represent that class. 24 Example 1.5 Suppose the weights (in kg) of 100 students are tabulated in the following table. Find the mean weight of the students. Weight (kg) Class Mark Frequency 30-34 32 2 (Solution) 35-39 37 8 40-44 42 15 45-49 47 30 50-54 52 23 55-59 57 16 60-64 62 6 7 x= fx i =1 7 i i f i =1 2 32 + 8 37 + ... + 6 62 = = 48.8 2 + 8 + ... + 6 i 25 1.2.1.2 Median The median is calculated by placing all the observations in order (ascending or descending). The observation that falls in the middle is the median. For n observations, we have two cases. n +1 ➢ When n is odd, the median is the 2 th ranked value. ➢ When n is even, the median is the average th n n of the and the +1 ranked values. 2 2 th 26 Example 1.6 Compute the median for each of the following sequence of numbers. 1. 2. 1.2 1.4 1.8 2.1 2.7 3.5 3.9 1.2 1.4 1.8 2.1 2.7 3.5 3.9 4.1 (Solution) For the 1st sequence, n = 7, thus the median is 2.1 For the 2nd sequence, n = 8, thus the median is 1 (2.1 + 2.7) = 2.4 2 27 1.2.1.3 Mode The mode is defined as the observation (or observations) that occurs with the greatest frequency. Remark: Note that a distribution may have more than one mode. If all data appear only once, then there is no mode. Example 1.7 Find the mode of data. 29 31 35 39 39 40 43 44 44 52 (Solution) There are two modes, 39 & 44, each appears twice. 28 Some other useful measures of central tendency. The geometric mean is used whenever we wish to find the “average” growth rate, or rate of change, in a variable over time. 1.2.1.4 Which is Best? The mean is generally our first choice -- simple and easy to compute. Sometimes median is better. -- not sensitive to extreme values. The mode is seldom the best measure of central location. 29 1.2.2 Measures of Variability Data can be characterized by its variation and shape. Variation measures the spread or dispersion of values in a data set. 1.2.2.1 Range Difference between the largest and the smallest values in a data set. The larger the range, the larger the variation of the data. Range = Largest observation – Smallest observation 30 Advantage: ➢ Simplicity Disadvantage: ➢ Simplicity -- calculated from only two observations. -- tells us nothing about the other observations. Example 1.8 Find the range for the following sets of data: Set 1: 4 4 4 4 4 50 Set 2: 4 8 15 24 39 50 (Solution) The range of both sets is 46. 31 1.2.2.2 Interquartile Range Quartiles It splits a set of ranked data into four equal parts. Q1 : the middle between the smallest observation and the median Q2 : median Q3 : the middle between the largest observation and the median Q1 25% Q2 25% Q3 25% 25% 32 n + 1 Q1 = the ranked value 4 th 3( n + 1) Q3 = the ranked value 4 th Remark: Note that there are different ways to find quartiles. In general, the pth q-tile is th p(n + 1) Q p = the ranked value q If q = 4, then this is called quartile. If q = 10, then this is called decile. If q = 100, then we called it percentile. 33 Some rules to compute the pth q-tile ➢ If Qp is an integer, the q-tile is simply equal to the measurement corresponding to that ranked value. e.g., q = 4, n = 7, Q1 = 2 → the 2nd ranked value ➢ If Qp is a fractional half (e.g.: 2.5, 3.5, 4.5, etc), the q-tile is equal to the measurement corresponding to the average of the two ranked values involved. e.g., q = 4, n = 9, Q1 = 2.5 → the average of 2nd and the 3rd ranked values 34 ➢ If Qp is neither an integer nor a fractional half, we round Qp to the nearest integer and the q-tile is equal to the measurement corresponding to that ranked value. e.g., q =4, n = 10, Q1 = 2.75 → the 3rd ranked value 35 Interquartile Range ➢ Difference between the 1st quartile and the 3rd quartile in the data set. ➢ It measures the spread in the middle 50% of the data. ➢ Not influenced by extreme values. Interquartile range IQR = Q3 – Q1 36 Example 1.9 The readings of diastolic blood pressure (mm Hg) of 16 randomly selected males are: 66 70 74 75 79 81 81 82 85 91 91 93 95 99 99 100 (a) Find the mean, mode, median, Q1 and Q3. (b) Find also the 10th and 85th percentile of the readings. 37 66 70 74 75 79 81 81 82 85 91 91 93 95 99 99 100 (Solution) (a) mean = 85.0625; mode = 81, 91, and 99 82 + 85 median = = 83.5 2 16 + 1 th th Q1 = ranked value = 4.25 ranked value = 4 ranked value = 75 4 th 3(16 + 1) th th Q3 = ranked value = 12.75 ranked value = 13 ranked value = 95 4 th Interquartile range IQR = Q3 – Q1 = 95 - 75 =20 38 (b) Find also the 10th and 85th percentile of the readings. th p(n + 1) Recall: Qp = the ranked value q 10(16 + 1) th nd Q10 = ranked value = 1.7 ranked value = 2 ranked value = 70 100 th 85(16 + 1) th th Q85 = ranked value = 14.45 ranked value = 14 ranked value = 99 100 th 39 1.2.2.3 Variance and Standard Deviation (Ungrouped data) They are used to measure variability. Population Variance: N 1 2 2 = ( xi − ) N i =1 Population Standard Deviation: Sample Variance: Sample Standard Deviation: = 2 1 n 1 n 2 2 2 s = ( xi − x ) = xi − n x n − 1 i =1 n − 1 i =1 2 s= s2 40 Characteristics of the Range, Variance and Standard Deviation ➢ The greater (smaller) the spread or dispersion of the data, the larger (smaller) the range, variance and standard deviation. ➢ If the observations are all the same, then there is no variation in the data. Thus, the range, variance and standard deviation must be equal to zero. ➢ All these measures are non-negative. 41 Example 1.10 Consider a set of sample data: 70 74 75 79 81 81 82 Find the variance and standard deviation. (Solution) s = 4.5040 s 2 = 20.2857 42 For Grouped data Sample Variance: k 1 2 s2 = f ( x − x ) i i n − 1 i =1 1 k 2 2 = f i xi − n x n − 1 i =1 k k fx i =1 n where n = f i and x = i =1 i i 43 Example 1.11 The overflow data for the last 50 business days are as follows: Daily Overflow Call 1-15 Frequency 14 16-30 31-45 46-60 61-75 21 8 4 3 Find the mean, variance, and standard deviation of the sample. (Solution) Daily Overflow Call 1-15 16-30 31-45 46-60 61-75 Class Mark 8 23 38 53 68 Frequency 14 21 8 4 3 44 Daily overflow calls 1-15 16-30 31-45 46-60 61-75 Class mark (x) 8 23 38 53 68 Frequency (f) 14 21 8 4 3 k x= fx i i =1 n i 14 8 + 21 23 + ... + 3 68 = = 26.3 14 + 21 + ... + 3 k 1 2 2 2 s = f i xi − n x n − 1 i =1 1 ( = 48665 − 50 26.32 ) = 287.3571 50 − 1 s = 16 .9516 45 1.3 Covariance and Correlation Coefficient 1.3.1 Measures of Linear Relationship -- Two variables x and y have a linear relationship if y = mx + c. -- Direction and strength of the linear relationship between two variables. -- Covariance, coefficient of correlation and coefficient of determination. 46 1.3.2 Covariance Let X and Y be two variables and the corresponding sample data are x1, x2, …, xn and y1, y2, …, yn , respectively. N 1 Population xy = ( xi − x )( yi − y ) Covariance: N i =1 Sample Covariance: 1 n s xy = ( x i − x )( y i − y ) n − 1 i =1 1 n = xi y i − n x y n − 1 i =1 where x and y are the population means of X and Y, respectively. 47 To illustrate how covariance measures linear relationship, consider the following 3 sample data sets. -- As x increases, y increases. -- ( xi − x )( y i − y ) 0 and thus s xy 0 -- When X and Y move in the same direction (both increase or both decrease), the covariance will be a large positive number. 48 -- As x increases, y decreases. -- ( xi − x )( y i − y ) 0 and thus s xy 0 -- When X and Y move in the opposite direction, the covariance will be a large negative number. 49 -- As x increases, y does not exhibit any particular pattern. -- When there is no particular pattern, the covariance is a small number. 50 Two pieces of information: (1) The sign of the covariance tells us the nature of the relationship (i.e., positive linear relationship or negative linear relationship). (2) The magnitude describes the strength of the association between X and Y. The larger the covariance, the stronger the linear relationship. However, how large the covariance should be so that we can say that the two variables have a strong linear relationship? We need another measure called Coefficient of Correlation. 51 1.3.3 Coefficient of Correlation xy = x y Population Correlation: Sample Correlation: r= s xy sx s y where x and y are the population SD of X and Y, respectively; s x and s y are the sample SD of X and Y, respectively. In addition, we have − 1 +1 and − 1 r +1 52 Drawback: Hard to interpret the correlation. E.g.: r = 0.3, we can only say that the linear relationship is weak. → Introduce another measure of the strength of linear relationship: Coefficient of Determination. 53 Example 1.12 Calculate the coefficient of correlation for the three sets of data on pages 48-50 of the lecture note. (Solution) s xy 17.5 Set 1 : r = = = 0.9449 s x s y (2.6458)(7) s xy − 17.5 Set 2 : r = = = −0.9449 s x s y (2.6458)(7) s xy − 3.5 Set 3 : r = = = −0.1890 s x s y (2.6458)(7) Remark: Find correlation using calculator 54 1.3.4 Coefficient of Determination Coefficient of Determination: R =r 2 2 It measures the amount of variation in the dependent variable Y that is explained by the variation in the independent variable X in the linear equation. (1) If r = 1, then R 2 = 1 → 100% of the variation in Y is explained by the variation in X. (2) If r = 0 , then R = 0 → No linear pattern → None of the variation in Y is explained by the variation in X. 2 55 In Example 1.12 Set 1: r = 0.9449 → R = 0.8928 2 89.28% of the variation in Y is explained by the variation in X. The remaining 10.72% is unexplained. Remark: The sample covariance, correlation and determination will be discussed in more details in other AMS modules. 56 1.4 Graphical Representations and Comparison of Data Sets 1.4.1 Data Organization -- Raw data -- Data visualization (categorical or numerical) -- Tabulation and graphical representations 57 1.4.2 Organizing Categorical Data 1.4.2.1 Summary Table -- Represent number of responses as frequencies or percentages for each category. -- Help to identify the differences among categories by displaying frequency, amount, or percentage of items in a set of categories in separate column. 58 1.4.2.2 Contingency Table -- Study patterns that may exist between the responses of two or more categorical variables. 59 1.4.3 Visualizing Categorical Data 1.4.3.1 Bar Chart and Pie Chart 60 1.4.4 Organizing Numerical Data 1.4.4.1 Ordered Array It arranges the values of a numerical variable in a ranked order, from the smallest value to the largest value. For instance, an ordered array of number of members in 10 households: 2, 2, 2, 6, 3, 4, 2, 5, 3, 7 is 2, 2, 2, 2, 3, 3, 4, 5, 6, 7. 61 1.4.4.2 Frequency Distribution -- counts the number of numerical observations that fall into each of a series of intervals, called Classes . -- In general, 4 < Classes < 16. Too few or many classes provides little information. Largest value − Smallest value Interval width = Number of classes 62 Example 1.13 Construct frequency distributions (with interval width 10) of the following ordered arrays which show a cost per person at 50 city restaurants and 50 suburban restaurants 63 (Solution) The smallest and largest values for the ordered arrays are 21 and 79, respectively. We use 20 and 80 as the smallest and largest values for convenience. 64 When comparing two or more classes, the proportion or percentage for each class is more useful and meaningful than the frequency count of each class. 65 1.4.5 Visualizing Numerical Data 1.4.5.1 Stem-and-Leaf Plot It (i) shows the range and distribution of the data. (ii) identifies outliers/unusual observations. 66 Example 1.14 The number of hours spent on internet weekly of 10 students are: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38. Construct a stem-and-leaf plot for the data set. (Solution) Data in ordered array from smallest to largest: 21 24 24 26 27 27 30 32 38 41 Stem (in 10) 2 3 4 Leaf (in 1) 1 4 4 6 7 7 0 2 8 1 Remark: Stem may have as many digits as needed but leaf should contain only a single digit. 67 Example 1.15 The weights (in kg) of 26 girls and 34 boys of BSc-DSBI year 1 students of HSUHK are recorded and plotted into a so-called half-stem-and-leaf plot. 68 We call it bimodal distribution as it has two peaks. 69 Example 1.16 Based on the data in Example 1.15, we can compare the weights of girls and boys using a back-to-back half-stem-and-leaf plot. Weight of girls (in kg) Stem Weight of boys (in kg) 4 2 4 8 6 5 5 4 4 3 3 3 2 2 1 0 0 5 2 7 7 6 6 5 5 5 6 3 1 0 6 0 1 2 2 3 3 4 9 8 6 5 5 5 6 6 6 7 7 8 9 0 7 0 0 1 3 4 7 5 5 6 7 8 0 2 4 8 6 9 1 70 1.4.5.2 Histogram and Polygon Histogram: is a bar chart for grouped numerical data where vertical bars are used to represent frequencies or percentages in each group. Polygon (or percentage polygon): uses the midpoints for all class intervals and then link these midpoints to form a line. 71 Example 1.17 According to the ordered arrays and frequency distributions of cost per person at 50 city restaurants and 50 suburban restaurants in Example 1.13. 72 Shapes of Histogram (1) A histogram is said to be symmetric if, we draw a vertical line down the center of the histogram, the two sides are identical in shape and size. (2) A histogram is said to be positively/negatively skewed if it has a long tail extending to the right/left. 73 Example 1.18 Suppose the weights (in kg) of 100 students are tabulated in the following table. Construct a histogram, frequency polygon and cumulative frequency polygon/curve. 30-34 29.5 – 34.5 32 2 Cumulative frequency 2 35-39 34.5 – 39.5 37 8 10 40-44 39.5 – 44.5 42 15 25 45-49 44.5 – 49.5 47 30 55 50-54 49.5 – 54.5 52 23 78 55-59 54.5 – 59.5 57 16 94 60-64 59.5 – 64.5 62 6 100 Weight (kg) Class boundary Class Mark Frequency Total 100 74 Weight (kg) Histogram of the weights (in kg) of 100 students Remark: We use class boundary and frequency to construct the histogram. 75 30 Frequency 20 10 0 27 32 37 42 47 52 Weight (in kg) 57 62 67 Frequency polygon of the weights (in kg) of 100 students Remark: We use class mark and frequency to construct the frequency polygon. 76 Q3 Median Q1 IQR Cumulative frequency polygon of the weights (in kg) of 100 students. 77 Remarks: -- We use the upper class boundaries to plot the points. -- Polygon means using a straight line to join the points, whereas curve means using a smooth curve to join the points. -- Note that Q1, median and Q3 can be read directly from the cumulative frequency polygon/curve. 78 1.4.5.3 Scatter Plot is used to examine possible relationship between two numerical variables. Example 1.19 The volume per day and cost per day are as follows. Draw a scatter diagram of 11 volumes and costs per day of a store. Volume per day: 23 24 26 29 33 38 41 42 50 55 60 Cost per day: 131 120 140 151 160 167 185 170 188 195 200 79 Scatter plot for the volume and cost per day 80 1.4.5.4 Box-and-Whisker Plot A Box-and-Whisker Plot (or Five-Number Summary or simply Boxplot) provides a way to determine the shape of a distribution. The five points of a data set include Minimum Q1 Median Q3 Maximum 81 Middle 50% of data 25% Upper 25% Lower Q1 L (min.value) Q2=Median Q3 Box and Whisker Plot H (max. value) (i) The median inside the bar shows the location of the center of the data. (ii) The length of the box shows the spread of the middle half of the data. (iii)The lengths of the whiskers show the spread of the lower and upper quarters of the data. 82 End of Chapter 1 83