MATH 1060‐ Statistics for Data Analytics Descriptive Stats An understanding of statistics requires us to make a number of definitions. While they may seem unnecessarily complex, we need to be speaking the same language and a good understanding is useful to deal with others who may be using statistics to mislead you. Example of Population and Sample: A polling company wants to find out how many Canadian households (10,820,050) have internet access. The company randomly phones 1700 households in Vancouver. Population: Sample: Descriptive Statistics – Inferential Statistics – More Definitions (related to Data): Quantitative data – Categorical data – Discrete data – Continuous data – Page | 1 DESCRIPTIVE STATISTICS Organizing Data: When conducting a statistical study, the researcher must gather data for a particular variable under study. To describe situations, draw conclusions, or make inferences about the data, the researcher must organize the data in some meaningful way. Example: Your boss has just dumped a set of Radiation Exposure measurements on your desk and asked you to prepare a presentation of the results (and it is Friday afternoon ‐ of course.) The measurements are from 51 different patients who all received the same procedure ‐ radiation therapy or x‐rays perhaps. The variation between patients reflects differences in your company’s equipment from machine to machine. How are you going to get the presentation ready? The data set (as given) is: Radiation Exposure Data Patient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Radiation Exposure (rd) 13.6 2.8 2.9 3.8 15.9 1.7 3.4 13.7 6.1 16.8 7.9 3.5 2.2 4.1 3.2 2.9 3.7 2.9 2 2.9 11.2 1.9 2 6 2.9 7.7 Patient 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 Radiation Exposure (rd) 5.1 13.2 3.8 13.9 2.4 7.9 1.4 5.9 6.5 11.8 13.2 2.8 6.9 0.7 12.9 3.6 3.6 8.1 17 8.2 9.8 13 11.3 4 6 Page | 2 Graphs “Pictures are worth a thousand words” 1) Scatter plot. strip Plot 0 2) Line Graph. 5 10 Radiation Exposure (rd) 15 20 2) Line graph 20 15 10 5 0 0 10 3) Bar Graph. 20 30 Patient Number 40 50 60 Bar Graph Radiation Exposure (rd) 20 15 10 5 0 1 5 9 13 17 21 25 29 33 37 41 45 49 Radiation Exposure (rd) Line Graph Patient Number Page | 3 Depending on the feature we are interested in, we choose the type of chart that best fits our purpose. For example, to show the composition (in %) we can use a pie chart. For instance, to show the type of the devices used to visit your website, given that the number of categories is less than 6, a pie chart is a good option. Note that all charts and graphs must be properly labelled. Devices used to access the internet in 2017 5% 11% Desktop Computer 16% Smartphone Laptop 26% 42% Tablet Computer Other In order to understand the distribution of data we may choose a stem and leaf chart, a bar chart (or column chart), a histogram, or a density chart depending on the data. A line chart is suitable for analyzing trends in data. For example to visualize the monthly sales of a certain brand of smartphone, or the price of a certain stock, a line chart is a good choice. We will create some of the charts mentioned above using the radiation data. Stem and leaf chart: A mixture of table and chart that you may run across is the Stem and Leaf Diagram. It involves breaking the numerical measurements into a stem and a leaf. i.e. 2.34 becomes stem=2 and leaf=34. You may use any digit to break the numerical value but you should do so in such way as to display the variation in the measurements. The “Leaf Unit” should be labelled to allow reconstruction of the original data. You may use any digit to break the numerical value but you should do so in such way as to display the variation in the measurements. The “Leaf Unit” should be labelled to allow reconstruction of the original data. Page | 4 Radiation Example: Stem Leaves Page | 5 Histograms Other common graphs require a certain amount of data analysis where you divide the range of all the measurements into m equal intervals called Classes Typically there should be between 5 and 20 classes in a plot depending on the number of measurements. The size of each class is called the Class Width. Class Boundaries : We can set up the class boundaries as: [a,b). Here the lower class boundary is included in the class and the upper class boundary is not. We can easily change this to “(a,b]” in R Class Boundary Rule: Class boundaries must be set up so that no single data point lies in 2 different classes. Radiation Example: No of Classes: Class Width: (high – low)/# of classes = Class Boundaries: Class Mark: The number in middle of each Class is called class mark. Radiation Exposure Frequency Relative Frequency Class Mark Page | 6 Frequency Histogram:. Frequencyi Radiation Exposure 20 18 16 14 12 10 8 6 4 2 0 1.8 5.3 8.8 12.3 15.8 Rad exposure Relative Frequency Histogram: There is also a relative frequency histogram where each class now has a column representing the percentage of the total that the class represents. In this case, you divide each class by the total number of measurements in the experiment. This gives the relative percentage of each class. Relative frequency 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1.8 5.3 8.8 12.3 15.8 Radiation Exposure (rd) Page | 7 Distributions: A relative frequency histogram can also be considered a distribution of the values that the variable can take in a population. If it is a histogram such as shown above it is called a Discrete Distribution since there are only a finite number of classes or values that the variable can take. If we take the class sizes to be very small, perhaps as we measure more and more data points, then we get a Continuous Distribution. You have seen a similar situation in calculus where the derivative is calculated using a slope formula and then the x x which is infinitesimally small. For example a continuous distribution of radiation levels may look like: 0 1.25 2.5 3.75 5 6.25 7.5 8.75 10 11.25 12.5 13.75 15 16.25 17.5 Relative Frequency A Distribution 0.05 0.04 0.03 0.02 0.01 0 Radiation Exposure (rd) A continuous distribution would represent the relative frequency for an infinite population. Page | 8 Descriptive Stats (2) NUMERICAL DESCRIPTIVE STATISTICS While being able to make graphical displays of your data is a valuable tool, you often need some more quantitative measure of your data that are well understood and tell people something about your data without listing all the measurement values. The simplest quantitative description of data tends to fall into 3 basic categories: 1. 2. 3. We will first consider the measure of the central tendencies which helps us answer the question "Does our data tend to cluster around a central value?" or "Does our data have a trend towards a particular value?" Measures of Central Tendency Mean or Arithmetic Average Example: The data representing the annual chocolate sales (in billions of dollars) for a sample of seven countries in the world is represented below. Find the mean. $2.0, 4.9, 6.5, 2.1, 5.1, 3.2, 16.6 Page | 9 Advantage of Using Mean: Disadvantage of Using Mean: Median Median overcomes the sensitivity issues in using the Mean. Median – Example: (a) Find the median for the annual chocolate sales for 7 countries. $2.0, 4.9, 6.5, 2.1, 5.1, 3.2, 16.6 (b) Find the median for the annual chocolate sales for 6 countries. $2.0, 4.9, 6.5, 2.1, 5.1, 3.2 Page | 10 Mode The third measure of average is called the mode. Mode – Example: Find the mode of the following data set. 5, 5, 5, 3, 1, 5, 1, 4, 3, 5 If you have put your data into Classes then you can talk of the Modal Class which has the greatest frequency. Example: A study of reaction times involved 30 left‐handed subjects, 50 right‐handed subjects, and 20 ambidextrous subjects. Find the mode. Page | 11 SUMMARY OF MEASURES OF CENTRAL TENDENCIES There are several different ways to define the center of a set of data. The figure below illustrates the differences among the mean, median and mode. Which one is best? The answer is dependent on the objective of the data. The table below summarizes the different measures of center. Unfortunately, the term average is sometimes used for any measure of centre and is sometimes used for the mean. Avoid using the term average and be more specific and use words like mean or median. Page | 12 MEASURES OF DISPERSION (OR VARIATION) It is a trivial matter to design two distributions which have the same mean, median and mode but which have significantly different degrees of clustering around the mean values. Take the following example: Example: A testing lab wishes to test two experimental brands of outdoor paint to see how long each will last before fading. The testing lab takes 6 gallons of each paint to test. The results in months are shown. Find the mean of each group. Brand A 10 60 50 30 40 20 Brand B 35 45 30 35 40 25 Three Measures of Dispersion: 1. 2. 3. Page | 13 Range Range is the largest data value minus the smallest data value. Example: Find the ranges for the paints. Variance and Standard Deviation Variance and Standard Deviation are important measures of variation but the values must be interpreted correctly. Variance – the average of the squares of the distance each value is from the mean Standard Deviation – the square root of the variance. Page | 14 Example: Find the variance and standard deviation for the data set of Paint A and B. Paint A: 10, 60, 50, 30, 40, 20 Values (X) X X X 35 X X 2 X X 2 10 60 50 30 40 20 Sum: Paint B: 35, 45, 30, 35, 40, 25 Values (X) X X 35 45 30 35 40 25 Sum: Page | 15 MEASURES OF DISPERSION WRAP‐UP Understanding Standard Deviation Standard deviation measures the variation among values. Values close together will yield a small standard deviation while values spread farther apart will yield a larger standard deviation. Empirical Rule If the distribution is “bell” shaped (or "normal" as it is referred to) we can make a stronger statement about the significance of the standard deviation. Page | 16 MEASURES OF POSITION In this section we introduce z scores, which enable us to standardize values so that they can be compared more easily. We also introduce quartiles and percentiles which help us better understand data by showing their positions relative to the whole data set. z scores z Score Number of standard deviations that a given value x is above or below the mean. Example: A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative positions on the two tests. Quartiles and Percentiles Just as the median divides the data into two equal parts, the three quartiles, denoted by Q1, Q2 and Q3, divide the sorted values into four equal parts. Roughly speaking, Q1 , separates the bottom 25% of the sorted values from the top 75%, Q2 is the median and Q3 separates the top 25% from the bottom 75%. Page | 17 Just as there are three quartiles separating a data set into four parts, there are 99 percentiles, P1, P2, P3, etc., which partition the data into 100 groups with about 1% of the scores in each group. 25% (minimum) 25% Q1= P25 25% 25% Q2= P50 Q3= P75 (maximum) (median) Finding the Percentile/Quartile of a Given Score: Percentile of score x = Finding the Score of a Given Percentile/Quartile: L k n 100 n = total number of values in the data set k = percentile being used L = locator that gives the position of a value Pk = kth percentile Steps for finding the Score: Step 1: Arrange the data in order from lowest to highest Step 2: Substitute into the formula for L above Step 3a: If L is not a whole number, round up to the next whole number. Starting at the lowest value, count over to the number that corresponds to the rounded‐up value. Step 3b: If L is a whole number, use the value halfway between the Lth and the (L+1)th values when counting up from the lowest value. Page | 18 Other Useful Definitions: Interquartile Range (or IQR): Semi‐interquartile Range: Midquartile: 10 – 90 Percentile Range: Example (Pulse Rate of Smokers): 52 52 60 60 60 60 63 63 66 67 68 69 71 72 73 75 78 80 82 83 88 90 Find Q1 (or P25), Q3 (or P75) and IQR. Outliers: a value that is located very far away from almost all the other values. an extreme value can have a dramatic effect on the mean, standard deviation, and on the scale of the histogram so that the true nature of the distribution is totally obscured Steps for Identifying Outliers Step 1: Arrange the data in order and find Q1 and Q2 Step 2: Find the interquartile range: IQR = Q3 – Q1 Step 3: Multiply the IQR by 1.5. Step 4: Subtract the value obtained in step 3 from Q1 and add the value to Q3. Step 5: Check the data set for any data value that is smaller than Q1 – 1.5(IQR) or larger than Q1 + 1.5(IQR). Boxplots: Reveals the: center of the data spread of the data distribution of the data presence of outliers Page | 19 Excellent for comparing two or more data sets 5‐number summary Minimum First quartile Q1 Median Q2 Third quartile Q3 Maximum 1.5 IQR 1.5 IQR Outlier * Q1 Median Q3 Example (Pulse Rate of Smokers): 52 69 52 71 60 72 60 73 60 75 60 78 63 80 63 82 66 83 67 88 68 90 Graph the boxplot for the data. Minimum: Q1, first quartile: Median: Q3, third quartile: Maximum: Check for Outliers IQR = 1.5(IQR) = Q1 ‐ 1.5(IQR) = Q3 + 1.5(IQR) = Page | 20