Statistics 205 Fall 2023: Visual Representations of Data The first step in the exploration of the data resulting from our data collection method is to understand how the data ‘behaves’. Techically, this means how the variable on which the data collected ‘tends to vary’. To achieve this understanding we have to visualize the data. How we visualize the data - or what graph to make - depends on the type of variable. Visualizing Categorical Variables Categorical variables can be visualized in two ways: 1. Pie Charts (Pareto Charts) 2. Bar Graphs Example 1: Consider the sample of n = 101 first-year University of Calgary students. A pie-chart of the breakdown of the sample, by gender, is given below. Example 2: Each first-year student was asked the following question: “What political party in Canada most closely represents your political views?” Students responded the following way: Liberal - 1, Progressive Conservative - 2, NDP - 3, Green Party - 4, Other - 5. A series of bar graphs summarize the (i) overall results and (ii) results broken down by Gender. Statistics 205: ©Jim Stallard 2023 2 Caution! Data having ‘one-dimension’ - that is, data on one variable should not be displayed in a graph beyond two dimensions. For example, consider the following pie-chart that used to be a sticker on all gas-pumps at Petro Canada stations. Statistics 205: ©Jim Stallard 2023 3 Visualizing Numerical Variables We will find that when we collect data on a numerical variable, the data will tend to behave in a certain way. This ‘behaviour’, or the tendency for data to taken on certain values more often than others is what we call the distribution of the variable. There are three basic types of distributions: 1. Symmetrical: 2. Right-skewed: 3. Left-skewed: Data collected from numerical variables can be visualized in one of two forms: 1. Dotplots 2. Histograms Dotplots Example of a Dotplot: The following data represents the length of survival in days of a random sample of people diagnosed with varying stages of lung-cancer: 37, 63, 63, 65, 72, 138,151, 155, 166, 166, 223, 245, 246, 450, 859 Create a dotplot of this data. Answer: The dotplot is a one-dimensional plot, where the x-axis represents the values of the associated numerical variable. The dotplot (we have seen a dotplots before; days survived after diagnosis for various forms of cancer). In this instance, the target population consists of all persons who have been diagnosed with lung-cancer and have passed-away as a result, the variable of interest is the survival time (in days). The dotplot below was produced with StatCrunch. Included is a portion of the StatCrunch screen shot, showing the spread-sheet consisting of the data above. This dotplot can be created in StatCrunch with the following: Statistics 205: ©Jim Stallard 2023 4 Graph → Dotplot Select Column(s): LungCancer Survival Example 2: Using Dotplots as Visual Comparisons The following dotplots were generated in StatCrunch. The CancerSurvival.csv (comma separated value) file contained data in columns, where each column contained data on a variable ‘survival time in days’ for a random sample of people who were diagnosed with various forms of cancer. Statistics 205: ©Jim Stallard 2023 5 Histograms A histogram is a bar-graph of a frequency distribution. Often, a data set is summarized into a frequency distribution. Simply stated, a frequency distribution assigns each data point to a class. After this has been completed, the frequency of data points assigned to each class is provided and given in the form of a table. Class Class Intervals (Boundaries) Frequency fi 1 LB1 ≤ U B1 f1 p1 2 LB2 ≤ U B2 f2 p2 3 LB3 ≤ U B3 f3 p3 . . . . . . . . . . . . k LBk ≤ U Bk Pk fk i=1 fi = n Relative Frequency pi = Pk fi n pk i=1 pi = 1 Some guidelines to consider when constructing a frequency/percentage distribution: 1. Calculate the range of the data set (sample). Range = M ax − M in, where M ax is the maximum (largest) value in the sample, M in is the minimum (or smallest) value in the sample. 2. Determine the number of classes/intervals the frequency/relative frequency distribution will have. The number of classes/intervals is equal to k, where 5 ≤ k ≤ 15. The value of k is often an arbitrary value, but can be determined as a function of the sample size n: √ Square Root Rule: k ≈ n 3. Divide the Range of the data into k equal sized intervals. This is called the class width, where width = Range . k 4. The lower bound of the first class starts with a point ≤ M in. Continue to defined classes such that the upper bound of the last class - U Bk - is greater than the M ax. 5. Assign each data point to its corresponding class. When this is complete, count the number of data points assigned to each class. This count is called the class frequency, or simply frequency of class i, i = 1, 2, · · · , k. 5(b). One can convert the class frequencies into percentages and produce a relative-frequency distribution. This simply indicates what percentage of the data falls into each class. A histogram is a bar-graph of the percentage distribution. It is a visual tool that is used in an attempt to understand how a population variable behaves - its distribution shape - based on a random sample of data taken from that population. In the scope of this course, we will consider two types of histograms: 1. A frequency, or count, histogram. 2. A relative frequency histogram 3. A density histogram. Statistics 205: ©Jim Stallard 2023 6 Example 1: A Relative Frequency Histogram: The salaries of n = 60 randomly chosen professional hockey players with NHL contracts for the 2019-2020 season were observed, in $1,000,000s. The data are given below, and sorted in ascending order for convenience. 0.7000, 0.7000, 0.7000, 0.7000, 0.7000, 0.7000, 0.7500, 0.8000, 0.8000, 0.8325, 0.8325, 0.8741, 0.9000, 0.9000, 0.9000, 0.9050, 0.9250, 0.9250, 0.9250, 0.9250, 0.9250, 0.9250, 0.9250, 1.0500, 1.0500, 1.3250, 1.5000, 1.7000, 1.9000, 1.9500, 2.0000, 2.3500, 2.5000, 2.9000, 3.0000, 3.0000, 3.2000, 3.5000, 3.5000, 3.7000, 4.0000, 4.0000, 4.4500, 4.5000, 5.0000, 5.0000, 5.2500, 5.2500, 5.5250, 6.0000, 6.5000. 6.7500, 7.0000, 7.5000, 7.5000, 8.2750, 8.8000, 9.8000, 10.0000, 11.0000 From these data, we will create a relative frequency histogram. Answer: Class 0.5 < 2.0 2.0 < 3.5 3.5 < 5.0 5.0 < 6.5 6.5 < 8.0 8.0 < 9.5 9.5 < 11.0 11.0 < 12.5 Count/Frequency, fi 30 7 7 6 5 2 2 1 Relative Frequency or 30/60 = 0.5000 7/60 = 0.1167 7/60 = 0.1167 6/60 = 0.1000 5/60 = 0.0833 2/60 = 0.0333 2/60 = 0.0333 1/60 = 0.0167 fi n Statistics 205: ©Jim Stallard 2023 7 The following are a series of histograms constructed with StatCrunch, the first being a relative frequency histogram and the second being the count/frequency histogram. Statistics 205: ©Jim Stallard 2023 8 A problem with relative frequency (or count) histograms, is that they are not appropriate when the classes of the relative frequency distribution are not of the same size. Consider the last example: there are 2 NHL salaries in the class 9.5. < 11.0, and 1 salary in the 11.0 < 12.5 classes. What if these last two classes were combined into a single class? Then there would be 3 players with 3 salaries between 9.5 < 12.5, and the relative frequency would be 60 = 0.05.. To remove this problem altogether, a density-scale histogram is constructed. A density-scale histogram differs from a relative frequency histogram or histogram of counts in this sense: the total area of a densityscale histogram - the sum of the areas of each bar - is equal to 1 or 100%. As a result, the height of each bar in a density-scale histogram is NOT equal to the relative frequency or count associated with its class; rather, the height of each bar is deemed as the class’s density, and computed in the following way: W idthClass i ∗ HeightClass i = HeightClass i = DensityClass i = AreaBari = RelativeF requencyClass RelativeF requencyClass i W idthClass i RelativeF requencyClass i W idthClass i i Example 1, Part II: Reconsider the relative frequency table of the NHL salary data created on the previous page, but below the last two classes are combined into one larger class: Class 0.5 < 2.0 2.0 < 3.5 3.5 < 5.0 5.0 < 6.5 6.5 < 8.0 8.0 < 9.5 9.5 < 11.0 11.0 < 12.5 Count/Frequency, fi 30 7 7 6 5 2 2 1 Relative Frequency or 30/60 = 0.5000 7/60 = 0.1167 7/60 = 0.1167 6/60 = 0.1000 5/60 = 0.0833 2/60 = 0.0333 2/60 = 0.0333 1/60 = 0.0167 fi n Density = Column 3 corresponding class width A density-scale histogram of the NHL Salary data (Example 1, Part II) will be drawn below. Statistics 205: ©Jim Stallard 2023 9 Example 1, Part III: From the histograms given, what can you say about the proportion of all professional hockey players with an NHL contract for the 2019-2020 season? (a) less than $3 million a season? Answer: To compute this probability from the density-scale histogram, we need to compute the area that is less than 3.0 million. P (< 3.0) = (Area Between 0.5 < 2.0) + (Area Between 2.0 < 3.0) = ( 1.5 |{z} width of class 1 ∗ 0.3333 | {z }) + ( density (3.0 − 2.0) | {z } partial−width of class 2 ∗ |0.0778 {z }) density = 0.50 + 0.0778 = 0.5778 ≈ 0.58 Here we have used a random sample - data - and empirical probability to estimate the proportion of ALL professional hockey players with an NHL contract for the 2019-2020 who makes less than $3.0 million a season. (b) (i) between $2.0 and $4.0 million a season? Answer to (i): We need to compute the area that is between 3.0 million and 4.0 million. P (< 2.0) = (Area Between 2.0 < 3.5) + (Area Between 3.5 < 4.0) = ((3.5 − 2.0) ∗ 0.0778) + ((4.0 − 3.5) ∗ (0.0778) = 0.1167 + 0.0389 = 0.1556 Again, using the data to compute an empirical probability, we infer from this data/the sample that approximately 15.56% of all professional hockey players with an NHL contract for the 2019-2020 season make between $2.0 million and $4.0 million. One More Example, Time Permitting: In this instance, we are referring to the Professor Salaries 20182019.csv data file. I will be loading this into StatCrunch, and creating various visualizations of these data in class. Statistics 205: ©Jim Stallard 2023 10 Wrap Up Exercise: The data below was obtained from a random sample of n = 36 men. The creatine phosphokinase concentration, or CK-level (measured in u) was measured for each. The data is given below and can be found in the NHLSampleSalaries19 20.csv file. 25 42 48 57 58 60 62 64 67 68 70 78 82 83 84 92 93 94 95 95 100 101 104 110 110 113 118 119 121 123 139 145 151 163 201 203 Please answer the question-parts posed below within Top Hat. (a) Using class limits of 25 - 55 - 85 - 115 - 145 - 175 - 205, create a relative frequency histogram of these data. (b) What proportion of men in this sample had a CK-level that was at least 145 and less than 175? (c) Think about the shape of the distribution of CK-levels for a population of males. What can you say about the distribution shape of the CK-levels? (d) Create a density histogram using the same classes as above. What is the density associated with the 55 < 85 class? (e) Suppose you were to take the upper two classes, 145 < 175 and 175 < 205 and combine these into one class. What would be the density of this new class? (f) Compute the proportion of all males that have a CK-level below 100. Use four decimals in your answer, keeping your answer in decimal form.