Uploaded by cgtraderbusiness

DataVisualization

advertisement
Statistics 205 Fall 2023: Visual Representations of Data
The first step in the exploration of the data resulting from our data collection method is to understand how
the data ‘behaves’. Techically, this means how the variable on which the data collected ‘tends to vary’. To
achieve this understanding we have to visualize the data. How we visualize the data - or what graph to make
- depends on the type of variable.
Visualizing Categorical Variables
Categorical variables can be visualized in two ways:
1. Pie Charts (Pareto Charts)
2. Bar Graphs
Example 1: Consider the sample of n = 101 first-year University of Calgary students. A pie-chart of the
breakdown of the sample, by gender, is given below.
Example 2: Each first-year student was asked the following question: “What political party in Canada
most closely represents your political views?” Students responded the following way: Liberal - 1, Progressive
Conservative - 2, NDP - 3, Green Party - 4, Other - 5. A series of bar graphs summarize the (i) overall
results and (ii) results broken down by Gender.
Statistics 205:
©Jim Stallard 2023
2
Caution! Data having ‘one-dimension’ - that is, data on one variable should not be displayed in a graph
beyond two dimensions. For example, consider the following pie-chart that used to be a sticker on all
gas-pumps at Petro Canada stations.
Statistics 205:
©Jim Stallard 2023
3
Visualizing Numerical Variables
We will find that when we collect data on a numerical variable, the data will tend to behave in a certain
way. This ‘behaviour’, or the tendency for data to taken on certain values more often than others is what
we call the distribution of the variable.
There are three basic types of distributions:
1. Symmetrical:
2. Right-skewed:
3. Left-skewed:
Data collected from numerical variables can be visualized in one of two forms:
1. Dotplots
2. Histograms
Dotplots
Example of a Dotplot: The following data represents the length of survival in days of a random sample
of people diagnosed with varying stages of lung-cancer:
37, 63, 63, 65, 72, 138,151, 155, 166, 166, 223, 245, 246, 450, 859
Create a dotplot of this data.
Answer: The dotplot is a one-dimensional plot, where the x-axis represents the values of the associated
numerical variable. The dotplot (we have seen a dotplots before; days survived after diagnosis for various
forms of cancer). In this instance, the target population consists of all persons who have been diagnosed
with lung-cancer and have passed-away as a result, the variable of interest is the survival time (in
days). The dotplot below was produced with StatCrunch. Included is a portion of the StatCrunch screen
shot, showing the spread-sheet consisting of the data above.
This dotplot can be created in StatCrunch with the following:
Statistics 205:
©Jim Stallard 2023
4
Graph → Dotplot
Select Column(s): LungCancer Survival
Example 2: Using Dotplots as Visual Comparisons
The following dotplots were generated in StatCrunch. The CancerSurvival.csv (comma separated value)
file contained data in columns, where each column contained data on a variable ‘survival time in days’ for a
random sample of people who were diagnosed with various forms of cancer.
Statistics 205:
©Jim Stallard 2023
5
Histograms
A histogram is a bar-graph of a frequency distribution.
Often, a data set is summarized into a frequency distribution. Simply stated, a frequency distribution
assigns each data point to a class. After this has been completed, the frequency of data points assigned to
each class is provided and given in the form of a table.
Class
Class Intervals (Boundaries)
Frequency fi
1
LB1 ≤ U B1
f1
p1
2
LB2 ≤ U B2
f2
p2
3
LB3 ≤ U B3
f3
p3
.
.
.
.
.
.
.
.
.
.
.
.
k
LBk ≤ U Bk
Pk
fk
i=1
fi = n
Relative Frequency pi =
Pk
fi
n
pk
i=1
pi = 1
Some guidelines to consider when constructing a frequency/percentage distribution:
1. Calculate the range of the data set (sample). Range = M ax − M in, where M ax is the maximum
(largest) value in the sample, M in is the minimum (or smallest) value in the sample.
2. Determine the number of classes/intervals the frequency/relative frequency distribution will have. The
number of classes/intervals is equal to k, where 5 ≤ k ≤ 15. The value of k is often an arbitrary value,
but can be determined as a function of the sample size n:
√
Square Root Rule: k ≈ n
3. Divide the Range of the data into k equal sized intervals. This is called the class width, where
width = Range
.
k
4. The lower bound of the first class starts with a point ≤ M in. Continue to defined classes such that
the upper bound of the last class - U Bk - is greater than the M ax.
5. Assign each data point to its corresponding class. When this is complete, count the number of data
points assigned to each class. This count is called the class frequency, or simply frequency of class i,
i = 1, 2, · · · , k.
5(b). One can convert the class frequencies into percentages and produce a relative-frequency distribution.
This simply indicates what percentage of the data falls into each class.
A histogram is a bar-graph of the percentage distribution. It is a visual tool that is used in an attempt to
understand how a population variable behaves - its distribution shape - based on a random sample of data
taken from that population. In the scope of this course, we will consider two types of histograms:
1. A frequency, or count, histogram.
2. A relative frequency histogram
3. A density histogram.
Statistics 205:
©Jim Stallard 2023
6
Example 1: A Relative Frequency Histogram: The salaries of n = 60 randomly chosen professional
hockey players with NHL contracts for the 2019-2020 season were observed, in $1,000,000s. The data are
given below, and sorted in ascending order for convenience.
0.7000, 0.7000, 0.7000, 0.7000, 0.7000, 0.7000, 0.7500, 0.8000, 0.8000, 0.8325, 0.8325, 0.8741, 0.9000, 0.9000,
0.9000, 0.9050, 0.9250, 0.9250, 0.9250, 0.9250, 0.9250, 0.9250, 0.9250, 1.0500, 1.0500, 1.3250, 1.5000, 1.7000,
1.9000, 1.9500, 2.0000, 2.3500, 2.5000, 2.9000, 3.0000, 3.0000, 3.2000, 3.5000, 3.5000, 3.7000, 4.0000, 4.0000,
4.4500, 4.5000, 5.0000, 5.0000, 5.2500, 5.2500, 5.5250, 6.0000, 6.5000. 6.7500, 7.0000, 7.5000, 7.5000,
8.2750, 8.8000, 9.8000, 10.0000, 11.0000
From these data, we will create a relative frequency histogram.
Answer:
Class
0.5 < 2.0
2.0 < 3.5
3.5 < 5.0
5.0 < 6.5
6.5 < 8.0
8.0 < 9.5
9.5 < 11.0
11.0 < 12.5
Count/Frequency, fi
30
7
7
6
5
2
2
1
Relative Frequency or
30/60 = 0.5000
7/60 = 0.1167
7/60 = 0.1167
6/60 = 0.1000
5/60 = 0.0833
2/60 = 0.0333
2/60 = 0.0333
1/60 = 0.0167
fi
n
Statistics 205:
©Jim Stallard 2023
7
The following are a series of histograms constructed with StatCrunch, the first being a relative frequency
histogram and the second being the count/frequency histogram.
Statistics 205:
©Jim Stallard 2023
8
A problem with relative frequency (or count) histograms, is that they are not appropriate when the classes
of the relative frequency distribution are not of the same size.
Consider the last example: there are 2 NHL salaries in the class 9.5. < 11.0, and 1 salary in the 11.0 < 12.5
classes. What if these last two classes were combined into a single class? Then there would be 3 players with
3
salaries between 9.5 < 12.5, and the relative frequency would be 60
= 0.05..
To remove this problem altogether, a density-scale histogram is constructed. A density-scale histogram
differs from a relative frequency histogram or histogram of counts in this sense: the total area of a densityscale histogram - the sum of the areas of each bar - is equal to 1 or 100%. As a result, the height of each
bar in a density-scale histogram is NOT equal to the relative frequency or count associated with its class;
rather, the height of each bar is deemed as the class’s density, and computed in the following way:
W idthClass i ∗ HeightClass
i
=
HeightClass
i
=
DensityClass
i
=
AreaBari = RelativeF requencyClass
RelativeF requencyClass i
W idthClass i
RelativeF requencyClass i
W idthClass i
i
Example 1, Part II: Reconsider the relative frequency table of the NHL salary data created on the previous
page, but below the last two classes are combined into one larger class:
Class
0.5 < 2.0
2.0 < 3.5
3.5 < 5.0
5.0 < 6.5
6.5 < 8.0
8.0 < 9.5
9.5 < 11.0
11.0 < 12.5
Count/Frequency, fi
30
7
7
6
5
2
2
1
Relative Frequency or
30/60 = 0.5000
7/60 = 0.1167
7/60 = 0.1167
6/60 = 0.1000
5/60 = 0.0833
2/60 = 0.0333
2/60 = 0.0333
1/60 = 0.0167
fi
n
Density =
Column 3
corresponding class width
A density-scale histogram of the NHL Salary data (Example 1, Part II) will be drawn below.
Statistics 205:
©Jim Stallard 2023
9
Example 1, Part III: From the histograms given, what can you say about the proportion of all professional
hockey players with an NHL contract for the 2019-2020 season?
(a) less than $3 million a season?
Answer: To compute this probability from the density-scale histogram, we need to compute the area
that is less than 3.0 million.
P (< 3.0)
=
(Area Between 0.5 < 2.0) + (Area Between 2.0 < 3.0)
= (
1.5
|{z}
width of class 1
∗ 0.3333
| {z }) + (
density
(3.0 − 2.0)
| {z }
partial−width of class 2
∗ |0.0778
{z })
density
= 0.50 + 0.0778
= 0.5778
≈
0.58
Here we have used a random sample - data - and empirical probability to estimate the proportion of
ALL professional hockey players with an NHL contract for the 2019-2020 who makes less than $3.0
million a season.
(b) (i) between $2.0 and $4.0 million a season?
Answer to (i): We need to compute the area that is between 3.0 million and 4.0 million.
P (< 2.0)
=
(Area Between 2.0 < 3.5) + (Area Between 3.5 < 4.0)
= ((3.5 − 2.0) ∗ 0.0778) + ((4.0 − 3.5) ∗ (0.0778)
= 0.1167 + 0.0389
= 0.1556
Again, using the data to compute an empirical probability, we infer from this data/the sample that
approximately 15.56% of all professional hockey players with an NHL contract for the 2019-2020 season make between $2.0 million and $4.0 million.
One More Example, Time Permitting: In this instance, we are referring to the Professor Salaries 20182019.csv data file. I will be loading this into StatCrunch, and creating various visualizations of these data
in class.
Statistics 205:
©Jim Stallard 2023
10
Wrap Up Exercise: The data below was obtained from a random sample of n = 36 men. The creatine
phosphokinase concentration, or CK-level (measured in u) was measured for each. The data is given below
and can be found in the NHLSampleSalaries19 20.csv file.
25 42 48 57 58 60 62 64 67 68 70 78 82 83 84 92 93 94 95 95 100 101 104 110 110 113 118 119 121 123 139
145 151 163 201 203
Please answer the question-parts posed below within Top Hat.
(a) Using class limits of 25 - 55 - 85 - 115 - 145 - 175 - 205, create a relative frequency histogram of
these data.
(b) What proportion of men in this sample had a CK-level that was at least 145 and less than 175?
(c) Think about the shape of the distribution of CK-levels for a population of males. What can you say
about the distribution shape of the CK-levels?
(d) Create a density histogram using the same classes as above. What is the density associated with the
55 < 85 class?
(e) Suppose you were to take the upper two classes, 145 < 175 and 175 < 205 and combine these into one
class. What would be the density of this new class?
(f) Compute the proportion of all males that have a CK-level below 100. Use four decimals in your
answer, keeping your answer in decimal form.
Download