Uploaded by John Lee

Math 1060 - Lecture 1

advertisement
MATH 1060‐ Statistics for Data Analytics
Descriptive Stats
An understanding of statistics requires us to make a number of definitions. While they may seem
unnecessarily complex, we need to be speaking the same language and a good understanding is useful
to deal with others who may be using statistics to mislead you.
Example of Population and Sample: A polling company wants to find out how many Canadian
households (10,820,050) have internet access. The company randomly phones 1700 households in
Vancouver.
Population:
Sample:
Descriptive Statistics
–
Inferential Statistics
–
More Definitions (related to Data):
Quantitative data –
Categorical data –
Discrete data –
Continuous data –
Page | 1
DESCRIPTIVE STATISTICS
Organizing Data:
When conducting a statistical study, the researcher must gather data for a particular variable under
study. To describe situations, draw conclusions, or make inferences about the data, the researcher
must organize the data in some meaningful way.
Example: Your boss has just dumped a set of Radiation Exposure measurements on your desk and
asked you to prepare a presentation of the results (and it is Friday afternoon ‐ of course.) The
measurements are from 51 different patients who all received the same procedure ‐ radiation
therapy or x‐rays perhaps. The variation between patients reflects differences in your company’s
equipment from machine to machine.
How are you going to get the presentation ready?
The data set (as given) is:
Radiation Exposure Data
Patient
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Radiation Exposure
(rd)
13.6
2.8
2.9
3.8
15.9
1.7
3.4
13.7
6.1
16.8
7.9
3.5
2.2
4.1
3.2
2.9
3.7
2.9
2
2.9
11.2
1.9
2
6
2.9
7.7
Patient
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Radiation
Exposure (rd)
5.1
13.2
3.8
13.9
2.4
7.9
1.4
5.9
6.5
11.8
13.2
2.8
6.9
0.7
12.9
3.6
3.6
8.1
17
8.2
9.8
13
11.3
4
6
Page | 2
Graphs “Pictures are worth a thousand words”
1) Scatter plot.
strip Plot
0
2) Line Graph.
5
10
Radiation Exposure (rd)
15
20
2) Line graph
20
15
10
5
0
0
10
3) Bar Graph.
20
30
Patient Number
40
50
60
Bar Graph
Radiation
Exposure (rd)
20
15
10
5
0
1
5
9
13
17
21
25
29
33
37
41
45
49
Radiation Exposure
(rd)
Line Graph
Patient Number
Page | 3
Depending on the feature we are interested in, we choose the type of chart that best fits our
purpose. For example, to show the composition (in %) we can use a pie chart. For instance, to show
the type of the devices used to visit your website, given that the number of categories is less than 6,
a pie chart is a good option. Note that all charts and graphs must be properly labelled.
Devices used to access the internet
in 2017
5%
11%
Desktop Computer
16%
Smartphone
Laptop
26%
42%
Tablet Computer
Other
In order to understand the distribution of data we may choose a stem and leaf chart, a bar chart (or
column chart), a histogram, or a density chart depending on the data.
A line chart is suitable for analyzing trends in data. For example to visualize the monthly sales of a
certain brand of smartphone, or the price of a certain stock, a line chart is a good choice.
We will create some of the charts mentioned above using the radiation data.
Stem and leaf chart:
A mixture of table and chart that you may run across is the Stem and Leaf Diagram. It involves
breaking the numerical measurements into a stem and a leaf. i.e. 2.34 becomes stem=2 and
leaf=34.
You may use any digit to break the numerical value but you should do so in such way as to display
the variation in the measurements. The “Leaf Unit” should be labelled to allow reconstruction of
the original data.
You may use any digit to break the numerical value but you should do so in such way as to display the
variation in the measurements. The “Leaf Unit” should be labelled to allow reconstruction of the original
data.
Page | 4
Radiation Example:
Stem Leaves
Page | 5
Histograms
Other common graphs require a certain amount of data analysis where you divide the range of all the
measurements into m equal intervals called Classes
Typically there should be between 5 and 20 classes in a plot depending on the number of
measurements.
The size of each class is called the Class Width.
Class Boundaries :
We can set up the class boundaries as: [a,b). Here the lower class boundary is included in the class and
the upper class boundary is not.
We can easily change this to “(a,b]” in R
Class Boundary Rule: Class boundaries must be set up so that no single data point lies in 2
different classes. 
Radiation Example:
No of Classes:
Class Width: (high – low)/# of classes =
Class Boundaries:
Class Mark:
The number in middle of each Class is called class mark.
Radiation Exposure
Frequency
Relative Frequency
Class
Mark
Page | 6
Frequency Histogram:.
Frequencyi
Radiation Exposure
20
18
16
14
12
10
8
6
4
2
0
1.8
5.3
8.8
12.3
15.8
Rad exposure
Relative Frequency Histogram:
There is also a relative frequency histogram where each class now has a column representing the
percentage of the total that the class represents. In this case, you divide each class by the total number
of measurements in the experiment. This gives the relative percentage of each class.
Relative frequency
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
1.8
5.3
8.8
12.3
15.8
Radiation Exposure (rd)
Page | 7
Distributions:
A relative frequency histogram can also be considered a distribution of the values that the variable can
take in a population. If it is a histogram such as shown above it is called a Discrete Distribution since
there are only a finite number of classes or values that the variable can take.
If we take the class sizes to be very small, perhaps as we measure more and more data points, then we
get a Continuous Distribution. You have seen a similar situation in calculus where the derivative is
calculated using a slope formula and then the x x which is infinitesimally small.
For example a continuous distribution of radiation levels may look like:
0
1.25
2.5
3.75
5
6.25
7.5
8.75
10
11.25
12.5
13.75
15
16.25
17.5
Relative Frequency
A Distribution
0.05
0.04
0.03
0.02
0.01
0
Radiation Exposure (rd)
A continuous distribution would represent the relative frequency for an infinite population.
Page | 8
Descriptive Stats (2)
NUMERICAL DESCRIPTIVE STATISTICS
While being able to make graphical displays of your data is a valuable tool, you often need some more
quantitative measure of your data that are well understood and tell people something about your data
without listing all the measurement values.
The simplest quantitative description of data tends to fall into 3 basic categories:
1.
2.
3.
We will first consider the measure of the central tendencies which helps us answer the question
"Does our data tend to cluster around a central value?"
or
"Does our data have a trend towards a particular value?"
Measures of Central Tendency
Mean or Arithmetic Average
Example:
The data representing the annual chocolate sales
(in billions of dollars) for a sample of seven
countries in the world is represented below. Find
the mean.
$2.0, 4.9, 6.5, 2.1, 5.1, 3.2, 16.6
Page | 9
Advantage of Using Mean:
Disadvantage of Using Mean:
Median
Median overcomes the sensitivity issues in using the Mean.
Median –
Example:
(a) Find the median for the annual chocolate sales for 7 countries.
$2.0, 4.9, 6.5, 2.1, 5.1, 3.2, 16.6
(b) Find the median for the annual chocolate sales for 6 countries.
$2.0, 4.9, 6.5, 2.1, 5.1, 3.2
Page | 10
Mode
The third measure of average is called the mode.
Mode –
Example:
Find the mode of the following data set.
5, 5, 5, 3, 1, 5, 1, 4, 3, 5
If you have put your data into Classes then you can talk of the Modal Class which has the greatest
frequency.
Example:
A study of reaction times involved 30 left‐handed subjects, 50 right‐handed subjects, and 20
ambidextrous subjects. Find the mode.
Page | 11
SUMMARY OF MEASURES OF CENTRAL TENDENCIES
There are several different ways to define the center of a set of data. The figure below illustrates the
differences among the mean, median and mode. Which one is best? The answer is dependent on the
objective of the data. The table below summarizes the different measures of center.
Unfortunately, the term average is sometimes used for any measure of centre and is sometimes used
for the mean. Avoid using the term average and be more specific and use words like mean or median.
Page | 12
MEASURES OF DISPERSION (OR VARIATION)
It is a trivial matter to design two distributions which have the same mean, median and mode but which
have significantly different degrees of clustering around the mean values. Take the following example:
Example: A testing lab wishes to test two experimental brands of outdoor paint to see how long each will
last before fading. The testing lab takes 6 gallons of each paint to test. The results in months are
shown. Find the mean of each group.
Brand
A
10
60
50
30
40
20
Brand
B
35
45
30
35
40
25
Three Measures of Dispersion:
1.
2.
3.
Page | 13
Range
Range is the largest data value minus the smallest data
value.
Example:
Find the ranges for the paints.
Variance and Standard Deviation
Variance and Standard Deviation are important measures of variation but the values must be
interpreted correctly.
Variance – the average of the squares of the distance each value is from the mean
Standard Deviation – the square root of the variance.
Page | 14
Example: Find the variance and standard deviation for the data set of Paint A and B.
Paint A: 10, 60, 50, 30, 40, 20
Values (X)
X X
X  35
X  X 
2
X  X 
2
10
60
50
30
40
20
Sum:
Paint B: 35, 45, 30, 35, 40, 25
Values (X)
X X
35
45
30
35
40
25
Sum:
Page | 15
MEASURES OF DISPERSION WRAP‐UP
Understanding Standard Deviation
Standard deviation measures the variation among values. Values close together will yield a small
standard deviation while values spread farther apart will yield a larger standard deviation.
Empirical Rule
If the distribution is “bell” shaped (or "normal" as it is referred to) we can make a stronger statement
about the significance of the standard deviation.
Page | 16
MEASURES OF POSITION
In this section we introduce z scores, which enable us to standardize values so that they can be
compared more easily. We also introduce quartiles and percentiles which help us better understand
data by showing their positions relative to the whole data set.
z scores
z Score
Number of standard deviations that a given value x is above or below the mean.
Example:
A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored
30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative positions on
the two tests.
Quartiles and Percentiles
Just as the median divides the data into two equal parts, the three quartiles, denoted by Q1, Q2 and Q3,
divide the sorted values into four equal parts.
Roughly speaking, Q1 , separates the bottom 25% of the sorted values from the top 75%, Q2 is the
median and Q3 separates the top 25% from the bottom 75%.
Page | 17
Just as there are three quartiles separating a data set into four parts, there are 99 percentiles, P1, P2, P3,
etc., which partition the data into 100 groups with about 1% of the scores in each group.
25%
(minimum)
25%
Q1= P25
25%
25%
Q2= P50
Q3= P75
(maximum)
(median)
Finding the Percentile/Quartile of a Given Score:
Percentile of score x =
Finding the Score of a Given Percentile/Quartile:
L
k
n
100
n = total number of values in the data set
k = percentile being used
L = locator that gives the position of a value
Pk = kth percentile
Steps for finding the Score:
Step 1:
Arrange the data in order from lowest to highest
Step 2:
Substitute into the formula for L above
Step 3a:
If L is not a whole number, round up to the next whole number. Starting at the
lowest value, count over to the number that corresponds to the rounded‐up value.
Step 3b:
If L is a whole number, use the value halfway between the Lth and the (L+1)th values
when counting up from the lowest value.
Page | 18
Other Useful Definitions:
Interquartile Range (or IQR):
Semi‐interquartile Range:
Midquartile:
10 – 90 Percentile Range:
Example (Pulse Rate of Smokers):
52
52
60
60
60
60
63
63
66
67
68
69
71
72
73
75
78
80
82
83
88
90
Find Q1 (or P25), Q3 (or P75) and IQR.
Outliers:

a value that is located very far away from almost all the other values.

an extreme value

can have a dramatic effect on the mean, standard deviation, and on the scale of the histogram
so that the true nature of the distribution is totally obscured
Steps for Identifying Outliers
Step 1:
Arrange the data in order and find Q1 and Q2
Step 2:
Find the interquartile range: IQR = Q3 – Q1
Step 3:
Multiply the IQR by 1.5.
Step 4:
Subtract the value obtained in step 3 from Q1 and add the value to Q3.
Step 5:
Check the data set for any data value that is smaller than
Q1 – 1.5(IQR) or larger than Q1 + 1.5(IQR).




Boxplots:
Reveals the:
center of the data
spread of the data
distribution of the data
presence of outliers
Page | 19
Excellent for comparing two or more data sets
5‐number summary





Minimum
First quartile Q1
Median Q2
Third quartile Q3
Maximum
1.5 IQR
1.5 IQR
Outlier
*
Q1
Median
Q3
Example (Pulse Rate of Smokers):
52
69
52
71
60
72
60
73
60
75
60
78
63
80
63
82
66
83
67
88
68
90
Graph the boxplot for the data.
Minimum:
Q1, first quartile:
Median:
Q3, third quartile:
Maximum:
Check for Outliers
IQR =
1.5(IQR) =
Q1 ‐ 1.5(IQR) =
Q3 + 1.5(IQR) =
Page | 20
Download