Very Basic Statistics

advertisement
Very Basic Statistics
Course Content
• Data Types
• Descriptive Statistics
• Data Displays
Data Types
Variables
• Quantitative Variable
• A variable that is counted or measured on a
numerical scale
• Can be continuous or discrete (always a whole
number).
• Qualitative Variable
• A non-numerical variable that can be classified into
categories, but can’t be measured on a numerical
scale.
• Can be nominal or ordinal
Continuous Data
• Continuous data is measured on a scale.
• The data can have almost any numeric value
and can be recorded at many different points.
• For example
•
•
•
•
Temperature (39.25oC)
Time (2.468 seconds)
Height (1.25m)
Weight (66.34kg)
Discrete Data
• Discrete data is based on counts, for example;
• The number of cars parked in a car park
• The number of patients seen by a dentist each day.
• Only a finite number of values are possible e.g.
a dentist could see 10, 11, 12 people but not
12.3 people
Nominal Data
• A Nominal scale is the most basic level of measurement.
The variable is divided into categories and objects are
‘measured’ by assigning them to a category.
• For example,
• Colours of objects (red, yellow, blue, green)
• Types of transport (plane, car, boat)
• There is no order of magnitude to the categories i.e.
blue is no more or less of a colour than red.
Ordinal Data
• Ordinal data is categorical data, where the categories
can be placed in a logical order of ascendance e.g.;
• 1 – 5 scoring scale, where 1 = poor and 5 = excellent
• Strength of a curry (mild, medium, hot)
• There is some measure of magnitude, a score of ‘5 –
excellent’ is better than a score of ‘4 – good’.
• But this says nothing about the degree of difference
between the categories i.e. we cannot assume a
customer who thinks a service is excellent is twice as
happy as one who thinks the same service is good.
Task 1
• Look at the following variables and decide if they are
qualitative or quantitative, ordinal, nominal, discrete
or continuous
•
•
•
•
•
•
•
•
Age
Year of birth
Sex
Height
Number of staff in a department
Time taken to get to work
Preferred strength of coffee
Company size
Descriptive Statistics
Session Content
• Measures of Location
• Measures of Dispersion
Measures of Location
Common Measures
• Measures of location summarise the data with
a single number
• There are three common measures of location
• Mean
• Mode
• Median
• Quartiles are another measure
Mean
• The mean (more precisely, the arithmetic mean) is
commonly called the average
• In formulas the mean is usually represented by
read as ‘x-bar’.
x
• The formula for calculating the mean from ‘n’ individual
data-points is;
x

x
n
X bar equals the sum of the data divided by the
number of data-points
Pro’s & Con’s
• Advantages
• Disadvantages
– It may not be an actual
– basic calculation is easily
understood
‘meaningful’ value, e.g. an
average of 2.4 children per
family.
– Can be greatly affected by
– all data values are used in the
calculation
extreme values in a dataset. e.g.
seven students take a test and
receive the following scores.
40 42 45 50 53 54 99
– used in many statistical
procedures.
– The average score is 54.7 – but
is this really representative of the
group?
– If the extreme value of 99 is
dropped, the average falls to
47.3
Mode
• The mode represents the most commonly occurring
value within a dataset.
• We usually find the mode by creating a frequency
distribution in which we tally how often each value
occurs.
• If we find that every value occurs only once, the distribution
has no mode.
• If we find that two or more values are tied as the most
common, the distribution has more than one mode.
Pro’s & Con’s
• Advantages
– easy to understand
– not affected by outliers
(extreme values)
– can also be obtained for
qualitative data
e.g. when looking at the
frequency of colours of cars
we may find that silver occurs
most often
• Disadvantages
– not all sets of data have a modal
value
– some sets of data have more
than one modal value
– multiple modal values are often
difficult to interpret
Task 2
• The following values are the ages of students in their
first year of a course
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
• Find the mean age of the students
• Find the modal value
• In your opinion which is the better measure of location
for this data set?
Median
• Median means middle, and the median is the middle of
a set of data that has been put into rank order.
• Specifically, it is the value that divides a set of data into
two halves, with one half of the observations being
larger than the median value, and one half smaller.
Half the data < 29
18
24
Half the data > 29
29
30
32
Finding the Median from Individual
Data
• Step 1:- Arrange the observations in increasing order i.e.
rank order. The median will be the number that corresponds
to the middle rank.
• Step 2:- Find the middle rank with the following formula:
Middle rank = ½*(n+1)
• Step 3 – Identify the value of the median
• If ‘n’ is an odd number the middle rank will fall on an
observation. The median is then the value of that
observation.
Finding the Median from Individual
Data
• If ‘n’ is an even number, the middle rank will fall between
two observations. In this case the median is equal to the
arithmetic mean of the values of the two observations
40
42
45
50
53
54
70
99
Position of Median = ½*(n+1) = 4.5
data - point 4  data - point 5
Median =
2
50  53
 51.5
Median =
2
Pro’s & Con’s
• Advantages
• Disadvantages
– the concept is easy to
– data must be arranged in rank
understand
– the median can be
determined for any type of
data (with the exception of
nominal)
– the median is not unduly
influenced by extreme values
in the dataset
order (ascending or
descending)
– cannot combine medians in
statistical calculations as with
mean values
Task 3
• Using the student age data below, find the
median age
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Quartiles
• Also known as percentiles
• Lower quartile - 25% of the data is below this
• Position of Q1 = ¼*(n+1)
• Upper quartile – 75% of the data is below this
• Position of Q3 = ¾*(n+1)
• If a quartile falls on an observation, the value of the
quartile is the value of that observation.
• For example, if the position of a quartile is 20, its value is
the value of the 20th observation.
Quartiles
• If a quartile lies between observations, the value of the
quartile is the value of the lower observation plus the
specified fraction of the difference between the two
observations.
40
42
45
50
53
54
70
99
Position of Upper Quartile = ¾*(n+1) = 6.75
Upper quartile = data-point 6 + 0.75*(data-point 7 – data-point 6)
Upper quartile = 54 + 0.75*(70 – 54) = 66
Task 4
• Using the student age data below find the
upper and lower quartiles
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Measures of Dispersion
Common Measures
• The dispersion in a set of data is the variation among
the set of data values.
• It measures whether they are all close together, or
more scattered.
2
4
6
8 10 12 14 16
Report turnaround time (days)
2
4
6
8 10 12
Report turnaround time (days)
Common Measures
• The four common measures of spread are
• the range
• the inter-quartile range
• the variance
• the standard deviation
Range
• The range is the difference between the largest and the
smallest values in the dataset i.e. the maximum
difference between data-points in the list.
• It is sensitive to only the most extreme values in the list.
The range of a list is 0 if and only if all the data-points
in the list are equal.
4
16
Range
Days
Pro’s & Con’s
• Advantages
• Disadvantages
– best for symmetric data
– doesn’t use all of the
with no outliers
– easy to compute and
understand
– good option for ordinal
data
data, only the extremes
– very much affected if the
extremes are outliers
– only shows maximum
spread, does not show
shape
Task 5
• Using the student age data find the range of
the data.
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Inter-quartile Range
• (upper quartile – lower quartile)
• Essentially describes how much the middle 50% of
your dataset varies
• example: if all patients in a dentist surgery took moreor-less the same time to be treated with only one or
two exceptionally quick or long appointments you
would expect the inter-quartile range to be very small
• but if all appointments were either very quick or very
long, with few in between then the inter-quartile
range would be larger.
Pro’s & Con’s
• Advantages
• Disadvantages
– Good for ordinal data
– Harder to calculate and
– Ignores extreme values
– Doesn’t use all the
– More stable than the range
because it ignores outliers
understand
information (ignores half of
the data-points, not just the
outliers)
• Tails almost always matter in
data and these aren’t
included
• Outliers can also sometimes
matter and again these aren’t
included.
Task 6
• Using the student age data find the interquartile range.
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Variance and Standard Deviation
(s2, s2) =(population notation, sample notation)
• The variance (s2, s2) and standard deviation (s, s)
are measures of the deviation or dispersion of
observations (x) around the mean (m) of a
distribution
• Variance is an ‘average’ squared deviation from
the mean
Variance and Standard Deviation
• The standard deviation (SD) is the square root of the
variance.
• small SD = values cluster closely around the mean
• large SD = values are scattered
Mean
1 SD
4
6
8
10
1 SD
1 SD
12
14
16
Days
8
Mean
10
1 SD
12
Variance and Standard Deviation
• The following formulae define these measures
Population
Variance  s 2 
Sample
2

)
x

m

N
Standard Deviation  s  s 2
Variance  s 2

x  x)


2
n 1
Standard Deviation  s  s 2
Variance
• Advantages:
• uses all of the data values
• Disadvantages:
• the variance is measured in the original units
squared
• extreme values or outliers effect the variance
considerably
• hard to calculate manually
Standard Deviation
• Advantages:
• same units of measurement as the values
• useful in theoretical work and statistical methods
and inference
• Disadvantages:
• hard to calculate manually
Task 7
• Using the student age data find the variance
and the standard deviation
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Session Summary
• Measures of Location
•
•
•
•
Mean
Mode
Median
Quartiles
• Measures of Dispersion
•
•
•
•
Range
Interquartile Range
Variance
Standard Deviation
Data Displays
Session Content
–
–
–
–
–
–
–
–
Histograms
Run charts
Box plots
Bar charts
Pareto charts
Pie charts
Scatter plots
Contingency tables
Histograms
Histogram of dataset 1 (normal)
30
25
Frequency
20
15
10
5
0
45.0
52.5
60.0
67.5
75.0
dataset 1 (normal)
82.5
90.0
Run Charts
Time Series Plot of Time Taken
35.0
Time Taken
32.5
30.0
27.5
25.0
mon tue wed thu
fri mon tue wed thu
fri mon tue wed thu
Day
fri mon tue wed thu
fri
Boxplots
Boxplot of dataset 1 (norma, dataset 2 (expon, dataset 3 (unifo
400
Data
300
200
100
0
dataset 1 (normal)
dataset 2 (exponential)
dataset 3 (uniform)
Bar Charts
Chart of Frequency
20
Frequency
15
10
5
0
missed dose
wrong patient
wrong dose
wrong time
Causes of Medication Errors
wrong medicine
Pareto Charts
Pareto Chart of Causes of Medication Errors
40
100
Frequency
60
20
40
10
Causes of Medication Errors
20
0
w
ng
ro
se
do
w
ng
o
r
tim
w
Frequency
Percent
Cum %
18
45.0
45.0
e
ng
ro
15
37.5
82.5
m
e
cin
i
ed
w
ng
ro
4
10.0
92.5
nt
it e
pa
2
5.0
97.5
r
he
t
O
1
2.5
100.0
0
Percent
80
30
Pie Charts
Pie Chart of Causes of Medication Errors
4, 10.0%
15, 37.5%
Category
missed dose
wrong patient
wrong dose
wrong time
wrong medicine
1, 2.5%
2, 5.0%
18, 45.0%
Scatterplots
Scatterplot of Weight Loss vs Time on Diet
80
70
Weight Loss
60
50
40
30
20
10
0
0
5
10
15
Time on Diet
20
25
Contingency Tables
Colour of eyes
Colour of hair
Brown
Green/grey
Blue
Total
Black
50
54
41
145
Brown
38
46
48
132
Fair
22
30
31
83
Ginger
10
10
20
40
Total
120
140
140
400=N
Session Summary
–
–
–
–
–
–
–
–
Histograms
Run charts
Box plots
Bar charts
Pareto charts
Pie charts
Scatter plots
Contingency tables
Course Summary
• Data Types
• Descriptive Statistics
• Data Displays
Download