Uploaded by Yin Man LO

Descriptive Statistics: Foundations of Data Science

advertisement
AMS1301
Foundations of Data Science
Chapter 1 - Descriptive Statistics
Bachelor of Science (Hons) in
• Actuarial Studies and Insurance (BSC-AIN)
• Data Science and Business Intelligence (BSC-DSBI)
1
What is Data Science ?
A collection of analytical skills and techniques
derived from mathematics, statistics and computer
science for extracting information from data.
Why Data Science ?
1. The world is data-driven (octopus card, Google, etc).
2. Increasing demand for data scientists in the world.
3. Reports from Ernst & Young and McKinsey Global
Institute.
-- analyzing business data sets will become a key
basis of competition in the future.
2
1.1
Statistical Terminology
Statistics is a branch of mathematics that
transforms numbers into useful information for
decision making.
1. Let you know the risk in a business.
2. Allow you to understand and reduce the variation
so as to make an appropriate business decision.
Statistics involves collecting, classifying, summarizing,
organizing, analyzing, and interpreting numerical
information.
3
Why Statistics ?
1. To summarize business data
2. To draw conclusions from the data
3. To make reliable forecasts about business
activities
4. To improve business process
4
1.1.1
Basic Terminology
1. Variable : a characteristic of an item or individual
2. Data : values associated with a variable (Singular: Datum)
3. Population : a set of elements of interest for a given problem
4. Sample : a portion of a population selected for analysis
5. Sampling : a process of selecting samples from the
population
5
6. Census : a study on every elements of the population
7. Survey : a study on part of the elements (samples) of the
population
8. Parameter : a descriptive measure of the population
e.g.:  and 
9. Statistic : a descriptive measure of the sample
e.g.: x and s
6
Descriptive statistics utilizes numerical
and graphical methods to look for
patterns in a data set, to summarize
the information revealed in a data set,
and to present information in a
convenient form.
Inferential statistics draws conclusions
about the characteristics of the
population based on the sample data.
7
Parameter
Sampling
Population
data
Sample
data
Statistical
inference
Sample
data
Calculation /
Analysis
Statistic
8
1.1.2 Types of Data
There are different ways to summarize
different types of data. Before we
move on to the ways to convert data into
useful information, we need to be
able to classify various types of data.
9
Ratio
Quantitative /
Numerical
Interval
Data
Ordinal
Qualitative /
Categorical
Nominal
10
Categorical data:
(1) Also known as qualitative data.
(2) Data are separated as categories
e.g.: yes/no; true/false,
male/female, etc
Numerical data:
(1) Also known as quantitative data.
(2) Data are represented by numerical values,
e.g.: height, weight, temperature,
income, distance, etc
11
Numerical data can be further
identified as being either discrete or
continuous.
Discrete variable:
Produce numerical responses that arise from a
counting process.
e.g.: how many modules have you taken in this
semester? [0, 7]
Continuous variable:
Produce numerical responses that arise from a
measuring process.
e.g.: how tall are you? [170cm, 190cm]
12
Example 1.1 Categorical vs Numerical Data
Question
Response Data Type
1. Do you currently have a profile on Facebook?
2. How many modules have you taken in this
semester?
3. How long did you take to travel from home to
University?
Yes/No
5
Categorical
Numerical
30 mins
Numerical
13
Statisticians use the terms
(1) nominal scale and ordinal scale to
describe categorical variable.
(2) interval scale and ratio scale to
describe numerical variable.
14
Nominal data (Categorical)
(1) No particular order (e.g. yes/no, true/false).
(2) Information is obtained by doing counts
of the number of occurrences.
(3) Numbers, letters, symbols, colours, etc
are used to represent nominal data.
15
Ordinal data (Categorical)
(1) Ranked categorical items,
e.g., fail/pass/credit/distinction
(2) Information is lost if data structure
is re-organized,
e.g., fail/pass/credit/distinction
→ fail/satisfactory/distinction
16
Example 1.2 Nominal vs Ordinal Data
Categorical Variable
Categories
1. Gender
Male / Female
2. Profile on Facebook?
Yes / No
3. Types of investments
Cash / stocks / bonds / None
4. Rating of the bus services
Excellent to Poor
5. Student Grade
A, A-, …, D, U
6. Standard & Poor’s bond credit ratings AAA, AA, A, BBB, …
Data Type
Nominal
Nominal
Nominal
Ordinal
Ordinal
Ordinal
Remark: We often record categorical data by
arbitrarily assigning a number to each category.
For instance, Male = 1, Female = 2.
17
Interval data (Numerical)
(1) Ordered data
(2) The difference between measurements
is a meaningful quantity but does not
involve a true zero point (arbitrary zero).
e.g.: The difference between a temperature of 100
degrees and 90 degrees is the same as
between 90 degrees and 80 degrees.
18
Ratio data (Numerical)
(1) have all the properties of an interval variable
(2) have a clear definition of 0.0. When the
variable equals 0.0, there is none of that
variable (involve a true zero point).
e.g.: Temperature (in F or C) is interval, but
temperature (in Kelvin) is ratio because
0.0 Kelvin means “no thermal energy”.
e.g.: Height, weight, age, length are ratio.
19
Example 1.3 Interval vs Ratio Data
Numerical Variable
Data Type
1. Temperature (in C or F)
2. Temperature (in Kelvin)
3. Time
4. Weight
5. Age
Interval
Ratio
Ratio
Ratio
Ratio
Remark: The distinctions between interval and ratio
data are subtle, but fortunately, this distinction is
often not important. For statistical purposes, there is
no difference between ratio and interval data.
20
1.2 Measures of Central Tendency
and Variability
1.2.1 Measures of Central Location
Three different measures
-- Arithmetic mean
-- Median
-- Mode
21
1.2.1.1 Arithmetic Mean
Let x1, x2, …, xn be observations in a sample
(ungrouped data), where n is the sample size.
The sample mean is denoted by x (X-bar).
In a population, the number of observations is N
and the population mean is 
1 N
Population Mean:  =
xi

N i =1
n
1
Sample Mean: x =
xi

n i =1
22
Example 1.4
Number of TV watching hours per week of 5
students randomly selected from a class are:
(Solution)
5 7 3 8 7
5
1
The sample mean is x =
xi

5 i =1
5+ 7 +3+8+7
=
=6
5
Remark: Note that the mean is very sensitive to the
extreme value (or outlier). For instance, if 8 is changed
to 38, the mean will change substantially from 6 to 12.
23
For Grouped data
Suppose n observations are grouped into k classes c1 – d1,
c2 – d2, …, ck – dk with frequencies f1, f2,…,fk and the class
marks (mid-value for each class) are x1, x2,…,xk.
Class
Class Mark
c1 – d1
x1
c2 – d2
x2
…
…
ck - dk
xk
Frequency
f1
f2
…
fk
k
fx
f x + f 2 x2 + ... + f k xk
Sample Mean: x = 1 1
= i =1k
f1 + f 2 + ... + f k
i
f
i =1
i
i
Remark: we use the mid-value of a class to represent that class.
24
Example 1.5
Suppose the weights (in kg) of 100 students are tabulated
in the following table. Find the mean weight of the students.
Weight (kg) Class Mark Frequency
30-34
32
2
(Solution)
35-39
37
8
40-44
42
15
45-49
47
30
50-54
52
23
55-59
57
16
60-64
62
6
7
x=
fx
i =1
7
i i
f
i =1
2  32 + 8  37 + ... + 6  62
=
= 48.8
2 + 8 + ... + 6
i
25
1.2.1.2 Median
The median is calculated by placing all the
observations in order (ascending or descending).
The observation that falls in the middle is the
median.
For n observations, we have two cases.
 n +1

➢ When n is odd, the median is the 
 2 
th
ranked value.
➢ When n is even, the median is the average
th
n
n 
of the   and the  +1 ranked values.
2 
2
th
26
Example 1.6
Compute the median for each of the following
sequence of numbers.
1.
2.
1.2 1.4 1.8 2.1 2.7 3.5 3.9
1.2 1.4 1.8 2.1 2.7 3.5 3.9 4.1
(Solution)
For the 1st sequence, n = 7, thus the median is 2.1
For the 2nd sequence, n = 8, thus the median is
1
(2.1 + 2.7) = 2.4
2
27
1.2.1.3 Mode
The mode is defined as the observation (or
observations) that occurs with the greatest
frequency.
Remark: Note that a distribution may have more than one
mode. If all data appear only once, then there is no mode.
Example 1.7
Find the mode of data.
29 31 35 39 39 40 43 44 44 52
(Solution)
There are two modes, 39 & 44, each appears
twice.
28
Some other useful measures of central tendency.
The geometric mean is used whenever we wish to find the
“average” growth rate, or rate of change, in a variable
over time.
1.2.1.4 Which is Best?
The mean is generally our first choice
-- simple and easy to compute.
Sometimes median is better.
-- not sensitive to extreme values.
The mode is seldom the best measure of central
location.
29
1.2.2 Measures of Variability
Data can be characterized by its variation
and shape. Variation measures the spread or
dispersion of values in a data set.
1.2.2.1 Range
Difference between the largest and the
smallest values in a data set. The larger the
range, the larger the variation of the data.
Range = Largest observation – Smallest observation
30
Advantage:
➢ Simplicity
Disadvantage:
➢ Simplicity
-- calculated from only two observations.
-- tells us nothing about the other observations.
Example 1.8
Find the range for the following sets of data:
Set 1: 4 4 4 4 4 50
Set 2: 4 8 15 24 39 50
(Solution)
The range of both sets is 46.
31
1.2.2.2 Interquartile Range
Quartiles
It splits a set of ranked data into four equal parts.
Q1 : the middle between the smallest observation
and the median
Q2 : median
Q3 : the middle between the largest observation
and the median
Q1
25%
Q2
25%
Q3
25%
25%
32
 n + 1
Q1 = the 
 ranked value
 4 
th
 3( n + 1) 
Q3 = the 
 ranked value
 4 
th
Remark: Note that there are different ways to find quartiles.
In general, the pth q-tile is
th
 p(n + 1) 
Q p = the 
 ranked value
q


If q = 4, then this is called quartile.
If q = 10, then this is called decile.
If q = 100, then we called it percentile.
33
Some rules to compute the pth q-tile
➢ If Qp is an integer, the q-tile is simply equal to
the measurement corresponding to that
ranked value.
e.g., q = 4, n = 7, Q1 = 2 → the 2nd ranked value
➢ If Qp is a fractional half (e.g.: 2.5, 3.5, 4.5, etc),
the q-tile is equal to the measurement
corresponding to the average of the two ranked
values involved.
e.g., q = 4, n = 9, Q1 = 2.5 → the average of
2nd and the 3rd ranked values
34
➢ If Qp is neither an integer nor a fractional
half, we round Qp to the nearest integer
and the q-tile is equal to the measurement
corresponding to that ranked value.
e.g., q =4, n = 10, Q1 = 2.75 → the 3rd
ranked value
35
Interquartile Range
➢ Difference between the 1st quartile and
the 3rd quartile in the data set.
➢ It measures the spread in the middle 50%
of the data.
➢ Not influenced by extreme values.
Interquartile range IQR = Q3 – Q1
36
Example 1.9
The readings of diastolic blood pressure (mm
Hg) of 16 randomly selected males are:
66 70 74 75 79 81 81 82
85 91 91 93 95 99 99 100
(a) Find the mean, mode, median, Q1 and Q3.
(b) Find also the 10th and 85th percentile of
the readings.
37
66 70 74 75 79 81 81 82
85 91 91 93 95 99 99 100
(Solution)
(a) mean = 85.0625; mode = 81, 91, and 99
82 + 85
median =
= 83.5
2
 16 + 1 
th
th
Q1 = 
 ranked value = 4.25 ranked value = 4 ranked value = 75
 4 
th
 3(16 + 1) 
th
th
Q3 = 
 ranked value = 12.75 ranked value = 13 ranked value = 95
4


th
Interquartile range IQR = Q3 – Q1 = 95 - 75 =20
38
(b) Find also the 10th and 85th percentile of
the readings.
th
 p(n + 1) 
Recall: Qp = the 
 ranked value
 q 
 10(16 + 1) 
th
nd
Q10 = 
ranked
value
=
1.7
ranked
value
=
2
ranked value = 70

 100 
th
 85(16 + 1) 
th
th
Q85 = 
 ranked value = 14.45 ranked value = 14 ranked value = 99
 100 
th
39
1.2.2.3 Variance and Standard Deviation
(Ungrouped data)
They are used to measure variability.
Population
Variance:
N
1
2
2
 =  ( xi −  )
N i =1
Population Standard
Deviation:
Sample
Variance:
Sample Standard
Deviation:
 = 2
1 n
1  n 2
2
2
s =
( xi − x ) =
  xi − n x 

n − 1 i =1
n − 1  i =1

2
s=
s2
40
Characteristics of the Range, Variance
and Standard Deviation
➢ The greater (smaller) the spread or
dispersion of the data, the larger (smaller)
the range, variance and standard deviation.
➢ If the observations are all the same, then
there is no variation in the data. Thus, the
range, variance and standard deviation
must be equal to zero.
➢ All these measures are non-negative.
41
Example 1.10
Consider a set of sample data:
70 74 75 79 81 81 82
Find the variance and standard deviation.
(Solution)
s = 4.5040
s 2 = 20.2857
42
For Grouped data
Sample
Variance:
k
1
2
s2 =
f
(
x
−
x
)

i
i
n − 1 i =1
1  k
2
2
=
  f i xi − n x 
n − 1  i =1

k
k
fx
i =1
n
where n =  f i and x = i =1
i
i
43
Example 1.11
The overflow data for the last 50 business
days are as follows:
Daily Overflow Call
1-15
Frequency
14
16-30 31-45 46-60 61-75
21
8
4
3
Find the mean, variance, and standard deviation
of the sample.
(Solution)
Daily Overflow Call
1-15
16-30 31-45 46-60 61-75
Class Mark
8
23
38
53
68
Frequency
14
21
8
4
3
44
Daily overflow calls
1-15
16-30 31-45 46-60 61-75
Class mark (x)
8
23
38
53
68
Frequency (f)
14
21
8
4
3
k
x=
fx
i
i =1
n
i
14  8 + 21  23 + ... + 3  68
=
= 26.3
14 + 21 + ... + 3
k
1

2
2
2
s =
  f i xi − n x 
n − 1  i =1

1
(
=
48665 − 50  26.32 ) = 287.3571
50 − 1
s = 16 .9516
45
1.3 Covariance and Correlation
Coefficient
1.3.1 Measures of Linear Relationship
-- Two variables x and y have a linear relationship if
y = mx + c.
-- Direction and strength of the linear relationship
between two variables.
-- Covariance, coefficient of correlation and
coefficient of determination.
46
1.3.2 Covariance
Let X and Y be two variables and the corresponding
sample data are x1, x2, …, xn and y1, y2, …, yn ,
respectively.
N
1
Population
 xy =  ( xi −  x )( yi −  y )
Covariance:
N i =1
Sample
Covariance:
1 n
s xy =
( x i − x )( y i − y )

n − 1 i =1
1 n
=
xi y i − n x  y

n − 1 i =1
where  x and  y are the population means of X and Y,
respectively.
47
To illustrate how covariance measures linear
relationship, consider the following 3 sample
data sets.
-- As x increases, y increases.
-- ( xi − x )( y i − y )  0 and thus s xy  0
-- When X and Y move in the same direction (both
increase or both decrease), the covariance will be a
large positive number.
48
-- As x increases, y decreases.
-- ( xi − x )( y i − y )  0 and thus s xy  0
-- When X and Y move in the opposite direction, the
covariance will be a large negative number.
49
-- As x increases, y does not exhibit any particular
pattern.
-- When there is no particular pattern, the
covariance is a small number.
50
Two pieces of information:
(1) The sign of the covariance tells us the nature of the
relationship (i.e., positive linear relationship or
negative linear relationship).
(2) The magnitude describes the strength of the
association between X and Y. The larger the covariance,
the stronger the linear relationship.
However, how large the covariance should be so that we can
say that the two variables have a strong linear relationship?
We need another measure called Coefficient of Correlation.
51
1.3.3 Coefficient of Correlation
 xy
=
 x y
Population
Correlation:
Sample
Correlation:
r=
s xy
sx s y
where  x and  y are the population SD of X and Y,
respectively; s x and s y are the sample SD of X and Y,
respectively.
In addition, we have
− 1    +1
and
− 1  r  +1
52
Drawback:
Hard to interpret the correlation. E.g.: r = 0.3, we
can only say that the linear relationship is weak.
→ Introduce another measure of the strength of linear
relationship: Coefficient of Determination.
53
Example 1.12
Calculate the coefficient of correlation for the
three sets of data on pages 48-50 of the lecture
note.
(Solution)
s xy
17.5
Set 1 : r =
=
= 0.9449
s x s y (2.6458)(7)
s xy
− 17.5
Set 2 : r =
=
= −0.9449
s x s y (2.6458)(7)
s xy
− 3.5
Set 3 : r =
=
= −0.1890
s x s y (2.6458)(7)
Remark: Find correlation using calculator
54
1.3.4 Coefficient of Determination
Coefficient of
Determination:
R =r
2
2
It measures the amount of variation in the dependent
variable Y that is explained by the variation in the
independent variable X in the linear equation.
(1) If r = 1, then R 2 = 1
→ 100% of the variation in Y is explained by the
variation in X.
(2) If r = 0 , then R = 0
→ No linear pattern → None of the variation in Y is
explained by the variation in X.
2
55
In Example 1.12 Set 1:
r = 0.9449 → R = 0.8928
2
89.28% of the variation in Y is explained by the variation
in X. The remaining 10.72% is unexplained.
Remark: The sample covariance, correlation and
determination will be discussed in more details in other
AMS modules.
56
1.4 Graphical Representations and
Comparison of Data Sets
1.4.1 Data Organization
-- Raw data
-- Data visualization (categorical or numerical)
-- Tabulation and graphical representations
57
1.4.2 Organizing Categorical Data
1.4.2.1 Summary Table
-- Represent number of responses as frequencies or
percentages for each category.
-- Help to identify the differences among categories
by displaying frequency, amount, or percentage of
items in a set of categories in separate column.
58
1.4.2.2 Contingency Table
-- Study patterns that may exist between the responses
of two or more categorical variables.
59
1.4.3 Visualizing Categorical Data
1.4.3.1 Bar Chart and Pie Chart
60
1.4.4 Organizing Numerical Data
1.4.4.1 Ordered Array
It arranges the values of a numerical
variable in a ranked order, from the smallest
value to the largest value.
For instance, an ordered array of number of
members in 10 households: 2, 2, 2, 6, 3, 4, 2, 5,
3, 7 is 2, 2, 2, 2, 3, 3, 4, 5, 6, 7.
61
1.4.4.2 Frequency Distribution
-- counts the number of numerical observations that
fall into each of a series of intervals, called
Classes .
-- In general, 4 < Classes < 16. Too few or many
classes provides little information.
Largest value − Smallest value
Interval width =
Number of classes
62
Example 1.13
Construct frequency distributions (with interval
width 10) of the following ordered arrays
which show a cost per person at 50 city
restaurants and 50 suburban restaurants
63
(Solution)
The smallest and largest values for the ordered arrays
are 21 and 79, respectively. We use 20 and 80 as the
smallest and largest values for convenience.
64
When comparing two or more classes, the proportion or
percentage for each class is more useful and
meaningful than the frequency count of each class.
65
1.4.5 Visualizing Numerical Data
1.4.5.1 Stem-and-Leaf Plot
It (i) shows the range and distribution of the
data.
(ii) identifies outliers/unusual observations.
66
Example 1.14
The number of hours spent on internet weekly of
10 students are: 24, 26, 24, 21, 27, 27, 30, 41, 32,
38. Construct a stem-and-leaf plot for the data set.
(Solution)
Data in ordered array from smallest to largest:
21 24 24 26 27 27 30 32 38 41
Stem (in 10)
2
3
4
Leaf (in 1)
1 4 4 6 7 7
0 2 8
1
Remark: Stem may have as many digits as needed but leaf
should contain only a single digit.
67
Example 1.15
The weights (in kg) of 26 girls and 34 boys of BSc-DSBI
year 1 students of HSUHK are recorded and plotted
into a so-called half-stem-and-leaf plot.
68
We call it bimodal distribution as it has
two peaks.
69
Example 1.16
Based on the data in Example 1.15, we can compare the weights of girls
and boys using a back-to-back half-stem-and-leaf plot.
Weight of girls (in kg)
Stem
Weight of boys (in kg)
4 2
4
8 6 5 5
4
4 3 3 3 2 2 1 0 0
5
2
7 7 6 6 5
5
5 6
3 1 0
6
0 1 2 2 3 3 4
9 8
6
5 5 5 6 6 6 7 7 8 9
0
7
0 0 1 3 4
7
5 5 6 7
8
0 2 4
8
6
9
1
70
1.4.5.2 Histogram and Polygon
Histogram:
is a bar chart for grouped numerical data
where vertical bars are used to represent
frequencies or percentages in each group.
Polygon (or percentage polygon):
uses the midpoints for all class intervals and
then link these midpoints to form a line.
71
Example 1.17
According to the ordered arrays and frequency
distributions of cost per person at 50 city
restaurants and 50 suburban restaurants in
Example 1.13.
72
Shapes of Histogram
(1) A histogram is said to be symmetric if, we draw a
vertical line down the center of the histogram, the
two sides are identical in shape and size.
(2) A histogram is said to be positively/negatively
skewed if it has a long tail extending to the
right/left.
73
Example 1.18
Suppose the weights (in kg) of 100 students are tabulated
in the following table. Construct a histogram, frequency
polygon and cumulative frequency polygon/curve.
30-34
29.5 – 34.5
32
2
Cumulative
frequency
2
35-39
34.5 – 39.5
37
8
10
40-44
39.5 – 44.5
42
15
25
45-49
44.5 – 49.5
47
30
55
50-54
49.5 – 54.5
52
23
78
55-59
54.5 – 59.5
57
16
94
60-64
59.5 – 64.5
62
6
100
Weight (kg) Class boundary Class Mark Frequency
Total
100
74
Weight
(kg)
Histogram of the weights (in kg) of 100 students
Remark: We use class boundary and frequency to
construct the histogram.
75
30
Frequency 20
10
0
27
32
37
42
47
52
Weight (in kg)
57
62
67
Frequency polygon of the weights (in kg) of 100 students
Remark: We use class mark and frequency to construct
the frequency polygon.
76
Q3
Median
Q1
IQR
Cumulative frequency polygon of the weights (in kg) of 100
students.
77
Remarks:
-- We use the upper class boundaries to plot
the points.
-- Polygon means using a straight line to join
the points, whereas curve means using a
smooth curve to join the points.
-- Note that Q1, median and Q3 can be read
directly from the cumulative frequency
polygon/curve.
78
1.4.5.3 Scatter Plot
is used to examine possible relationship
between two numerical variables.
Example 1.19
The volume per day and cost per day are as
follows. Draw a scatter diagram of 11 volumes and
costs per day of a store.
Volume per day:
23 24 26 29 33 38 41 42 50 55 60
Cost per day:
131 120 140 151 160 167 185 170 188 195 200
79
Scatter plot for the volume and cost per day
80
1.4.5.4 Box-and-Whisker Plot
A Box-and-Whisker Plot (or Five-Number
Summary or simply Boxplot) provides a way to
determine the shape of a distribution. The five
points of a data set include
Minimum Q1 Median Q3 Maximum
81
Middle 50% of data
25% Upper
25% Lower
Q1
L
(min.value)
Q2=Median
Q3
Box and Whisker Plot
H
(max. value)
(i) The median inside the bar shows the location
of the center of the data.
(ii) The length of the box shows the spread of the
middle half of the data.
(iii)The lengths of the whiskers show the spread
of the lower and upper quarters of the data.
82
End of Chapter 1
83
Download