A Brief Review of Some Important Statistical Concepts

advertisement
Lecture 3
A Brief Review of Some Important
Statistical Concepts
The Meaning of a Variable

A variable refers to any quantity that may take on more than one value
 Population is a variable because it is not fixed or constant –
changes over time
 The unemployment rate is a variable because it may take on any
value from 0-100%

A random variable can be thought of as an unknown value that may
change every time it is inspected.

A random variable either may be discrete or continuous
 A variable is discrete if its possible values have jumps or breaks
 Population - measured in integers or whole units: 1, 2, 3, …
 A variable is continuous if there are no jumps or breaks
 Unemployment rate – needs not be measured in whole units:
1.77, .., 8.99, …
Descriptive Statistics

Descriptive statistics are used to describe the main features of a
collection of data in quantitative terms.
 Descriptive statistics aim to quantitatively summarize a data set

Some statistical summaries are especially common in descriptive
analyses. For example
 Frequency Distribution
 Central Tendency
 Dispersion
 Association
Frequency Distribution



Every set of data can be described in terms of how frequently certain
values occur.
In statistics, a frequency distribution is a tabulation of the values that
one or more variables take in a sample.
Consider the hypothetical prices of Dec CME Live Cattle Futures
Month
Price (cents/lb)
May
67.05
June
66.89
July
67.45
August
68.39
September
67.45
October
70.10
November
68.39
Frequency Distribution

Univariate frequency distributions are often presented as lists
ordered by quantity showing the number of times each value appears.



A frequency distribution may be grouped or ungrouped
For a small number of observations - ungrouped frequency distribution
For a large number of observations - grouped frequency distribution
Ungrouped
Grouped
Price (X)
Frequency
Price (X)
Frequency
67.05
1
65.00-66.99
1
66.89
1
67.00-68.99
4
67.45
2
69.00-70.99
1
68.39
2
71.00-72.99
0
70.10
1
73.00-74.99
0
Central Tendency

In statistics, the term central tendency relates to the way in which
quantitative data tend to cluster around a “central value”.

A measure of central tendency is any of a number of ways of
specifying this "central value.“

There are three important descriptive statistics that gives
measures of the central tendency of a variable:
 The Mean
 The Median
 The Mode
The Mean




The arithmetic mean is the most commonly-used type of average and
is often referred to simply as the average.
In mathematics and statistics, the arithmetic mean (or simply the
mean) of a list of numbers is the sum of all numbers in the list divided
by the number of items in the list.
 If the list is a statistical population, then the mean of that population
is called a population mean.
 If the list is a statistical sample, we call the resulting statistic a
sample mean.
If we denote a set of data by X = (x1, x2, ..., xn), then the sample mean
is typically denoted with a horizontal bar over the variable ( X ,
enunciated "x bar").
The Greek letter μ is used to denote the arithmetic mean of an entire
population.
The Sample Mean

In mathematical notation, the sample mean of a set of data denoted as
X = (x1, x2, ..., xn) is given by
1 n
1
X   X i  ( X 1  X 2  ... X n )
n i 1
n


To calculate the mean, all of the observations (values) of X are added
and the result is divided by the number of observations (n)
In the previous example, the mean price of Dec CME Live Cattle futures
contract is
1 n
1
X   X i  (67.05  66.89  ... 68.39)  67.96
n i 1
7
The Median

In statistics, a median is described as the numeric value separating
the higher half of a sample or population from the lower half.

The median of a finite list of numbers can be found by arranging all
the observations from lowest value to highest value and picking the
middle one.

If there is an even number of observations, then there is no single
middle value, so one often takes the mean of the two middle values.

Organize the price data in the previous example in ascending order
67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10

The median of this price series is 67.45
The Mode

In statistics, the mode is the value that occurs the most frequently in
a data set.

The mode is not necessarily unique, since the same maximum
frequency may be attained at different values.

Organize the price data in the previous example in ascending order
67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10

There are two modes in the given price data – 67.45 and 68.39

Thus the mode of the sample data is not unique

The sample price dataset may be said to be bimodal

A population or sample data may be unimodal, bimodal, or
multimodal
Statistical Dispersion


In statistics, statistical dispersion (also called statistical variability
or variation) is the variability or spread in a variable or probability
distribution.
In particular, a measure of dispersion is a statistic (formula) that
indicates how disperse (i.e., spread) the values of a given variable are

Common measures of statistical dispersion are
 The Variance, and
 The Standard Deviation

Dispersion is contrasted with location or central tendency, and
together they are the most used properties of distributions
The Variance

In statistics, the variance of a random variable or distribution is the
expected (mean) value of the square of the deviation of that variable
from its expected value or mean.

Thus the variance is a measure of the amount of variation within the
values of that variable, taking account of all possible values and their
probabilities.

If a random variable X has the expected (mean) value E[X]=μ, then
the variance of X can be given by:
Var( X )  E[(X   )2 ]   x2
The Variance

The above definition of variance encompasses random variables that
are discrete or continuous. It can be expanded as follows:
Var ( X )  E[( X   ) 2 ]
 E[ X 2  2X   2 ]
 E[ X 2 ]  2E[ X ]   2
 E[ X 2 ]  2 2   2
 E[ X 2 ]   2
 E[ X 2 ]  ( E[ X ])2
The Variance: Properties

Variance is non-negative because the squares are positive or zero.

The variance of a constant a is zero, and the variance of a variable
in a data set is 0 if and only if all entries have the same value.
Var (a)  0

Variance is invariant with respect to changes in a location
parameter. That is, if a constant is added to all values of the
variable, the variance is unchanged.
Var( X  a)  Var( X )

If all values are scaled by a constant, the variance is scaled by the
square of that constant.
Var (aX )  a 2Var ( X )
Var (aX  b)  a 2Var ( X )
The Sample Variance

If we have a series of n measurements of a random
variable X as Xi, where i = 1, 2, ..., n, then the sample
variance, can be used to estimate the population variance
of X = (x1, x2, ..., xn), The sample variance is calculated as
 X  X 
n
S x2 
i 1
2
i
n 1

1
2
2
2
X1  X   X 2  X   ... X n  X 

n 1

The Sample Variance

The denominator, (n-1) is known as the degrees of freedom in
calculating
sx2 : Intuitively, once X is known, only
values are free to vary, one is predetermined by

n-1 observation
X
When n = 1 the variance of a single sample is obviously zero
regardless of the true variance. This bias needs to be corrected for
when n is small.
 X
n
S 
2
x
i 1
X
2
i
n 1

1
2
2
2
X1  X   X 2  X   ... X n  X 

n 1

The Sample Variance

For the hypothetical price data for Dec CME Live Cattle futures
contract, 67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10, the sample
variance can be calculated as
 X  X 
n
S x2 
2
i
i 1
n 1

1
67.05  67.962  ... 70.10  67.962

7 1
 1.24

The Standard Deviation


In statistics, the standard deviation of a random variable
or distribution is the square root of its variance.
If a random variable X has the expected value (mean)
E[X]=μ, then the standard deviation of X can be given by:
 x    E [( X   ) ]
2
x

2
That is, the standard deviation σ (sigma) is the square root
of the average value of (X − μ)2.
The Standard Deviation

If we have a series of n measurements of a random
variable X as Xi, where i = 1, 2, ..., n, then the sample
standard deviation, can be used to estimate the
population standard deviation of X = (x1, x2, ..., xn). The
sample standard deviation is calculated as
 X  X 
n
Sx  S 
2
x
i 1
2
i
n 1
 1.24  1.114
The Mean Absolute Deviation

The mean or average deviation of X from its mean
  di


 n

(X
 X)


n

i
is always zero. The positive and negative deviations cancel out in the
summation, which makes it a useless measure of dispersion.

The mean absolute deviation (MAD), calculated by:

 d i 
 n

 (X
 X ) 

n

i
solves the “canceling out” problem.
The MSD and RMSD

The alternative way to address the canceling out problem is by
squaring the deviations from the mean to obtain the mean squared
deviation (MSD):
 di
2
n



X  X 
2
i
n
The problem of squaring can be solved by taking the square root of
the MSD to obtain the root mean squared deviation (RMSD):
 X
n
RMSD  MSD 
i 1
X
2
i
n
RMSD vs. Standard Deviation


When calculating the RMSD, the squaring of the deviations gives a
greater importance to the deviations that are larger in absolute value,
which may or may not be desirable.
For statistical reasons, it turns out that a slight variation of the RMSD,
known as the standard deviation (SX), is more desirable as a measure
of dispersion.
 X
n
RMSD  MSD 
i 1
i X
2
n
 X
n
Sx 
i 1
X
2
i
n 1
Variance vs. MSD
Standard Deviation vs. RMSD
Price (X)
67.05
66.89
67.45
68.39
67.45
70.10
68.39
Total
Variance =
Std. Dev. =
Mean
67.96
67.96
67.96
67.96
67.96
67.96
67.96
1.24
1.11
(Xi−Mean) |Xi−Mean| |Xi−Mean|2
-0.91
0.91
0.83
-1.07
1.07
1.14
-0.51
0.51
0.26
0.43
0.43
0.18
-0.51
0.51
0.26
2.14
2.14
4.58
0.43
0.43
0.18
0.00
6.00
7.44
MAD =
MSD =
RMSD =
0.86
1.06
1.03
p 53
Association

Bivariate statistics can be used to examine the degree in
which two variables are related or associated, without
implying that one causes the other

Multivariate statistics can be used to examine the degree in
which multiple variables are related or associated, without
implying that one causes any or some of the others

Two common measures of bivariate and multivariate statistics are
 Covariance
 Correlation Coefficient
24
p 54
Association: Bivariate Statistics

In Figure 3.3 (a) Y and X are positively but weakly correlated while
in 3.3 (b) they are negatively and strongly correlated
25
The Covariance

The covariance between two real-valued random variables X and Y,
with mean (expected values) X   and Y  v , is
Cov( X , Y )  E[( X  X ).(Y  Y )]  E[( X   ).(Y  v)]
 E[ X .Y  Y  vX  v]
 E[ X .Y ]  E[Y ]  vE[ X ]   v
 E[ X .Y ]   v   v   v
 E[ X .Y ]   v


Cov(X, Y) can be negative, zero, or positive
Random variables with covariance is zero are called uncorrelated
or independent
Covariance

If X and Y are independent, then their covariance is zero. This
follows because under independence,
E[ X .Y ]  E[ X ].E[Y ]   v

Recalling the final form of the covariance derivation given above,
and substituting, we get
Cov( X , Y )   v   v  0

The converse, however, is generally not true: Some pairs of random
variables have covariance zero although they are not independent.
The Covariance: Properties

If X and Y are real-valued random variables and a and b are
constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of covariance:
Cov( X , a)  0
Cov( X , X )  Var ( X )
Cov( X , Y )  Cov(Y , X )
Cov(aX , bY )  abCov( X , Y )
Cov( X  a, Y  b)  Cov( X , Y )
Variance of the Sum of Correlated
Random Variables

If X and Y are real-valued random variables and a and b are
constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of variance and
covariance:
Var( X  Y )  Var( X )  Var(Y )  2Cov( X , Y )
Var(aX  bY )  a 2Var( X )  b 2Var(Y )  2abCov( X , Y )

The variance of a finite sum of uncorrelated random variables is
equal to the sum of their variances.
Var( X  Y )  Var( X )  Var(Y )

This is because, if X and Y are uncorrelated, their covariance is 0.
p 53
The Sample Covariance

The covariance is one measure of how closely the values taken by
two variables X and Y vary together:

If we have a series of n measurements of X and Y written as Xi and
Yi where i = 1, 2, ..., n, then the sample covariance can be used to
estimate the population covariance between X=(X1, X2, …, Xn) and
Y=(Y1, Y2, …, Yn). The sample covariance is calculated as
 X
n
S x, y 
i 1
i
 X Yi  Y 
n  1
30
Correlation Coefficient


A disadvantage of the covariance statistic is that its magnitude
can not be easily interpreted, since it depends on the units in
which we measure X and Y
The related and more used correlation coefficient remedies
this disadvantage by standardizing the deviations from the
mean:
 x, y

 X ,Y
Cov( X , Y )


Var( X ) Var(Y )  X . Y
The correlation coefficient is symmetric, that is
 x, y   y , x
Correlation Coefficient

If we have a series of n measurements of X and Y written as Yi
and Yi, where i = 1, 2, ..., n, then the sample correlation
coefficient, can be used to estimate the population correlation
coefficient between X and Y. The sample correlation coefficient
is calculated as
n
rx , y 
(X
i 1
i
 X )(Yi  Y )
(n  1) S x S y
Correlation Coefficient

The value of correlation coefficient falls between −1 and 1:
1  rx, y  1



rx,y= 0 => X and Y are uncorrelated
rx,y= 1 => X and Y are perfectly positively correlated
rx,y = −1 => X and Y are perfectly negatively correlated
Download