A Brief Review of Some Important Statistical Concepts

Lecture 3 A Brief Review of Some Important Statistical Concepts The Meaning of a Variable  A variable refers to any quantity that may take on more than one value  Population is a variable because it is not fixed or constant – changes over time  The unemployment rate is a variable because it may take on any value from 0-100%  A random variable can be thought of as an unknown value that may change every time it is inspected.  A random variable either may be discrete or continuous  A variable is discrete if its possible values have jumps or breaks  Population - measured in integers or whole units: 1, 2, 3, …  A variable is continuous if there are no jumps or breaks  Unemployment rate – needs not be measured in whole units: 1.77, .., 8.99, … Descriptive Statistics  Descriptive statistics are used to describe the main features of a collection of data in quantitative terms.  Descriptive statistics aim to quantitatively summarize a data set  Some statistical summaries are especially common in descriptive analyses. For example  Frequency Distribution  Central Tendency  Dispersion  Association Frequency Distribution    Every set of data can be described in terms of how frequently certain values occur. In statistics, a frequency distribution is a tabulation of the values that one or more variables take in a sample. Consider the hypothetical prices of Dec CME Live Cattle Futures Month Price (cents/lb) May 67.05 June 66.89 July 67.45 August 68.39 September 67.45 October 70.10 November 68.39 Frequency Distribution  Univariate frequency distributions are often presented as lists ordered by quantity showing the number of times each value appears.    A frequency distribution may be grouped or ungrouped For a small number of observations - ungrouped frequency distribution For a large number of observations - grouped frequency distribution Ungrouped Grouped Price (X) Frequency Price (X) Frequency 67.05 1 65.00-66.99 1 66.89 1 67.00-68.99 4 67.45 2 69.00-70.99 1 68.39 2 71.00-72.99 0 70.10 1 73.00-74.99 0 Central Tendency  In statistics, the term central tendency relates to the way in which quantitative data tend to cluster around a “central value”.  A measure of central tendency is any of a number of ways of specifying this "central value.“  There are three important descriptive statistics that gives measures of the central tendency of a variable:  The Mean  The Median  The Mode The Mean     The arithmetic mean is the most commonly-used type of average and is often referred to simply as the average. In mathematics and statistics, the arithmetic mean (or simply the mean) of a list of numbers is the sum of all numbers in the list divided by the number of items in the list.  If the list is a statistical population, then the mean of that population is called a population mean.  If the list is a statistical sample, we call the resulting statistic a sample mean. If we denote a set of data by X = (x1, x2, ..., xn), then the sample mean is typically denoted with a horizontal bar over the variable ( X , enunciated "x bar"). The Greek letter μ is used to denote the arithmetic mean of an entire population. The Sample Mean  In mathematical notation, the sample mean of a set of data denoted as X = (x1, x2, ..., xn) is given by 1 n 1 X   X i  ( X 1  X 2  ... X n ) n i 1 n   To calculate the mean, all of the observations (values) of X are added and the result is divided by the number of observations (n) In the previous example, the mean price of Dec CME Live Cattle futures contract is 1 n 1 X   X i  (67.05  66.89  ... 68.39)  67.96 n i 1 7 The Median  In statistics, a median is described as the numeric value separating the higher half of a sample or population from the lower half.  The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one.  If there is an even number of observations, then there is no single middle value, so one often takes the mean of the two middle values.  Organize the price data in the previous example in ascending order 67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10  The median of this price series is 67.45 The Mode  In statistics, the mode is the value that occurs the most frequently in a data set.  The mode is not necessarily unique, since the same maximum frequency may be attained at different values.  Organize the price data in the previous example in ascending order 67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10  There are two modes in the given price data – 67.45 and 68.39  Thus the mode of the sample data is not unique  The sample price dataset may be said to be bimodal  A population or sample data may be unimodal, bimodal, or multimodal Statistical Dispersion   In statistics, statistical dispersion (also called statistical variability or variation) is the variability or spread in a variable or probability distribution. In particular, a measure of dispersion is a statistic (formula) that indicates how disperse (i.e., spread) the values of a given variable are  Common measures of statistical dispersion are  The Variance, and  The Standard Deviation  Dispersion is contrasted with location or central tendency, and together they are the most used properties of distributions The Variance  In statistics, the variance of a random variable or distribution is the expected (mean) value of the square of the deviation of that variable from its expected value or mean.  Thus the variance is a measure of the amount of variation within the values of that variable, taking account of all possible values and their probabilities.  If a random variable X has the expected (mean) value E[X]=μ, then the variance of X can be given by: Var( X )  E[(X   )2 ]   x2 The Variance  The above definition of variance encompasses random variables that are discrete or continuous. It can be expanded as follows: Var ( X )  E[( X   ) 2 ]  E[ X 2  2X   2 ]  E[ X 2 ]  2E[ X ]   2  E[ X 2 ]  2 2   2  E[ X 2 ]   2  E[ X 2 ]  ( E[ X ])2 The Variance: Properties  Variance is non-negative because the squares are positive or zero.  The variance of a constant a is zero, and the variance of a variable in a data set is 0 if and only if all entries have the same value. Var (a)  0  Variance is invariant with respect to changes in a location parameter. That is, if a constant is added to all values of the variable, the variance is unchanged. Var( X  a)  Var( X )  If all values are scaled by a constant, the variance is scaled by the square of that constant. Var (aX )  a 2Var ( X ) Var (aX  b)  a 2Var ( X ) The Sample Variance  If we have a series of n measurements of a random variable X as Xi, where i = 1, 2, ..., n, then the sample variance, can be used to estimate the population variance of X = (x1, x2, ..., xn), The sample variance is calculated as  X  X  n S x2  i 1 2 i n 1  1 2 2 2 X1  X   X 2  X   ... X n  X   n 1  The Sample Variance  The denominator, (n-1) is known as the degrees of freedom in calculating sx2 : Intuitively, once X is known, only values are free to vary, one is predetermined by  n-1 observation X When n = 1 the variance of a single sample is obviously zero regardless of the true variance. This bias needs to be corrected for when n is small.  X n S  2 x i 1 X 2 i n 1  1 2 2 2 X1  X   X 2  X   ... X n  X   n 1  The Sample Variance  For the hypothetical price data for Dec CME Live Cattle futures contract, 67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10, the sample variance can be calculated as  X  X  n S x2  2 i i 1 n 1  1 67.05  67.962  ... 70.10  67.962  7 1  1.24  The Standard Deviation   In statistics, the standard deviation of a random variable or distribution is the square root of its variance. If a random variable X has the expected value (mean) E[X]=μ, then the standard deviation of X can be given by:  x    E [( X   ) ] 2 x  2 That is, the standard deviation σ (sigma) is the square root of the average value of (X − μ)2. The Standard Deviation  If we have a series of n measurements of a random variable X as Xi, where i = 1, 2, ..., n, then the sample standard deviation, can be used to estimate the population standard deviation of X = (x1, x2, ..., xn). The sample standard deviation is calculated as  X  X  n Sx  S  2 x i 1 2 i n 1  1.24  1.114 The Mean Absolute Deviation  The mean or average deviation of X from its mean   di    n  (X  X)   n  i is always zero. The positive and negative deviations cancel out in the summation, which makes it a useless measure of dispersion.  The mean absolute deviation (MAD), calculated by:   d i   n   (X  X )   n  i solves the “canceling out” problem. The MSD and RMSD  The alternative way to address the canceling out problem is by squaring the deviations from the mean to obtain the mean squared deviation (MSD):  di 2 n    X  X  2 i n The problem of squaring can be solved by taking the square root of the MSD to obtain the root mean squared deviation (RMSD):  X n RMSD  MSD  i 1 X 2 i n RMSD vs. Standard Deviation   When calculating the RMSD, the squaring of the deviations gives a greater importance to the deviations that are larger in absolute value, which may or may not be desirable. For statistical reasons, it turns out that a slight variation of the RMSD, known as the standard deviation (SX), is more desirable as a measure of dispersion.  X n RMSD  MSD  i 1 i X 2 n  X n Sx  i 1 X 2 i n 1 Variance vs. MSD Standard Deviation vs. RMSD Price (X) 67.05 66.89 67.45 68.39 67.45 70.10 68.39 Total Variance = Std. Dev. = Mean 67.96 67.96 67.96 67.96 67.96 67.96 67.96 1.24 1.11 (Xi−Mean) |Xi−Mean| |Xi−Mean|2 -0.91 0.91 0.83 -1.07 1.07 1.14 -0.51 0.51 0.26 0.43 0.43 0.18 -0.51 0.51 0.26 2.14 2.14 4.58 0.43 0.43 0.18 0.00 6.00 7.44 MAD = MSD = RMSD = 0.86 1.06 1.03 p 53 Association  Bivariate statistics can be used to examine the degree in which two variables are related or associated, without implying that one causes the other  Multivariate statistics can be used to examine the degree in which multiple variables are related or associated, without implying that one causes any or some of the others  Two common measures of bivariate and multivariate statistics are  Covariance  Correlation Coefficient 24 p 54 Association: Bivariate Statistics  In Figure 3.3 (a) Y and X are positively but weakly correlated while in 3.3 (b) they are negatively and strongly correlated 25 The Covariance  The covariance between two real-valued random variables X and Y, with mean (expected values) X   and Y  v , is Cov( X , Y )  E[( X  X ).(Y  Y )]  E[( X   ).(Y  v)]  E[ X .Y  Y  vX  v]  E[ X .Y ]  E[Y ]  vE[ X ]   v  E[ X .Y ]   v   v   v  E[ X .Y ]   v   Cov(X, Y) can be negative, zero, or positive Random variables with covariance is zero are called uncorrelated or independent Covariance  If X and Y are independent, then their covariance is zero. This follows because under independence, E[ X .Y ]  E[ X ].E[Y ]   v  Recalling the final form of the covariance derivation given above, and substituting, we get Cov( X , Y )   v   v  0  The converse, however, is generally not true: Some pairs of random variables have covariance zero although they are not independent. The Covariance: Properties  If X and Y are real-valued random variables and a and b are constants ("constant" in this context means non-random), then the following facts are a consequence of the definition of covariance: Cov( X , a)  0 Cov( X , X )  Var ( X ) Cov( X , Y )  Cov(Y , X ) Cov(aX , bY )  abCov( X , Y ) Cov( X  a, Y  b)  Cov( X , Y ) Variance of the Sum of Correlated Random Variables  If X and Y are real-valued random variables and a and b are constants ("constant" in this context means non-random), then the following facts are a consequence of the definition of variance and covariance: Var( X  Y )  Var( X )  Var(Y )  2Cov( X , Y ) Var(aX  bY )  a 2Var( X )  b 2Var(Y )  2abCov( X , Y )  The variance of a finite sum of uncorrelated random variables is equal to the sum of their variances. Var( X  Y )  Var( X )  Var(Y )  This is because, if X and Y are uncorrelated, their covariance is 0. p 53 The Sample Covariance  The covariance is one measure of how closely the values taken by two variables X and Y vary together:  If we have a series of n measurements of X and Y written as Xi and Yi where i = 1, 2, ..., n, then the sample covariance can be used to estimate the population covariance between X=(X1, X2, …, Xn) and Y=(Y1, Y2, …, Yn). The sample covariance is calculated as  X n S x, y  i 1 i  X Yi  Y  n  1 30 Correlation Coefficient   A disadvantage of the covariance statistic is that its magnitude can not be easily interpreted, since it depends on the units in which we measure X and Y The related and more used correlation coefficient remedies this disadvantage by standardizing the deviations from the mean:  x, y   X ,Y Cov( X , Y )   Var( X ) Var(Y )  X . Y The correlation coefficient is symmetric, that is  x, y   y , x Correlation Coefficient  If we have a series of n measurements of X and Y written as Yi and Yi, where i = 1, 2, ..., n, then the sample correlation coefficient, can be used to estimate the population correlation coefficient between X and Y. The sample correlation coefficient is calculated as n rx , y  (X i 1 i  X )(Yi  Y ) (n  1) S x S y Correlation Coefficient  The value of correlation coefficient falls between −1 and 1: 1  rx, y  1    rx,y= 0 => X and Y are uncorrelated rx,y= 1 => X and Y are perfectly positively correlated rx,y = −1 => X and Y are perfectly negatively correlated

A Brief Review of Some Important Statistical Concepts

Related documents

Products

Support

A Brief Review of Some Important Statistical Concepts

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib