notes

advertisement
OBJECTIVE, DATA, DESCRIPTIVE STATISTICS
OBJECTIVE
The first step in an empirical study is to clearly state the objective. The objective is to use a
sample of data to make conclusions about an economic relationship in a population. The
population is the entire group of units in which you are interested. The sample is a subset of units
from the population. The unit can be an individual, household, firm, city, county, state, nation,
etc.
DATA
The next step is to obtain a sample of data. The process that generates this data can be a
controlled or uncontrolled experiment. The data can be experimental or observational data.
Random Sample or Nonrandom Sample?
Once we have our sample, we must ask: Is this a random sample or is nonrandom sample? A
random sample will be representative of the population. A nonrandom sample may over represent
or under represent different groups in the population. Sample selection results in a nonrandom
sample.
Consequences of a Nonrandom Sample
A nonrandom sample can cause bias. Bias is a systematic error when making a conclusion about a
population.
Sources of Sample Selection
Three important sources of sample selection. 1) Nonrandom method of sampling 2)
Nonresponse 3) Response Error. A nonrandom sample selects units from the population into the
sample by a systematic choice process, not chance. Nonresponse occurs when units selected into
the sample choose not to respond or cannot be contacted. Response error occurs when units in the
sample given incorrect information, or the information is incorrectly coded.
DESCRIPTIVE STATISTICS: GET TO KNOW THE DATA
The next step in an empirical study is to get to know the data. To get to know the data, we must
do the following. 1) Get to know the variables. 2) Get to know the observations. 3) Organize,
summarize, and describe the data. This is called the descriptive study of data. The descriptive
study of data involves describing the characteristics of the sample data, not drawing conclusions
about the population.
GET TO KNOW THE VARIABLES
A variable is a quantifiable characteristic of a unit. It is a known number that can differ from unit
to unit.
Quantitative Variable
A variable that can be measured numerically on a well defined scale.
Qualitative Variable
A variable that indicates the presence or absence of a quality or characteristic of a unit. It has as
many categories as possible characteristics.
Quantifying a Qualitative Variable
To quantify a qualitative variable, one or more artificial variables are created. These artificial
variables are called dummy variables. A dummy variable can take two values: 0 or 1. It takes a
value of 1 if a characteristic is present and zero if absent.
Discrete Variable
A variable that can take a finite number of values on a given interval.
Continuous Variable
A variable that can take an infinite number of values on a given interval.
Variable Measurement
Variable measurement refers to how a variable is defined and the unit in which it is measured.
GET TO KNOW THE OBSERVATIONS
There are three possible types of observations on a unit. A univariate observation is a numerical
value for single variable. A bivariate observation is two numerical values for two variables. A
multivariate observation is three or more numerical values for three or more variables. The
number of observations always equals the number of units in the sample. Let X, Y, and Z be
variables. Let xt, yt, zt be values of the variables X, Y, Z for the tth unit in the sample. Then (xt) is
a univariate observation; (xt, yt) is a bivariate observation; (xt, yt, zt) is a multivariate observation.
ORGANIZE, SUMMARIZE, DESCRIBE THE DATA
To organize, summary, and describe the data we do two things. 1) Look for patterns in the data.
2) Calculate descriptive statistics.
Look for Patterns in the Data
To look for patterns in the data, we use 2 statistical tools. 1) Frequency distribution and histogram
for a single variable. 2) Scatter diagram for two variables.
Calculate Descriptive Statistics
A statistic is a quantifiable characteristic of a sample. It is a known number that can differ from
sample to sample. A descriptive statistic is a numerical measure that describes a characteristic of
the sample data. There are two types of descriptive statistics. 1) Univariate descriptive statistics.
2) Bivariate descriptive statistics.
Univariate Descriptive Statistics
Univariate descriptive statistics describe the characteristics of the data for a single variable. The
two characteristics of most interest are the following. 1) Center of the data. (Measured by mean,
median, or mode). 2) Dispersion of the data. (Measured by range, variation, variance, standard
deviation, or coefficient of variation).
Measures of Center of Data
The most often used measure of the center of data is the sample mean. Let X denote a variable.
Let xt denote the value of X for the tth unit in the sample. Let n denote the number of
units/observations in the sample. The sample mean is Xbar =  xi / n.
Measures of Dispersion of Data
The range is the difference between the maximum and minimum values of X. The sample
variation (also called the total sum-of-squares) is TSS =  (xi – xbar)2. The sample variance is s2 =
 (xi – xbar)2/ (n-1) = TSS / (n – 1). The sample standard deviation is s =  s2. The sample
coefficient of variation is CV = (s / xbar)100.
Comparison of Measures of Dispersion
The range is the least used measure because it wastes information about dispersion. Larger values
of variation (TSS), variance (s2), and standard deviation (s) indicate more dispersion in the values
of X about its mean. The advantage of standard deviation is that it can be interpreted as the
average deviation of X from its mean. The major disadvantage of each of these measures is that
they are not unit-free measures. We cannot use these measures to compare dispersion for two or
more variables measured in different units. The major advantage of the coefficient of variation is
that it is unit-free, and can be used to make such a comparison.
Bivariate Descriptive Statistics
Bivariate descriptive statistics describe the characteristics of the data for two variables. The
characteristic of most interest is linear association between two variables. The most often used
measures of linear association are the following. 1) Covariation. 2) Covariance. 3) Correlation
coefficient. The basic idea for each of these measures is as follows. Two variables X and Y have
a positive linear association if when X is above (below) its mean Y tends to be above (below) its
mean. Two variables have a negative linear association if when X is above (below) its mean Y
tends to be below (above) its mean. If two variables do not display this tendency, then they have
no linear association. The stronger this tendency, the stronger the degree of linear association.
The sample covariation of X and Y is: Sample Covariation =  (xi – xbar)(yi – ybar). The sample
covariance is sxy =  (xi – xbar)(yi – ybar) / (n – 1). The sample correlation coefficient is
rxy = sxy / sxsy .
Advantages and Disadvantages of Covariation, Covariance, Correlation Coefficient
The major advantage of these 3 measures is that they tell us something about the linear
relationship between X and Y in the sample. If X and Y have a linear relationship in the sample,
then they may be related in the population. These measures have 3 major disadvantages. 1) They
tell us nothing about the causal relationship between X and Y in the sample, and therefore in the
population. They can’t tell us if X has an independent causal effect on Y, and if so the direction
and size of the effect. 2) They do not measure nonlinear association between X and Y. For
example, if X and Y have a close U-shaped relationship, covariation, covariance, and correlation
would be zero or close to zero. While it is true there is no linear association, there is a nonlinear
association.
Download