Lecture1 - Oregon State University

advertisement
Conducting Social Research
Statistical Principles and
An Overview of
Regression Analysis
Univariate, Bivariate, and
Multivariate Statistics
Roger B. Hammer
Assistant Professor
Department of Sociology
Oregon State University
Conducting Social Research
Basic Notation
Y
Yi
A random variable (data vector) that we want
to model.
The ith observation in our data vector.
Y  4,5,6,7,8
Y2  5
Notation Notation: It varies, so be flexible.
Conducting Social Research
Basic Notation
Y
Y-Bar is the mean of Y.
Yi
The observed value of ith observation.
Yˆ
i
Y-Hat is the estimated or predicted
value of ith observation.
Conducting Social Research
Random Variable
• A variable whose numerical value
is determined by chance, the
outcome of a random phenomenon.
• Discrete has a countable number
of values.
• Continuous can take on any value
in an interval.
Is “statistical anxiety” continuous or discrete?
Conducting Social Research
Probability
• Probability is the likelihood or chance that
something (an event) is the case or will happen.*
• The probability of an event is represented by a real
number in the range from 0 to 1.*
• An impossible event has a probability of 0, and a
certain event has a probability of 1.*
P(A), p(A) or Pr(A) *
P[X] **
*Wikipedia
**Studenmund
Conducting Social Research
Probability Distribution
• Assigns probabilities to the possible
values of a discrete variable.
P[X] + P[Not X] = 1
P[Not X] = 1 - P[X]
In the Statistical Anxiety Survey data, what
is the probability of having taken a
previous statistics course?
Of not having taken one?
Conducting Social Research
Normal (Gaussian) Distribution
The Bell Curve
Conducting Social Research
Law of Large Numbers
• The first theorem of probability that
describes the long-term stability of a
random variable.
• Given a sample of independent and
identically distributed (iid) random
variables with a finite population mean,
the average of these observations will
eventually approach and stay close to
the population mean.
Conducting Social Research
The Central Limit Theorem
• The second theorem of probability that
describes the distribution of a random
variable.
• Given a sample of independent and
identically distributed (iid) random
variables with a finite, nonzero standard
deviation, the probability distribution
approaches the normal distribution as
the sample size increases.
Conducting Social Research
Sampling
• Population is the entire
group of items of interest.
• Sample is the observed
part of the population.
Is the Statistical Anxiety Survey data sample
or population based?
Conducting Social Research
Statistical Inference
• The use of a sample to draw
conclusions about the population
from which the sample came.
• Inference is necessary because it is
often impractical to “scrutinize” the
entire population.
Are medical blood tests based on inference?
Is the U.S. Census based on inference?
Conducting Social Research
Random Sampling
• The use of a sample to draw
conclusions about the population
from which the sample came.
• Inference is necessary because it is
often impractical to “scrutinize” the
entire population.
Are medical blood tests based on inference?
Is the U.S. Census based on inference?
Conducting Social Research
Selection Bias
• The exclusion or underrepresentation of certain types of
respondents/observations in a
sample, resulting in a nonrepresentative sample.
Can you give an example of selection bias
highlighted recently in the media?
Is the Statistical Anxiety Survey data sample
biased? Why or Why not?
Conducting Social Research
The Expected Value of a
Random Variable
• A weighted average of all the
possible values of the random
variable (population mean).
  E[ X ]
  X P[ X ]
i
i
i
Notation Notation: The italics don’t exactly conform to Studenmund. Remember to be flexible.
Conducting Social Research
The Variance of a Random
Variable
• The extent to which the values may
differ from the expected value.
• The expected value of the difference.
  E[( X   ) ]
  ( X   ) P[ X ]
2
2
2
i
i
i
Conducting Social Research
Similarity of Mean and
Variance Formulas
• Substitution of the squared difference
for the value.
   X P[ X ]
i
i
i
   ( X   ) P[ X ]
2
2
i
i
i
Conducting Social Research
The Standard Deviation of
a Random Variable
• The square root of the variance.
• Absolute value of the difference.
• Residuals.
  E[( X   ) ]
2
  ( X   ) P[ X ]
2
i
i
i
Conducting Social Research
Population Parameters and
Sample Statistics
Concept
Mean
Variance
Standard
Deviation
Sample
Statistic
Y
2
sy
sy
Population Parameter
  E[ Y ]
  Var [ Y ]
2
y
 y  Var [ Y ]
Conducting Social Research
Sample Statistics Example
We have obtained a sample of 40 housing sales
that took place somewhere in some year. The
data contains two variables, price (in $’s) and size
(total above grade finished area in feet2).
Conducting Social Research
Price and Size
Do you think that price and size would be
related to each other?
Would one “cause” the other?
Which variable would you consider to be
independent (X) and which dependent (Y)?
Why?
Conducting Social Research
Independent and
Dependent Variables
• X= Size and Y = Price
• For a buyer the price that they are willing to
pay is a function of the size of the house, along
with other factors.
• X= Price and Y = Size
• For a builder the price that they want to receive
for a home will determine its size, along with
other factors.
Conducting Social Research
Univariate Statistics
Conducting Social Research
The Sample Mean of Price
Y  Y1  Y2  Y3  ...  Yn / n
n
Y   Yi / n
i 1
 $3 ,481,200 / 40
 $87 ,030
Conducting Social Research
Population Mean and
Sample Means
If we drew a second sample of 40 housing sales
would the mean be exactly the same as the
mean of the first sample?
Is the sample mean exactly the same as the
population mean?
Conducting Social Research
The expectation of the
Sample Means
  E[ X ]  E[ X ]
• The Law of Large numbers.
E[ X ]  N (  , )
2
• The Central Limit Theorem.
Conducting Social Research
The Sample Mean of Size
X  X 1  X 2  X 3  ...  X n / n
n
X   Xi / n
i 1
 177 ,097 / 40
 4 ,427
Conducting Social Research
The Sum of the Deviations
The Zero-sum Property
E( X i  X )   ( X i  X )  0
E( Yi  Y )   ( Yi  Y )  0
Conducting Social Research
The Sum of the Squared
Deviations
Total Sum of Squares
( X

X
)

405,415,59
9
i
( Y  Y )
i
2
2
 $114,245,084,000
Conducting Social Research
The Sample Variance
s   ( X i  X ) /( n  1 )
2
X
2
 405,415,599/39
 10,395,271
2
2
sY   ( Yi  Y ) /( n  1 )
 114,245,084,000/39
 2,929 ,361,128
Conducting Social Research
Sample Standard Deviation
s X  s  3,224
2
X
sY  s  $54 ,123
2
Y
Conducting Social Research
Bivariate Statistics
(Skipping Ahead to Chapter 2)
Conducting Social Research
Covariance of X and Y
s XY   ( X i  X )( Y  Y ) /( n  1 )
 6,760 ,921,922 /39
 173 ,356 ,972
Conducting Social Research
Covariance of Y and Y
is the Variance of Y
sYY   ( Yi  Y )( Yi  Y ) /( n  1 )
s
2
Y
 114,245,084,000/39
 2,929 ,361,128
Conducting Social Research
Correlation of X and Y
r
s xy
sx s y
173 ,356 ,972

3,224 * 54,124
 .9934
Conducting Social Research
Regression Analysis
• Econometricians use regression analysis to make
quantitative estimates of economic relationships
that previously have been completely theoretical in
nature.
• Sociologists use regression analysis to make
quantitative estimates of social relationships that
previously have been completely theoretical in
nature.
• Political scientists use regression analysis to make
quantitative estimates of political relationships that
previously have been completely theoretical in
nature
Conducting Social Research
The Basic (Theoretical)
Linear Model
y  f ( x ) e.g. price  f ( size )
f ( x)     X
0
1
Y   X
0
1
• β0 is the Y-intercept, the point at which the
regression line crosses the vertical axis.
• β1 is the slope of the regression line, a 1 unit
change in Xi results in a β1 unit change in Yi.
Conducting Social Research
Change in the Expected Value of Y
E[Y ]     X
i
0
1
i
Other determinants of Y
  Y  E[Y ]  Y     X
i
i
i
i
0
1
i
Change in the Observed Value of Y
Y  E[Y ]       X  
i
i
i
0
1
i
i
Download