Uploaded by Rohit Arora

Upload

advertisement
ECON 3318 – Data Analysis
and Visualization
David Quigley
Statistics and Probability Review – Aug. 28
Population vs. Sample
• Population – the entire group of interest for a
particular research question
• Examples: Do tariffs help domestic steel
companies?
• Population = all domestic steel companies
• Do tax cuts change what consumers buy?
• Population = all consumers
• While it may be possible to survey all domestic
steel companies, it would be prohibitively
expensive to survey all consumers about purchases
Population vs. Sample
• The U.S. Census surveys the entire U.S. population,
but due to cost and effort, most surveys only reach
a sample of the desired population
• Sample – a subset or part of the population that
can be used to make inferences about the entire
population
• Example: What do Americans watch on TV? It
would be too expensive to survey all TV watching
Americans, but a representative sample can give us
a really good guess about which shows are popular
Population vs. Sample
• The larger the sample is as a fraction of the
population, the more likely the inferences we draw
from the sample accurately reflect the entire
population
• Example: The larger the sample of voters in a poll,
the more likely the poll will be close to the actual
results of the election
• Statistics are about using the sample to make
guesses about the population and quantifying how
likely our guesses are to be accurate
Statistical Inference
• Statistical Inference – the process of using statistics
from the sample as guesses about the population
of interest and assessing how likely those statistics
are to be accurate
• Parameter – a relationship that exists in the
population
• Example: For every $1 of taxes cut, a consumer
increases spending by $0.70 on average.
• Statistic – a relationship that exists in the sample
often used as a guess for a parameter
Statistical Inference
• Example: In the sample of consumers used, a $1 cut
in taxes resulted in those consumers increasing
spending by $0.71 on average.
• The statistic is close to the population parameter,
but isn’t 100% accurate because we didn’t have the
spending behavior of the population of consumers
• However, the statistic is likely close enough given
the additional costs that would have entailed from
trying to survey the entire population of consumers
Statistical Inference
• Furthermore, a different sample is likely to have a
slightly different statistic
• Example: In a different sample of consumers, a $1
cut in taxes resulted in those consumers increasing
spending by $0.68 on average.
• Therefore, a statistic has an element of randomness
to it since the sample has an element of
randomness in it
• A parameter is NOT random since it covers the
entire population, but how close is our statistic?
Sampling Distribution
• Sampling Distribution – a frequency distribution for
a sample statistic over all possible samples of a
given size
• Example: The Normal Distribution
Sampling Distribution
• Example: The Normal Distribution
• The Normal Distribution is named that way because
it’s what we think most normal statistics are
distributed as
Sampling Distribution
• A higher value of the Sampling Distribution
indicates that a statistic around those values is
more likely or occurs more frequently if you had all
possible samples of the given size
• A lower value of the Sampling Distribution indicates
that while those values of the statistic are possible
for some samples of the given size, those values are
unlikely or those samples are rare
• Consequently, the Sampling Distribution gives you
some idea about how far off your statistic is likely
to be
Summary Statistics
• Summary Statistics include statistics as simple as
the minimum value and the maximum value in the
data
• Before conducting any complicated analysis of the
data, it’s always a good idea to better understand
the data first through Summary Statistics and Data
Visualization
• Let’s go through some math and notation
necessary for this course
Summary Statistics
• Suppose you have data on credit card users
• The size of the sample is denoted by ๐‘› so ๐‘› =
number of observations
• The value of a particular observation is denoted in
the following way: for the 1st person in the sample,
the amount of the person’s credit card balance is
๐‘ฅ1 , for the 2nd person in the sample, the amount of
the person’s credit card balance is ๐‘ฅ2 , for the 3rd
person, ๐‘ฅ3 , and so on…
Summary Statistics
• Up to the last observation, ๐‘ฅ๐‘›
• The Sample Mean or Average is denoted by ๐‘ฅาง
• Therefore, ๐‘ฅาง =
1
๐‘›
๐‘ฅ1 + ๐‘ฅ2 + ๐‘ฅ3 + … + ๐‘ฅ๐‘›
• To save space on the summation, the following
notation is used: ๐‘ฅ1 + ๐‘ฅ2 + ๐‘ฅ3 + … + ๐‘ฅ๐‘› =
σ๐‘›๐‘–=1 ๐‘ฅ๐‘–
• Therefore, ๐‘ฅาง =
1 ๐‘›
σ๐‘–=1 ๐‘ฅ๐‘–
๐‘›
Summary Statistics
• Remember that since the Sample Mean depends
on the sample, it will be different for different
samples
• However, the Population Mean or Average does
NOT depend on the sample since by definition the
population covers everything
• The Population Mean or Average is denoted by µ
• It follows then that the Sample Mean has a
Sampling Distribution and will be closer to or
further away from the Population Mean depending
on the sample
Summary Statistics
• Sample Median – the value in the data such that
half the data has a larger value and half the data
has a lower value
• Depending on the data, it is typically the case that
the Sample Mean ≠ the Sample Median
• For a sample where the Sample Mean > the Sample
Median, this is called Right-skewed
• For a sample where the Sample Mean < the Sample
Median, this is called Left-skewed
Right-Skewed Distribution
Left-Skewed Distribution
Download