Uploaded by Meghan Franco

BA216 - introduction to statistics - final exam study guide

advertisement
BA216 – MFranco – Study guide, final exam
BA216 – Review for Final Exam
Contents
Exam 1 Material ................................................................................................................................................................................ 1
How to describe variables, datasets, and other basic descriptive concepts: ....................................................................... 1
Sampling & Study design (basic) ............................................................................................................................................... 2
Univariate summary statistics – mean, median, IQR, standard deviation .......................................................................... 3
Bivariate summary and descriptive statistics............................................................................................................................ 3
The normal distribution, including percentiles and the z-score ........................................................................................... 3
R coding concepts........................................................................................................................................................................ 4
Exam 2 Material ................................................................................................................................................................................ 5
Foundations for inference: central Limit Theorem (CLT), standard error, and sampling distributions ...... Ошибка!
Закладка не определена.
Central limit theorem (CLT).............................................................................. Ошибка! Закладка не определена.
Sampling distribution .......................................................................................... Ошибка! Закладка не определена.
Standard error ...................................................................................................... Ошибка! Закладка не определена.
Experimental design & Systematic Bias........................................................... Ошибка! Закладка не определена.
Constructing confidence intervals (CI) ................................................................ Ошибка! Закладка не определена.
Hypothesis testing with the z and t-statistics (i.e. quantifying strength with p-value), for a single 𝑥 or two 𝑥
..................................................................................................................................... Ошибка! Закладка не определена.
Comparing three or more groups with ANOVA ............................................... Ошибка! Закладка не определена.
Post-exam 2, regression ................................................................................................................................................................... 5
The basics: ...................................................................................................................................................................................10
Ordinary Least Squared (OLS) linear regression ..................................................................................................................11
Exam 1 Material
How to describe variables, datasets, and other basic descriptive concepts:
Study questions:
1. Why do we care whether a variable is categorical or numerical?
1
BA216 – MFranco – Study guide, final exam
2. What are the subtypes of categorical vs. numerical, and how do you identify them?
3. How do we visualize frequencies in numerical data? How do we visualize counts of categorical data? (hint:
think about the difference between histograms vs. bar plots/charts)
4. You should be able to identify, interpret, and use the following data visualizations: frequency plots, relative
frequency plots, bar charts, histograms, box-and-whisker plots, scatterplots, time series
a. What does it mean that a histogram “bins” observations?
b. How can you tell the difference between a bar chart and a histogram?
c. What do the different parts of a box-and-whisker plot represent? How can you interpret this kind
of graph?
5. What is a data matrix?
6. What are variables vs. observations? How can you tell the difference between an observation and a variable
in a data matrix?
7. Make sure you can describe the shape and characteristics of data distributions with the concepts: central
tendency, dispersion, outliers, modality/shape, and skew-ness.


Dependent variable:
o Usually graphed on the y-axis
o It may be called an outcome variable, criterion variable, or endogenous variable. In this class, we
will use dependent and outcome variable.
o There is only ever one outcome/dependent variable.
Independent variable:
o Usually graphed on the x-axis, and is the variable thought to effect the dependent/outcome
variable (like how growing taller usually contributes to increased weight throughout puberty).
o Can be called predictor variables, or exogenous variables. In this class, we will use the terms
predictor and independent variable
o There may be one, or more than one, predictor/independent variable:
Sampling & Study design (basic)
Study questions:
 What is the difference between anecdotes and data?
 Why do we need to sample? Why is working with the full population difficult?
 What is the difference between a sample and a population?
 What are the different kinds of probability vs. non-probability sampling?
 What does it mean to have a non-representative sample? Why does it muddy your results?
 What is the definition of random error? When do we have to deal with random error? Is it a sign that
something is wrong with the study design?
 What is the definition of statistical systematic bias? Where are the places that statistical systematic bias might
sneak into a study design?
 What is the difference between random error and statistical systematic bias?

What is the difference between an observational vs. an experimental study?
o When might each be used? What are the benefits of each? The drawbacks?
o What kind of study is best for exploring the causal relationship between two (or more) variables?
o What kind of study can only show correlation/association?
2
BA216 – MFranco – Study guide, final exam
Univariate summary statistics – mean, median, IQR, standard deviation
Study questions:
 What are the measures of central tendency? Why do we care about this? Why is it important that we
describe a data distribution with regards to central tendency?
o Make sure you can calculate mean, median, and mode. Make sure you can eyeball where these
would be on a histogram, bar chart, and dot plot.
o Understand the different parts of the formula for mean. What is n?
o What is the statistical shorthand for a population mean versus a sample mean?

What are the measures of dispersion? Why do we care about dispersion? Why is it important that we
describe a data distribution with regards to dispersion?
o Make sure you can calculate and understand variance, standard deviation, Q1, Q3, and IQR.
o What is the statistical shorthand for a population standard deviation versus a sample standard
deviation?

What is the difference between standard deviation and IQR?
o How does each relate to mean versus the median? Which are conceptually paired with the other?
Does IQR go with the median or mean? Does standard deviation go with the median or the mean?

What is the difference between variance and central tendency?
o How do distributions look if they have the same mean but different standard deviations?
o How do distributions look if they have the same standard deviation but different means?

How do these tools function in symmetrical data? What about skewed data?
o Which one are robust statistics, and which ones are susceptible to outliers?
o What happens to mean and median with right skewed vs. left skewed dataset?
o With very skewed data, which tools are best for the job of summary statistics?
Bivariate summary and descriptive statistics
Study questions:
1. What is the difference between bivariate and univariate statistics?
2. Why do we consider the correlation coefficient to be a summary statistic for bivariate analyses?
3. What is the potential range of the correlation coefficient? How do you interpret the number?
4. When can correlation coefficient be used, and when is it NOT the right tool for the job?
5. What do the dots in a scatterplot represent? How are the dots linked to the underlying data matrix?
6. Make sure you can eyeball a trendline over a cloud of data points
7. Make sure you can describe a bivariate relationship using form, direction, strength, and outliers, and that you
can put it in the context of the two variables you are describing
The normal distribution, including percentiles and the z-score
Study questions
1. What is the normal distribution? What does it look like? Can different normal distributions have different
standard deviations and means?
a. What is the relationship between a histogram to the smoothed probability curve we usually see in as
the normal distribution?
3
BA216 – MFranco – Study guide, final exam
b. What are the parameters for the normal distribution? How do you write the normal distribution
using statistic shorthand?
c. How do you draw the normal distribution if you are given only a sample mean and standard
deviation? Where do you put the ‘tick’ marks along the x-axis based on the mean and standard
deviation?
d. What is a z-score? Why do we use a z-score? When might it be useful?
e. What is the relationship between the concepts of percentiles and percentages?
f. What is the standard normal distribution? How does that relate to the concept of a z-score?
g. What is the 68-95-99.7 Rule (the Empirical Rule)? What does it mean? When do we say an
observation is unusual, very unusual, or a suspected outlier, based on the 68-95-99.7 Rule?
2. How do you calculate a percentage/percentile/proportion from a measurement/cutoff value/observation
a. What is the relationship between a proportion (that you can find using pnorm(), and how to report
that as a percentage (i.e. how do you do the mathematical conversion between the two)
b. How do you know if you are looking for the ‘area under the normal curve’ to the left/below the
value? Or, to the right/above? How does that change what you do in R?
c. How do you find the percent of observations between two values?
3. How do you calculate a measurement/cutoff value/observation from a percentage/percentile/proportion?
a. How do you know if you are looking for the ‘area under the normal curve’ to the left/below the
percentile? Or, to the right/above? How does that change what you do in R?
b. How do you find the values that fall between two percentiles?
4. What is the difference between pnorm() and qnorm()? When do use each?
R coding concepts from exam 1
Concepts:
 The difference between “Base R” and the IDE we use called “R Studio”
 The difference between the four windows in R (R script, the console, the environment, and the misc
window)
o What is an R Script, and why to use them vs. working directly in the console
o Opening help and/or data documentation with “?” or using the help window
 Using R like a calculator, including PEMDAS
 Creating variables/objects, and assigning numbers (and lists of numbers) into objects with < Commenting out the “text for humans” with #
 Working with a data matrix in R
o Using the '$' operator
 The basics of ggplot graphing (geom_bar, geom_histogram, geom_point, geom_line)
 Calculating z-scores
 Calculating percentiles/percentages from measurements/values/observations
 Converting back to measurements/values/observations from percentiles/percentages
After reading and playing around with the demo code in your own R Studio, you should understand:
1. The difference between an R script, the console, and the environment inside of R Studio. What is the
kitchen analogy?
2. How to store ("assign") numbers or a list of numbers inside of an 'object'
4
BA216 – MFranco – Study guide, final exam
3. What functions look like and what they do in R. How you run a function on a list of numbers stored in an
object. How do you calculate a mean/median/sd?
4. The difference between installing a package and activating a package (we've learned about the ggplot and
openintro packages so far)
5. How to tell R to run a function on a particular column inside a data matrix using $
6. The basics of what a ggplot() function looks like, and what it create. Where in the ggplot() code do you
usually tell R which dataset to use, which variables to visualize, and what kind of plot to create? What are
the different geom_ arguments for histogram, bar charts, line graphs, and scatterplots?
Functions:
 library()
 mean(), median(), sd(), summary(), names(), dim(), head(), tail()
 ggplot()
 pnorm(), qnorm()
Exam 2 Material
Foundations for inference: central Limit Theorem (CLT), standard error, and
sampling distributions
What is the difference between descriptive and inferential statistics?


How do both relate to samples vs. populations?
What parts of the course have been about descriptive stats, and which parts have focused on inferential?
What does it mean to “quantify uncertainty?” Why do we need to do this with statistics? How have we learned to
do it with both confidence intervals and hypothesis testing?
How are categorical vs. numerical data related to proportions vs. means?
Point estimates:



What is the definition of a point estimate?
What are the different types of point estimates we’ve learned about? What is their statistical shorthand
notation?
How do point estimates related to population parameters when conducing inferential statistics?
What are the statistical notation symbols for the following:

Sample mean & population mean
5
BA216 – MFranco – Study guide, final exam


Sample proportion & population proportion
Sample standard deviation & population standard deviation
Central limit theorem (CLT)
What are the conditions for the CLT to hold for:





Single proportion (p-hat)
Single group average (x-bar)
Two paired group averages (x-bardiff)
Two unpaired group averages (x1 – x2)
Three or more groups (ANOVA)
What should you do if even one of the CLT conditions fails?
If you need more of a hint for what should be included in the CLT conditions, think about:





How do the conditions for the central limit theorem vary when you’re working with proportions vs. means?
Why is independence one of the two key conditions for the CLT, i.e. why is so important for avoiding bias,
and why does a random sample help ensure independence?
What are the characteristics of a normal distribution, and how do these appear in the CLT conditions? What
do we look for?
How do you spot an outliers in (1) a histogram, (2) a scatterplot, or (3) a box-and-whisker plot? How do
you spot skew?
What is the rule of thumb to check for normality (CLT#2) for numerical data?
Sampling distribution
What is a sampling distribution?
Why don’t we usually see a sampling distribution? What is stopping us from taking a few hundred or a few thousand
separate samples of, say, 100 people each, in order to map out the full sampling distribution?
What is a “simulation” in R? How does this help us see and understand what a sampling distribution is?
Standard error
How are standard deviation, standard error, and a sampling distribution related?
How are standard error and sample size related?
Does standard error refer to a sample, or just to a population? Why or why not?
What happens to the standard error as the sample size increases?
6
BA216 – MFranco – Study guide, final exam
Experimental design & Systematic Bias
What is the difference between random error and systematic bias in a study? Which do we try to avoid, and which is
(to a certain extent) inevitable? How do scientists and researchers deal with each error vs. bias?


What are some examples of unintentional biasing of study results?
A confidence interval shows the likely “wiggle room” around a point estimate, where the (true, hidden)
population parameter is likely to fall – does this margin of error account for random error or systematic
bias? Which one of these two invalidates a study?
What is the difference between random sampling and random assignment? How are these both beneficial for
ensuring independence for the CLT? How do they differ?
What is the difference between an experiment and a survey?
What are the four principles of randomized experiments?
What is the difference between treatment and control conditions? Can you think of examples of treatment and
control conditions beyond just medical trials?
What is a blinded experiment? What is a double-blind experiment? What is a placebo? What is the placebo effect?
They appear on the surface to be very similar processes, but what is the difference between stratified random
sampling and blocking?



What stages of the research process are they used at?
Why are they useful?
When might each be useful in the context of a study?
Constructing confidence intervals (CI)
What are the steps for constructing a CI for:




Single proportion (p-hat)
Single group average (x-bar)
Two paired group averages (x-bardiff)
Two unpaired group averages (x1 – x2)
Note: I’d highly recommend mapping each of these to the four steps (prepare/check/calculate/conclude) as
well as their conditions for the CLT, and the R code you’ll need
Confidence Interval basics:




What does it mean to “fish with a net” vs. “fishing with a spear”
Do confidence intervals help capture and quantify the random error or the systematic bias present in a sample?
What is the relationship between a confidence level and a margin of error?
When do you use pnorm() and qnorm() vs. pt() and qt()? When do you use the three z* I provided you in
class?
7
BA216 – MFranco – Study guide, final exam


What is the proper phrasing for reporting on the outcome of a confidence interval, and why is it important?
When do we use p0 vs. p-hat for checking the conditions when building a confidence interval?
What is standard error?



How is it related to the concepts of a sampling distribution and standard deviation?
What processes and tests have been used it in so far?
Can we calculate standard error for means, proportions, or both?
Reminder – make sure you understand the difference between z* and t*




What does the book call these values? (they have a special name in the formulae)
In what cases do we use each?
How are they similar, how are they different?
What R functions do you use for each?
What are the three common confidence levels for confidence intervals?



How does the process change for finding each of these?
Which is the “widest net” and which is the narrowest?
How do the conclusions change for each of these? (think about the amount of uncertainty, the size of the
“net,” the level of the standard error, etc)
Hypothesis testing with the z and t-statistics (i.e. quantifying strength with pvalue), for a single group 𝑝̂ , single group 𝑥̅ , or 2-group 𝑥̅
What is the alternative and null hypothesis?





How do we phrase them?
What is the statistical shorthand for each?
What are the common null and alternative hypotheses for 2-group (both paired and unpaired) hypothesis
tests? For ANOVA?
Why does the null hypothesis need to be set to an actual numerical value?
Do hypothesis tests use population notation or sample/point estimate notation?
Overarching comparisons which are absolutely key to understand:
o
o
o
o
o
o
Differences when hypothesis testing ideas with sample proportions vs. sample means
Differences when hypothesis testing ideas with one sample vs. two samples
One-tailed vs. two-tailed hypothesis test (and how this shows up in the R code)
When do we use p0 vs. p-hat for checking the conditions in a hypothesis test?
At what stages in the computation process you should use population vs. sample parameters
NOTE: I would recommend mapping the four steps (Prepare/Check/Calculate/Conclude) to each
of the different use cases above
8
BA216 – MFranco – Study guide, final exam
Hypothesis testing basics for:





Single proportion (p-hat)
Single group average (x-bar)
Two paired group averages (x-bardiff)
Two unpaired group averages (x1 – x2)
(And an ANOVA is technically a hypothesis test, but there’s a separate ANOVA section below)

What are the similarities and differences when working with two samples, i.e. with paired data vs. the
difference in two means?
When do you use pnorm() and qnorm() vs. pt() and qt()?
What is the proper phrasing for reporting on the outcome of a hypothesis test, and why is it important?


Significance levels:


What is the most common significance level? Is this a hard and fast rule, or more of a guideline?
If we’re more worried about a false positive (i.e. moving away from the status quo), do you think we would
set the significance level higher or lower than 0.05? What about if you’re worried about a false negative
(failing to move away from status quo when there is a real difference)?
The p-value, i.e. the strength-of-evidence score






Do you find a p-value for: single-xbar t-test, 2 x-bar t-test, an ANOVA, and/or creating a confidence
interval? (hint: it’s not a yes for all of those options)
What is a p-value? What is the difference between the size of the effect/size of the difference, and the pvalue?
What are common misunderstandings about the p-value?
Is the p-value going to be higher or lower (according to the research discussed in class) than the true error
rate?
What is the difference between statistical and practical significance?
Why are you more likely to detect a statistically significant difference in two groups that are only slightly
different, if you have a really big sample size?
Comparing 3+ group 𝑥̅ with ANOVA
ANOVA concepts:



What is t-test vs. an ANOVA?
How do you (theoretically and on an ANOVA output table) find the F-statistic in an ANOVA? What is
MSE vs. MSG, and what kind of variability do they represent?
What is a “pairwise comparison?” compared to an ANOVA test?
ANOVA process:


How do the conditions vary for an ANOVA compared to the other types of hypothesis testing?
What are the steps for an ANOVA test?
9
BA216 – MFranco – Study guide, final exam




How do you write the hypotheses for an ANOVA, and do you use the notation for a sample or population
mean?
How do you determine if there is strong evidence (statistical significance) for real multi-group differences in
an ANOVA? How do we properly report on these results? How do we phrase it?
When do you move on to pairwise comparisons? Will these pairwise t-test comparisons use the paired or
unpaired 2-group method? What are the other changes you’ll do to this final process to guard against type1
error?
When do you use the Bonferroni Corrected significance level (alpha)? Before or after running an ANOVA?
With the ANOVA or with pairwise t-test comparisons? With statistical significance or no statistical
significance? How do you properly report on these results? How do we phrase it?
Avoiding inflated type 1 error:




What are you trying to avoid by first using an ANOVA to compare three groups, rather than three different
pairwise t-tests, or a pairwise t-test of the most different groups?
What are you trying to avoid by using a Bonferroni corrected significance level (alpha) to compare three
groups, rather than three different t-test?
What “inflates” if you just pick the most “interesting” groups to compare with a single t-test?
Define data fishing or data snooping, and briefly outline some ways to avoid it.
Post-exam 2, regression
The basics:









Make sure you can describe scatterplots using the methods we discussed for exam 1.
Why is regression a kind of bivariate analysis? What is it trying to do?
What is an OLS linear regression model?
How is regression similar to, and different from, calculating a correlation coefficient R? When can they be
used? What are the conditions for each, and how do they overlap/differ?
How you can tell positive vs. negative association of two variables using a scatterplot, the sign on the
correlation coefficient, and the sign on the beta-one
What are the other names for the independent variable? What are the other names for the dependent
variable?
What is a residual?
o How do you calculate it? How do you recognize the “observed” value? How do you calculate the
predicted value? How do you use these two to calculate the residual?
o Does a positive residual suggest that the observation was over- or under-predicted by the linear
regression line? What about a negative residual?
Can a regression analysis have more than one outcome/dependent variable? Can a regression analysis have
more than one predictor/independent variable?
What is the difference between simple linear regression and multiple linear regression? We didn’t have
enough time to cover multiple linear regression in any mathematical detail, but we have touched on the
concept several times, and why it is powerful. You should be able to write at least a sentence describing the
difference.
10
BA216 – MFranco – Study guide, final exam
Simple Ordinary Least Squared (OLS) linear regression













Why do we try to minimize residuals in regression analysis? How OLS regression related to the concept of
SSR? How does that mathematically tell us where to position the regression trend line?
What is the y-intercept? Where is it located in the equation? What does it mean? How do you interpret it
in context?
o When is the y-intercept mathematically necessary? When is it practically useful? Is it always both of
these, or always one of these, or can it vary?
What is the slope coefficient for the predictor? Where is it located in the equation? What does it mean?
o (important) How do you interpret the number, verbally it in the context of the problem?
How can you find the R-squared, the beta-naught, the beta-one, and the p-value from BOTH the lm()
output in R, and from a cleaned up table?
What is the R-squared? How do you interpret it in context?
What are the 2 hypotheses for a linear regression analysis?
o What is the p-value?
o Do we look at the p-value for the beta-naught, the beta-one, or both?
o How many p-values do we examine in simple linear regression?
What are the dangers of over-extrapolation far away from the data? How does the related to the idea that,
sometimes, the y-intercept isn’t practically useful but is always mathematically necessary? (we talked about
this at the end of lecture 21)
What is a residual plot? How can you recognize that vs. a standard bivariate scatterplot? What is a residual
histogram?
What are the three conditions for linear regression? How can you recognize them in plots? Which ones do
you check before running lm(), and which do you check afterwards? Which ones use the residual plot,
and which one uses the residual histogram?
What is an indicator variable? When is it used?
How do you work with and interpret categorical variables?
What is the difference between variable and predictors?
What is a reference level?
Multiple OLS Regression



What is omitted variable bias?
Why do we use multiple linear regression, and how can that help us avoid omitted variable bias?
What are the examples we discussed in class? What are others you could think of?

What are the differences between the equations for multiple and simple linear regression? Make sure you
can write out both a simple linear regression generic equation, and a 3-predictor multiple linear regression
generic equation
How many predictors can there be in multiple linear regression? How many outcome variables can there
be?
How do you (verbally) interpret each of the slope coefficient beta of each variable?
How many p-values will you interpret when hypothesis testing with multiple linear regression with, say, four
predictors?



11
Related documents
Download