BA216 – MFranco – Study guide, final exam BA216 – Review for Final Exam Contents Exam 1 Material ................................................................................................................................................................................ 1 How to describe variables, datasets, and other basic descriptive concepts: ....................................................................... 1 Sampling & Study design (basic) ............................................................................................................................................... 2 Univariate summary statistics – mean, median, IQR, standard deviation .......................................................................... 3 Bivariate summary and descriptive statistics............................................................................................................................ 3 The normal distribution, including percentiles and the z-score ........................................................................................... 3 R coding concepts........................................................................................................................................................................ 4 Exam 2 Material ................................................................................................................................................................................ 5 Foundations for inference: central Limit Theorem (CLT), standard error, and sampling distributions ...... Ошибка! Закладка не определена. Central limit theorem (CLT).............................................................................. Ошибка! Закладка не определена. Sampling distribution .......................................................................................... Ошибка! Закладка не определена. Standard error ...................................................................................................... Ошибка! Закладка не определена. Experimental design & Systematic Bias........................................................... Ошибка! Закладка не определена. Constructing confidence intervals (CI) ................................................................ Ошибка! Закладка не определена. Hypothesis testing with the z and t-statistics (i.e. quantifying strength with p-value), for a single 𝑥 or two 𝑥 ..................................................................................................................................... Ошибка! Закладка не определена. Comparing three or more groups with ANOVA ............................................... Ошибка! Закладка не определена. Post-exam 2, regression ................................................................................................................................................................... 5 The basics: ...................................................................................................................................................................................10 Ordinary Least Squared (OLS) linear regression ..................................................................................................................11 Exam 1 Material How to describe variables, datasets, and other basic descriptive concepts: Study questions: 1. Why do we care whether a variable is categorical or numerical? 1 BA216 – MFranco – Study guide, final exam 2. What are the subtypes of categorical vs. numerical, and how do you identify them? 3. How do we visualize frequencies in numerical data? How do we visualize counts of categorical data? (hint: think about the difference between histograms vs. bar plots/charts) 4. You should be able to identify, interpret, and use the following data visualizations: frequency plots, relative frequency plots, bar charts, histograms, box-and-whisker plots, scatterplots, time series a. What does it mean that a histogram “bins” observations? b. How can you tell the difference between a bar chart and a histogram? c. What do the different parts of a box-and-whisker plot represent? How can you interpret this kind of graph? 5. What is a data matrix? 6. What are variables vs. observations? How can you tell the difference between an observation and a variable in a data matrix? 7. Make sure you can describe the shape and characteristics of data distributions with the concepts: central tendency, dispersion, outliers, modality/shape, and skew-ness. Dependent variable: o Usually graphed on the y-axis o It may be called an outcome variable, criterion variable, or endogenous variable. In this class, we will use dependent and outcome variable. o There is only ever one outcome/dependent variable. Independent variable: o Usually graphed on the x-axis, and is the variable thought to effect the dependent/outcome variable (like how growing taller usually contributes to increased weight throughout puberty). o Can be called predictor variables, or exogenous variables. In this class, we will use the terms predictor and independent variable o There may be one, or more than one, predictor/independent variable: Sampling & Study design (basic) Study questions: What is the difference between anecdotes and data? Why do we need to sample? Why is working with the full population difficult? What is the difference between a sample and a population? What are the different kinds of probability vs. non-probability sampling? What does it mean to have a non-representative sample? Why does it muddy your results? What is the definition of random error? When do we have to deal with random error? Is it a sign that something is wrong with the study design? What is the definition of statistical systematic bias? Where are the places that statistical systematic bias might sneak into a study design? What is the difference between random error and statistical systematic bias? What is the difference between an observational vs. an experimental study? o When might each be used? What are the benefits of each? The drawbacks? o What kind of study is best for exploring the causal relationship between two (or more) variables? o What kind of study can only show correlation/association? 2 BA216 – MFranco – Study guide, final exam Univariate summary statistics – mean, median, IQR, standard deviation Study questions: What are the measures of central tendency? Why do we care about this? Why is it important that we describe a data distribution with regards to central tendency? o Make sure you can calculate mean, median, and mode. Make sure you can eyeball where these would be on a histogram, bar chart, and dot plot. o Understand the different parts of the formula for mean. What is n? o What is the statistical shorthand for a population mean versus a sample mean? What are the measures of dispersion? Why do we care about dispersion? Why is it important that we describe a data distribution with regards to dispersion? o Make sure you can calculate and understand variance, standard deviation, Q1, Q3, and IQR. o What is the statistical shorthand for a population standard deviation versus a sample standard deviation? What is the difference between standard deviation and IQR? o How does each relate to mean versus the median? Which are conceptually paired with the other? Does IQR go with the median or mean? Does standard deviation go with the median or the mean? What is the difference between variance and central tendency? o How do distributions look if they have the same mean but different standard deviations? o How do distributions look if they have the same standard deviation but different means? How do these tools function in symmetrical data? What about skewed data? o Which one are robust statistics, and which ones are susceptible to outliers? o What happens to mean and median with right skewed vs. left skewed dataset? o With very skewed data, which tools are best for the job of summary statistics? Bivariate summary and descriptive statistics Study questions: 1. What is the difference between bivariate and univariate statistics? 2. Why do we consider the correlation coefficient to be a summary statistic for bivariate analyses? 3. What is the potential range of the correlation coefficient? How do you interpret the number? 4. When can correlation coefficient be used, and when is it NOT the right tool for the job? 5. What do the dots in a scatterplot represent? How are the dots linked to the underlying data matrix? 6. Make sure you can eyeball a trendline over a cloud of data points 7. Make sure you can describe a bivariate relationship using form, direction, strength, and outliers, and that you can put it in the context of the two variables you are describing The normal distribution, including percentiles and the z-score Study questions 1. What is the normal distribution? What does it look like? Can different normal distributions have different standard deviations and means? a. What is the relationship between a histogram to the smoothed probability curve we usually see in as the normal distribution? 3 BA216 – MFranco – Study guide, final exam b. What are the parameters for the normal distribution? How do you write the normal distribution using statistic shorthand? c. How do you draw the normal distribution if you are given only a sample mean and standard deviation? Where do you put the ‘tick’ marks along the x-axis based on the mean and standard deviation? d. What is a z-score? Why do we use a z-score? When might it be useful? e. What is the relationship between the concepts of percentiles and percentages? f. What is the standard normal distribution? How does that relate to the concept of a z-score? g. What is the 68-95-99.7 Rule (the Empirical Rule)? What does it mean? When do we say an observation is unusual, very unusual, or a suspected outlier, based on the 68-95-99.7 Rule? 2. How do you calculate a percentage/percentile/proportion from a measurement/cutoff value/observation a. What is the relationship between a proportion (that you can find using pnorm(), and how to report that as a percentage (i.e. how do you do the mathematical conversion between the two) b. How do you know if you are looking for the ‘area under the normal curve’ to the left/below the value? Or, to the right/above? How does that change what you do in R? c. How do you find the percent of observations between two values? 3. How do you calculate a measurement/cutoff value/observation from a percentage/percentile/proportion? a. How do you know if you are looking for the ‘area under the normal curve’ to the left/below the percentile? Or, to the right/above? How does that change what you do in R? b. How do you find the values that fall between two percentiles? 4. What is the difference between pnorm() and qnorm()? When do use each? R coding concepts from exam 1 Concepts: The difference between “Base R” and the IDE we use called “R Studio” The difference between the four windows in R (R script, the console, the environment, and the misc window) o What is an R Script, and why to use them vs. working directly in the console o Opening help and/or data documentation with “?” or using the help window Using R like a calculator, including PEMDAS Creating variables/objects, and assigning numbers (and lists of numbers) into objects with < Commenting out the “text for humans” with # Working with a data matrix in R o Using the '$' operator The basics of ggplot graphing (geom_bar, geom_histogram, geom_point, geom_line) Calculating z-scores Calculating percentiles/percentages from measurements/values/observations Converting back to measurements/values/observations from percentiles/percentages After reading and playing around with the demo code in your own R Studio, you should understand: 1. The difference between an R script, the console, and the environment inside of R Studio. What is the kitchen analogy? 2. How to store ("assign") numbers or a list of numbers inside of an 'object' 4 BA216 – MFranco – Study guide, final exam 3. What functions look like and what they do in R. How you run a function on a list of numbers stored in an object. How do you calculate a mean/median/sd? 4. The difference between installing a package and activating a package (we've learned about the ggplot and openintro packages so far) 5. How to tell R to run a function on a particular column inside a data matrix using $ 6. The basics of what a ggplot() function looks like, and what it create. Where in the ggplot() code do you usually tell R which dataset to use, which variables to visualize, and what kind of plot to create? What are the different geom_ arguments for histogram, bar charts, line graphs, and scatterplots? Functions: library() mean(), median(), sd(), summary(), names(), dim(), head(), tail() ggplot() pnorm(), qnorm() Exam 2 Material Foundations for inference: central Limit Theorem (CLT), standard error, and sampling distributions What is the difference between descriptive and inferential statistics? How do both relate to samples vs. populations? What parts of the course have been about descriptive stats, and which parts have focused on inferential? What does it mean to “quantify uncertainty?” Why do we need to do this with statistics? How have we learned to do it with both confidence intervals and hypothesis testing? How are categorical vs. numerical data related to proportions vs. means? Point estimates: What is the definition of a point estimate? What are the different types of point estimates we’ve learned about? What is their statistical shorthand notation? How do point estimates related to population parameters when conducing inferential statistics? What are the statistical notation symbols for the following: Sample mean & population mean 5 BA216 – MFranco – Study guide, final exam Sample proportion & population proportion Sample standard deviation & population standard deviation Central limit theorem (CLT) What are the conditions for the CLT to hold for: Single proportion (p-hat) Single group average (x-bar) Two paired group averages (x-bardiff) Two unpaired group averages (x1 – x2) Three or more groups (ANOVA) What should you do if even one of the CLT conditions fails? If you need more of a hint for what should be included in the CLT conditions, think about: How do the conditions for the central limit theorem vary when you’re working with proportions vs. means? Why is independence one of the two key conditions for the CLT, i.e. why is so important for avoiding bias, and why does a random sample help ensure independence? What are the characteristics of a normal distribution, and how do these appear in the CLT conditions? What do we look for? How do you spot an outliers in (1) a histogram, (2) a scatterplot, or (3) a box-and-whisker plot? How do you spot skew? What is the rule of thumb to check for normality (CLT#2) for numerical data? Sampling distribution What is a sampling distribution? Why don’t we usually see a sampling distribution? What is stopping us from taking a few hundred or a few thousand separate samples of, say, 100 people each, in order to map out the full sampling distribution? What is a “simulation” in R? How does this help us see and understand what a sampling distribution is? Standard error How are standard deviation, standard error, and a sampling distribution related? How are standard error and sample size related? Does standard error refer to a sample, or just to a population? Why or why not? What happens to the standard error as the sample size increases? 6 BA216 – MFranco – Study guide, final exam Experimental design & Systematic Bias What is the difference between random error and systematic bias in a study? Which do we try to avoid, and which is (to a certain extent) inevitable? How do scientists and researchers deal with each error vs. bias? What are some examples of unintentional biasing of study results? A confidence interval shows the likely “wiggle room” around a point estimate, where the (true, hidden) population parameter is likely to fall – does this margin of error account for random error or systematic bias? Which one of these two invalidates a study? What is the difference between random sampling and random assignment? How are these both beneficial for ensuring independence for the CLT? How do they differ? What is the difference between an experiment and a survey? What are the four principles of randomized experiments? What is the difference between treatment and control conditions? Can you think of examples of treatment and control conditions beyond just medical trials? What is a blinded experiment? What is a double-blind experiment? What is a placebo? What is the placebo effect? They appear on the surface to be very similar processes, but what is the difference between stratified random sampling and blocking? What stages of the research process are they used at? Why are they useful? When might each be useful in the context of a study? Constructing confidence intervals (CI) What are the steps for constructing a CI for: Single proportion (p-hat) Single group average (x-bar) Two paired group averages (x-bardiff) Two unpaired group averages (x1 – x2) Note: I’d highly recommend mapping each of these to the four steps (prepare/check/calculate/conclude) as well as their conditions for the CLT, and the R code you’ll need Confidence Interval basics: What does it mean to “fish with a net” vs. “fishing with a spear” Do confidence intervals help capture and quantify the random error or the systematic bias present in a sample? What is the relationship between a confidence level and a margin of error? When do you use pnorm() and qnorm() vs. pt() and qt()? When do you use the three z* I provided you in class? 7 BA216 – MFranco – Study guide, final exam What is the proper phrasing for reporting on the outcome of a confidence interval, and why is it important? When do we use p0 vs. p-hat for checking the conditions when building a confidence interval? What is standard error? How is it related to the concepts of a sampling distribution and standard deviation? What processes and tests have been used it in so far? Can we calculate standard error for means, proportions, or both? Reminder – make sure you understand the difference between z* and t* What does the book call these values? (they have a special name in the formulae) In what cases do we use each? How are they similar, how are they different? What R functions do you use for each? What are the three common confidence levels for confidence intervals? How does the process change for finding each of these? Which is the “widest net” and which is the narrowest? How do the conclusions change for each of these? (think about the amount of uncertainty, the size of the “net,” the level of the standard error, etc) Hypothesis testing with the z and t-statistics (i.e. quantifying strength with pvalue), for a single group 𝑝̂ , single group 𝑥̅ , or 2-group 𝑥̅ What is the alternative and null hypothesis? How do we phrase them? What is the statistical shorthand for each? What are the common null and alternative hypotheses for 2-group (both paired and unpaired) hypothesis tests? For ANOVA? Why does the null hypothesis need to be set to an actual numerical value? Do hypothesis tests use population notation or sample/point estimate notation? Overarching comparisons which are absolutely key to understand: o o o o o o Differences when hypothesis testing ideas with sample proportions vs. sample means Differences when hypothesis testing ideas with one sample vs. two samples One-tailed vs. two-tailed hypothesis test (and how this shows up in the R code) When do we use p0 vs. p-hat for checking the conditions in a hypothesis test? At what stages in the computation process you should use population vs. sample parameters NOTE: I would recommend mapping the four steps (Prepare/Check/Calculate/Conclude) to each of the different use cases above 8 BA216 – MFranco – Study guide, final exam Hypothesis testing basics for: Single proportion (p-hat) Single group average (x-bar) Two paired group averages (x-bardiff) Two unpaired group averages (x1 – x2) (And an ANOVA is technically a hypothesis test, but there’s a separate ANOVA section below) What are the similarities and differences when working with two samples, i.e. with paired data vs. the difference in two means? When do you use pnorm() and qnorm() vs. pt() and qt()? What is the proper phrasing for reporting on the outcome of a hypothesis test, and why is it important? Significance levels: What is the most common significance level? Is this a hard and fast rule, or more of a guideline? If we’re more worried about a false positive (i.e. moving away from the status quo), do you think we would set the significance level higher or lower than 0.05? What about if you’re worried about a false negative (failing to move away from status quo when there is a real difference)? The p-value, i.e. the strength-of-evidence score Do you find a p-value for: single-xbar t-test, 2 x-bar t-test, an ANOVA, and/or creating a confidence interval? (hint: it’s not a yes for all of those options) What is a p-value? What is the difference between the size of the effect/size of the difference, and the pvalue? What are common misunderstandings about the p-value? Is the p-value going to be higher or lower (according to the research discussed in class) than the true error rate? What is the difference between statistical and practical significance? Why are you more likely to detect a statistically significant difference in two groups that are only slightly different, if you have a really big sample size? Comparing 3+ group 𝑥̅ with ANOVA ANOVA concepts: What is t-test vs. an ANOVA? How do you (theoretically and on an ANOVA output table) find the F-statistic in an ANOVA? What is MSE vs. MSG, and what kind of variability do they represent? What is a “pairwise comparison?” compared to an ANOVA test? ANOVA process: How do the conditions vary for an ANOVA compared to the other types of hypothesis testing? What are the steps for an ANOVA test? 9 BA216 – MFranco – Study guide, final exam How do you write the hypotheses for an ANOVA, and do you use the notation for a sample or population mean? How do you determine if there is strong evidence (statistical significance) for real multi-group differences in an ANOVA? How do we properly report on these results? How do we phrase it? When do you move on to pairwise comparisons? Will these pairwise t-test comparisons use the paired or unpaired 2-group method? What are the other changes you’ll do to this final process to guard against type1 error? When do you use the Bonferroni Corrected significance level (alpha)? Before or after running an ANOVA? With the ANOVA or with pairwise t-test comparisons? With statistical significance or no statistical significance? How do you properly report on these results? How do we phrase it? Avoiding inflated type 1 error: What are you trying to avoid by first using an ANOVA to compare three groups, rather than three different pairwise t-tests, or a pairwise t-test of the most different groups? What are you trying to avoid by using a Bonferroni corrected significance level (alpha) to compare three groups, rather than three different t-test? What “inflates” if you just pick the most “interesting” groups to compare with a single t-test? Define data fishing or data snooping, and briefly outline some ways to avoid it. Post-exam 2, regression The basics: Make sure you can describe scatterplots using the methods we discussed for exam 1. Why is regression a kind of bivariate analysis? What is it trying to do? What is an OLS linear regression model? How is regression similar to, and different from, calculating a correlation coefficient R? When can they be used? What are the conditions for each, and how do they overlap/differ? How you can tell positive vs. negative association of two variables using a scatterplot, the sign on the correlation coefficient, and the sign on the beta-one What are the other names for the independent variable? What are the other names for the dependent variable? What is a residual? o How do you calculate it? How do you recognize the “observed” value? How do you calculate the predicted value? How do you use these two to calculate the residual? o Does a positive residual suggest that the observation was over- or under-predicted by the linear regression line? What about a negative residual? Can a regression analysis have more than one outcome/dependent variable? Can a regression analysis have more than one predictor/independent variable? What is the difference between simple linear regression and multiple linear regression? We didn’t have enough time to cover multiple linear regression in any mathematical detail, but we have touched on the concept several times, and why it is powerful. You should be able to write at least a sentence describing the difference. 10 BA216 – MFranco – Study guide, final exam Simple Ordinary Least Squared (OLS) linear regression Why do we try to minimize residuals in regression analysis? How OLS regression related to the concept of SSR? How does that mathematically tell us where to position the regression trend line? What is the y-intercept? Where is it located in the equation? What does it mean? How do you interpret it in context? o When is the y-intercept mathematically necessary? When is it practically useful? Is it always both of these, or always one of these, or can it vary? What is the slope coefficient for the predictor? Where is it located in the equation? What does it mean? o (important) How do you interpret the number, verbally it in the context of the problem? How can you find the R-squared, the beta-naught, the beta-one, and the p-value from BOTH the lm() output in R, and from a cleaned up table? What is the R-squared? How do you interpret it in context? What are the 2 hypotheses for a linear regression analysis? o What is the p-value? o Do we look at the p-value for the beta-naught, the beta-one, or both? o How many p-values do we examine in simple linear regression? What are the dangers of over-extrapolation far away from the data? How does the related to the idea that, sometimes, the y-intercept isn’t practically useful but is always mathematically necessary? (we talked about this at the end of lecture 21) What is a residual plot? How can you recognize that vs. a standard bivariate scatterplot? What is a residual histogram? What are the three conditions for linear regression? How can you recognize them in plots? Which ones do you check before running lm(), and which do you check afterwards? Which ones use the residual plot, and which one uses the residual histogram? What is an indicator variable? When is it used? How do you work with and interpret categorical variables? What is the difference between variable and predictors? What is a reference level? Multiple OLS Regression What is omitted variable bias? Why do we use multiple linear regression, and how can that help us avoid omitted variable bias? What are the examples we discussed in class? What are others you could think of? What are the differences between the equations for multiple and simple linear regression? Make sure you can write out both a simple linear regression generic equation, and a 3-predictor multiple linear regression generic equation How many predictors can there be in multiple linear regression? How many outcome variables can there be? How do you (verbally) interpret each of the slope coefficient beta of each variable? How many p-values will you interpret when hypothesis testing with multiple linear regression with, say, four predictors? 11