Business & Economics Statistics Lecture Notes

ECMT1010: Business and Economics Statistics A Notes The University of Sydney Summary of Notes Covered 1. 2. 3. 4. 5. 6. 7. 8. Collecting Data Describing Data Confidence Intervals Hypothesis Tests Approximating with a Distribution Inference for Means and Proportions Inference for Regression Probability Part 1 -> Collecting Data Data: Set of measurements extracted and stored in a dataset based on individuals, groups or countries • • Cases: Any individual item that can be analysed or observed Variables: Any characteristics or traits specific to a certain case Variables: Characteristics of each case for observation. Categorical: Variables classified into groups such as Gender, medals, hobbies etc Quantitative: Variables that have a numerical amount such as age, height, weight etc Explanatory: Explanatory variables are those that help us explain the cause of the scenario. They are generally first Response: Variables which value will be impacted by the value of the explanatory variable. • Essentially highlight the effect the explanatory variable has Sample population cycle The Big Picture Population Inference: Statistical inferences are where data collected from a sample group gives a generalization of the population as a whole. Sampling Sample Statistical Inference Sample: Collecting data from a group of individuals who are a subset (group) of a whole population. Sample Bias: Sample bias occurs where a chosen sample may be different to the overall population. • Any generalization from this sample will therefore be misleading and inaccurate When sampling, we should be careful as to minimise any forms of bad sampling. Some of these forms are: • • • • • Sampling units that are obviously related to the variable you are studying: In order to accurately sample, we SHOULDN’T sample a group that relate very well to the course. An example of this is a personal trainer poll on a fitness website Volunteer bias where you let the sample be whoever would like to participate: Sometimes allowing people to volunteer to sample is bad, as they will have a more biased and personal opinion. An example of this is emailing customer about flight experiences. Context: Sometimes the context of a scenario can give an indication towards what should be the answer, which would defeat the purpose of the survey. An example of this is conducting a pregnancy survey and providing additional information about the negatives of having children. This would obviously influence people to go against pregnancy. Wording: Sometimes the way a question is worded can influence the outcome. For example if the government proposed spending on medicine as opposed to tax cuts instead of just tax cuts by itself the majority would rather medicine. Lazy Responses: Sometimes, surveys may be attempted or neglected simply because students or individuals don’t have much of a care about the survey. Confounding Variable: Third variable that is associated with the explanatory(cause) variable and the response (effect) variable. Part 2 -> Describing Data One Categorical Variable: Important where we consider the proportion of candidates that are part of a certain outcome (eg comparing people who agree against those who disagree or don’t know) NOTE: Proportions are also known as relative frequencies. We are able to calculate frequencies (or proportions) as follows: 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑖𝑛 𝑎 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑡ℎ𝑎𝑡 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 NOTE: When writing the notation for a proportion: • • p denotes the proportion of a population p̂ denotes the proportion of a sample Two Categorical variables Sometimes we might ask questions like “does the opinion differ between male and female?” These types of questions ask about a relationship between two categorical variables. NOTE that the categorical variables are the opinion and also gender. For this, we set up a two-way table, which has one of the categorical variables along the row and the other categorical variable along the bottom. Sometimes it is useful if we add a total column. NOTE: the total for the two-way table should be the same as the total for a frequency table if you are looking at the same scenario. Two Way Table: A two-way table shows the relationship between two categorical variables by plotting the categories of one variable along the rows and the other along the columns. NOTE: The most important categorical variables to use in this case are gender and type of award When observing a histogram, we have to ask ourselves whether the data is symmetrical or skewed. There are 4 cases that can occur: • • • • Symmetrical: The data is considered symmetrical if we are able to fold the graph and both sides are similar to each other Right Skewed: The data is considered right skewed if all the data is on the left and therefore there is an extended tail running on the right Left Skewed: The data is considered left skewed if all the data is on the right and there is an extended tail running on the left Belled: A belled shape histogram will look like one hill that slopes up and then down. Mean: The mean of a given quantitative variable is the average of the numerical data. This can be denoted in the following ways: • • µ denotes the average of a given population x̄ denotes the average of a sample within that population 𝑀𝑒𝑎𝑛 = 𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 … + 𝑥𝑛 Σ𝑥 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑛 𝑛 Median: The median is either the middle entry of an ordered data list if the list has odd amount of variables or the average of the two middle values if the order list is even. This means that the median would split the data in half. 2) Resistance Sometimes in statistics, we can get outliers which in turn can a resistance or robust. This will have an effect on the mean and median. Resistance/Robust: A statistic is resistant when it is relatively unaffected by extreme or large values. This will cause the median to be resistant but not the mean. • In other words, the median is resistant because no matter how extreme a value is placed, it is relatively unaffected. However the mean would be severely impacted. Standard Deviation: Standard deviation is the quantitative measure of the spread of data in a dataset. As the standard deviation increases, the spread of the data also increases. This is calculated by the following formula: Similar to proportion, standard deviation also has its own notation. • • ‘s’ denotes the standard deviation for a sample group, measuring how spread out the data is from the mean 𝑥̅ σ denotes the standard deviation for a population, measuring how far the data is from the mean μ Note: Standard deviation allows us to determine how many deviations a certain value is from the mean (Eg: 1 deviation, 2 deviation etc.) IMPORTANT: When looking at a bell-shaped symmetrical curve, approximately 95% of the values should fall within the 2 deviations of the mean. Mathematically this means that the values should fill within -2s and +2s. This is shown in the graph on the right. When analysing data, we are particularly interested in the centre and spread of a distribution. To do this we look at how many standard deviations a value is from the mean. This is called the z-score Z-Score: Measures how many standard deviations a given data value is from the mean. Mathematically, this is calculated using the following formula: 𝑍 𝑠𝑐𝑜𝑟𝑒 = 𝑥 − 𝑥𝑏𝑎𝑟 𝑠 Percentiles: When looking at a range of data, we can distribute the data based on percentiles. EXACTLY LIKE THE HSC MARKS, the percentile of a given mark highlights that that mark beat a certain %of all other marks. Some examples are: • • 92th percentile in mathematics means that the students mark beat 92% of all other mathematics marks 21.5th percentile in visual arts means that the students mark beat 21.5% of all other visual arts marks Using the above information, we now look at the idea of 5 number summary. The 5-number summary is a method which uses the median, the minimum and maximum (1st and last) and also the midpoint between the minimum and median as well as the median and maximum (Q1 and Q3). 5 Number Summary: We divide the data into the minimum, Q1, median, Q3 and maximum. Using the 5 Number summary, we can also find the range and interquartile range of a given data spread. • • Range: Maximum - Minimum Interquartile Range: Q3 - Q1 Correlation: Correlation measures the strength and direction of a linear association between 2 quantitative variables. The notation for correlation is expressed as: • • r for two quantitative variables for a sample group p for the correlation between 2 quantitative variables of a population Correlation between two variables on their associated scatterplots. Based on these, we can see that the properties of scatterplots are: • • • • Correlation will always be between -1 and 1 The sign of correlation will indicate the direction of association Correlation closer to -1 or 1 have a stronger linear association Correlation = 0 has no linear association Correlation Cautions When looking at correlations, there are a number of factors we need to be careful to avoid: • • • A strong correlation between two quantitative variables does NOT mean that there is a causation involved (cause and effect) Correlations near 0 doesn’t mean that two quantitative variables are not associated because correlation only tests linear association Outliers can heavily influence correlations. Be sure to plot your data carefully As we already know the equation of a straight line, we are able to adapt that equation in order to find the regression line for any given explanatory and response variable. Mathematically, the regression line is given in the following manner: Regression Line Equation where: B1 = the slope of the regression line B0 = constant or vertical intercept As show in the above formula, the response variable is on the left representing “y” while the explanatory variable is on the right representing “x”. This means that the response variable is always a function of the explanatory variable. Using the above general regression formula, we could estimate the value of the response variable given a value of the explanatory variable. However, it’s important to note that we are predicting the response variable value and in fact the true value could be above or below this point. This means we get the following definitions: • • ‘y’ is the observed response variable which is the actual value for a particular data point on the regression line y-bar is the predicted response variable value which is the value we estimate when calculating the regression line formula Residual -> Residuals are the difference between the observed response values and the predicted response values. On the above graph, it is the vertical distance of the observed response variable ‘y’ away from the regression line ‘y-bar’ 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 = 𝑶𝒃𝒔𝒆𝒓𝒗𝒆𝒅 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 = 𝒚 − 𝒚 − 𝒃𝒂𝒓 REMEMBER: Our objective is to find the regression line of best fit. In order to do this we must calculate a line that is as close to all the scatterplot values as possible. In order to do this, we use the least squared lines. Least Squares Line (LSL): Regression line, which minimises the sum of all the squared residuals from the scatterplot. ̂)𝟐 We are able to achieve the LSL by using the formula: 𝑳𝑺𝑳 = 𝑴𝒊𝒏𝒊𝒎𝒊𝒔𝒆 ∑(𝒚 − 𝒚 NOTE -> In regression modelling, it is HIGHLY IMPORTANT that the explanatory and response variables are properly distinguished otherwise different values will be calculated Part 3 -> Confidence Intervals Point Estimate: Single value or statistic that can be used as a population parameter. • NOTE that this number isn’t accurately the population parameter but is a close approximation of the true population parameter. Since the population parameter is fixed and won’t change, we can use sample statistics variability in order to get a closer and closer estimate to the true value. This means that we get different samples OF THE SAME SIZE in a population and calculate the statistic in each sample. From this we then compare the samples and look for any variation or variability. • • Low variation: This is where the sample statistic for each sample are very close and therefore show accuracy High variation: This is where the sample static for each sample are far apart from each other and there is a question about accuracy CASE STUDY EXAMPLE: Using the case study example 1 from above about US presidential voting polls, we get the following sample statistics results: • • • Sample 1: 48% voted for Obama Sample 2: 47% voted for Obama Sample 3: 50% voted for Obama As we can see, there is a 3% variation in the sample statistics that have been gathered. Based on the context, we can see that this is accurate and therefore reliable in order to get a true value for the population parameter. Sampling Distribution: Distribution or variation in sample statistics for each sample in a given population. These will have the following characteristics: • • • Each sample statistic should be centred or plotted round the population parameter. EG: Each sample mean should be centred around the population mean (which is the parameter) If our sample size is large enough, then we should get a symmetrical and bell-shaped curve The standard error allows us to see how much the samples vary NOTE: All these concepts apply heavily to random samples. Any non-random samples will give heavily inaccurate results Standard Error: A type of standard deviation, standard error looks at how spread the samples are in a given population. We can calculate this with a formula similar to standard deviation. It is important to understand the difference in concepts here. A sample error looks at the standard deviation or variation of sample statistics in a population where as we can also look at the standard deviation of various values for a particular sample. NOTE: As the size of a sample gets larger, the standard error or variation will decrease as the sample size starts to become a more accurate representation of the population. Interval Estimate: Range of values in which the parameter is situated in (parameter within range of values) Margin Error: Precision of sample statistic as a point estimate for this parameter. • The give or take value from the statistic value Confidence Intervals: Proportion of all samples whose interval (range of values) contain the parameter • EG: 85% of the sample intervals given contain the parameter proportion ‘p’ As we already know, for a bell shaped and symmetric curve, 95% of the data should fall within 2 standard deviations. Therefore, we can calculate the confidence interval such that 95% of the sample intervals should run through the parameter. The formula for calculating the confidence interval is: 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑰𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = 𝑺𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄 ± 𝟐 × 𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑬𝒓𝒓𝒐𝒓 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑰𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = 𝑺𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄 ± 𝑴𝒂𝒓𝒈𝒊𝒏 𝑬𝒓𝒓𝒐𝒓 NOTE: The difference between marginal error and standard error: • • Margin error is the value that is either added or subtracted from confidence interval Standard error is the standard deviation of a lot of standard statistics put together The diagram above shows a dot plot containing all of the sample statistics. Using the 95% confidence interval rule, we can be sure that when observing a symmetrical and bell-shaped curve, approximately 95% of the statistics should fall within ± 2 standard errors of the population parameter (which is the centre in this case). This is similar to the 95% standard deviation rule for various values of a given sample. Bootstrapping: Using a sample to calculate the standard error, population parameter and 95% confidence interval. • This is done because it is very difficult to create multiple samples Therefore, bootstrapping will involve replicating or reproducing the data such that to create an artificial population. NOTE: This concept was covered in the tutorial questions when we had to use STATKEY to produce 7000 samples and place them onto a dot plot Graph: The right graph shows how we used bootstrapping to generate 7000 samples and plot them into a dot plot NOTE that using the above method isn’t possible in practice. Therefore we use replacement with the original sample. Therefore, we randomly choose something from the sample and then place the item back into the original sample, thus creating bootstrap samples. Centre: Difference between centre of sample distribution and bootstrap distribution • • Sample Distribution: Centre is the population parameter Bootstrap Distribution: Centre is the original sample statistic before replication occurred Some important things to note for bootstrap samples are that: • • • The bootstrap sample has to be the same size The bootstrap sample and the original sample have to have the same numbers, although the frequency of these numbers don’t have to match Only works when sample is random Bootstrap Definitions • • • Bootstrap sample: The process of replacement which has created another sample Bootstrap Statistic: Statistic or variable calculated in this copied or bootstrapped sample Bootstrap Distribution: Distribution of many bootstrap statistics We can use the concept of bootstrapping in order to estimate the standard error of the sample statistic. To do this, we would use the bootstrap distribution in order to calculate the bootstrap standard deviation. By doing this, we will be able to obtain a good approximation of a sample statistics standard error. A mathematical way of seeing this is given below: 𝑺𝑬 𝒐𝒇 𝒂 𝒔𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄 = 𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑫𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 𝒐𝒇 𝒃𝒐𝒐𝒕𝒔𝒕𝒓𝒂𝒑 𝒔𝒂𝒎𝒑𝒍𝒆𝒔 With the above bootstrapping, we can construct a 95% confidence interval • NOTE -> This is very similar to both the 95% rule for a sample and also the 95% confidence interval for sampling distribution 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑰𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = 𝑺𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄 ± 𝟐 × 𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑬𝒓𝒓𝒐𝒓 Part 4 -> Hypothesis Tests Focus of this section will be looking at statistical data and whether the sample statistic data is convincing enough to be a true inference about the population. Statistical Test -> Data from a sample to study how accurate or convincing a claim about the population is • We answer how convincing the data is by using a null hypothesis and alternative hypothesis. Null Hypothesis (H0): A claim or statement that there is no difference or effect Alternative Hypothesis (Ha): Trying to seek evidence to prove that there is an effect and therefore the alternative hypothesis is true • The aim is to provide enough evidence so that we can rule out the null hypothesis. CASE STUDY QUESTION -> In this question, we are looking at whether leniency is greater when a student smiles. The experimenters have no prior beliefs about the effect of smiling on leniency and are testing to see if facial expression has any effect. ANSWER: We are using the parameters µsmile for average smiles and µneutral for average no facial expression. In this case we are testing if there is any effect on leniency if a student was smiling. Therefore the hypothesis are: • • Null Hypothesis: µsmile = µneutral Alternative Hypothesis: µsmile ≠ µneutral Statistical Significance: When extreme results similar to the sample statistics occur with little possibility of it being done by chance • Simply put, extreme results occur without being random or coincidental The importance of statistical significance is that if we have a sample that is statistically significant, then we have enough evidence to support the Alternative Hypothesis (Ha) and disregard the Null Hypothesis (H0) Another useful measure is using P values. P values are used to measure the strength or significance of the sample statistic in order to support the alternative hypothesis. P value -> Where a null is true, it’s the probability of obtaining an extreme sample statistic The idea with p values is that as the p-value gets smaller and smaller, therefore approaching 0, the strength of the statistical evidence gets bigger. As a result, the alternative hypothesis gets favoured even more. • IDEA -> the smaller p values are stronger because they show that there is a much less chance that an extreme value occurs by random or by coincidence Method: Use bootstrap methods to calculate p values (assuming the Null Hypothesis is true) Procedure • • • In order to get a bootstrap, we use the process of randomisation samples where we simulate or generate samples that are consistent with the null hypothesis. We then calculate the sample statistic for each generated sample and plot these on a distribution If the statistic falls in a section of the distribution that is unlikely, such as the tails or extreme outliers, then we have evidence to disapprove of the null hypothesis and instead approve the alternative hypothesis Randomisation Distribution: Assuming the null hypothesis is true, we generate many samples and analyse each static to see if it falls in a likely section of the distribution curve. Those that fall in an unlikely area have strong evidence The right graph shows the bootstrap distribution for the number of dogs that run to their owners. When using a bootstrap distribution, we look at how often an event occurs both at that value and also above it. As a result, this becomes our p-value. Using the graph as an example, we have an original sample proportion statistic of 16/25 or 0.64, using a bootstrap distribution, we were able to generate many samples and plot them. Using the distribution, we found the percentage of values that are greater than or equal to 16. To be 0.1145 This then becomes our corresponding p-value. As a result, this p-value shows us that when a person randomly guesses, they will 16/25 or more only 11.45% of the time. Methods to Estimate P value When using a bootstrap distribution, we can use the following methods: • • One Tail Alternative: Find the proportion of randomisation samples that are >= the value in the indicated direction (to the left or right of the curve). This is the method that we have been using so far o NOTE: One tail alternatives are used for Ha: x1> x1 or x1< x2 Two Tail Alternative: Using the small tail, find the proportion of samples that are >= the value and double this to account for the other tail. o NOTE: Two tail alternatives are used for Ha: x1 ≠ x2 Left Diagram: The left diagram shows how to apply the two-tail method where you find the proportion in the smaller tail and then double. Right Diagram: The right diagram shows how to apply the one tail method where you find the proportion in the direction given (in this case it’s the direction to the left) For very small p values, there is a very small change that the sample will occur by random guessing. Therefore we have evidence to favour the alternative hypothesis Ha instead of the null hypothesis H0 𝒂𝒔 𝒑 𝒗𝒂𝒍𝒖𝒆 → 𝟎, 𝒘𝒆 𝒘𝒊𝒍𝒍 𝒇𝒂𝒗𝒐𝒖𝒓 𝑯𝒂 Decision Outcome Reject H0 Found evidence to support the alternative hypothesis Ha Don’t reject H0 No significant evidence to prove one of the hypothesis and therefore must consider either hypothesis Significant Level Test -> The cut off point for a p-value where: • • • Below this value means that we have statistical evidence to reject the null hypothesis and favour the alternative hypothesis o p < significant level Greater than or equal to this value means we can not reject the null hypothesis o p ≥ significant level Generally, statisticians will determine that common significance levels (cut off points) are a=0.01, a=0.05, a=0.10 We can also use a graph to visually determine the cut off point or significant level. Hence for any values above this point, we would reject the null hypothesis H0 Type I and Type II errors Chances where there is an error in the decisioning of hypothesis testing. These errors are broken down into Type 1 errors and Type 2 errors. Type I Error -> Type I errors occur where the null hypothesis H0 is true but infact we have rejected the null hypothesis in favour of the alternative hypothesis. Type II Error: Type II errors occur where the null hypothesis H0 is false but infact we do not reject the null hypothesis However, the problem with this is that sometimes we could be rejecting many null hypothesis which would mean that some non-extreme values are ignored. When choosing a significance level, we should choose one that will give a reasonable probability of a Type I error occurring The following is a criteria for establishing when creating randomisation samples 1. Must be consistent with the null hypothesis 2. Use the original samples data 3. Method done the same way that the original data was collected Randomisation Distribution Centre • Where there is a true null hypothesis, the distribution would be centred around the parameter value of that null hypothesis o EG: If the parameter was proportion=0.9 and the null hypothesis: H=0.9 is true, then the centre value of the distribution will be the p=0.9 To conduct hypothesis testing, we can use any of the following randomisation tests: • • • Differences in proportions Test for correlation Test for mean 1.) Randomisation Distribution for differences in proportions When creating random distributions to find the p value for differences in proportions, we take the differences in proportion and have our null hypothesis = 0 difference between the two. This means: H0: p1 = p2 or p1-p2=0 Using this null hypothesis, we generate our randomisation distribution around this difference and then based on the alternative hypothesis we observe which side of the graph. Using the alternative hypothesis, we calculate the differences in means and observe the graph. From the graph we take that value and the rest in the given direction as our p value. 2.) Randomisation Distribution for Correlation When using correlation and trying to generate random distribution methods, we again always centre our distribution around the null hypothesis. Then we use our alternative hypothesis to find the percentage of values that are to the left or to the right of the sample correlation. 3.) Randomisation Distribution for Mean When looking at questions about means, we will always centre our data around the null hypothesis when generating a random distribution. However, the difference with mean questions is that we our alternative hypothesis will argue that the given mean is not true and therefore ≠ • Use the two tailed method when calculating the p value The graph below shows how our mean distribution method is centred around the null hypothesis. When we calculate and plot our sample mean, we use the two-tailed method to calculate the p value for the lower tail and then multiply by 2 to get both sides of the graph. Bootstrap: Use the confidence interval of a bootstrap distribution and whether the null hypothesis falls inside this confidence interval. We can see that in two cases, a null hypothesis will either generate a small p value or a relatively bigger p value which can be sued to determine whether or not to reject the null hypothesis. NOTE -> When looking at confidence intervals, the significance level is the percentage of data not within that confidence interval. For example: • • 95% confidence interval will mean a 5% significance level 99% confidence interval will mean a 1% significance level Ho is outside Confidence Interval If the null hypothesis is outside the confidence interval, then we should reject the null hypothesis, as the p value will be smaller than the significance level. This is because the p-value calculated for the null hypothesis would cover a lower area on the bootstrap distribution than the 5% significance level outside the 95% confidence interval If Ho outside confidence interval, then reject Ho Ho is inside the Confidence Interval If the null hypothesis is found to be inside the confidence interval (say for example inside a 95% confidence interval) then we can’t reject the null hypothesis because our p value will be greater than the significance level. If Ho inside confidence interval, then accept Ho If the null hypothesis is inside the 95% confidence interval, then taking values to the left/right and doubling will give a p value that is greater than the 5% significance level (other 5% of the graph). Therefore, we cannot reject the null hypothesis Part 5 -> Approximating with a Distribution Density Curve -> Theoretical curve that describes the distribution of values. The characteristics of a density curve are: • • Total area underneath the curve should = 1 Using the above characteristic, the area underneath the graph is therefore the same as the proportion in a given interval Graph: The graph on the left shows a black line/curve along the graph, which represents the density curve A density curve can take on any shape, however we will be focusing on normal density curves which are symmetrical, and bell shaped The left graph shows the area underneath a given density curve. The area/proportion is visualised by the red shaded interval Normal Density Curve: A bell shaped and symmetrical density curve that involves parameters mean (µ) and standard deviation (∂) Density curves that are normally distributed can be noted as follows: 𝑿 = 𝑵~(𝒎𝒆𝒂𝒏, 𝒔𝒕𝒅𝒆𝒗) Where: • • • • ‘N’ specifies that the given distribution is a normal distribution X denotes the variable that has a normal distribution curve we are observing Mean ‘µ’ is the centre value that the symmetrical and bell shaped curve is centred around Standard Deviation shows the spread in the curve Characteristics of Normal Density Curve 1. Bell shaped and symmetrical 2. Centred around the mean 3. 95% of the data falls within 2 standard deviations Calculating Percentiles and Normal Areas/Probabilities/Proportions We can visualise the area under the curve for a given interval, however the integral involved for calculating the area is very complicated and difficult. Therefore we use stat key. When using this method, we need to provide the following information: 1. The mean and standard deviation 2. Endpoints of the interval (which values to calculate between) 3. Direction in which to calculate (are we calculating values to the left or values to the right) Standard Normal Standard Normal -> Standard Normal is a distribution that follows the normal density curve BUT is centred at 0 and has a standard deviation of 1. The significance of standard normal graphs is that they are able to show how many standard deviations a statistic is from the population parameter. In other words, it is able to show us the z-score for a given statistic. The characteristics of a standard normal distribution are that is has a mean = 0 and a standard deviation = 1. When referring to standard normal distributions, we use the following notation: 𝒁 = 𝑵~(𝟎, 𝟏) Where: • • • • Z specifies that we are looking at a standard normal curve N specifies it is a normal distribution curve µ=0 Standard deviation = 1 With bell shaped symmetrical curves, it is possible to convert a normal distribution curve to a standard density curve and also vice versa. This is expressed in the following formulas. Convert from X~ N (mean, standard deviation) to Z ~ N (0,1), we use the following formula which resembles the z-score formula and will give us a z-score for that statistical value: 𝑿−𝒖 𝒁= 𝜹 Where: • • • X is the value µ Is the mean ∂ is the standard deviation Question: How do we find percentiles using a standard normal distribution? Answer: We would reverse the process and find an endpoint on the standard normal distribution curve that gives us the percentile, then we convert this using X = µ + Z∂ in order to find the corresponding statistic value where that percentile occurs. Central Limit Theorem (CLT): Central Limit Theorem says that for a sufficiently large sample size, the distribution of the sample statistics can be approximately found using a normal distribution CLT Characteristics • • • For a skewed graph, the sample size ‘n’ needs to be very large For a quantitative variable: ’n’ > 30 For a categorical variable: ‘n’ > 10 Therefore, using normal distributions and bootstrap distributions, we are able to use statkey to calculate a confidence interval using the central limit theorem. NOTE that the confidence interval should be approximately similar to the confidence interval calculated from a bootstrap distribution When using a normal distribution curve in the form of N ~ (0,1), we can use the following mathematical formula in order to calculate the given endpoints of the confidence interval Where: • z* is the given z-score within a normal distribution Note that in the above formula, z* is the z-score used so that the area between -z* and +z* gives us the desired confidence interval. For example: • z* = 12.56 which gives us a 90% confidence interval when using the formula Summary of how to calculate P% confidence Interval Step 1 -> Confirm that sampling distribution can be done with normal distribution (check whether sample size is big enough) Step 2 -> Find z* for the P% confidence interval (this is given in the formula sheet) Step 3 -> Use statistic +- z* SE to calculate confidence interval Normal Distributions and P-values In some cases, we can use a randomisation distribution curve to calculate the p value for hypothesis testing. IF our randomisation distribution curve is in the shape of a normal distribution curve, we can assume that the null hypothesis is true and use our statistic to calculate a p-value (area underneath the curve between the endpoints) Test Statistic → Number of standard errors (z-score) a sample statistic is from the Null Hypothesis Following this, the p-value can be calculated by taking the proportion of data either to the left or right of the z-score Summary -> Calculate p-value for Ho in a standardised normal distribution Step 1 -> Find standardised test statistic “z” Step 2 -> Calculate the p-value by taking the proportion of data either to the left or right depending on alternative hypothesis criteria Part 6 -> Inference for Means and Proportions For a given sample proportion, using a random sample we are able to use the sample size and proportion in order to calculate the Standard Error SE. We can assume that the sample proportion is representative of the population proportion. Mathematically, the formula is given by: 𝒑(𝟏 − 𝒑) 𝑺𝑬 = √ 𝒏 Sample Size -> Central Limit Theorem When using random samples, there are cases where a normal distribution may not occur. For a large sample size, a normal distribution will allow us to find the distribution of sample proportion and the standard error so long as the following conditions are satisfied: 𝒏𝒑 ≥ 𝟏𝟎 𝒏(𝟏 − 𝒑) ≥ 𝟏𝟎 When the above central limit theorem, criteria is satisfied, then our sample size is sufficiently large and therefore the normal distribution is categorised by: Since the population proportion has not been given, we know that in the CLT, a large sample size means that the sample statistic is very close to the population parameter. Therefore for a sample that satisfies the conditions: 𝒏𝒑 ≥ 𝟏𝟎 𝒏(𝟏 − 𝒑) ≥ 𝟏𝟎 We can mathematically calculate the confidence interval as follows: Sample Size for Confidence Interval When using a confidence interval, we want to know how large does our sample size need to be in order to obtain that confidence interval. HOWEVER, if we know the margin of error for our proportion/CI then we can use this in order to calculate the sample size. Special -> Sample size ‘n’ can be chosen using ME From the confidence interval formula, ME is: Rearranging this means we can use the ME in order to calculate the sample size: Z -score: Calculate a sample proportion and its distance from the null hypothesis H0. For hypothesis testing this is: Where: p-hat is the sample statistic that we are given p0 is where we assume the null hypothesis is true Using the Central Limit Theorem, we can calculate the SE by assuming the null hypothesis is true: Therefore, when the Central Limit Theorem Criteria is valid, the z-score statistic can be calculated using the above formulas. When observing means, we are often using quantitative variables. To calculate SE for a normal distribution we use: However, given in some cases that we may not know the population standard deviation, we can estimate this from the sample standard deviation using inference. Therefore, our mathematical formula for SE can also be: t-distribution: Used for sample statistics Since we are using sample statistics, we do not have a normal distribution curve but instead a tvalue curve. It is important to know that ‘t’ curves look similar to normal distribution curves but have fatter tails due to added levels of uncertainty. T distribution -> Degrees of freedom that help distinguish between each ‘t’ distribution curve. We denote this as df (degrees of freedom) • as degrees of freedom increases, the closer the distribution will resemble a normal distribution curve. IMPORTANT: When we are using a sample mean and sample standard deviation, we would associate the curve with a ‘t’ distribution that has n-1 df. • Hence for sample mean and sample stdev, we use tn-1 Characteristics: The characteristics of using sample means with sample standard deviations are: • Centre: Mean is the same as the population mean µ • • • Spread: the spread of the data is given by the Se formula where we use ‘s’ for the sample standard deviation Shape: Standardised sample means should follow a ‘t’ distribution with “n-1” degrees of freedom, written as tn-1 When a sample size ‘n’ is greater than 30, the ‘t’ distribution curve should follow a “n-1” degrees of freedom Therefore, a ‘t’ distribution of “n-1” degrees of freedom will be denoted as: “t” distribution curve is where the tails are a tiny bit fatter than a normal distribution curve Confidence intervals can still be calculated in a t distribution using: Consequently, when using sample means and incorporating ‘t’ distributions, we are effectively using a normal distribution with “n-1” degrees of freedom. Therefore, we can calculate a confidence interval for a single mean using: Where: T* is the endpoint of a “t” distribution with “n-1” degrees of freedom Sample Size for Confidence Interval Similar to proportions, when looking at Sample sizes, we can determine what size or how large our sample should be using the Margin Error (ME). Using the confidence interval formula, our ME will be: In order to use a sample standard deviation in cases where we don’t have one, we can use any of the 4 following methods: • • • • Use previous sample data Use a small sample to estimate ‘s’ Find the range and divide it by 4 Make any reasonable guess Therefore, when we want to find the size of the sample for a given Margin of Error, we can use the following formula: Hence, when we are using a population mean and would like to use hypothesis testing, we can calculate the z-score using the following formula: Formula for z-score for a given hypothesis test where: • µ0 is the population mean given in the hypothesis test Using central limit theorem, we can find the t-statistic and use this to test the null hypothesis. NOTE that the ‘t’ statistic is just a given z-score in order to find the p-value that we need to either reject the null or not reject the null. In order to find the difference in proportions through p1 - p2, we estimate using the sample statistic p1̂ − p̂2. From this estimate, we are able to use the following mathematical formula in order to estimate the standard error for the difference in sample statistic proportions. SE formula for a difference in proportions Central Limit Theorem for Difference in proportions Similar to a single proportion, we need to check a certain criteria in order to determine whether the sample size ‘n’ is large enough for the use of the central limit theorem. The criteria is essentially the same but is applied to each sample proportion. The criteria is given below: When the criteria on the previous page are sufficiently met, we have a large enough sample size in order to carry out the central limit theorem for normal distributions. As a result, the sufficient criteria give us a normal distribution for a difference in proportions that is represented in the following manner: Using our above mean and standard errors, we are able to calculate the confidence interval for the differences in proportions. The mathematical formula to do this is given below: When using hypothesis testing for a difference in proportions, we use the following notation in order to define the null hypothesis: H0: p1 = p2 Therefore, we are able to substitute this into the z-score formula in order to get the following formula for calculating the accompanying z-score: NOTE: In this formula, the population proportions are equal under the null, hence we get 0 which is why it cancels out from the formula When calculating the standard error, we assume that the null hypothesis is true. HOWEVER, this gives us equal values for both population proportions, which creates a problem for calculating the SE. As a result, we use pooled proportions to solve this and therefore correctly calculate the sample. Pooled Proportions -> We combine all the samples into one big sample and then calculate the proportion within this sample. As a result, the use of the pooled proportions allows us to calculate the SE properly. Mathematically, the z-score involving the modified standard error is given by: NOTE: Note that in this z-score formula, the SE part involves the use of p hat instead of proportion 1 and proportion 2 Where our criteria for the Central Limit Theorem is successful and our sample size is sufficient, the distribution should be centred around the differences in means and the SE is found using the following formula: Therefore under a normal distribution we get: Sample standard deviation means we can estimate in a ‘t’ distribution, the centre will be around the difference in population means. Therefore, our mathematical formula for the SE in a ‘t’ distribution is calculated by: Using the degrees of freedom for ‘t’ distribution, the sample size that gives the lower degrees of freedom should be used as the distribution curve in finding the standardised test statistic If our samples are large enough (greater than or equal 30), then we can use the following formula in order to calculate the confidence interval for the difference in means. NOTE -> the ‘t’ statistic is found using the sample size with the smaller degrees of freedom This is used for hypothesis testing involving a null hypothesis of no differences in the means. The necessary steps in order to carry out a hypothesis test for difference in means are: 1. 2. 3. 4. 5. Check sample size obeys Central limit Theorem Take the lower degrees of freedom Calculate SE Use any method to calculate the p-value Conclude based on p-value and significance level Using these statistics, what we do is we take the difference between these statistics, then we carry out a single mean test to find the confidence intervals or p-values. After finding the difference in means from the matched pairs experiment, we can use: Our hypothesis testing then becomes → H0: µ = value Similarly, the confidence interval formula for a matched pairs experiment is given below Part 7 -> Inference for Regression Regression: Equation that estimates the relationship between two quantitative variables The mathematical formula for the value of the last squares line for a sample was given by: Although this was for a sample, we are able to extend this in order to find the regression line for a population. Mathematically, the formula for a population is: Where: ß0 is the intercept for the population ß1 is the slope for the population regression ℇ is the random error or the variance for each data point since either above or below the line Similar to means and proportions, we are able to estimate the value of both the intercept and the slope. This is because we don’t exactly know the population parameters ß0 and ß1 for the intercept and slope. Inference for Intercept -> Confidence Intervals and Hypothesis Testing Since we don’t know the true value of the population intercept for regression lines, we are able to use both a confidence interval and also a hypothesis test in order to determine this. The formulas we use to calculate the confidence interval for population intercept and also to carry out hypothesis testing for the population intercept are: NOTE -> Since we are estimating two values (slope and intercept) we use a curve with degrees of freedom: df=n-2 NOTE -> In order to find the SE, we use either a randomisation or bootstrap distribution Hypothesis Testing: Generally use the following hypotheses: Hypothesis Testing for Correlation -> Test linear association between two variables When testing the association between two variables without the use of regression, we can determine the appropriate ‘t’ statistics as follows: t-statistic for calculating the critical value which we will use for determining the p-value Relationship between Correlation and Regression As we know, a correlation value will always fall between positive and negative 1. Infact, when we take the square of this, we determine the coefficient of determination and therefore end up with a relationship between correlation and regression. Definition: 𝑅 2 shows the proportion of response values that are properly explained by the predicted values X. Since the R-squared is a proportion value, the formula is given by: Conditions Criteria: The criteria that must be checked in order to allow regression to apply are: 1. The epsilon error values are randomly placed away from the line 2. Variability keeps changing 3. The data creates a curved pattern 𝒚 = 𝒚𝒃𝒂𝒓 + 𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝑫𝒂𝒕𝒂 = 𝑴𝒐𝒅𝒆𝒍 + 𝑬𝒓𝒓𝒐𝒓 When observing the total variability, we are able to break down the formula into different sections in order to analyse each variability or each error. This means splitting the total variability of the equation ‘y’ into the following sections: 1. Variability explained by the model 2. Error (unexplained) variability This variability partitioning is expressed in the following formula: 𝑺𝑺𝑻𝒐𝒕𝒂𝒍 = 𝑺𝑺𝑹 + 𝑺𝑺𝑬𝒓𝒓𝒐𝒓 𝒀 = 𝒀𝒃𝒂𝒓 + 𝑬𝒓𝒓𝒐𝒓 This formula is mathematically calculated by taking the squares of the summed deviations. The following formulas on the next page illustrate how to calculate each variable in the above formula: When looking at the variability, we can use the following in order to test whether or not the model is effective. We need to consider the mean square for the model and also the mean square error for the variability of the data. Mathematically, the formulas for Mean Square Model and Mean Square Error are: 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 𝒎𝒐𝒅𝒆𝒍 = 𝑴𝑺𝑴𝒐𝒅𝒆𝒍 = 𝑺𝑺𝑹 (𝑺𝑺𝑴𝒐𝒅𝒆𝒍) 𝟏 𝑴𝒆𝒂𝒏 𝑺𝒒𝒂𝒖𝒓𝒆 𝑬𝒓𝒓𝒐𝒓 = 𝑴𝑺𝑬 = 𝑺𝑺𝑬 𝒏−𝟐 For a hypothesis test to check whether the model is effective or not, we define the following: 𝑯𝟎 : 𝑴𝒐𝒅𝒆𝒍 𝒊𝒔 𝒊𝒏𝒏𝒆𝒇𝒆𝒄𝒕𝒊𝒗𝒆 (𝒔𝒍𝒐𝒑𝒆 = 𝟎) 𝑯𝒂 : 𝑴𝒐𝒅𝒆𝒍 𝒊𝒔 𝒆𝒇𝒇𝒆𝒄𝒕𝒊𝒗𝒆 (𝒔𝒍𝒐𝒑𝒆 ≠ 𝟎) Using the following F-statistic 𝑭= 𝑴𝑺𝑴𝒐𝒅𝒆𝒍 𝑴𝑺𝑬 From the response variable equation ‘y’, we are able to assume that the random errors (epsilon) are done by random chance. We can use this to see that there is a standard deviation for the errors. This means that we able to calculate the average stdev or value that the errors deviate away from the regression line. There are two cases here: 1. If our stdev value is small, then our least squares line is pretty accurate as the residuals are very small 2. If our stdev value is large, then our least squares line is not so accurate as the residuals are very large Mathematically, the formula for calculating the Standard error for the residual errors within the model is: Similarly, we can also calculate the Standard Error of the slope using the following formula: NOTE: It is important to note that an Anova table will give us these values, however we need to be able to determine the values REGRESSION CAUTIONS When looking at regression modelling, we must be careful of the following cautions: 1. Don’t use the regression equation to predict ‘x’ values that are outside of the graph or the scope of the data 2. Always plot data! The regression equation should be used where there is some sort of linear association between quantitative variables 3. Be careful of outliers as they can heavily influence the regression line 4. Only randomised experiments will allow causation to occur where there is a valid change in ‘y’ for a change in ‘x’ Part 8 -> Probability Probability: Probability of an event occurring is the proportion of times it occurs. Therefore probabilities are always like this: 0 ≤ Probability of an event ≤ 1 Throughout this section, we will be looking at cases where outcomes are equally likely. That is, every outcome has the same possibility of being chosen. (EG fair dice, coin toss etc.) Therefore the probability of selecting something from equally likely events is calculated by: Sometimes we can use a Venn diagram to determine the probability of an event occurring. Using the diagram below: • • • Blue dots represents all the events that can occur All the blue dots inside A’s red circle represents P(event A) occurring All the dots outside the circle represent the P(Not event A) doesn’t occur Using probability, we can find the probability of any given combination. However, these can sometimes be hard to distinguish. The table below summarises the differences in combinations. Rule 1 -> Additive Rule (A or B) The additive rule for probability looks at the probability that either event A occurs or event B occurs BUT NOT both. We do not take into account cases where both occur because then there is an overlap. Mathematically, the formula for the additive rule is: 𝑷 (𝑨 𝒐𝒓 𝑩) = 𝑷(𝑨) + 𝑷(𝑩) − 𝑷(𝑨 𝒂𝒏𝒅 𝑩) EXCEPTION: Where we are looking at disjoint events where 2 events don’t have a common outcome, then we can simplify the additive rule. Mathematically, the formula for disjoint events is: 𝑷 (𝑨 𝒐𝒓 𝑩) = 𝑷(𝑨) + 𝑷(𝑩) Diagram: The Venn diagram on the right shows that where the additive rule is used, we must subtract the overlap. Therefore it is the yellow circle plus the purple circle minus the middle overlap Rule 2 -> Complement Rule (Not A) The complement rule looks at the probability that an event doesn’t occur. Mathematically, the formula is: 𝑷 (𝑵𝒐𝒕 𝑨) = 𝟏 − 𝑷(𝑨) Diagram: The diagram on the right shows the probability that an event A doesn’t occur Rule 3 -> Conditional probability Conditional probability looks at finding the probability of A given that we know B occurred. Therefore this can be expressed as: • • Probability of A if we know B Probability of A given B Mathematically: the formula is expressed as: 𝑷 (𝑨 𝒊𝒇 𝑩) = 𝑷(𝑨 𝒂𝒏𝒅 𝑩) 𝑷(𝑩) Rule 4 -> Multiplication Rule (And rule) The multiplication rule looks at the probability of 2 events occurring at the same time. This means we looking at the probability of A occurring and then the probability of B occurring given that A has already occurred. Mathematically, the formula is given by: 𝑷 (𝑨 𝒂𝒏𝒅 𝑩) = 𝑷 (𝑨) × 𝑷 (𝑩 𝒊𝒇 𝑨) Special Case -> Independent Events When using the multiplication rule, we can sometimes experience independent events. Independent Events are defined as the events where the P(A) does not influence P(B). Mathematically, this simplifies the conditional rule and multiplication rule to the following: 𝑪𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏𝒂𝒍 𝑹𝒖𝒍𝒆 → 𝑷 (𝑨 𝒊𝒇 𝑩) = 𝑷 (𝑨) 𝑴𝒖𝒍𝒕𝒊𝒑𝒍𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑹𝒖𝒍𝒆 → 𝑷 (𝑨 𝒂𝒏𝒅 𝑩) = 𝑷(𝑨) × 𝑷(𝑩) Difference between disjoint and independent? Disjoint: A disjoint event is where there is no common outcome or overlap. Therefore in 1 trial, only 1 of the outcomes can occur Independent: An independent event is where a common outcome can occur or there is an overlap. This is because 1 event occurring can’t influence the other occurring. Diagram: Summary of probability rules for any 2 events occurring. Law of Total Probability: The law of total probability says that for disjoint events, the probability that an event A occurs is the sum of all the outcomes. Mathematically this is done as: 𝑷 (𝑨) = 𝑷 (𝑨 𝒂𝒏𝒅 𝑩𝟏) + 𝑷 (𝑨 𝒂𝒏𝒅 𝑩𝟐) + ⋯ … … … . +𝑷 (𝑨 𝒂𝒏𝒅 𝑩𝒏) When looking at probabilities of one event and probabilities of conditional events, we can easily organise these into tree diagrams. In the tree diagram: • • The first set of branches give us the probability of an event occurring The set of branches after gives the conditional probabilities of events occurring (that is, the probability of an event occurring after the first branch) Bayes Rule For conditional probability, instead of using a tree diagram we can mathematically use Bayes rule which is a quicker method. Mathematically, Bayes rule for any 2 events is: Extending this, we can also calculate the conditional probability for 3 or more events: Random Variables: As we already know HSC Mathematics, a random variable is a value that can change for each scenario or random sample/trial. However, we can classify random variables even further as either: • • Discrete Variables: Set number of values Continuous Variables: Infinite number of values Discrete Variable: A random variable that has a definite or set number of values. Generally, these variables will have {} to signify that only the values within the brackets can count. Some examples are: • • • Die roll {0,1,2,3,4,5,6} Number of Females in a class Sum of two dice {2,3 ……… 12} Continuous Variables: Continuous variables are those that can take on any value within in an interval. Unlike discrete variables, continuous variables don’t have set numbers as they can take on any value ONLY IF it is within the defined interval. Some examples are: • • Weight Height Probability Function for Discrete variables Denote: Probability of a certain discrete function occurring is denoted as P(EVENT) Sum: For Discrete variables, the sum of all probabilities must at all times = 1. Mathematically this is shown as: ∑𝒑(𝒂𝒍𝒍 𝒆𝒗𝒆𝒏𝒕𝒔) = 𝟏 Mean of a Random Variable For a certain random variable, if we know the probability, we can calculate the mean of the random variable. This process is done in the following manner: • • Multiply each discrete value by corresponding probability value Add every multiplication Denote: To denote the mean of a random variable with probability functions, we use µ Mathematically, the formula for calculating Mean of a probability function is: µ = ∑ 𝒙 . 𝒑(𝒙) Standard Deviation Using the random variable and its probability functions, we can also calculate the standard deviation. This process is done in the following manner: • • • Multiply the difference between the discrete value and the variable with the probability function, ie (x-µ)2 . p(x) Take the sum of all these multiplications to get the variance Take the square root to get the standard deviation Mathematically: Standard Deviation for probability functions is calculated by: 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 = 𝝈𝟐 = ∑(𝒙 − µ)𝟐 . 𝒇(𝒙) 𝒔𝒕𝒅𝒆𝒗 = √𝝈𝟐 Binomial probability looks at the idea of success and failure in probability. This means what is the probability an event does or doesn’t work. Binomial Random Variable: Binomial Random Variable looks at the number of times an event is successful. The characteristics of a Binomial Random Variable are that: • • • ‘n’ is the number of trials ‘p’ is the probability that 1 event occurs for no matter how many tries Each trial for a Binomial random variable is independent of one another GENERAL RULE -> The general rule of Binomial probability is the idea of success or failure where an event eithers occurs or it doesn’t occur The mathematical formula for binomial probability is given as: The notation for the formula is: • • • k = number of times that the trial is successful n = number of trials that occur p = probability that event occurs Mean of Binomial Random Variable For a certain binomial random variable, we are able to calculate the mean. Mathematically, the formula for calculating the mean of a binomial random variable is: µ = 𝒏𝒑 The notation for this formula is: • • • µ = Mean for a binomial random variable n = number of trials that occur p = probability that a success/failure will occur for these number of trials Standard Deviation of Binomial Random Variable For a certain binomial random variable, we can also calculate the standard deviation. Mathematically, the formula for doing this is: NOTE -> The notation is the same for the mean as well as the standard deviation

Business & Economics Statistics Lecture Notes

Related documents

Products

Support

Business & Economics Statistics Lecture Notes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib