lOMoARcPSD|35125377 ISDS 361A - Final Cheat Sheet Business Analytics I (California State University, Fullerton) Studocu is not sponsored or endorsed by any college or university Downloaded by David Lu (davidlu6381@gmail.com) lOMoARcPSD|35125377 Statistics – a way to get information from data Scales of measurement: Nominal – qualitative or categorical and labels are used to denote the categories Ordinal – Same as nominal but there is ordering or ranking that is meaningful Interval – Quantitative or numerical in nature (sat scores, grades etc) Descriptive statistics – summary of important aspects of a data set (includes collecting data, organizing the data, presenting with charts etc. Business Statistics is analyzing the data Parameter (Variable) – A descriptive measure of the population that is of interest Statistic – A descriptive measure that is calculated from the sample Sampling: Examine part of the whole or population (impractical, prohibitive, costly) Randomized sample (every item in the population has an equal change of being in the sample Non-Random Sampling Errors: Selection Bias (one subset has unequal change of being selected) Non-response Bias (When data is unavailable or unattainable) Measurement errors (Inaccuracies in getting/recording data, ambiguous questions etc) Symmetrical Normal Distribution Single Mode Mean is recommended Mean = Median = Mode 50% of the probability on each side of the mean Skewed to the left Negatively Skewed Single Mode Median is better measure Mean < Median Q2 close to Q3 Bi-Modal – 2 modes Empirical Rule (Symmetrical Distribution) 68% within 1 SD 95% within 2 SD 100% within 3 SD (99.7%) If 2 population means are not equal, we use coefficient of variation (C.V.) Data is facts and figures. Decision Science – Getting info from data Cross Sectional – Data collected at the same time Time series data – Collected over several time periods Inferential Statistics – goes beyond data to draw conclusions about population based on sample data. Population – a set of items under the study Random Sample – A random subset chosen from the population Purpose of Inferential Stats To make inferences about a parameter of a population based on information obtained from a statistic of the sample. Sources of statistical data: Designed experiment Public Source Survey Observation studies Shape (Distribution) -Histogram (symmetry, skewness, modality) Location – Central tendency (mean, median, mode) Variability / Spread – range, IQR, variance, standard deviation Skewed to the right Positively Skewed Single Mode Median is better measure Mean > Median Q2 close to Q1 Range (Max – Min) – Best for limited data i.e. under 10 data only uses 2 values Interquartile Range – IQR (Q3 – Q1) Measure variablitily of the middle 50% of data Variance (Not meaningful) Standard Deviation (σ) = uses all of the data (most efficient) Outliers Formula = Q1 - 1.5(IQR) and Q3 + 1.5(IQR) (IQR Formula = Q3-Q1) Chebyshev’s Rule (Non-Symmetrical Distribution) Matched Pair – W/Pop Std. =(Xbar1-Xbar2)/SQRT((std1^2/samp1)+(std2^2/samp2)) p-value =1-NORM.S.DIST(Test Stat, True) Continuous – Fractions (time, height, weight). (361 always uses continuous) Any value within an interval CAN NEVER BE EXACTLY ANYTHING Random Variable – A numerical description of the outcome of an event Probability Distribution – The collection of all possible values of the random variable X and the associated probabilities P(X=x) Discrete – No fraction, Whole #’s, P(X=1) A discrete number of possible values Higher Standard Deviation – Wider Spread Lower Standard Deviation – Narrower Spread Standard Normal Distribution – mean is 0 and standard deviation is 1 Location: Mean = average (range of data) Median =median (range of data) Mode (Single) = mode.sngl (range of data) Mode (Multiple) =mode.mult (range of data) If Test Stat is < C.V. Do not reject H0 3 treatments 10 observations, SSE=399.6 MSE? 399.6 / 3*10 – 3 EXCEL COMMANDS Variability Range =Max(range of data)-Min(range of data) IQR =(Quartile.exc(range of data),3)- (Quartile.exc(range of data),1) Standard Deviation =stdev.p (range of data) for population =stdev.s (range of data) for sample For a random variable X with mean = 10, variance =25, P(X>20)? (x-mean)/sqrt(var) Inferential stats goes beyond data at disposal, descriptive does not. Downloaded by David Lu (davidlu6381@gmail.com) lOMoARcPSD|35125377 Sampling Distribution, Sample Mean (x̄ ) 1. Convert x̄ into: Z = (x̄ - x) / SE a. SE of x̄ = σ / 2. Solve for probability a. P(x̄ < x1) = NORM.S.DIST(Z,TRUE) b. P(x̄ > x1) = 1-NORM.S.DIST(Z,TRUE) c. (between) P(x1 < x- < x2) = NORM.S.DIST(Z2,TRUE) – NORM.S.DIST(-Z1,TRUE) or NORM.S.DIST(X/SE,TRUE) – NORM.S.DIST(-X/SE,TRUE) Interval Estimation of p (proportions %) 1. Find: CV = -NORM.S.INV(a/2) 2. Find: SE of p̂ = 3. Find confidence interval (UL and LL) of µ with p̂ +/- ME a. ME = CV * SE of p̂ Sampling Distribution, Sample Proportion (p̂ ) 1. Find: p̂ = x/n 2. Find SD = 3. Check distrib: when n*p ≥ 5 or n(1-p) ≥ 5; then distrib is normal 4. Same “P rules” as Finding Prob Norm Distrib = NORM.S.DIST functions Sample size in interval estimation of p 1. Find: Za/2 = NORM.S.INV(a/2) 2. Determine ME (desired) Solve: Solve: n =(((NORM.S.INV(0.05/2)/0.06)^2)*(0.47*(1-0.47))) Type I Error – Incorrectly rejected H0 when H0 is true Hypothesis Testing (known SD ) 1. Determine H0 and Ha 2. Find: SE of x̄ = / a. =( /SQRT(n)) 3. Find: Z = x̄ -mean / SE of x̄ 4. Find: CV =NORM.S.INV(a) (left) or =-NORM.S.INV(a) (right) (a/2) for two tailed 5. Find: p-value =1-NORM.S.DIST(6.45,TRUE) 6. Reject H0 if Z > CV or Z < -CV Null Hypothesis (H0): tentative assumption, can possibly be disproved using sample evidence, always is =, ≥, ≤ Interval Estimation of µ with known population σ 1. Find: CV = -NORM.S.INV(a/2) 2. Find: SE of x̄ = pop σ / a. = σ /SQRT(n) 3. Find confidence interval (UL and LL) of µ with x̄ +/- ME a. ME = CV * SE of x̄ b. **ME = confidence.norm(α, ,n) TIM! LOOK IN BOX TO LEFT! <<<<<<<<<<<<<<<<<<< Recommended sample size, in interval estimation when σ is known 1. Find: Za/2 = NORM.S.INV(a/2) 2. Determine ME (desired margin of error) 3. Solve: n = =SD^2*(z/a)^2 Interval Estimation of µ with unknown population σ 1. Find: CV = -T.INV(a/2, n-1) 2. Find: SE of x̄ = s / =σ /SQRT(n) 3. Find confidence interval (UL and LL) of µ with x̄ +/- ME a. ME = confidence.norm(α, ,n) Alternative Hypothesis (Ha): opposite of null, often what the test attempts to establish. If x̄ > H0, Everything to right in Excel If x̄ < H0, Everything to the left in Excel Type II Error – Incorrectly accepted / not rejected H0 when H0 is false. Hypothesis Testing (P-Value Method average) 1. Determine H0 and Ha 2. Find: SE of x̄ = / 3. Test Statistic: x̄ -Ha/SE a. = (x̄ -Ha)/SE) 4. Find: a. P-Value = 1-NORM.S.DIST (Z,True) for P (Z>Zobs) test b. P-Value = NORM.S.DIST (Z,Ture) for P (Z<Zobs) test 5. Reject H0 if P < a Reject H0 if Z > Standard C.V. Reject H0 if P-value < a ***If P-value is less than α, Reject Hypothesis Testing (Population Proportion %) 1. Determine H0 and Ha 2. Find: SE of p = a. =SQRT((*(1-))/n) 3. Test Statistic: -Ha/S = (-Ha)/SE) 4. Find: a. P-Value = 1-NORM.S.DIST (Z,True) for P (Z>Zobs) test (lower) b. P-Value = NORM.S.DIST (Z,True) for P (Z<Zobs) test (upper) Reject H0 if P < a Hypothesis Testing (µ Differences Known SD) (matched samples) Find Critical Value (Za/2) =NORM.S.INV(a/2) i.e. for 95% confidence it’s =NORM.S.INV(.05/2) At 90% confidence, it will always be +/- 1.65 At 95% confidence, it will always be +/- 1.96 At 98% confidence, it will always be +/- 2.33 At 99% confidence, it will always be +/- 2.58 Step 2 – Find Margin of Error a. = x̄ +(CV* (/SQRT(n))) CV = =-NORM.S.INV(0.025) =1.96 Sample proportion of p̂ Normal when: np ≥ 5 & n(1-p) ≥ 5 Find P Value of Proportion =Z*(SQRT(P*(1-P)/n)) If sample size increases, probability increases. If sample size decrease, probability decreases. 1. Find P Value =NORM.S.DIST(Z,TRUE) 2. Find CV =NORM.S.INV(a/2) 3. Test Statistic =( x̄ -µ)/( /SQRT(n)) # on horizonal axis =NORM.INV Area under the curve = NORM.DIST If you know SD, it’s a Z test If you do not know SD, it’s a T test. In ANOVA, treatments refer to different levels of a factor A descriptive measure of linear association between two variables is the correlation coefficient If F is less than F Critical Value, Do not reject P value > alpha, Do not reject 1-F.DIST(F,df1, df2,TRUE) Test Stat CV F.INV(.95,df(test),df(error) Mathematical solution procedure is called least square method. R^2 is coefficient of determ Higher R^2 the better the model. Between Groups – Treatments Within groups – error Central Limit Theorem – Even if X does not have a normal distribution, x̄ will be approximately normal if n is ≥ 30 The only other way to be normal is if X is normal. ANOVA – Factor: Sales, Car Waxes (Top Column) Treatment: 3 Types of car waxes (# of columns) Experimental Units – Cars Response Variable - # of washes ŷ = b0 + b1, x ŷ is predicted variable b0 is estimate of B0 b1 is estimate of B1 x is independent variable Regression analysis is a statistical procedure that describes how one dependent, and one or more independent variables are related. 1. 2. Enter the information on excel Go to Data – Data Analysis (must be enabled) Choose “t-Test: Paired Two Sample for Means 3. a. b. c. d. e. f. g. Var 1 range: choose first column (w/header) Var 2 range: choose sec column (w/header) Hypothesized Mean Difference = 0 Click box for Labels (if you included header) Alpha will be given (.05 or .01) Output range, click any empty cell in Excel You have your t stat; determine right tail or left tail and you have your critical & p-value Downloaded by David Lu (davidlu6381@gmail.com) Hypothesis Testing (Unknown SD) 1. 2. 3. 4. 5. 6. 7. 8. Determine H0 and Ha Find: SE of x̄ = / = s/SQRT(n) Find: T = x̄ - mean / SE of x̄ =( x̄ - µ)/SE Find: CV =T.INV(a,n-1) (left) or =T.INV(a,n-1) (right) or =T.INV(a/2,n-1) for 2-tailed Find: P-Value = T.DIST(T,n-1,TRUE) for P(t < T) test P-Value = 1-T.DIST(T,n-1,TRUE) for P(t > T) test P-Value = 2*T.DIST(T,n-1,TRUE) for two-tailed test Reject H0 if P < a SE of = =( /SQRT(n)) SE of = = SQRT(P*(1-P)/n) Hypothesis Testing (µ Differences Unknown SD) (chart) 1. Use spreadsheet from Canvas with SE 2. Determine H0 and Ha (always 0) 3. SE X1-X2 a. =SQRT((s1^2/n1)/(s2^2/n2) b. Test stat =(x1-x2)/SE (SE from excel) c. P-value =1-T.DIST(T,DF,TRUE) d. The Degree of Freedom (DF) is on spreadsheet. Regression: Y=B0 + B1, X + E Y=Dependent Variable (like to predict) B0 = Intercept B1=Slope X=Independent Variable E=Error Term (unexplained variation) Regression H0: B1 = 0 Ha: B1 ≠ 0 T-Stat = b1 – B1 / se(b1) p-value= 2*P(t>t-stat) or 2q*(1-T.DIST(F,df-1,True) Rows (blocks) Columns (treatments) Test stat for treatments C.V. = F.INV P-value = 1-F.DIST > Alpha, do not reject