Business Statistics 1 Authors: Alma Jismyr Qi Peng Elina Hakimzadeh 1 1 Table of Content 1. Graphical Analysis ..................................................................................................... 3 2. Measures of Location and Spread .............................................................................. 5 3. Confidence Interval and Hypothesis Test .................................................................. 6 4. Repeated Hypothesis Test by Splitting the Data ........................................................ 8 5. ANOVA .................................................................................................................... 11 2 2 1. Graphical Analysis a. What is the proportion of movies available on Hulu in the dataset? 12.07% according to Figure 1. Figure 1 – movies available on Hulu b. What is the mode for the “Hulu” variable and what does it mean? The mode is 0 according to Table A, this means 0 is the most frequent variable. Table A – Mode for Hulu c. Briefly discuss whether the mode is a good measure of central tendency for the variable "Hulu"? Yes, it is a good measure of central tendency in this particular variable, since the data is not skewed. d. What is the mode of the “Age” variable for each of the four streaming platforms? 3 3 Hulu: 5, Table B. Prime video: 5, Table C. Disney: 5, Table D. Netflix: 5, Table E. Table B – Mode of the age for Hulu Statistics N Valid Missing Mode Age 4227 0 5,00 Hulu 4227 0 0 Table C – Mode of the age for Prime Video Statistics N Valid Missing Mode Age 4227 0 5,00 PrimeVideo 4227 0 0 Table D – Mode of the age for Disney Statistics N Valid Missing Mode Age 4227 0 5,00 Disney 4227 0 0 Table E – Mode of the age for Netflix Statistics N Mode Valid Missing Age 4227 0 5,00 Netflix 4227 0 0 4 4 2. Measures of Location and Spread a. At what measurement of scale (i.e., ratio, interval, ordinal or nominal scale) are the variables measured? Discuss the three variables: “Age”, “Netflix” and “Runtime”. Age: Ratio scale, since age has an absolute zero point. Netflix: Nominal Scale, since Netflix can be categorized but it lacks natural ranking. Runtime: Interval Scale, since it has no absolute zero point. b. Are all three measures of central tendency (mode, median & mean) relevant for all three variables? Discuss each variable (“Age”, “Netflix” and “Runtime”) and each measure of tendency: for instance: Age: mean can/cannot be used here because...median can/cannot be used because... For “Age”, all three measures of central tendency can be used since it is measured on a ratio scale. For “Netflix”, only the mode is relevant as it is a nominal categorical variable without order. For “Runtime”, both the mean and median are relevant, while the mode´s relevance depends on the dataset. Table F – Measures of central tendency Statistics Age N Valid Missing Mean Median Mode Std. Deviation Variance Range Minimum Maximum 4227 0 3,4845 3,0000 5,00 1,50262 2,258 4,00 1,00 5,00 Netflix 4227 0 ,35 ,00 0 ,478 ,229 1 0 1 Runtime 4207 20 99,251 97,000 90,0 22,6142 511,402 180,0 30,0 210,0 c. Are the measures of variability (variance, standard deviation & range) relevant 5 5 for all three variables? Discuss each variable (“Age”, “Netflix” and “Runtime”) and each measure of variability. Age: Variance, standard deviation, and range are relevant measures of variability because age is measured on a ratio scale with meaningful numerical values (See Table F) Netflix: None of these measurements (variance, standard deviation, or range) are relevant because "Netflix" is a nominal categorical variable without numerical values. (see Table F). Runtime: Variance and standard deviation are important measurements of variability for Runtime, because it is measured on an interval scale with meaningful values in terms of numbers. The range is also important, it provides a more precise measure of distribution than variance and standard deviation (See Table F). d. Are there missing data? If so, how many, and for which variables (consider only “Age”, “Netflix” and “Runtime”)? There're 20 missing data of Runtime according to Table F 3. Confidence Interval and Hypothesis Test a. Hypothesis H0: μ1 ≥ μ2 H1: μ1 < μ2 μ1 = the mean IMDB score for movies released before 2001 μ2 = the mean IMDB score for movies released after 2001 (Numbers from Table G) T-value = 3.979 P-value = 0.001 Critical value = 2.326 Conclusion: P-value < α, therefore we reject the null hypothesis H0. T-value > Critical value, therefore we reject the null hypothesis H0. Furthermore, we can conclude that the mean IMDB score for movies released before 2001 is lower than the mean scores for movies released after 2001. Table G – Independent Samples Test 6 6 b. Calculate by hand based on the output in the contingency table a 95% confidence interval of the share of good movies on Disney+. Show your calculations and interpret the confidence interval. (The numbers come from the calculation in question c). πΌ = 0.05 π = 0.332 π = 2.63 p± zπΌ /2√ p (1-p) /n = 0.332± 2.63× 0.05/2√ 0 .332(1-0.332)/517 = 0.332± 0.065√ 0.332× 0.668/517 = 0.332± 0.065√ 0.00042897 = 0.332± 0.065× 0.0207 = 0.332± 0.0013 = 0.332+0.0013 = 0.3333 = 0.332-0.0013 = 0.3307 In conclusion; we can be 95% confident that the share of good movies on Disney+ is between 0.3307 and 0.3333. c. By hand, conduct a hypothesis test on the 90% confidence level (i.e., πΌπΌ = 0.10) to test the null hypothesis that the share of good movies on Disney+ is at most 28% vs. the alternative hypothesis that the share of good movies on Disney+ is greater than 28%. State the hypotheses, significance level, etc. and show your calculations. Use both the p-value- and the critical value approach. State your conclusions based on the hypothesis test. H0: π ≤ 0.28 H1: π > 0.28 πΌ = 0.10 p = x/n = 172/517 = 0.332 (numbers from Table H) Z = π − π0 ÷ √π0(1 − π) ÷ π = = 0.9957 0,332 - 0,28 ÷ √ 0,2016 ÷ 517 = 2,63 P-value: 1 – P (Z > 2,63) = 1 - 0.9957 = 0,0043 Since 0.043 < 0.10 we reject the H0. Critical value: df=517-1=516, the critical value can be found in the infinity in t 7 7 distribution, the critical value at 90% confidence level is 1.282. Since 1.282<2.63 we reject H0. In conclusion: Since Critical value < Z we reject the null hypothesis. Since P-value < πΌ , we reject the null hypothesis. Table H - number of good movies on Disney+ d. Based on the result, would your advice the enthusiast to subscribe to Disney+? Based on the calculations, we reject that the good movies on Disney+ is at most 28%, so we would advise the enthusiast to subscribe to Disney+. 4. Repeated Hypothesis Test by Splitting the Data a. Calculate the t-statistics by hand for one streaming platform and check that SPSS provided the correct value of the t-statistics. Use the values available in the descriptive table provided in SPSS. You can double click on values in the table to see more decimals. H0: π H1: π 1= π 1≠ π2 2 H0: SPSS provided the correct value of the t-statistics H1: SPSS provided the incorrect value of the t-statistics 8 8 Standard Error Mean: 0,0469, Table I Mean difference: -0,0231, Table J t= -0,0231/0.0469=-0.4925 ≈- 0.493 In conclusion, after calculating by hand, t-statistics equals the t-statistics as the SPSS provided therefore, we accept the null hypothesis. Table I – one-sample statistics for all the platforms Table J – one-sample test for all the platforms b. Test π»π»0 βΆ μ = 6.25 against π»π»1 βΆ μ ≠ 6.25 on the 5% level (i.e., πΌπΌ =0.05) by comparing the t-statistic to the critical value for all platforms. The critical values are found in the distribution table uploaded on Canvas; it is not a part of the SPSS output. H0: π = 6.25 H1: π ≠ 6.25 (Numbers from Table J) Netflix: df = 1497 critical value = 1.645 9 9 t = 1.803 t > critical value. Therefor we can reject the null hypothesis, because there is no evidence that the population mean for Netflix is equal to 6.25. Hulu: df = 509 critical value = 1.645 t = - 0.493 t< critical value. Therefore, we fail to reject the null hypothesis, because there is evidence that the population mean for Hulu is equal to 6.25. Prime Video: df = 1701 critical value = 1.645 t = -14.236 t < critical value. Therefore, we fail to reject the null hypothesis that the population mean of Prime Video is equal to 6.25. Disney+: df = 516 critical value = 1.645 t = 4.045 t> critical value. Therefor we reject the null hypothesis, because there is no evidence that the population mean for Disney+ is equal to 6.25. c. Test π»π»0 βΆ μ = 6.25 against π»π»1 βΆ μ ≠ 6.25 on the 5% level (i.e., πΌπΌ =0.05) by comparing the p-value (given in the table by SPSS) to πΌπΌ. No calculations required. (Numbers from Table J) Netflix: p-value: 0.072 p-value > πΌ Hulu: p-value: 0.622 p-value > πΌ Prime Video: p-value: < 0.001 10 10 p-value < πΌ Disney+: p value: < 0.001 p-value < πΌ d. Briefly comment on the result for each streaming platform. What can you say? Netflix: Conclusion, we fail to reject the null hypotheses since here is evidence to conclude that the average IMDB-score for Netflix is equal to 6.25 Hulu: Conclusion, we fail to reject the null hypotheses since there is evidence to suggest that the average IMDB-score for Hulu is equal to 6.25 Prime Video: Conclusion, we reject the null hypotheses because there is no evidence to conclude that the average IMDB-score for Prime Video is equal to 6.25 Disney+: Conclusion, we reject the null hypothesis because there is evidence to conclude that the average IMDB-score for Disney + is not equal to 6.25 5. ANOVA a. Based on the sum of squares between groups and sum of squares within groups found in the output, show how to calculate the mean square treatment (between groups) and the mean square error (within groups) and finally how to calculate the F-statistic. Table K - ANOVA SSTR= 71701.941, k=5, MSTR= SSTR/(k-1) = 71701.941/4=17925.485 SSE=2079255.490/(nT-k) =4202 MSE=SSE/ (nT-k) =2079255.490/4202=494.825 11 11 F statistics=MSTR/MSE= 17925.485/494.825= 36.226 b. Use the output to test π»π»0: μ1 = μ2 = μ3 = μ4 = μ5 against π»π»1: Not all μππ (ππ = 1,2,3,4,5) are equal on the 1% level (i.e., α = 0.01) by comparing the F-statistic to the critical value. You can use df=1000 for the denominator. State your conclusion of the test. π»0: μ1 = μ2 = μ3 = μ4 = μ5 H1: not all population means are equal We have df1=4 for the numerator and df2=1000 for the denominator. The critical value at α = 0.01 equals 3.34. Since 36.226>3.34, we can conclude that there is sufficient evidence to reject the null hypothesis, so not all population means are equal. 12 12