Uploaded by Giang Bui

lab assignment

advertisement
Business Statistics 1
Authors:
Alma Jismyr
Qi Peng
Elina Hakimzadeh
1
1
Table of Content
1. Graphical Analysis ..................................................................................................... 3
2. Measures of Location and Spread .............................................................................. 5
3. Confidence Interval and Hypothesis Test .................................................................. 6
4. Repeated Hypothesis Test by Splitting the Data ........................................................ 8
5. ANOVA .................................................................................................................... 11
2
2
1. Graphical Analysis
a. What is the proportion of movies available on Hulu in the dataset?
12.07% according to Figure 1.
Figure 1 – movies available on Hulu
b. What is the mode for the “Hulu” variable and what does it mean?
The mode is 0 according to Table A, this means 0 is the most frequent variable.
Table A – Mode for Hulu
c. Briefly discuss whether the mode is a good measure of central tendency for the
variable "Hulu"?
Yes, it is a good measure of central tendency in this particular variable, since the data
is not skewed.
d. What is the mode of the “Age” variable for each of the four streaming platforms?
3
3
Hulu: 5, Table B.
Prime video: 5, Table C.
Disney: 5, Table D.
Netflix: 5, Table E.
Table B – Mode of the age for Hulu
Statistics
N
Valid
Missing
Mode
Age
4227
0
5,00
Hulu
4227
0
0
Table C – Mode of the age for Prime Video
Statistics
N
Valid
Missing
Mode
Age
4227
0
5,00
PrimeVideo
4227
0
0
Table D – Mode of the age for Disney
Statistics
N
Valid
Missing
Mode
Age
4227
0
5,00
Disney
4227
0
0
Table E – Mode of the age for Netflix
Statistics
N
Mode
Valid
Missing
Age
4227
0
5,00
Netflix
4227
0
0
4
4
2. Measures of Location and Spread
a. At what measurement of scale (i.e., ratio, interval, ordinal or nominal scale) are
the variables measured? Discuss the three variables: “Age”, “Netflix” and
“Runtime”.
Age: Ratio scale, since age has an absolute zero point.
Netflix: Nominal Scale, since Netflix can be categorized but it lacks natural ranking.
Runtime: Interval Scale, since it has no absolute zero point.
b. Are all three measures of central tendency (mode, median & mean) relevant for
all three variables? Discuss each variable (“Age”, “Netflix” and “Runtime”) and
each measure of tendency: for instance: Age: mean can/cannot be used here
because...median can/cannot be used because...
For “Age”, all three measures of central tendency can be used since it is measured on
a ratio scale.
For “Netflix”, only the mode is relevant as it is a nominal categorical variable without
order.
For “Runtime”, both the mean and median are relevant, while the mode´s relevance
depends on the dataset.
Table F – Measures of central tendency
Statistics
Age
N
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Range
Minimum
Maximum
4227
0
3,4845
3,0000
5,00
1,50262
2,258
4,00
1,00
5,00
Netflix
4227
0
,35
,00
0
,478
,229
1
0
1
Runtime
4207
20
99,251
97,000
90,0
22,6142
511,402
180,0
30,0
210,0
c. Are the measures of variability (variance, standard deviation & range) relevant
5
5
for all three variables? Discuss each variable (“Age”, “Netflix” and “Runtime”)
and each measure of variability.
Age: Variance, standard deviation, and range are relevant measures of variability
because age is measured on a ratio scale with meaningful numerical values (See Table
F)
Netflix: None of these measurements (variance, standard deviation, or range) are
relevant because "Netflix" is a nominal categorical variable without numerical values.
(see Table F).
Runtime: Variance and standard deviation are important measurements of variability
for Runtime, because it is measured on an interval scale with meaningful values in terms
of numbers. The range is also important, it provides a more precise measure of
distribution than variance and standard deviation (See Table F).
d. Are there missing data? If so, how many, and for which variables (consider only
“Age”, “Netflix” and “Runtime”)?
There're 20 missing data of Runtime according to Table F
3. Confidence Interval and Hypothesis Test
a.
Hypothesis
H0: μ1 ≥ μ2
H1: μ1 < μ2
μ1 = the mean IMDB score for movies released before 2001
μ2 = the mean IMDB score for movies released after 2001
(Numbers from Table G)
T-value = 3.979
P-value = 0.001
Critical value = 2.326
Conclusion: P-value < α, therefore we reject the null hypothesis H0.
T-value > Critical value, therefore we reject the null hypothesis H0.
Furthermore, we can conclude that the mean IMDB score for movies released before
2001 is lower than the mean scores for movies released after 2001.
Table G – Independent Samples Test
6
6
b. Calculate by hand based on the output in the contingency table a 95%
confidence interval of the share of good movies on Disney+. Show your calculations
and interpret the confidence interval.
(The numbers come from the calculation in question c).
𝛼 = 0.05
𝜌
= 0.332
πœ‹ = 2.63
p± z𝛼 /2√
p (1-p) /n
= 0.332± 2.63× 0.05/2√ 0 .332(1-0.332)/517
= 0.332± 0.065√ 0.332× 0.668/517
= 0.332± 0.065√
0.00042897
= 0.332± 0.065× 0.0207
= 0.332± 0.0013
= 0.332+0.0013 = 0.3333
= 0.332-0.0013 = 0.3307
In conclusion; we can be 95% confident that the share of good movies on Disney+ is
between 0.3307 and 0.3333.
c. By hand, conduct a hypothesis test on the 90% confidence level (i.e., 𝛼𝛼 = 0.10)
to test the null hypothesis that the share of good movies on Disney+ is at most 28%
vs. the alternative hypothesis that the share of good movies on Disney+ is greater
than 28%. State the hypotheses, significance level, etc. and show your calculations.
Use both the p-value- and the critical value approach. State your conclusions based
on the hypothesis test.
H0: πœ‹ ≤ 0.28
H1: πœ‹ > 0.28
𝛼 = 0.10
p = x/n = 172/517 = 0.332 (numbers from Table H)
Z = 𝑝 − πœ‹0 ÷ √πœ‹0(1 − πœ‹) ÷ 𝑛 =
= 0.9957
0,332 - 0,28 ÷ √
0,2016 ÷ 517 = 2,63
P-value: 1 – P (Z > 2,63) = 1 - 0.9957 = 0,0043
Since 0.043 < 0.10 we reject the H0.
Critical value: df=517-1=516, the critical value can be found in the infinity in t
7
7
distribution, the critical value at 90% confidence level is 1.282.
Since 1.282<2.63 we reject H0.
In conclusion: Since Critical value < Z we reject the null hypothesis.
Since P-value < 𝛼 , we reject the null hypothesis.
Table H - number of good movies on Disney+
d. Based on the result, would your advice the enthusiast to subscribe to Disney+?
Based on the calculations, we reject that the good movies on Disney+ is at most 28%,
so we would advise the enthusiast to subscribe to Disney+.
4. Repeated Hypothesis Test by Splitting the Data
a. Calculate the t-statistics by hand for one streaming platform and check that
SPSS provided the correct value of the t-statistics. Use the values available in the
descriptive table provided in SPSS. You can double click on values in the table to
see more decimals.
H0: πœ‡
H1: πœ‡
1= πœ‡
1≠ πœ‡2
2
H0: SPSS provided the correct value of the t-statistics
H1: SPSS provided the incorrect value of the t-statistics
8
8
Standard Error Mean: 0,0469, Table I
Mean difference: -0,0231, Table J
t= -0,0231/0.0469=-0.4925 ≈- 0.493
In conclusion, after calculating by hand, t-statistics equals the t-statistics as the SPSS
provided therefore, we accept the null hypothesis.
Table I – one-sample statistics for all the platforms
Table J – one-sample test for all the platforms
b. Test 𝐻𝐻0 ∢ μ = 6.25 against 𝐻𝐻1 ∢ μ ≠ 6.25 on the 5% level (i.e., 𝛼𝛼 =0.05) by
comparing the t-statistic to the critical value for all platforms. The critical values
are found in the distribution table uploaded on Canvas; it is not a part of the SPSS
output.
H0: πœ‡ = 6.25
H1: πœ‡ ≠ 6.25
(Numbers from Table J)
Netflix:
df = 1497
critical value = 1.645
9
9
t = 1.803
t > critical value. Therefor we can reject the null hypothesis, because there is no
evidence that the population mean for Netflix is equal to 6.25.
Hulu:
df = 509
critical value = 1.645
t = - 0.493
t<
critical value. Therefore, we fail to reject the null hypothesis, because there is
evidence that the population mean for Hulu is equal to 6.25.
Prime Video:
df = 1701
critical value = 1.645
t = -14.236
t < critical value. Therefore, we fail to reject the null hypothesis that the population
mean of Prime Video is equal to 6.25.
Disney+:
df = 516
critical value = 1.645
t = 4.045
t>
critical value. Therefor we reject the null hypothesis, because there is no
evidence that the population mean for Disney+ is equal to 6.25.
c. Test 𝐻𝐻0 ∢ μ = 6.25 against 𝐻𝐻1 ∢ μ ≠ 6.25 on the 5% level (i.e., 𝛼𝛼 =0.05) by
comparing the p-value (given in the table by SPSS) to 𝛼𝛼. No calculations required.
(Numbers from Table J)
Netflix:
p-value: 0.072
p-value >
𝛼
Hulu:
p-value: 0.622
p-value > 𝛼
Prime Video:
p-value: < 0.001
10
10
p-value <
𝛼
Disney+:
p value: < 0.001
p-value < 𝛼
d. Briefly comment on the result for each streaming platform. What can
you say?
Netflix: Conclusion, we fail to reject the null hypotheses since here is evidence to
conclude that the average IMDB-score for Netflix is equal to 6.25
Hulu: Conclusion, we fail to reject the null hypotheses since there is evidence to suggest
that the average IMDB-score for Hulu is equal to 6.25
Prime Video: Conclusion, we reject the null hypotheses because there is no evidence to
conclude that the average IMDB-score for Prime Video is equal to 6.25
Disney+: Conclusion, we reject the null hypothesis because there is evidence to
conclude that the average IMDB-score for Disney + is not equal to 6.25
5. ANOVA
a. Based on the sum of squares between groups and sum of squares within groups
found in the output, show how to calculate the mean square treatment (between
groups) and the mean square error (within groups) and finally how to calculate
the F-statistic.
Table K - ANOVA
SSTR= 71701.941, k=5,
MSTR= SSTR/(k-1) = 71701.941/4=17925.485
SSE=2079255.490/(nT-k) =4202
MSE=SSE/ (nT-k) =2079255.490/4202=494.825
11
11
F statistics=MSTR/MSE= 17925.485/494.825= 36.226
b. Use the output to test 𝐻𝐻0: μ1 = μ2 = μ3 = μ4 = μ5 against 𝐻𝐻1: Not all μ𝑖𝑖 (𝑖𝑖
= 1,2,3,4,5) are equal on the 1% level (i.e., α = 0.01) by comparing the F-statistic
to the critical value. You can use df=1000 for the denominator. State your
conclusion of the test.
𝐻0: μ1 = μ2 = μ3 = μ4 = μ5
H1: not all population means are equal
We have df1=4 for the numerator and df2=1000 for the denominator.
The critical value at α = 0.01 equals 3.34. Since 36.226>3.34, we can conclude that
there is sufficient evidence to reject the null hypothesis, so not all population means
are equal.
12
12
Download