Uploaded by Adam Semple

DSA8001 - Practicals NS (1)

advertisement
22/09/2018
DSA8001 - Practicals & Solutions
Analysing Normally Distributed Data
Question 1
Question 2
Question 3
Question 4
Question 5
Question 6
Statistical Inference
DSA8001 - Practicals & Solutions
Code
Analysing Normally Distributed Data
Question 1
Head lengths of brushtail possums follow a nearly normal distribution with mean 92.6 mm and standard deviation 3.6
mm.
1. Compute the Z-scores for possums with head lengths of 95.4 mm and 85.8 mm.
2. Use calculated Z-scores to determine how many standard deviations above or below the mean measured head
lengths of these two possums fall
3. Head length of which possum is more unusual?
Solution
1. Compute the Z-scores for possums with head lengths of 95.4 mm and 85.8 mm
Code
[1] "z1 = 0.78"
Code
[1] "z2 = -1.89"
2.
A possum with the head length of 95.4 is 0.78 standard deviations ABOVE the mean
A possum with the head length of 85.8 is 1.89 standard deviations BELOW the mean
3.
Because |z2| = 1.89 >= |z1| = 0.78, opossum with the head length of 85.8 mm is more unusual than opossum with the
head length of 95.4 mm
Question 2
Suppose the average number of Facebook friends is approximated well by the normal model N(mu = 1500, sigma =
300). Randomly selected person Julie has 1800 friends.
1. She would like to know what percentile she falls among other Facebook users?
2. What is the percentage of people that have more friends than Julie?
Solution
1.
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
1/18
22/09/2018
DSA8001 - Practicals & Solutions
Code
[1] 84.13447
Julie is 84.13 percentile.
2.
If 84.13% have less facebook friends than Julie, than the proportion of people that have more friends is 15.87%.
Question 3
Suppose the average number of Facebook friends is approximated well by the normal model N(mu = 1500, sigma =
300). What is the probability that a randomly selected person has AT LEAST 1630 friends on Facebook?
NOTE: Round solution to 3 decimal places.
Solution
Code
[1] 0.332
Code
[1] 0.332
Code
[1] TRUE
The probability that randomly selected person has at least 1630 friends on Facebook is 0.332.
Question 4
Suppose the average number of Facebook friends is approximated well by the normal model N(mu = 1500, sigma =
300). A randomly selected person is at the 79.95th percentile. How many Facebook friends does this person have?
Solution
Code
[1] 1751.951
Randomly selected person, which is at the 79.95th percentile, has 1752 friends on Facebook.
Question 5
At Heinz factory the amounts which go into bottles of ketchup are supposed to be normally distributed with mean 36
oz. standard deviation 0.11 oz. Once every 30 minutes a bottle is selected from the production line, and its contents
are noted precisely. If the amount of ketchup in the bottle is below 35.8 oz. or above 36.2 oz., then the bottle fails the
quality control inspection.
1. What percentage of bottles have less than 35.8 ounces of ketchup?
2. What percentage of bottles PASS the quality control inspection?
NOTE: Round solutions to 2 decimal places
Solution
1.
Code
[1] "3.45% of bottles have less than 35.8 oz of ketchup."
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
2/18
22/09/2018
DSA8001 - Practicals & Solutions
2.
Code
[1] 0.9309637
Code
[1] 93.1
93.1% of bottles pass inspection.
Question 6
Body temperatures of healthy humans are distributed nearly normally with mean 98.2F and standard deviation 0.73F.
What is the cutoff for the lowest 3% of human body temperatures?
NOTE: Round solution to 1 decimal place.
Solution
Code
[1] "The cutoff value for the lowest 3% of human body temperature is 96.8F"
Statistical Inference
Test of single mean (mu) using the Z statistic
Question 1
The mean content of a sample of 120 bottles of milk from one days output of a dairy was found to be 0.9975 litres,
while the standard deviation of the sample was 0.012 litres. investigate if there is evidence to suggest that the mean
content of that days output is different from 1 litre.
Solution
Code
[1] "z = -2.3"
Code
[1] "p_value = 0.022"
Code
[1] "Test is significant at 5% level, because 1% < p = 2.2% <= 5%.
idence for rejection H0 in favour HA."
There is considerable ev
Code
[1] "95% C.I. is [0.99, 1.01]"
Test of the comparison of two means using the Z
statistic
Question 1
Suppose we wish to determine if there is a difference in mean weight between the two sexes in a particular bird
species, at a 5% significance level. The following data were obtained:
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
3/18
22/09/2018
DSA8001 - Practicals & Solutions
Male sample size n1 = 125, mean weight x1_bar = 92.31 g, and variaiance var1 = 56.22 g^2
Female sample size n2 = 85, mean weight x2_bar = 88.84 g, and variaiance var2 = 65.41 g^2
If significant, give a 95% CI for mu1 - mu2.
Solution
Code
[1] "z = 3.1"
Code
[1] "p_value = 0.002"
Code
[1] "Test is highly significant at 5% level, because 0.1% < p = 0.2% <= 1%.
erable evidence for rejection H0 in favour HA."
There is consid
Code
[1] "95% C.I. is [1.31, 5.63]"
Test of single proportion
Question 1
In a random sample of 120 graduates, 78 spent 3 years at university and 42 more than 3 years. Test the hypothesis
that 70% of graduates obtain degrees in 3 years. Give a 95% c.i. for the population proportion.
NOTE: Round solution to 3 decimal places. #### Solution
Code
[1] "z = -1.2"
Code
[1] "p_value = 0.232"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p_value = 23.2% > 5%."
Code
[1] "95% C.I. is [0.568, 0.732]"
Code
1-sample proportions test without continuity correction
data: c(78) out of c(120), null probability c(0.7)
X-squared = 1.4286, df = 1, p-value = 0.232
alternative hypothesis: true p is not equal to 0.7
95 percent confidence interval:
0.5612132 0.7294810
sample estimates:
p
0.65
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
4/18
22/09/2018
DSA8001 - Practicals & Solutions
[1] "p_value_alt = 0.232"
Code
[1] TRUE
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p_value_alt = 23.2% > 5%."
Code
[1] "95% C.I. is [0.561, 0.729]"
Test of two proportions
Question 1
We wish to compare the germination rates of spinach seeds for two different methods of preparation:
Method A: 80 seeds sown, 65 germinate
Method B: 90 seeds sown, 80 germinate
NOTE: Round solution to 3 decimal places. #### Solution
Code
[1] "z = -1.4"
Code
[1] "p_value = 0.16"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p_value = 16% > 5%."
Code
[1] "95% C.I. is [-0.184, 0.031]"
Code
2-sample test for equality of proportions without continuity correction
data: c(x1, x2) out of c(n1, n2)
X-squared = 1.9703, df = 1, p-value = 0.1604
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1837708 0.0309930
sample estimates:
prop 1
prop 2
0.8125000 0.8888889
Code
[1] "p_value_alt = 0.16"
Code
[1] TRUE
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
5/18
22/09/2018
DSA8001 - Practicals & Solutions
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p_value_alt = 16% > 5%."
Code
[1] "95% C.I. is [-0.184, 0.031]"
Tests based on the t-distribution (Single Mean)
Question 1
What proportion of the t-distribution with 18 degrees of freedom falls below -2.10?
NOTE: Round solution to 3 decimal places.
Solution
Code
[1] 0.025
Code
[1] 0.025
Question 2
What proportion of the t-distribution with 20 degrees of freedom falls above 1.65?
NOTE: Round solution to 3 decimal places.
Solution
Code
[1] 0.057
Question 3
What proportion of the t-distribution with 2 degrees of freedom falls 3 standard deviations from the mean (above or
below)?
NOTE: Round solution to 3 decimal places.
Solution
Code
[1] "Proportion below 3 SD: 0.0477329831333546"
Code
[1] "Proportion above 3 SD: 0.0477329831333546"
Code
[1] "Proportion of the t-distribution that falls 3SD above or below the mean is: 0.095"
Code
[1] "Proportion of the t-distribution that falls 3SD above or below the mean is: 0.095"
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
6/18
22/09/2018
DSA8001 - Practicals & Solutions
[1] "Proportion of the t-distribution that falls 3SD above or below the mean is: 0.095"
Question 4
The temperature of warm water springs in a basin is reported to have a mean of 38C. A sample of 12 springs from the
west end of the basin had mean temperature 39.4 and variance 1.92.
1. Have springs at the west end a greater mean temperature?
2. Have springs at the west end a different mean temperature? Give a 95% c.i. for the mean temperature.
NOTE: Round solution to 3 decimal places.
Solution
Code
[1] "Under H0, t=3.5 is an observation from t11"
1.
Code
[1] "p_value = 0.002"
Code
[1] "Test is highly significant at 5% level, because 0.1% < p = 0.2% <= 1%.
erable evidence for rejection H0 in favour HA."
There is consid
2.
Code
[1] "p_value = 0.004"
Code
[1] "Test is highly significant at 5% level, because 0.1% < p = 0.4% <= 1%.
erable evidence for rejection H0 in favour HA."
There is consid
Code
[1] 38.5196 40.2804
Code
[1] "95% C.I. is [38.5196, 40.2804]"
Question 5
Sweets producing company was interested in the mean net weight of contents in an advertised 80-gram pack. The
manufacturer has precisely weighed the contents of 24 randomly selected 80-gram packs from different stores and
recorded the weights as follows:
Code
1. Investigate the hypothesis that the sweets content in the packages is lesser than what is claimed on the
package.
2. Investigate the hypothesis that the sweets content in the packages is lesser than what is claimed on the
package. Give a 95% c.i. for the mean weight.
NOTE: Round solution to 3 decimal places.
Solution
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
7/18
22/09/2018
DSA8001 - Practicals & Solutions
[1] 79.36917
Code
[1] "Under H0, t=-0.947 is an observation from t23"
1.
Code
[1] "p_value = 0.177"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA, because
p = 17.7% > 5%."
Code
[1] TRUE
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA, because
p = 17.7% > 5%."
2.
Code
[1] "p_value = 0.354"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA, because
p = 35.4% > 5%."
Code
[1] 77.991 80.748
Code
[1] "95% C.I. is [77.991, 80.748]"
Code
[1] TRUE
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA, because
p = 35.4% > 5%."
Code
[1] "95% C.I. is [77.991, 80.747]"
Tests based on the t-distribution (Paired Coparison)
Question 1
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
8/18
22/09/2018
DSA8001 - Practicals & Solutions
Consider an experiment to compare the effects of two sleeping drugs A and B. There are 10 subjects and each subject
receives treatment with each of the two drugs (the order of treatment being randomised). The number of hours slept by
each subject is recorded. Is there any difference between the effects of the two drugs? Give a 95% c.i. for the
unknown mean difference.
NOTE: Round solution to 3 decimal places.
Code
Solution
Code
[1] 1.58
Code
[1] "Under H0, t=4.062 is an observation from t9"
Code
[1] "p_value = 0.003"
Code
[1] "Test is highly significant at 5% level, because 0.1% < p = 0.3% <= 1%.
erable evidence for rejection H0 in favour HA."
There is consid
Code
[1] 0.70 2.46
Code
[1] "95% C.I. is [0.7, 2.46]"
Code
[1] TRUE
Code
[1] "Test is highly significant at 5% level, because 0.1% < p_alt = 0.3% <= 1%.
nsiderable evidence for rejection H0 in favour HA."
There is co
Code
[1] "95% C.I. is [0.7, 2.46]"
Tests based on the t-distribution (Independent
Samples)
Question 1
Two methods of oxidation care are used in an industrial process. Repeated measurements of the oxidation time are
made to test the hypothesis that the oxidation time of method 1 is different than that of method 2 on average.
Method 1: Sample size = 9, Sample mean = 41.3, Sample Variance = 20.7
Method 2: Sample size = 8, Sample mean = 48.9, Sample Variance = 34.2
Assuming that the unknown variances are equal, investigate if there is there any difference between the oxidation
times of the two methods? Give a 95% c.i. for the unknown mean difference.
Solution
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
9/18
22/09/2018
DSA8001 - Practicals & Solutions
Code
[1] "Under H0, t=-3.01 is an observation from t15"
Code
[1] "p_value = 0.009"
Code
[1] "Test is highly significant at 5% level, because 0.1% < p = 0.9% <= 1%.
erable evidence for rejection H0 in favour HA."
There is consid
Code
[1] -12.981
-2.219
Code
[1] "95% C.I. is [-12.981, -2.219]"
Question 2
Two methods of oxidation care are used in an industrial process. Repeated measurements of the oxidation time are
made to test the hypothesis that the oxidation time of method 1 is different than that of method 2 on average. The
following measurements were recorded:
Method 1:
c(29.915269, 8.920123, 36.647273, 54.038639, 37.583526, 19.860171, 13.470132, 43.139612, 39.825299)
Method 2:
c(28.970122, 43.563546, 4.161069, 39.774523, 5.705720, 93.562336,
3.801087, 79.906087)
Assuming that the unknown variances are equal, investigate if there is there any difference between the oxidation
times of the two methods? Give a 95% c.i. for the unknown mean difference.
Solution
Code
[1] "Under H0, t=-0.472 is an observation from t15"
Code
[1] "p_value = 0.644"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p = 64.4% > 5%."
Code
[1] -32.762
20.879
Code
[1] "95% C.I. is [-32.762, 20.879]"
Or Alternatively
Code
[1] TRUE
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
10/18
22/09/2018
DSA8001 - Practicals & Solutions
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p = 64.4% > 5%."
Code
[1] "95% C.I. is [-32.768, 20.884]"
Tests based on the Chi-Square distribution
Question 1
What proportion of the chi-square distribution with 9 degrees of freedom falls above 17?
Question 1
Code
[1] 0.049
Code
[1] 0.049
Question 2
The geneticist Mendel evolved the theory that for a certain type of pea, the characteristics Round and Yellow, R and
Green, Angular and Y, A and G occurred in the ratio 9:3:3:1. He classified 556 seeds and the observed frequencies
were 315, 108, 101 and 32. Test Mendel’s theory on the basis of these data.
Solution
Code
[1] "Under H0, phi_squared=0.47 is an observation from Chi-square_3"
Code
[1] "p_value = 0.925"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p = 92.5% > 5%."
Or Alternatively
Code
[1] TRUE
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p = 92.5% > 5%."
Question 3
In a random sample of 120 graduates, 78 spent 3 years at University and 42 more than 3 years. Test hypothesis that
70% obtain degree in 3 years.
Solution
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
11/18
22/09/2018
DSA8001 - Practicals & Solutions
[1] 1.2
Code
[1] "Under H0, phi_squared=1.2 is an observation from Chi-square_1"
Code
[1] "p_value = 0.273"
Code
[1] "The test is not significant at 5% level and H0 is not rejected in favour of HA because
p = 27.3% > 5%."
Analysis of Variance (ANOVA)
Question1
Iris data set that comes preloaded with R gives the measurements in centimeters of the variables sepal length and
width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa,
versicolor, and virginica.
1. Create a copy of the Iris dataset that contains relevant variables, and:
1.1. Perform exploratory data analysis which will reveal how many flowers are in each category, how many sepal
width values are missing, as well as what is the sepal width mean and standard deviation per each category.
1.2 Perform graphical analysis which will reveal how collected sepal width values compare across the species
(histograms and boxplots)
1.3. Investigate if there is a difference between the means of the sepal width variable among this three species,
and if there is, perform further investigation on which species have statistically significant difference between
the means?
2. Similarly, repeat the investigation process for the other variables?
Solution
1. create a copy of the iris dataset that contains relevant variables
Code
1
2
3
4
5
6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1
3.5
1.4
0.2 setosa
4.9
3.0
1.4
0.2 setosa
4.7
3.2
1.3
0.2 setosa
4.6
3.1
1.5
0.2 setosa
5.0
3.6
1.4
0.2 setosa
5.4
3.9
1.7
0.4 setosa
Code
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0,
5.8, 5.7, 5.4, 5.1, 5.7,...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4,
4.0, 4.4, 3.9, 3.5, 3.8,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5,
1.2, 1.5, 1.3, 1.4, 1.7,...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2,
0.2, 0.4, 0.4, 0.3, 0.3,...
$ Species
<fct> setosa, setosa, setosa, setosa, setosa,
setosa, setosa, setosa...
4.4, 4.9, 5.4, 4.8, 4.8, 4.3,
2.9, 3.1, 3.7, 3.4, 3.0, 3.0,
1.4, 1.5, 1.5, 1.6, 1.4, 1.1,
0.2, 0.1, 0.2, 0.2, 0.1, 0.1,
setosa, setosa, setosa, setosa,
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
12/18
22/09/2018
DSA8001 - Practicals & Solutions
Code
1.1 Perform exploratory data analysis which will reveal how many flowers are in each category, how many sepal width
values are missing, as well as what is the sepal width mean and standard deviation per each category.
Code
[38;5;246m# A tibble: 3 x 5[39m
Species
n_samples n_missing mean_sepal_width sd_sepal_width
[3m[38;5;246m<fct>[39m[23m
[3m[38;5;246m<int>[39m[23m
[3m[38;5;246m<int>[39
m[23m
[3m[38;5;246m<dbl>[39m[23m
[3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m setosa
50
0
3.43
0.379
[38;5;250m2[39m versicolor
50
0
2.77
0.314
[38;5;250m3[39m virginica
50
0
2.97
0.322
1.2 Perform graphical analysis which will reveal how collected sepal width values compare across the species
(histograms and boxplots)
Code
Code
Code
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
13/18
22/09/2018
DSA8001 - Practicals & Solutions
1.3. Investigate if there is a difference between the means of the sepal width variable among this three species, and if
there is, perform further investigation on which species have statistically significant difference between the means?
H0: The mean sepal width is the same across all species HA: At least one mean is different than others
Code
Df Sum Sq Mean Sq F value Pr(>F)
Species
2 11.35
5.672
49.16 <2e-16 ***
Residuals
147 16.96
0.115
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Code
[1] "p_value = 0"
Code
[1] "The test is very significant at 5% level because p = 0% < 5%. We are very confidendt th
at HA is to be preferred to H0, i.e. at least one mean is different from the others"
Check ANOVA conditions
Assuming the independence of data, we should check whether the other conditions are valid as well.
Condition 1: The variability across the groups should be about equal.
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
14/18
22/09/2018
DSA8001 - Practicals & Solutions
In the plot above, there is no evident relationship between residuals and fitted values which implies equal variances
across the groups (homogeniety of variances).
Alternatively, we can also use Levene’s test to test equality of variances:
Code
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group
2 0.5902 0.5555
147
Because the p-value obtained from Levene’s test is p = 55.5% > 5%, the thest is not significant and there is no
evidence to suggest that the variance across species is statistically significantly different (i.e. we can assume the
equalirty of variance).
Condition 2: The observations within each group should be nearly normal.
This can be very difficult to determine in many real time situations, and to achieve this we need to mean-center each
sepal width by it’s respective group mean. These group-wise, mean-centered values are also known as residuals, and
by using them we can assess the normality of all observations as a whole.
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
15/18
22/09/2018
DSA8001 - Practicals & Solutions
Alternatively, for testing normality of residuals we can use Shapiro-Wilk test:
Code
Shapiro-Wilk normality test
data: anova_test_residuals
W = 0.98948, p-value = 0.323
Because the p-value obtained from Shapiro-Wilk test is p = 32.3% > 5%, the thest is not significant and there is no
evidence to suggest that the normality assumption is violated (i.e. we can assume normality of the residuals).
As we concluded that at least one pair of means differ, and because we do not know which one, we need to use ttests with Bonferroni correction to compare each pair of means to each other (i.e. multiple comparisons).
Code
Pairwise comparisons using t tests with pooled SD
data:
iris_copy$Sepal.Width and iris_copy$Species
setosa versicolor
versicolor < 2e-16 virginica 1.4e-09 0.0094
P value adjustment method: bonferroni
Conclusions:
1. In the case of the mean difference in sepal widths between species versicolor and virginica, the test is highly
significant at 5% level because 0.1% < p = 0.94% < 1%. There is considerable evidence for rejection H0 in
favour of HA, i.e. considerable evidence of a difference between the average sepal widths of the species
versicolor and virginica.
2. In other two comparisons (setosa-versicolor and setosa-virginica) the test is very highly significant at 5% level
because p = 0% < 1%. Therefore, we are very confident that HA is to be preferred to H0, i.e very confident that
there is a difference between the average sepal widths of the species setosa-versicolor and setosa-virginica.
Simple Linear Regression
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
16/18
22/09/2018
DSA8001 - Practicals & Solutions
Question1
Iris data set that comes preloaded with R gives the measurements in centimeters of the variables sepal length and
width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa,
versicolor, and virginica.
Create a copy of the Iris dataset that contains variables Petal.Width and Petal.Length, and:
1. Fit a simple linear regression create model for predicting petal widths based on the petal lengths.
2. Plot the line of the best fit against the input dataset
3. State the estimated simple linear regression equation
4. State whether there is a significan relationship between the predictor and response variable.
5. Assuming that both petal width and length values are given in milimetres (mm), state the interpretation of the
estimated slope parameter
6. If the petal lengths are 1.5, 1.6 and 1.7 mm, what are their estimated petal widths?
Solution
Code
1.
Code
2.
Code
3.
Code
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
17/18
22/09/2018
DSA8001 - Practicals & Solutions
Call:
lm(formula = Petal.Width ~ Petal.Length, data = iris_copy)
Residuals:
Min
1Q
Median
-0.56515 -0.12358 -0.01898
3Q
0.13288
Max
0.64272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.363076
0.039762 -9.131 4.7e-16 ***
Petal.Length 0.415755
0.009582 43.387 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-squared: 0.9271,
Adjusted R-squared: 0.9266
F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
The summary output shows the following components:
Call: Displays the formula that has been used to fit the regression model.
Residuals: Displays summary statistics of residuals, which by definition should have a mean equal to zero.
Therefore, as an indication of the normally distributed residuals, median should not be far from zero, and the
minimum and maximum should be roughly equal in absolute value. NOTE: Similar to ANOVA, normality of the
residuals can be inspected using Shapiro-Wilk test.
Coefficients: Displays the values of the intercept and slope parameters,and their statistical significance.
ANSWER: The estimated simple regression equation is: Petal.Width_Estimated = -0.363076 + 0.415755 *
Petal.Length
4.
By looking at the summary output above, we can see that the p-value < 0.01%, so we can reject the null hypothesis
that β_1 = 0. Because the test is very highly significant at 5% level, we are very confident that there is a significant
relationship between the predictor and the response variable in the linear regression model.
5. For each additional mm increase in length, we would expect petal width to increase for 0.415755 mm.
6.
Code
1
2
3
0.2605576 0.3021331 0.3437087
file:///Users/aleks/OneDrive%20-%20Queen's%20University%20Belfast/R_Projects/DSA8001_Lab/DSA8001_Practicals_Solutions.nb.html#
18/18
Download