SCIENTIFIC INQUIRY AND
ANALYSIS
UNIT 2
STATISTICAL DATA ANALYSIS
SCIENTIFIC DATA ANALYSIS 1
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
•
Create a frequency table from a set of data.
(CCCS.HSS.ID.A.1)
•
Compute, interpret, and analyze the measures of central tendency (mean, median, and mode) of a set of data.
(CCCS.HSS.ID.A.2)
•
Compute measures of spread (variance, standard deviation, quartiles, and interquartile range)
(CCCS.HSS.ID.2)
•
Graph one variable by hand. (histogram, boxplot)
(CCCS.HSS.ID.A.1)
SCIENTIFIC DATA ANALYSIS 2
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
•
Identify outliers informally and recognize their effect on a set of data. (CCCS.HSS.ID.A.3)
•
Define the characteristics of the Normal distribution by examining a histogram. (CCCS.HSS.ID.4)
•
Explain how a histogram, which is a discrete probability distribution, is related to the Normal distribution curve, a continuous probability distribution. (CCCS.HSS.ID.A.4)
•
Determine if a given set of data is approximately Normal using the empirical rule (68 - 95 - 99.7 rule). (CCCS.HSS.ID.A.4)
•
Estimate areas under the Normal curve using the empirical rule.
SCIENTIFIC DATA ANALYSIS 3
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
•
Graph two variables by hand (scatterplot).
(CCCS.HSS.ID.B.6)
•
Describe a scatterplot in terms of form, direction, strength, and the presence of outliers. (CCCS.HSS.ID.B.6)
•
Find equations of lines of best fit by fitting a line by hand and using technology (TI-84 regression function and/or
Excel). (CCCS.HSS.ID.B.6.A)
•
Interpret the slope (rate of change) and the intercept
(constant term) of a linear model in the context of the data.
(CCCS.HSS.ID.C.7)
SCIENTIFIC DATA ANALYSIS 4
STATISTICAL DATA ANALYSIS
OBJECTIVES:
The student will be able to:
•
Compute the correlation coefficient using technology
(TI-84 or Excel) and interpret it in the context of the data. (CCCS.HSS.ID.C.8)
•
Informally assess the fit of a function by plotting and analyzing residuals. (CCCS.HSS.ID.B.6.B)
•
Make predictions based upon analysis of data.
(5.2.12.A.3)
•
Distinguish between correlation and causation.
(CCCS.HSS.ID.C.9)
SCIENTIFIC DATA ANALYSIS 5
STATISTICAL DATA ANALYSIS
•
Statistics
– collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions from the data.
SCIENTIFIC DATA ANALYSIS 6
STATISTICAL DATA ANALYSIS
•
Measures of Central Tendency
– a method to describe the entire sample or population in a single number known as an average (mode, median and mean)
•
Mode
– the value that occurs most frequently in data.
–
Example 1: What is the mode of the following data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)?
SCIENTIFIC DATA ANALYSIS 7
STATISTICAL DATA ANALYSIS
•
Mode
–
Example 2: What is the mode of the following data: (2, 5, 3, 10, 1, 4, 4, 10, 1, 2, 3, 4, 1, 10, 3, 2,
5, 5)?
–
Mode is not a stable average, but it gives you the most common value in a distribution if that is the information desired.
–
There can sometimes be more than one mode in a given piece of data.
SCIENTIFIC DATA ANALYSIS 8
STATISTICAL DATA ANALYSIS
•
Median
– the central value that occurs in an ordered distribution of data.
–
If there is an odd number of data, it is the center value.
–
If there is an even number of data, there are two center values therefore:
Median = sum of two middle values / 2
SCIENTIFIC DATA ANALYSIS 9
STATISTICAL DATA ANALYSIS
•
Median
–
Example 1: What is the median of the following data: (62, 3, 5, 28, 67, 33, 22, 2, 10)?
–
Example 2: What is the median of the following data: (62, 3, 5, 28, 67, 33, 22, 2, 10, 120)?
–
Median is a more stable average than the mode, but it does not indicate the range of values above or below it.
SCIENTIFIC DATA ANALYSIS 10
STATISTICAL DATA ANALYSIS
•
Mean
– adds all values of a distribution of data and divides by the amount of data.
π₯ π π π’π ππ πππ # ′ π
= π‘βπ πππ‘. ππ # ′ π
ππππ’πππ‘πππ ππππ = π =
π₯ π
SCIENTIFIC DATA ANALYSIS 11
STATISTICAL DATA ANALYSIS
•
Mean
–
Trimmed Mean: will remove the highest and lowest values of a group of data before taking a mean. The typical trim amounts are either 5% or
10%.
–
5% Trim Mean: take 5% of the number of data points, round out the answer, take that amount off the top and bottom, and then take the average.
SCIENTIFIC DATA ANALYSIS 12
STATISTICAL DATA ANALYSIS
•
Mean
–
Example: Given the following data take the 5% trimmed mean: 34, 56, 72, 74, 78, 82, 85, 85, 88,
90, 90, 92, 95, 95, 99, 100.
•
5% of 16 values is .8, therefore round up to 1 and remove the top and bottom scores.
•
Remove 34 & 100; add up the remains = 1181 / 14 =
84.4%
•
If no trimming is done, then the mean would be 82.2%.
SCIENTIFIC DATA ANALYSIS 13
STATISTICAL DATA ANALYSIS
•
Measures of Variation
– a cross reference of the spread of the data.
•
Range
– the difference between the largest and smallest values of a distribution.
–
Example 1: What is the range of the following data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)?
–
Range fails to tell how much values vary from one another.
SCIENTIFIC DATA ANALYSIS 14
STATISTICAL DATA ANALYSIS
•
Sample Standard Deviation
– a measurement that gives you a better idea of how the data entries differ from the mean.
ππππππ π π‘π. πππ£πππ‘πππ = π =
π₯ − π₯ π − 1
2
ππππππ π£πππππππ = π 2 =
π₯ − π₯
2 π − 1
– x = a value in the distribution
– π₯ = the sample mean value of the distribution.
– n = the total number of values in a sample distribution
SCIENTIFIC DATA ANALYSIS 15
STATISTICAL DATA ANALYSIS
•
Population Standard Deviation
– this is the same as the sample standard deviation with the exception that this includes the complete population that you are studying not just a sample set. NOTE: the symbol is different and you divide by the whole population ( N ).
ππππ’πππ‘πππ π π‘π. πππ£πππ‘πππ = π =
π₯ − π
π
2
– x = a value in the distribution
– π = the population mean value of the distribution.
–
N = the total number of values in the population
SCIENTIFIC DATA ANALYSIS 16
STATISTICAL DATA ANALYSIS
•
Standard Deviation
–
Example: Find the standard deviation of the following values:
(1, 2, 7, 9, 10, 10). π₯ (π − π₯) 2
1 – 6.5 = -5.5
30.3
s 2 =
1
2
7
9
10
10
ππππ = π₯ = s =
Σ (π − π₯) 2
=
SCIENTIFIC DATA ANALYSIS 17
STATISTICAL DATA ANALYSIS
•
Standard Deviation
– the following is an alternate means to calculate sample std. deviation.
π πππππ π π‘π. πππ£πππ‘πππ = π =
ππ π₯ π − 1 π€βπππ ππ π₯
= Σ(π₯ 2 ) −
(π₯) 2 π
SCIENTIFIC DATA ANALYSIS 18
STATISTICAL DATA ANALYSIS
•
Standard Deviation
–
Previous example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10) using alternate method
9
10
10
Σx =
2
7 x
1
SS x
= x 2
1
4
Σx 2 = s =
SCIENTIFIC DATA ANALYSIS 19
STATISTICAL DATA ANALYSIS
•
Coefficient of Variation
– while standard deviation computes a value which indicates the range of data around the mean value, coefficient of variation (CV) will indicate it as a % .
π
πΆππππ π π πππππ = × 100
π₯ π
πΆππππ π ππππ’πππ‘πππ = × 100 π
– s = sample standard deviation
– π₯ = the sample mean value of the distribution
– π = population standard deviation.
– π = the population mean value of the distribution.
SCIENTIFIC DATA ANALYSIS 20
STATISTICAL DATA ANALYSIS
•
Histograms
–
Sometimes it is difficult to see how data is distributed by just looking at the numbers. To see how data is distributed, a histogram is used.
–
A histogram is a type of bar graph with the exception that all of the bars touch, and the width of the bars represents something.
SCIENTIFIC DATA ANALYSIS 21
STATISTICAL DATA ANALYSIS
•
Histograms
Probability Test
10
8
6
4
2
0
59.5 -
65.5
65.5 -
71.5
71.5 -
77.5
77.5 -
83.5
Test Scores
83.5 -
89.5
89.5 -
95.5
95.5 -
101.5
SCIENTIFIC DATA ANALYSIS 22
STATISTICAL DATA ANALYSIS
•
Histograms Procedure
1.
Decide how many classes (bars) you want. It will be given by the problem.
2.
To figure out the width of the bars, divide the range by the # of bars and then round up to the next whole number. (NOTE: Always round up even if the number is less than 5, i.e. 5.41 rounds to 6.0)
π΅ππ ππππ‘β =
(βππβππ π‘ π£πππ’π −πππ€ππ π‘ π£πππ’π)
# ππ ππππ
3.
Take the bar width and add it to the lowest value to get the range of the first bar, then add the bar width to the last value to get the range of the next bar. Keep going until you get all of your bar ranges. (i.e. if the lowest value was 60, your bar width was 6 then the first bar would be 60 – 66, the second bar would be 66 – 72, etc.)
SCIENTIFIC DATA ANALYSIS 23
STATISTICAL DATA ANALYSIS
•
Histograms Procedure (continued)
The problem occurs if your data point is 66 as in the example.
In order to alleviate this problem, a boundary is calculated for the bars.
4.
Calculate the boundaries of each bar: a.
Find the interval of the data. Is the data given down to whole numbers, tenths, hundredths, etc? (Note: the data will always have the same interval) b.
Take the interval and divide by 2. This is the boundary adjustment. (i.e. whole numbers means intervals of 1, so ½ = 0.5) c.
For each bar range calculated previously in step 3, subtract the upper and lower limit by the boundary adjustment value. These will be your new bar ranges or boundaries. (i.e. 60 – 0.5 = 59.5 and 66 – 0.5 = 65.5; first bar 59.5 – 65.5)
SCIENTIFIC DATA ANALYSIS 24
STATISTICAL DATA ANALYSIS
•
Histograms Procedure (continued)
5. Calculate the midpoint of each bar: a.
Take the upper and lower limit of a bar add them together and divide by 2. This will be the midpoint. (i.e. (59.5 +
65.5) / 2 = 62.5) πππ ππππππππ‘ = πππ π’ππππ πππππ‘ + πππ πππ€ππ πππππ‘
2 b. Do this for all of the rest of the bars.
c.
The midpoint is sometimes used instead of the boundaries to graph the bars.
SCIENTIFIC DATA ANALYSIS 25
STATISTICAL DATA ANALYSIS
•
Histograms Procedure (continued)
6. Construct a frequency table by using tally marks.
59.5 –
65.5
||
65.5 –
71.5
|
71.5 –
77.5
||
77.5 –
83.5
|
83.5 –
89.5
||
89.5 –
95.5
||||
||
95.5 –
101.5
||||
||||
7. Graph the frequency table using a bar graph arrangement.
SCIENTIFIC DATA ANALYSIS 26
STATISTICAL DATA ANALYSIS
•
Draw the histogram for the following data.
Put it into 5 classes. The data is the number of passing touchdowns for the top 20 rated quarterbacks in the 2011 season.
45
41
17
13
9
46
15
21
18
20
39
29
27
21
13
31
29
16
18
20
SCIENTIFIC DATA ANALYSIS 27
STATISTICAL DATA ANALYSIS
•
Histograms
–
If the midpoint of each class is plotted, they can be interconnected with a straight line.
–
This straight line graph of the midpoints is known as a Frequency Polygon
SCIENTIFIC DATA ANALYSIS 28
STATISTICAL DATA ANALYSIS
•
Histograms
–
What did the histogram indicate?
–
Histograms can be used as a means of predicting outcome or probability. These are known as probability distributions.
–
One of the famous probability distributions is the normal distribution, also known as the normal curve or bell curve.
SCIENTIFIC DATA ANALYSIS 29
STATISTICAL DATA ANALYSIS
•
Normal
Distribution
–
The graph to the right is an example of a normal distribution. Not only does it indicate the results of the scores, but it can also be used for probability or predictions.
SCIENTIFIC DATA ANALYSIS 30
STATISTICAL DATA ANALYSIS
•
Normal Distribution Properties
–
The curve is bell shaped with the highest point at the mean value.
–
It is symmetrical about a vertical line through the mean value.
–
The curve approaches the horizontal axis but never touches it.
–
The transition points (between cup down and cup up) occur at (mean + standard deviation) and
(mean – standard deviation).
SCIENTIFIC DATA ANALYSIS 31
STATISTICAL DATA ANALYSIS
•
Empirical Rule
–
For a normal distribution the following can be said about the data:
•
68.2% of the data will lie within 1 standard deviation on either side of the mean
•
95.4% of the data will lie within 2 standard deviations on either side of the mean.
•
99.7% of the data will lie within 3 standard deviation on either side of the mean
SCIENTIFIC DATA ANALYSIS 32
STATISTICAL DATA ANALYSIS
•
Normal
Distribution
Properties
• π = 34.1%
•
2 π = 13.6%
•
3 π = 2.15%
•
>3 π = 0.15%
• These %’s are used to indicate probabilities.
SCIENTIFIC DATA ANALYSIS 33
STATISTICAL DATA ANALYSIS
•
Example: Assume the heights of college women are normally distributed, with a mean of 65 inches and a SD of 2.5 inches.
– What % of women are taller than 65 inches –OR- what is the probability if one woman is selected she is taller than 65 inches?
–
Shorter than 65 inches?
–
Between 62.5 and 67.5 inches?
–
Between 60 and 70 inches?
SCIENTIFIC DATA ANALYSIS 34
STATISTICAL DATA ANALYSIS
•
Percentiles
–
Sometimes it is more important to see the relative position of piece of data rather than its exact value.
–
Percentile refers to where data lies relative to the other data in the distribution. A data point at the n th percentile means n% of the data falls at or below that point and 100 – n% falls at or above that point.
–
Example: You scored in the 85 th percentile therefore
85% of the people who took the test scored at or below you while 15% scored at or above you. Note: this does
NOT mean you scored 85% on the test.
SCIENTIFIC DATA ANALYSIS 35
STATISTICAL DATA ANALYSIS
•
Percentiles
–
The median is a type of percentile. It is the middle data point in the distribution therefore it is at the
50 th percentile.
–
A special type of percentile known as the quartile is also used to evaluate the position of data.
–
Quartiles split data into fourths.
–
The 1 st quartile (Q
1 quartile (Q
2
(Q
3
) is the 25 th percentile, the 2 nd
) is the median, and the 3
) is the 75 th percentile.
rd quartile
SCIENTIFIC DATA ANALYSIS 36
STATISTICAL DATA ANALYSIS
•
Quartiles
Q
1
Q
2
Q
3
–
Interquartile Range (IQR) = Q
3
– Q
1
SCIENTIFIC DATA ANALYSIS 37
STATISTICAL DATA ANALYSIS
•
Quartiles
–
Procedure to compute quartiles:
1. Order the data from smallest to largest.
2. Find the median; this is the 2 nd quartile, Q
2
.
3. The first quartile Q
1 is then the median of the lower half of the data. It is the median of the data falling below the Q
2 and not including Q
2
.
4. The third quartile Q
3 is then the median of the upper half of the data. It is the median of the data falling above the Q
2 and not including Q
2
.
SCIENTIFIC DATA ANALYSIS 38
STATISTICAL DATA ANALYSIS
•
Quartiles
–
Example (even # of data):
–
Find Q1, Q2 & Q3 & IQR for the following data:
(3, 4, 9, 13, 20, 24)
1. Find Q2. Find the median of all of the data. No center data point so take mean of the two center data points. 13 +
9 / 2 = 11.
2. Find Q1. Find the median of the first half of the data not including Q2. Q1 = 4
3. Find Q3. Find the median of the second half of the data not including Q2. Q3 = 20
4. IQR = Q3 – Q1 = 20 – 4 = 16.
SCIENTIFIC DATA ANALYSIS 39
STATISTICAL DATA ANALYSIS
•
Quartiles
–
Example: A study of ice cream bars was done.
Twenty seven bars tested were rated as tasting
“fair.” The cost per bar is listed below. Find the quartiles and the IQR.
0.99
1.07
1.00
0.50
0.37
1.03
1.07
1.07
0.97
0.63
0.33
0.50
0.97
1.08
0.47
0.84
1.23
0.25
0.50
0.40
0.33
0.35
0.17
0.38
0.20
0.18
0.16
SCIENTIFIC DATA ANALYSIS 40
STATISTICAL DATA ANALYSIS
•
Quartiles
–
Knowing Q
1
, Q
2
, Q
3
, highest value and lowest value in a table of data is known as a Five-
Number Summary .
–
In order to graphically represent the five-number summary, a Box-and Whisker Plot will be used.
SCIENTIFIC DATA ANALYSIS 41
STATISTICAL DATA ANALYSIS
•
Quartiles
–
Box-and Whisker Plot (Shown vertically but can be done horizontally as well)
Highest Value
Q
3
Q
2
Q
1
Lowest Value
SCIENTIFIC DATA ANALYSIS 42
STATISTICAL DATA ANALYSIS
•
Quartiles
–
Proceure to make a Box-and Whisker Plot :
•
Draw a vertical scale to include the lowest and highest data values.
•
To the right of the scale draw a box from Q
1 to Q
3
.
•
Include a solid line through the box at the median level.
•
Draw solid lines called whiskers from Q
1 value and from Q
3 to the highest value.
to the lowest
–
EXAMPLE: Go back to the ice cream problem and create a box-and-whisker plot.
SCIENTIFIC DATA ANALYSIS 43
STATISTICAL DATA ANALYSIS
•
Outliers
–
Sometimes data can skew the average of a range of data.
–
When data is 1.5X the difference of the 1 st and 3 rd quartiles, than it may be considered an outlier.
–
Outliers are sometimes removed from the data so that is does not skew the results.
SCIENTIFIC DATA ANALYSIS 44
STATISTICAL DATA ANALYSIS
•
Scatter Plots
–
Remember from last unit that data can be plotted as a series of x and y points known as a scatter plot.
–
We estimated a line of best fit. In doing this, we were finding a linear correlation that exists between the x and y points.
–
We shall analyze the data of a scatter plot more closely in the next couple of slides.
SCIENTIFIC DATA ANALYSIS 45
STATISTICAL DATA ANALYSIS
Time
(seconds)
0.7
1.8
2.6
3.4
3.8
4.1
4.9
6.0
6.5
Position
(meters)
3.8
3.2
2.8
2.2
1.8
1.4
0.8
0.2
0
SCIENTIFIC INQUIRY AND ANALYSIS 46
STATISTICAL DATA ANALYSIS
•
Scatter Plots
–
The y-distance that a data point is away from the line of best fit is known as a Residual.
–
The optimal line of best fit occurs when the sum of all of the square of all of the residual values is the smallest. This is know as finding the line of best fit through Least Squares method.
SCIENTIFIC DATA ANALYSIS 47
STATISTICAL DATA ANALYSIS
•
Least Squares Method
–
Recall that the slope of a linear line is in the format: π¦ = ππ₯ + π
–
This method will allow us to find the optimal slope
( m ) and the y-intercept ( b ) based on the data.
–
We will use a similar method here as we did for calculating standard deviation.
SCIENTIFIC DATA ANALYSIS 48
STATISTICAL DATA ANALYSIS
•
Least Squares Method π¦ = ππ₯ + π
–
To find the slope m, the following equation is used: π =
ππ π₯π¦
ππ π₯ where ππ π₯π¦
= Σπ₯π¦ −
Σπ₯ Σπ¦ π and
ππ π₯
= Σπ₯ 2 −
Σπ₯
2 π
–
To find the y-intercept b, the following equation is used: where π¦ is he mean of y and π₯ is he mean of x
SCIENTIFIC DATA ANALYSIS 49
STATISTICAL DATA ANALYSIS
X -data
Time
(seconds)
0.7
Y-data
Position
(meters)
3.8
x 2 xy
1.8
2.6
3.4
3.8
4.1
4.9
6.0
6.5
Σ x =
π₯ =
3.2
2.8
2.2
1.8
1.4
0.8
0.2
0
Σ y = π¦ =
Σ x 2 = Σ xy =
SCIENTIFIC INQUIRY AND ANALYSIS 50
STATISTICAL DATA ANALYSIS
•
Example:
1. From the example on the previous page find the slope:
Σπ₯ Σπ¦
ππ π₯π¦
= Σπ₯π¦ −
Σπ₯
=
ππ π₯
= Σπ₯ 2 −
2 π
= π
ππ π₯π¦ π = =
ππ π₯
2. From the example on the previous page find the yintercept:
3.
Write the equations for line of least squares.
π¦ = ππ₯ + π
SCIENTIFIC DATA ANALYSIS 51
STATISTICAL DATA ANALYSIS
Graph 1: Movement of a Car
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
-0,5
0 1 2 3 4
Time (seconds) y = -0,6974x + 4,419
R² = 0,9886
5 6 7
SCIENTIFIC INQUIRY AND ANALYSIS 52
STATISTICAL DATA ANALYSIS
•
Measuring the Spread of Data
–
There are three methods for measuring the spread of the data around the line of least squares:
•
Standard Error of Estimate
•
Coefficient of Correlation
•
Coefficient of Determination
SCIENTIFIC DATA ANALYSIS 53
STATISTICAL DATA ANALYSIS
•
Standard Error of Estimate
–
In order to do this we look at how far away the y data point is away from the least squares line for each of the data points.
–
This method will calculate a value that is representative of spread of all of the data.
–
We will use values that were already calculated in figuring out the least squares line.
SCIENTIFIC DATA ANALYSIS 54
STATISTICAL DATA ANALYSIS
•
Standard Error of Estimate
π π
=
ππ π¦
− π ππ π₯π¦ π − 2
–
Why would it be n – 2? (In other words, why does n have to be >2)
–
Use the same method as before to find m, SS xy
SS x
: and
ππ π¦
= Σπ¦ 2 −
Σπ¦
2 π
SCIENTIFIC DATA ANALYSIS 55
STATISTICAL DATA ANALYSIS
X -data
Time
(seconds)
0.7
Y-data
Position
(meters)
3.8
y 2
Previously
Calculated
Data
SS xy = m = 1.8
2.6
3.4
3.8
4.1
4.9
6.0
6.5
Σ x =
3.2
2.8
2.2
1.8
1.4
0.8
0.2
0
Σ y = Σ y 2 =
SCIENTIFIC INQUIRY AND ANALYSIS 56
STATISTICAL DATA ANALYSIS
•
Example:
1. From the example on the previous page find the following:
Σπ¦ 2
ππ π¦
= Σπ¦ 2 − π
=
2. From the above calculation and previous calculated data find the standard error of estimate:
π π
=
ππ π¦
− π ππ π₯π¦ π − 2
=
SCIENTIFIC DATA ANALYSIS 57
STATISTICAL DATA ANALYSIS
•
Linear Correlation Coefficient, r
–
So far, we have been able to figure the line of best fit by using the line of least squares (which is also known as the
“least squares regression line of y on x”)
–
We then wanted to determine the quality of our line by using the standard error of estimate.
–
The problem with the standard error of estimate is that it has units of y; therefore, when looking at two different sets of data, you cannot say that one graph is better than other because the units may skew the result.
–
The linear correlation coefficient helps to alleviate this problem by calculating a number that is unitless and therefore independent of the units.
SCIENTIFIC DATA ANALYSIS 58
STATISTICAL DATA ANALYSIS
•
Linear Correlation Coefficient, r
ππ π₯π¦ π =
ππ π₯
ππ π¦
–
The value of r
0
1 or -1 r Indication
There is no linear relationship of the data points
There is a perfect linear relationship between the x and y data points; all points lie on the least-squares line.
Between 0 and 1 The x and y data points have a positive correlation (+ slope)
Between 0 and -1 The x and y data points have a negative correlation (- slope)
SCIENTIFIC DATA ANALYSIS 59
STATISTICAL DATA ANALYSIS
X -data
Time
(seconds)
0.7
Y-data
Position
(meters)
3.8
Previously
Calculated
Data
SS xy =
1.8
2.6
3.4
3.8
4.1
4.9
6.0
6.5
3.2
2.8
2.2
1.8
1.4
0.8
0.2
0
SS x =
SS y =
SCIENTIFIC INQUIRY AND ANALYSIS 60
STATISTICAL DATA ANALYSIS
•
Example:
1. From the example on the previous page find the following:
ππ π₯π¦ π = =
ππ π₯
ππ π¦
2. What does the value of r indicate about the correlation of the data points?
SCIENTIFIC DATA ANALYSIS 61
STATISTICAL DATA ANALYSIS
•
Coefficient of Determination, r 2
–
Another way of looking at the quality of your data is to look at how far away some y-data point ( y ) is from the mean of the y-data ( π¦ ). This is simply the deviation. π¦ − .
–
The deviation is made up of two parts:
•
The first part indicates how far away the least squares line ( from the mean of the y-data ( π¦ ). This is simply π¦ π y p
) is
− , and this is known as the explained portion of the standard deviation.
• The second part indicates how far away a particular y-data point ( y ) is from the least squares line ( y p
). This is simply π¦ − π¦ π
, and this is known as the unexplained portion of the standard deviation.
SCIENTIFIC DATA ANALYSIS 62
STATISTICAL DATA ANALYSIS
•
Coefficient of Determination, r 2
–
Recall that when the deviation is squared we get the variance or variation. Based on the explanation before the variance has two parts: the explained variation and the unexplained variation.
–
The Coefficient of Determination is a ratio of the explained variation to the total variation and is simply calculated by taking the Correlation
Coefficient ( r ) and squaring it.
SCIENTIFIC DATA ANALYSIS 63
STATISTICAL DATA ANALYSIS
•
Coefficient of Determination, r 2
–
So what does r 2 indicate?
–
Change r 2 into a %
–
The % indicates what % of the variation of the y data is explained by the variation of the x data if we use the least squares line.
– 100% − π 2 indicates what % of the variation of the y data is due to random chance or some other variable beside the x that may influence y.
SCIENTIFIC DATA ANALYSIS 64
STATISTICAL DATA ANALYSIS
•
Example:
1. From the previous example find the
Coefficient of Determination, r 2 :
2 π 2 =
ππ π₯π¦
=
ππ π₯
ππ π¦
2. What does the value of r 2 indicate about the explained and unexplained portions of the variation?
SCIENTIFIC DATA ANALYSIS 65
STATISTICAL DATA ANALYSIS
•
Correlation vs Causation
–
Correlation refers to one variable changing as another variable changes.
–
Causation refers to one variable changing because of another variable changing. (Cause & Effect)
–
Just because there is a correlation between two variables does not mean there is a causation.
SCIENTIFIC DATA ANALYSIS 66
DO NOW / HW Unit 2-1 Check
•
Have out your homework and do the following: Find the mode, median, mean and standard deviation.
60%
63%
66%
74%
74%
77%
86%
89%
89%
91%
91%
94%
94%
94%
94%
94%
97%
97%
100%
100%
100%
100%
100%
100%
100%
SCIENTIFIC DATA ANALYSIS 67
HW Assignment 2-1 Check
•
10, 12, 14, 18, 36, 37, pg. 449 – 50
10. 8.33
9
9
12. 85.625
85.5
91
14. 2.77
2.9
2.9
18. 14
4
36. $233,071.43
$142,000 none
37. $645,000
$213,242.66
SCIENTIFIC DATA ANALYSIS 68
DO NOW / HW Unit 2-2 Check
•
Have out your homework and do the following: Make a histogram of the following data in 7 classes.
These were the top 32 quarterback ratings in the NFL in 2012.
108.0
99.1
90.7
87.4
83.3
81.2
77.4
72.6
105.8
98.7
90.5
87.2
82.6
79.8
76.5
72.2
102.4
97.0
88.6
86.2
81.6
79.1
76.1
66.9
100.0
96.3
87.7
85.3
81.3
78.1
74.0
66.7
SCIENTIFIC DATA ANALYSIS 69
DO NOW / HW Unit 2-2 Check
RANGE : 41.3
CLASSES:
BAR WIDTH:
BAR STARTING POINT:
UPPER BAR RANGES:
INTERVAL:
BOUNDARY ADJUSTMENT:
BOUNDARIES STARTING POINT:
BOUNDARY RANGES:
7.0
6
66.7
72.7
0.1
0.05
66.65
72.65
78.7
78.65
84.7
84.65
90.7
90.65
96.7
96.65
102.7
102.65
108.7
108.65
66.65 - 72.65
72.65 - 78.65
78.65 - 84.65
84.65 - 90.65
90.65 - 96.65 96.65 - 102.65
4 5 7 7 2 5
102.65 - 108.65
2 # OF QB'S
SCIENTIFIC DATA ANALYSIS 70
HW Assignment 2-2
SCIENTIFIC DATA ANALYSIS 71
EXPERIMENTAL DESIGN
•
Standard Deviation
–
Example: Find the standard deviation of the following values:
(1, 2, 7, 9, 10, 10). π₯ (π − π₯) 2
1
2
7
9
10
10
Mean = 39/6 = 6.5
– s 2 = 81.8 / 5 = 16.4
1 – 6.5 = -5.5
2 – 6.5 = -4.5
7 – 6.5 = 0.5
9 – 6.5 = 2.5
10 – 6.5 = 3.5
10 – 6.5 = 3.5
s = 4.05
30.3
20.3
0.3
6.3
12.3
12.3
Σ = 81.8
SCIENTIFIC DATA ANALYSIS 72
EXPERIMENTAL DESIGN
•
Standard Deviation
–
Previous example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10) using alternate method
2
7 x
1
9
10
10
Σx = 39
–
SS x
= 335 – 39 2 /6 = 81.5 s = 4.04
x 2
1
4
49
81
100
100
Σx 2 = 335
SCIENTIFIC DATA ANALYSIS 73