SIA Unit 3

advertisement

SCIENTIFIC INQUIRY AND

ANALYSIS

UNIT 2

STATISTICAL DATA ANALYSIS

SCIENTIFIC DATA ANALYSIS 1

STATISTICAL DATA ANALYSIS

OBJECTIVES:

The student will be able to:

Create a frequency table from a set of data.

(CCCS.HSS.ID.A.1)

Compute, interpret, and analyze the measures of central tendency (mean, median, and mode) of a set of data.

(CCCS.HSS.ID.A.2)

Compute measures of spread (variance, standard deviation, quartiles, and interquartile range)

(CCCS.HSS.ID.2)

Graph one variable by hand. (histogram, boxplot)

(CCCS.HSS.ID.A.1)

SCIENTIFIC DATA ANALYSIS 2

STATISTICAL DATA ANALYSIS

OBJECTIVES:

The student will be able to:

Identify outliers informally and recognize their effect on a set of data. (CCCS.HSS.ID.A.3)

Define the characteristics of the Normal distribution by examining a histogram. (CCCS.HSS.ID.4)

Explain how a histogram, which is a discrete probability distribution, is related to the Normal distribution curve, a continuous probability distribution. (CCCS.HSS.ID.A.4)

Determine if a given set of data is approximately Normal using the empirical rule (68 - 95 - 99.7 rule). (CCCS.HSS.ID.A.4)

Estimate areas under the Normal curve using the empirical rule.

SCIENTIFIC DATA ANALYSIS 3

STATISTICAL DATA ANALYSIS

OBJECTIVES:

The student will be able to:

Graph two variables by hand (scatterplot).

(CCCS.HSS.ID.B.6)

Describe a scatterplot in terms of form, direction, strength, and the presence of outliers. (CCCS.HSS.ID.B.6)

Find equations of lines of best fit by fitting a line by hand and using technology (TI-84 regression function and/or

Excel). (CCCS.HSS.ID.B.6.A)

Interpret the slope (rate of change) and the intercept

(constant term) of a linear model in the context of the data.

(CCCS.HSS.ID.C.7)

SCIENTIFIC DATA ANALYSIS 4

STATISTICAL DATA ANALYSIS

OBJECTIVES:

The student will be able to:

Compute the correlation coefficient using technology

(TI-84 or Excel) and interpret it in the context of the data. (CCCS.HSS.ID.C.8)

Informally assess the fit of a function by plotting and analyzing residuals. (CCCS.HSS.ID.B.6.B)

Make predictions based upon analysis of data.

(5.2.12.A.3)

Distinguish between correlation and causation.

(CCCS.HSS.ID.C.9)

SCIENTIFIC DATA ANALYSIS 5

STATISTICAL DATA ANALYSIS

Statistics

– collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions from the data.

SCIENTIFIC DATA ANALYSIS 6

STATISTICAL DATA ANALYSIS

Measures of Central Tendency

– a method to describe the entire sample or population in a single number known as an average (mode, median and mean)

Mode

– the value that occurs most frequently in data.

Example 1: What is the mode of the following data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)?

SCIENTIFIC DATA ANALYSIS 7

STATISTICAL DATA ANALYSIS

Mode

Example 2: What is the mode of the following data: (2, 5, 3, 10, 1, 4, 4, 10, 1, 2, 3, 4, 1, 10, 3, 2,

5, 5)?

Mode is not a stable average, but it gives you the most common value in a distribution if that is the information desired.

There can sometimes be more than one mode in a given piece of data.

SCIENTIFIC DATA ANALYSIS 8

STATISTICAL DATA ANALYSIS

Median

– the central value that occurs in an ordered distribution of data.

If there is an odd number of data, it is the center value.

If there is an even number of data, there are two center values therefore:

Median = sum of two middle values / 2

SCIENTIFIC DATA ANALYSIS 9

STATISTICAL DATA ANALYSIS

Median

Example 1: What is the median of the following data: (62, 3, 5, 28, 67, 33, 22, 2, 10)?

Example 2: What is the median of the following data: (62, 3, 5, 28, 67, 33, 22, 2, 10, 120)?

Median is a more stable average than the mode, but it does not indicate the range of values above or below it.

SCIENTIFIC DATA ANALYSIS 10

STATISTICAL DATA ANALYSIS

Mean

– adds all values of a distribution of data and divides by the amount of data.

π‘₯ 𝑛 π‘ π‘’π‘š π‘œπ‘“ π‘Žπ‘™π‘™ # ′ 𝑠

= π‘‘β„Žπ‘’ π‘Žπ‘šπ‘‘. π‘œπ‘“ # ′ 𝑠

π‘ƒπ‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘šπ‘’π‘Žπ‘› = πœ‡ =

π‘₯ 𝑛

SCIENTIFIC DATA ANALYSIS 11

STATISTICAL DATA ANALYSIS

Mean

Trimmed Mean: will remove the highest and lowest values of a group of data before taking a mean. The typical trim amounts are either 5% or

10%.

5% Trim Mean: take 5% of the number of data points, round out the answer, take that amount off the top and bottom, and then take the average.

SCIENTIFIC DATA ANALYSIS 12

STATISTICAL DATA ANALYSIS

Mean

Example: Given the following data take the 5% trimmed mean: 34, 56, 72, 74, 78, 82, 85, 85, 88,

90, 90, 92, 95, 95, 99, 100.

5% of 16 values is .8, therefore round up to 1 and remove the top and bottom scores.

Remove 34 & 100; add up the remains = 1181 / 14 =

84.4%

If no trimming is done, then the mean would be 82.2%.

SCIENTIFIC DATA ANALYSIS 13

STATISTICAL DATA ANALYSIS

Measures of Variation

– a cross reference of the spread of the data.

Range

– the difference between the largest and smallest values of a distribution.

Example 1: What is the range of the following data: (2, 5, 3, 2, 1, 6, 4, 10, 44, 2, 4, 1, 10, 3, 2, 5)?

Range fails to tell how much values vary from one another.

SCIENTIFIC DATA ANALYSIS 14

STATISTICAL DATA ANALYSIS

Sample Standard Deviation

– a measurement that gives you a better idea of how the data entries differ from the mean.

π‘†π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑑𝑑. π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑠 =

π‘₯ − π‘₯ 𝑛 − 1

2

π‘†π‘Žπ‘šπ‘π‘™π‘’ π‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ = 𝑠 2 =

π‘₯ − π‘₯

2 𝑛 − 1

– x = a value in the distribution

– π‘₯ = the sample mean value of the distribution.

– n = the total number of values in a sample distribution

SCIENTIFIC DATA ANALYSIS 15

STATISTICAL DATA ANALYSIS

Population Standard Deviation

– this is the same as the sample standard deviation with the exception that this includes the complete population that you are studying not just a sample set. NOTE: the symbol is different and you divide by the whole population ( N ).

π‘ƒπ‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› 𝑠𝑑𝑑. π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝜎 =

π‘₯ − πœ‡

𝑁

2

– x = a value in the distribution

– πœ‡ = the population mean value of the distribution.

N = the total number of values in the population

SCIENTIFIC DATA ANALYSIS 16

STATISTICAL DATA ANALYSIS

Standard Deviation

Example: Find the standard deviation of the following values:

(1, 2, 7, 9, 10, 10). π‘₯ (𝒙 − π‘₯) 2

1 – 6.5 = -5.5

30.3

s 2 =

1

2

7

9

10

10

π‘€π‘’π‘Žπ‘› = π‘₯ = s =

Σ (𝒙 − π‘₯) 2

=

SCIENTIFIC DATA ANALYSIS 17

STATISTICAL DATA ANALYSIS

Standard Deviation

– the following is an alternate means to calculate sample std. deviation.

π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑑𝑑. π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑠 =

𝑆𝑆 π‘₯ 𝑛 − 1 π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑆𝑆 π‘₯

= Σ(π‘₯ 2 ) −

(π‘₯) 2 𝑛

SCIENTIFIC DATA ANALYSIS 18

STATISTICAL DATA ANALYSIS

Standard Deviation

Previous example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10) using alternate method

9

10

10

Σx =

2

7 x

1

SS x

= x 2

1

4

Σx 2 = s =

SCIENTIFIC DATA ANALYSIS 19

STATISTICAL DATA ANALYSIS

Coefficient of Variation

– while standard deviation computes a value which indicates the range of data around the mean value, coefficient of variation (CV) will indicate it as a % .

𝑠

πΆπ‘‰π‘“π‘œπ‘Ÿ π‘Ž π‘ π‘Žπ‘šπ‘π‘™π‘’ = × 100

π‘₯ 𝜎

πΆπ‘‰π‘“π‘œπ‘Ÿ π‘Ž π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› = × 100 πœ‡

– s = sample standard deviation

– π‘₯ = the sample mean value of the distribution

– 𝜎 = population standard deviation.

– πœ‡ = the population mean value of the distribution.

SCIENTIFIC DATA ANALYSIS 20

STATISTICAL DATA ANALYSIS

Histograms

Sometimes it is difficult to see how data is distributed by just looking at the numbers. To see how data is distributed, a histogram is used.

A histogram is a type of bar graph with the exception that all of the bars touch, and the width of the bars represents something.

SCIENTIFIC DATA ANALYSIS 21

STATISTICAL DATA ANALYSIS

Histograms

Probability Test

10

8

6

4

2

0

59.5 -

65.5

65.5 -

71.5

71.5 -

77.5

77.5 -

83.5

Test Scores

83.5 -

89.5

89.5 -

95.5

95.5 -

101.5

SCIENTIFIC DATA ANALYSIS 22

STATISTICAL DATA ANALYSIS

Histograms Procedure

1.

Decide how many classes (bars) you want. It will be given by the problem.

2.

To figure out the width of the bars, divide the range by the # of bars and then round up to the next whole number. (NOTE: Always round up even if the number is less than 5, i.e. 5.41 rounds to 6.0)

π΅π‘Žπ‘Ÿ π‘Šπ‘–π‘‘π‘‘β„Ž =

(β„Žπ‘–π‘”β„Žπ‘’π‘ π‘‘ π‘£π‘Žπ‘™π‘’π‘’ −π‘™π‘œπ‘€π‘’π‘ π‘‘ π‘£π‘Žπ‘™π‘’π‘’)

# π‘œπ‘“ π‘π‘Žπ‘Ÿπ‘ 

3.

Take the bar width and add it to the lowest value to get the range of the first bar, then add the bar width to the last value to get the range of the next bar. Keep going until you get all of your bar ranges. (i.e. if the lowest value was 60, your bar width was 6 then the first bar would be 60 – 66, the second bar would be 66 – 72, etc.)

SCIENTIFIC DATA ANALYSIS 23

STATISTICAL DATA ANALYSIS

Histograms Procedure (continued)

The problem occurs if your data point is 66 as in the example.

In order to alleviate this problem, a boundary is calculated for the bars.

4.

Calculate the boundaries of each bar: a.

Find the interval of the data. Is the data given down to whole numbers, tenths, hundredths, etc? (Note: the data will always have the same interval) b.

Take the interval and divide by 2. This is the boundary adjustment. (i.e. whole numbers means intervals of 1, so ½ = 0.5) c.

For each bar range calculated previously in step 3, subtract the upper and lower limit by the boundary adjustment value. These will be your new bar ranges or boundaries. (i.e. 60 – 0.5 = 59.5 and 66 – 0.5 = 65.5; first bar 59.5 – 65.5)

SCIENTIFIC DATA ANALYSIS 24

STATISTICAL DATA ANALYSIS

Histograms Procedure (continued)

5. Calculate the midpoint of each bar: a.

Take the upper and lower limit of a bar add them together and divide by 2. This will be the midpoint. (i.e. (59.5 +

65.5) / 2 = 62.5) π‘π‘Žπ‘Ÿ π‘šπ‘–π‘‘π‘π‘œπ‘–π‘›π‘‘ = π‘π‘Žπ‘Ÿ π‘’π‘π‘π‘’π‘Ÿ π‘™π‘–π‘šπ‘–π‘‘ + π‘π‘Žπ‘Ÿ π‘™π‘œπ‘€π‘’π‘Ÿ π‘™π‘–π‘šπ‘–π‘‘

2 b. Do this for all of the rest of the bars.

c.

The midpoint is sometimes used instead of the boundaries to graph the bars.

SCIENTIFIC DATA ANALYSIS 25

STATISTICAL DATA ANALYSIS

Histograms Procedure (continued)

6. Construct a frequency table by using tally marks.

59.5 –

65.5

||

65.5 –

71.5

|

71.5 –

77.5

||

77.5 –

83.5

|

83.5 –

89.5

||

89.5 –

95.5

||||

||

95.5 –

101.5

||||

||||

7. Graph the frequency table using a bar graph arrangement.

SCIENTIFIC DATA ANALYSIS 26

STATISTICAL DATA ANALYSIS

Draw the histogram for the following data.

Put it into 5 classes. The data is the number of passing touchdowns for the top 20 rated quarterbacks in the 2011 season.

45

41

17

13

9

46

15

21

18

20

39

29

27

21

13

31

29

16

18

20

SCIENTIFIC DATA ANALYSIS 27

STATISTICAL DATA ANALYSIS

Histograms

If the midpoint of each class is plotted, they can be interconnected with a straight line.

This straight line graph of the midpoints is known as a Frequency Polygon

SCIENTIFIC DATA ANALYSIS 28

STATISTICAL DATA ANALYSIS

Histograms

What did the histogram indicate?

Histograms can be used as a means of predicting outcome or probability. These are known as probability distributions.

One of the famous probability distributions is the normal distribution, also known as the normal curve or bell curve.

SCIENTIFIC DATA ANALYSIS 29

STATISTICAL DATA ANALYSIS

Normal

Distribution

The graph to the right is an example of a normal distribution. Not only does it indicate the results of the scores, but it can also be used for probability or predictions.

SCIENTIFIC DATA ANALYSIS 30

STATISTICAL DATA ANALYSIS

Normal Distribution Properties

The curve is bell shaped with the highest point at the mean value.

It is symmetrical about a vertical line through the mean value.

The curve approaches the horizontal axis but never touches it.

The transition points (between cup down and cup up) occur at (mean + standard deviation) and

(mean – standard deviation).

SCIENTIFIC DATA ANALYSIS 31

STATISTICAL DATA ANALYSIS

Empirical Rule

For a normal distribution the following can be said about the data:

68.2% of the data will lie within 1 standard deviation on either side of the mean

95.4% of the data will lie within 2 standard deviations on either side of the mean.

99.7% of the data will lie within 3 standard deviation on either side of the mean

SCIENTIFIC DATA ANALYSIS 32

STATISTICAL DATA ANALYSIS

Normal

Distribution

Properties

• 𝜎 = 34.1%

2 𝜎 = 13.6%

3 𝜎 = 2.15%

>3 𝜎 = 0.15%

• These %’s are used to indicate probabilities.

SCIENTIFIC DATA ANALYSIS 33

STATISTICAL DATA ANALYSIS

Example: Assume the heights of college women are normally distributed, with a mean of 65 inches and a SD of 2.5 inches.

– What % of women are taller than 65 inches –OR- what is the probability if one woman is selected she is taller than 65 inches?

Shorter than 65 inches?

Between 62.5 and 67.5 inches?

Between 60 and 70 inches?

SCIENTIFIC DATA ANALYSIS 34

STATISTICAL DATA ANALYSIS

Percentiles

Sometimes it is more important to see the relative position of piece of data rather than its exact value.

Percentile refers to where data lies relative to the other data in the distribution. A data point at the n th percentile means n% of the data falls at or below that point and 100 – n% falls at or above that point.

Example: You scored in the 85 th percentile therefore

85% of the people who took the test scored at or below you while 15% scored at or above you. Note: this does

NOT mean you scored 85% on the test.

SCIENTIFIC DATA ANALYSIS 35

STATISTICAL DATA ANALYSIS

Percentiles

The median is a type of percentile. It is the middle data point in the distribution therefore it is at the

50 th percentile.

A special type of percentile known as the quartile is also used to evaluate the position of data.

Quartiles split data into fourths.

The 1 st quartile (Q

1 quartile (Q

2

(Q

3

) is the 25 th percentile, the 2 nd

) is the median, and the 3

) is the 75 th percentile.

rd quartile

SCIENTIFIC DATA ANALYSIS 36

STATISTICAL DATA ANALYSIS

Quartiles

Q

1

Q

2

Q

3

Interquartile Range (IQR) = Q

3

– Q

1

SCIENTIFIC DATA ANALYSIS 37

STATISTICAL DATA ANALYSIS

Quartiles

Procedure to compute quartiles:

1. Order the data from smallest to largest.

2. Find the median; this is the 2 nd quartile, Q

2

.

3. The first quartile Q

1 is then the median of the lower half of the data. It is the median of the data falling below the Q

2 and not including Q

2

.

4. The third quartile Q

3 is then the median of the upper half of the data. It is the median of the data falling above the Q

2 and not including Q

2

.

SCIENTIFIC DATA ANALYSIS 38

STATISTICAL DATA ANALYSIS

Quartiles

Example (even # of data):

Find Q1, Q2 & Q3 & IQR for the following data:

(3, 4, 9, 13, 20, 24)

1. Find Q2. Find the median of all of the data. No center data point so take mean of the two center data points. 13 +

9 / 2 = 11.

2. Find Q1. Find the median of the first half of the data not including Q2. Q1 = 4

3. Find Q3. Find the median of the second half of the data not including Q2. Q3 = 20

4. IQR = Q3 – Q1 = 20 – 4 = 16.

SCIENTIFIC DATA ANALYSIS 39

STATISTICAL DATA ANALYSIS

Quartiles

Example: A study of ice cream bars was done.

Twenty seven bars tested were rated as tasting

“fair.” The cost per bar is listed below. Find the quartiles and the IQR.

0.99

1.07

1.00

0.50

0.37

1.03

1.07

1.07

0.97

0.63

0.33

0.50

0.97

1.08

0.47

0.84

1.23

0.25

0.50

0.40

0.33

0.35

0.17

0.38

0.20

0.18

0.16

SCIENTIFIC DATA ANALYSIS 40

STATISTICAL DATA ANALYSIS

Quartiles

Knowing Q

1

, Q

2

, Q

3

, highest value and lowest value in a table of data is known as a Five-

Number Summary .

In order to graphically represent the five-number summary, a Box-and Whisker Plot will be used.

SCIENTIFIC DATA ANALYSIS 41

STATISTICAL DATA ANALYSIS

Quartiles

Box-and Whisker Plot (Shown vertically but can be done horizontally as well)

Highest Value

Q

3

Q

2

Q

1

Lowest Value

SCIENTIFIC DATA ANALYSIS 42

STATISTICAL DATA ANALYSIS

Quartiles

Proceure to make a Box-and Whisker Plot :

Draw a vertical scale to include the lowest and highest data values.

To the right of the scale draw a box from Q

1 to Q

3

.

Include a solid line through the box at the median level.

Draw solid lines called whiskers from Q

1 value and from Q

3 to the highest value.

to the lowest

EXAMPLE: Go back to the ice cream problem and create a box-and-whisker plot.

SCIENTIFIC DATA ANALYSIS 43

STATISTICAL DATA ANALYSIS

Outliers

Sometimes data can skew the average of a range of data.

When data is 1.5X the difference of the 1 st and 3 rd quartiles, than it may be considered an outlier.

Outliers are sometimes removed from the data so that is does not skew the results.

SCIENTIFIC DATA ANALYSIS 44

STATISTICAL DATA ANALYSIS

Scatter Plots

Remember from last unit that data can be plotted as a series of x and y points known as a scatter plot.

We estimated a line of best fit. In doing this, we were finding a linear correlation that exists between the x and y points.

We shall analyze the data of a scatter plot more closely in the next couple of slides.

SCIENTIFIC DATA ANALYSIS 45

STATISTICAL DATA ANALYSIS

Time

(seconds)

0.7

1.8

2.6

3.4

3.8

4.1

4.9

6.0

6.5

Position

(meters)

3.8

3.2

2.8

2.2

1.8

1.4

0.8

0.2

0

SCIENTIFIC INQUIRY AND ANALYSIS 46

STATISTICAL DATA ANALYSIS

Scatter Plots

The y-distance that a data point is away from the line of best fit is known as a Residual.

The optimal line of best fit occurs when the sum of all of the square of all of the residual values is the smallest. This is know as finding the line of best fit through Least Squares method.

SCIENTIFIC DATA ANALYSIS 47

STATISTICAL DATA ANALYSIS

Least Squares Method

Recall that the slope of a linear line is in the format: 𝑦 = π‘šπ‘₯ + 𝑏

This method will allow us to find the optimal slope

( m ) and the y-intercept ( b ) based on the data.

We will use a similar method here as we did for calculating standard deviation.

SCIENTIFIC DATA ANALYSIS 48

STATISTICAL DATA ANALYSIS

Least Squares Method 𝑦 = π‘šπ‘₯ + 𝑏

To find the slope m, the following equation is used: π‘š =

𝑆𝑆 π‘₯𝑦

𝑆𝑆 π‘₯ where 𝑆𝑆 π‘₯𝑦

= Σπ‘₯𝑦 −

Σπ‘₯ Σ𝑦 𝑛 and

𝑆𝑆 π‘₯

= Σπ‘₯ 2 −

Σπ‘₯

2 𝑛

To find the y-intercept b, the following equation is used: where 𝑦 is he mean of y and π‘₯ is he mean of x

SCIENTIFIC DATA ANALYSIS 49

STATISTICAL DATA ANALYSIS

X -data

Time

(seconds)

0.7

Y-data

Position

(meters)

3.8

x 2 xy

1.8

2.6

3.4

3.8

4.1

4.9

6.0

6.5

Σ x =

π‘₯ =

3.2

2.8

2.2

1.8

1.4

0.8

0.2

0

Σ y = 𝑦 =

Σ x 2 = Σ xy =

SCIENTIFIC INQUIRY AND ANALYSIS 50

STATISTICAL DATA ANALYSIS

Example:

1. From the example on the previous page find the slope:

Σπ‘₯ Σ𝑦

𝑆𝑆 π‘₯𝑦

= Σπ‘₯𝑦 −

Σπ‘₯

=

𝑆𝑆 π‘₯

= Σπ‘₯ 2 −

2 𝑛

= 𝑛

𝑆𝑆 π‘₯𝑦 π‘š = =

𝑆𝑆 π‘₯

2. From the example on the previous page find the yintercept:

3.

Write the equations for line of least squares.

𝑦 = π‘šπ‘₯ + 𝑏

SCIENTIFIC DATA ANALYSIS 51

STATISTICAL DATA ANALYSIS

Graph 1: Movement of a Car

4,5

4

3,5

3

2,5

2

1,5

1

0,5

0

-0,5

0 1 2 3 4

Time (seconds) y = -0,6974x + 4,419

R² = 0,9886

5 6 7

SCIENTIFIC INQUIRY AND ANALYSIS 52

STATISTICAL DATA ANALYSIS

Measuring the Spread of Data

There are three methods for measuring the spread of the data around the line of least squares:

Standard Error of Estimate

Coefficient of Correlation

Coefficient of Determination

SCIENTIFIC DATA ANALYSIS 53

STATISTICAL DATA ANALYSIS

Standard Error of Estimate

In order to do this we look at how far away the y data point is away from the least squares line for each of the data points.

This method will calculate a value that is representative of spread of all of the data.

We will use values that were already calculated in figuring out the least squares line.

SCIENTIFIC DATA ANALYSIS 54

STATISTICAL DATA ANALYSIS

Standard Error of Estimate

𝑆 𝑒

=

𝑆𝑆 𝑦

− π‘š 𝑆𝑆 π‘₯𝑦 𝑛 − 2

Why would it be n – 2? (In other words, why does n have to be >2)

Use the same method as before to find m, SS xy

SS x

: and

𝑆𝑆 𝑦

= Σ𝑦 2 −

Σ𝑦

2 𝑛

SCIENTIFIC DATA ANALYSIS 55

STATISTICAL DATA ANALYSIS

X -data

Time

(seconds)

0.7

Y-data

Position

(meters)

3.8

y 2

Previously

Calculated

Data

SS xy = m = 1.8

2.6

3.4

3.8

4.1

4.9

6.0

6.5

Σ x =

3.2

2.8

2.2

1.8

1.4

0.8

0.2

0

Σ y = Σ y 2 =

SCIENTIFIC INQUIRY AND ANALYSIS 56

STATISTICAL DATA ANALYSIS

Example:

1. From the example on the previous page find the following:

Σ𝑦 2

𝑆𝑆 𝑦

= Σ𝑦 2 − 𝑛

=

2. From the above calculation and previous calculated data find the standard error of estimate:

𝑆 𝑒

=

𝑆𝑆 𝑦

− π‘š 𝑆𝑆 π‘₯𝑦 𝑛 − 2

=

SCIENTIFIC DATA ANALYSIS 57

STATISTICAL DATA ANALYSIS

Linear Correlation Coefficient, r

So far, we have been able to figure the line of best fit by using the line of least squares (which is also known as the

“least squares regression line of y on x”)

We then wanted to determine the quality of our line by using the standard error of estimate.

The problem with the standard error of estimate is that it has units of y; therefore, when looking at two different sets of data, you cannot say that one graph is better than other because the units may skew the result.

The linear correlation coefficient helps to alleviate this problem by calculating a number that is unitless and therefore independent of the units.

SCIENTIFIC DATA ANALYSIS 58

STATISTICAL DATA ANALYSIS

Linear Correlation Coefficient, r

𝑆𝑆 π‘₯𝑦 π‘Ÿ =

𝑆𝑆 π‘₯

𝑆𝑆 𝑦

The value of r

0

1 or -1 r Indication

There is no linear relationship of the data points

There is a perfect linear relationship between the x and y data points; all points lie on the least-squares line.

Between 0 and 1 The x and y data points have a positive correlation (+ slope)

Between 0 and -1 The x and y data points have a negative correlation (- slope)

SCIENTIFIC DATA ANALYSIS 59

STATISTICAL DATA ANALYSIS

X -data

Time

(seconds)

0.7

Y-data

Position

(meters)

3.8

Previously

Calculated

Data

SS xy =

1.8

2.6

3.4

3.8

4.1

4.9

6.0

6.5

3.2

2.8

2.2

1.8

1.4

0.8

0.2

0

SS x =

SS y =

SCIENTIFIC INQUIRY AND ANALYSIS 60

STATISTICAL DATA ANALYSIS

Example:

1. From the example on the previous page find the following:

𝑆𝑆 π‘₯𝑦 π‘Ÿ = =

𝑆𝑆 π‘₯

𝑆𝑆 𝑦

2. What does the value of r indicate about the correlation of the data points?

SCIENTIFIC DATA ANALYSIS 61

STATISTICAL DATA ANALYSIS

Coefficient of Determination, r 2

Another way of looking at the quality of your data is to look at how far away some y-data point ( y ) is from the mean of the y-data ( 𝑦 ). This is simply the deviation. 𝑦 − .

The deviation is made up of two parts:

The first part indicates how far away the least squares line ( from the mean of the y-data ( 𝑦 ). This is simply 𝑦 𝑝 y p

) is

− , and this is known as the explained portion of the standard deviation.

• The second part indicates how far away a particular y-data point ( y ) is from the least squares line ( y p

). This is simply 𝑦 − 𝑦 𝑝

, and this is known as the unexplained portion of the standard deviation.

SCIENTIFIC DATA ANALYSIS 62

STATISTICAL DATA ANALYSIS

Coefficient of Determination, r 2

Recall that when the deviation is squared we get the variance or variation. Based on the explanation before the variance has two parts: the explained variation and the unexplained variation.

The Coefficient of Determination is a ratio of the explained variation to the total variation and is simply calculated by taking the Correlation

Coefficient ( r ) and squaring it.

SCIENTIFIC DATA ANALYSIS 63

STATISTICAL DATA ANALYSIS

Coefficient of Determination, r 2

So what does r 2 indicate?

Change r 2 into a %

The % indicates what % of the variation of the y data is explained by the variation of the x data if we use the least squares line.

– 100% − π‘Ÿ 2 indicates what % of the variation of the y data is due to random chance or some other variable beside the x that may influence y.

SCIENTIFIC DATA ANALYSIS 64

STATISTICAL DATA ANALYSIS

Example:

1. From the previous example find the

Coefficient of Determination, r 2 :

2 π‘Ÿ 2 =

𝑆𝑆 π‘₯𝑦

=

𝑆𝑆 π‘₯

𝑆𝑆 𝑦

2. What does the value of r 2 indicate about the explained and unexplained portions of the variation?

SCIENTIFIC DATA ANALYSIS 65

STATISTICAL DATA ANALYSIS

Correlation vs Causation

Correlation refers to one variable changing as another variable changes.

Causation refers to one variable changing because of another variable changing. (Cause & Effect)

Just because there is a correlation between two variables does not mean there is a causation.

SCIENTIFIC DATA ANALYSIS 66

DO NOW / HW Unit 2-1 Check

Have out your homework and do the following: Find the mode, median, mean and standard deviation.

60%

63%

66%

74%

74%

77%

86%

89%

89%

91%

91%

94%

94%

94%

94%

94%

97%

97%

100%

100%

100%

100%

100%

100%

100%

SCIENTIFIC DATA ANALYSIS 67

HW Assignment 2-1 Check

10, 12, 14, 18, 36, 37, pg. 449 – 50

10. 8.33

9

9

12. 85.625

85.5

91

14. 2.77

2.9

2.9

18. 14

4

36. $233,071.43

$142,000 none

37. $645,000

$213,242.66

SCIENTIFIC DATA ANALYSIS 68

DO NOW / HW Unit 2-2 Check

Have out your homework and do the following: Make a histogram of the following data in 7 classes.

These were the top 32 quarterback ratings in the NFL in 2012.

108.0

99.1

90.7

87.4

83.3

81.2

77.4

72.6

105.8

98.7

90.5

87.2

82.6

79.8

76.5

72.2

102.4

97.0

88.6

86.2

81.6

79.1

76.1

66.9

100.0

96.3

87.7

85.3

81.3

78.1

74.0

66.7

SCIENTIFIC DATA ANALYSIS 69

DO NOW / HW Unit 2-2 Check

RANGE : 41.3

CLASSES:

BAR WIDTH:

BAR STARTING POINT:

UPPER BAR RANGES:

INTERVAL:

BOUNDARY ADJUSTMENT:

BOUNDARIES STARTING POINT:

BOUNDARY RANGES:

7.0

6

66.7

72.7

0.1

0.05

66.65

72.65

78.7

78.65

84.7

84.65

90.7

90.65

96.7

96.65

102.7

102.65

108.7

108.65

66.65 - 72.65

72.65 - 78.65

78.65 - 84.65

84.65 - 90.65

90.65 - 96.65 96.65 - 102.65

4 5 7 7 2 5

102.65 - 108.65

2 # OF QB'S

SCIENTIFIC DATA ANALYSIS 70

HW Assignment 2-2

SCIENTIFIC DATA ANALYSIS 71

EXPERIMENTAL DESIGN

Standard Deviation

Example: Find the standard deviation of the following values:

(1, 2, 7, 9, 10, 10). π‘₯ (𝒙 − π‘₯) 2

1

2

7

9

10

10

Mean = 39/6 = 6.5

– s 2 = 81.8 / 5 = 16.4

1 – 6.5 = -5.5

2 – 6.5 = -4.5

7 – 6.5 = 0.5

9 – 6.5 = 2.5

10 – 6.5 = 3.5

10 – 6.5 = 3.5

s = 4.05

30.3

20.3

0.3

6.3

12.3

12.3

Σ = 81.8

SCIENTIFIC DATA ANALYSIS 72

EXPERIMENTAL DESIGN

Standard Deviation

Previous example: Find the standard deviation of the following values: (1, 2, 7, 9, 10, 10) using alternate method

2

7 x

1

9

10

10

Σx = 39

SS x

= 335 – 39 2 /6 = 81.5 s = 4.04

x 2

1

4

49

81

100

100

Σx 2 = 335

SCIENTIFIC DATA ANALYSIS 73

Download