Population Analysis

advertisement
Revised 2009
Statistics Analyzed on Spread Sheets
SYNOPSIS
Students have an introduction of lab recording, reporting, determination of standard deviations
associated with random error, use of Excel. Read the New England Journal of Medicine
(http://www.luc.edu/faculty/afitch/Articles/Needleman%20NEJM%201979.pdf ) Needleman
article and “Introduction to Statistics and Sampling” before coming to lab.
INTRODUCTION
In this experiment students will collect data from two or more visibly different populations.
The random spread of the populations as a function of sampling size will be followed on a spread
sheet as a histogram. The resolution between the populations will also be followed. In addition
students will compute the standard deviation of the populations and follow the magnitude of the
standard deviation as a function of population size. Finally students will determine by ANOVA
(analysis of variance) if the two populations are statistically (significantly) different.
Testable SKILL OUTCOMES
1.
Review and/or learn basic spreadsheets “tricks” (see end of document) (copy, paste,
highlight, keyboard shortcuts)
2.
Review and/or learn basic spreadsheet calculations for standard deviation, average,
searching minimum, maximum, histogram and ANOVA functions
3.
Review and/or learn graphing in a spreadsheet.
4.
Quality Control – multiple operators, random sampling as sources of error.
5.
Learn how to create Gaussian curves and understand how the parameters of a
normal population allow us to maximize the quality of our measurements
6.
Understand qualitatively and quantitatively how to place a numerical value on
certainity.
READING: “Introduction to Statistics and Sampling” posted on the Lead lab web page.
MATERIALS Several sacks of red and white potatoes.
SCENARIO A potato farmer has arrived in town with several sample sacks of sorted potatoes to
show to potential grocers to demonstrate quality. In addition the farmer had several truck loads of
potatoes to sell, however the trucks were over turned so all the non-sample product were jumbled.
Your job is to use the sample sacks to create a method to re-sort the potatoes. You need to be able
when done (using an ANOVA) to assure the grocers the probable impurity of the resorted potatoes.
The farmer is employing family members to help in the resorting process, all of whom are color
blind. Furthermore, the farmer is cheap and will not purchase any equipment beyond strings and
rulers to help in the sorting.
PROCEDURE
a.
Break into several groups of 2. Divide each type of potato proportionally between groups.
1
b.
Create a method to analyze within the lab the red and white potatoes without using color.
The method must be numerical.1 The method must include quality control measures which
identify various sources of error and measures or accounts for those errors. The errors may
stem from instruments used or from the operator. The method must also take into account a
pre-determined procedure for obtaining a representative subset of the population for
measurement.
c.
Confer with other groups and discuss your method. Choose a method which all groups will
use that appears to be the most efficient and which has the best quality control.
d.
Transcribe your data into Excel spreadsheet as a series 2 columns per type of potato.
The first column should be the number of potatoes measured from 1 to N. The second
column should be the measurement you have chosen in your method. This is most easily
accomplished if one student reads out the data to another who is recording the data. To
facilitate graphing and other Excel manipulations leave a blank row at the top and bottom
of your column of data.
e.
Do not sort your data. Exchange data with other groups, keeping track of measurements
deriving from other groups.
1. Create a Frequency Plot of your first set of potatoes (your sub-sample of one kind)
ii.
Determine the largest and smallest measurements made. At the end of the
measurement column of data for the first type of potato, skip one row, and then type in the
formula:
=min(data set)
To highlight the data set right click on the first measurement, then hold down
CNTRL, SHIFT,  (the down arrow simultaneously). (Note this command will
highlight all the data until it reaches an empty row, which is why we left a blank row at
the end of the column of data).
In the next row find the largest measurement for that type of potato by typing in
formula:
=max(data set)
Highlight the data set by holding down CNTRL, SHIFT, 
the
Copy your formulas left to right across your various data sets. Highlight the two formulas,
then using the keyboard ALT, E, C. The “Alt” command allows you to access the
1
Instructor: bring strings, rulers, scissors, bucket of water if you wish. Students will normally choose to do the
circumference. The best result is obtained when using the long axis for circumference. Students should be asked what
source of error will occur in the method and how to create a greater throughput. I.E. if each student samples then there
will be variability from student to student. A good method is to ask students to measure the same potato 4 times and
chose the student with the least standard deviation. to make the measurement).
2
command tab at the top of the page. “E” indicates that you wish to open the “edit” tab, and
the “C” indicates that you wish to use the “copy” command. To copy you can play the
cursor on the lower right box that appears in the highlighted cells and drag to the right.
iii.
Create “bins”. You are familiar with this process when an instructor shows the
distribution of grades in a class in which the number of students getting an exam score is
plotted on the y axis vs the grade “bin”. The range of the grade bins is from the minimum
grade to the maximum grade or from 0 to 100%. The bin width for the grades is variable
(90-91% vs 90-95% vs 90-100%).
Your range of bins for your measurements should be from slightly smaller than your
smallest number to slightly larger than your largest number from all classes of potatoes as
you will be plotting multiple sets of data in a single graph. The bin width will affect the
graph you ultimately get. You want neither too large of bin (all grades between 0-100%)
nor too small (sorting grades by 0.5%).
Create a column for your bins. In the first cell type the value of the lowest bin,
ENTER. Return to the value and highlight it. Create the bins by typing ALT, E, I, S. This
will open a command box labeled series. Type ALT, C to indicate your bins will be in a
column rather than in a row. Type ALT, S to highlight the step box. This determines value
you increment (bin width). Type ALT, O to highlight the stop box. Enter the maximum
value for the range of bins. ENTER to create the column of bins.
iv.
Sort your data into bins. ALT, T, D, D, ENTER, then  or  to “Histogram”, and then
ENTER which opens a command box “Histogram” in which the cursor is in the Input
Range box. Either type in the range of your data (for example A1:A20) and then type ALT
I to access the bin range. Or: click on the Excel sheet icon at the right of the command line
and then highlight the data range; click on the Excel icon again to re-enter the command
box. Type ALT I to access the bin command line. Using a similar procedure enter the bin
range. Type ALT O to highlight the command line which specifies where the sorted data is
to be placed. Place the cursor in the command line and then specify by typing in a cell
address or by using the icon to move to the cell address. ENTER to get your sorted data.
In order to facilitate graphing you need to create an empty row above and below the sorted
data. Do not include the label row and do not include the “more” row.
v.
Create the Frequency Plot. Highlight the bins (x value) and the sorted data (y value).
ALT, I, H to activate the chart commands. ALT, C and  or  to XY scatter. ALT, T to
activate the subtypes.  or  to either points, or points with straight lines. (NEVER use the
curved lines as this indicates that you have some knowledge of the mathematics relating the
value of y to x). ALT, N to enter the commands for the plot.
3
2.
Create a Gaussian Curve associated with your Histogram.
i.
Calculate the standard deviation of the population of potatoes. Go to the
bottom
of the column of data (below your minimum and maximum calculations) and type
=stdev(cell range)
Again you can highlight the cell range by activating the last or first cell in the
range and then CNTRL SHIFT  or .
ii.
Name the cell containing the stdev. To name this cell highlight the cell. You should see
a small box just above the cell ranges and below the command bar that displays the address
of the cell you have highlighted. Click and the address will go gray. Backspace to erase.
Type in a unique name for this cell. Remember you will be making several calculations of
standard deviation for different size populations of the same potato group and for different
types of populations so you should choose a name like stdevredall for a standard deviation
of all of the red potatoes.
iii.
Calculate the average of the population of potatoes. In the next row type
=average(cell range)
Name this cell so that you can refer to it when writing formulas.
iv.
Calculate the predicted frequency or Gaussian. A perfectly random
should have a frequency plot described by the equation:
y
A
 2
exp
1  x 
 

2  
population
2
(1)
Where A is the peak value,  is the variance of the population (standard deviation for an
infinite population), and  or x is the mean or average of the population. You will calculate the
expected frequency (y) associated with x (the bin).
Move your cursor to an empty column and to the same row as your first (lowest
numerical value) bin. Type in Equation 1 using Excel language as:
=(peak height/(stderedall*sqrt(2*pi())))*(exp(-0.5*((bin-averedall)/stdevredall)^2))
In this equation where it says bin, highlight the “bin” for the row you are in. Stderedall
refers to the cell in which you have calculated the standard deviation using all of the red
potatoes. Averedall refers to the cell in which you calculated the average using all of the
red potatoes. Peak height refers to your estimate of the height of the experimental
4
histogram.
(Another way to always refer to a unique cell is to insert $ into the cell address. A $ before
the column ($A1, for example) indicates that Excel should always refer to column A when
copying the formula, yet allows the row to move down as the formula is copied down. A
$ before the row (A$1) indicates that Excel should always refer to row 1 when copying the
formula but allow the column to move as the formula is copied to the right or left. If the
cell is referred to as ($A$1) when the formula is copies it will always refer to cell A1.)
Copy your formula down so that each “bin” has a projected frequency value.
iv.
Add the expected value to the histogram plot. Highlight the top or bottom cell of
the calculated, predicted, frequencies and simultaneously press CNTRL SHIFT 
or  to highlight your calculated values. ALT, E, C to activate the cells. Then go to
your plot page. ALT, E, paste.
To insert a set of data into your graph that has different X values use paste special and indicate
that the first column contains the x values.
3.
(optional). Calculate the estimated width of your Gaussian by taking the derivative of the
Gaussian.
i.
Insert 2 columns adjacent to your calculated (theoretical) frequency values.
ii.
The first column will be the mid bin value. If, for example the cell containing the
first bin is Q10, in row 10 of your column type
=(Q11-Q10)/2
iii.
iv.
Copy this formula down to the N-1 bin row
The second column will contain the derivative of the frequency (y axis). If the
computed frequency begins in cell R10, in row 10 of your second column type
=(R11-R10)/(Q11-Q10)
v.
Plot the first column as x and the second column as y in the same graph as your
theoretical frequency. You should see that the second derivative peaks at the x
value of the standard deviation.
4. Repeat 1&2
i. Repeat the analysis for increasing population of measurements by including data for this
kind of pototo from other groups. Be sure to get a mean and average deviation for each
group’s sample of measurements so that you can discuss the effect of multiple operators
in your data analysis.
j. Repeat the analysis for the different kinds of potatoes.
5
k. Copy (without sorting) all of the data into one large population of measurements and repeat
1 & 2.
5.
Difference between Experimental Histogram and Model Gaussian
One way to test if your model Gaussian is “good” is to calculate the absolute
difference between the Gaussian and the experimental histogram at every point and
sum. A convenient way to get an absolute value is to square.
i.
In a new column in the same row as the first bin number type the Excel formula:
=(histogram value-model value)^2
Copy this down to the end of the bin numbers.
Sum this column of data. This represents the sum of squares.
Repeat for different populations. What happens to the sum of squares for different
size populations?
ii.
iii.
iv.
6.
Calculate the Resolution of your red and potato populations
Resolution refers to how well separated your histograms are. Resolution is calculated as:
R
x

 xa
 Wa Wb 



2 
 2
b
where x i refers to the mean of population I and Wi refers to the baseline width of peak i. The
baseline width can be obtained by triangulating the peak.
7.
Calculate the Analysis of Variance of your red and potato populations
i.
You will need to have the raw data for your red and white potato populations in adjacent
columns. Copy and move your data to some convenient location.
ii.
ALT, T, D, D, ENTER. Use the arrow key to move to ANOVA, single factor. ENTER. The
command box for the ANOVA is now displayed. Highlight both columns of data for the input
range.
iii. If you wish to test that you can be 95% certain that there are two different populations set the
alpha factor to 0.05. If you wish to test that you can be 99% certain that there are two different set
the alpha factor to 0.01.
iv.
Activate the output range and set it to some convenient location. ENTER.
6
v.
If the calculated F value is greater than the F critical value then you are alpha confident that
you have two different populations of potatos.
8.
Compute a running averages and running standard deviations.2
i.
A running value is one that is calculated with an ever increasing population. That is, for
two points compute the average and standard deviation of two points. For three points
compute the average of three points and the standard deviation of those three points.
Place your cursor in an empty column in the row of your first data point. For example if
your first measurement for red potatoes is in cell a10 place your cursor in some column,
row 10. Type in the following formula:
=average(A10:A$10)
Copy this formula down the column to the last row of data. You should see that the
formula starts by calculating the average of A10:A10; then A11:A10; then A12:A10; and
so on.
ii.
Repeat this process for a running standard deviation:
=stdev(A10:A$10)
iii.
Make a plot of the average circumference and standard deviation as a function of sample
number.
9.
Analyze the Needleman Data. You can obtain the Needleman data in an Excel sheet form
the Lead Lab web page.
This data was abstracted from the NIH report ORI 91-27.
Data is reported as a percent of the total, where the total is N. The ORI report investigated several
allegations of scientific misconduct brought against Needleman. One of these charges was that the
selection of children as part of the low lead cohort (<6 ppm tooth lead) and of the high lead cohort
(>24 ppm tooth lead) was misleading as the selection criteria changed. A second charge was that
the data was amended between 1979 and 1982 in such a way as to imply that high tooth lead
affected Verbal IQs in a uniform fashion. The ORI report stated: (p. 33) “According to the Hearing
Board, the preponderance of evidence indicated that Dr. Needleman deliberately misrepresented
the subject inclusion/exclusion procedures in his original and subsequent publications. The Board
2
Instructor: At this point it is very important that the sampling have been random as well as having a random
population of potatoes. If the sampling procedure was random then the standard deviation will decrease as the square
of the number of measurements made. If the sampling was not random, the plot will not demonstrate this point.
7
speculated that this misrepresentation may have been done to make the subject inclusion/exclusion
procedures appear much more rigorous than they were. The Board determined that this violates a
principle of scientific inquiry, namely that procedures be described so that the observations could
be replicated by other investigators (Hearing Board Report, pages 39 and 64). Later in the ORI
report (p. 75): As alleged by the complainants, the DRI analysis shows that the graphs presented do
not explicitly deal with the possible effects of covariates such as age. The DRI analysis indicates
that the 1982 Note contains errors, inconsistencies and misleading statements whose combined
impact is to favor a “simple shift” lead effect throughout the entire VIQ range. The correction and
clarification of the points raised above would serve scientific interests as well as, potentially, those
of public policy.
i.
ii.
iii.
iv.
Use the data to construct frequency plots of the 1978 comparison between the
verbal IQ of children with high tooth lead and children with low tooth lead.
Use the data to construct frequency plots of the 1982 amended set of data as
compared to the 1978 data for the low tooth lead children..
What has happened to the data between 1979 and 1982?
From the data estimate the Resolution between the verbal IQ of children with high
and low lead. Why might public policy be based on different standards for R than
analytical chemistry? What might be the consequence of failure to act?
8
REPORT
Your report should have
A.
A meaningful or descriptive title. (Neither Red and White Potatoes, nor First Lab are
descriptive titles).
B.
A marked section (I) for the introduction or purpose.
C.
A marked section (II) for materials and methods.
This section includes reagents used and their manufacturers and dilutions used in
the lab. If there are any changes in what is used or how much of a reagent is used, it
should be noted here. Additionally, all instrumentation, including manufacturer
and model number, any variable used and the settings for the lab should be noted. If
reagents were made by the students, all calculations involving dilutions etc should
be included.
D.
A marked section (III) for results in which graphs and tables are presented. Graphs are
both numbered and given a title. Graphs follow within the report immediately after the first
time they are mentioned, and should be in numerical sequence.
E.
A marked section (IV) for summary/discussion.
1.
2.
3.
4.
5.
6.
Calculate the resolution between red and white potatoes from your bar graph with
10 samples each and with 30 samples each. (Use the estimated s from the bar
graphs). How does resolution change with total population size?
What is a source of indeterminate error in your measurement? How did you
account for these errors?
What is a source of determinate error in your measurement? How did you account
for these errors?
What are the elements of record keeping during the experiment necessary to allow a
stranger to write a report? (I.e. for a stranger to "audit" you?)
What elements of the report are necessary to make it readable?
Identify the professional ethics involved in:
a.
Recording experimental observations.
b.
Reporting on experimental observations.
9
EXCEL TIP SHEET
Π is written as
Times
Divide
Square root
Standard deviation
Average
Mininium
Maximum
Exp
pi()
*
/
sqrt(number here)
stdev(cell range)
average(cell range)
minimum(cell range)
maximum(cell range)
exp(number here)
Making a copied equation refer always to a unique cell (for example A10):
$A10 always refer to column A when copying the formula. When copying down
or up the row will will change proportionally.
A$10 always refer to row 10. When copying left or right the column will change
proportionally.
$A$10 cell A10 will always be referred to even when copying the formula up down
or left and right.
ALT – will allow you to reach the command line at the two of the sheet by the
keyboard instead of using the mouse. If you get in the habit of doing this
you will save literally hours of time. Follow with the indicated underlined
letter for the tab of the command line.
To highlight a column (or row) of data to copy or alter or paste the easiest way to
avoid endless mouse scrolling is to make certain that your data is always
bookended by an empty row at the top and bottom (or left and right).
Highlight the cell at the top (or bottom) of the column to be copied then
simulataneously press the CNTRL SHIFT and arrow key to get all the data
highlighted.
To fill a column with a set of numbers. Type in the first number you want in the desired cell; Enter.
Then arrow up to highlight the cell. ALT, E, I, S. This will open a command box labeled
series. Type ALT, C to indicate you want to create the number series in a column rather
than in a row. Type ALT, S to highlight the step box. This determines value you increment.
Type ALT, O to highlight the stop box. Enter the maximum value for the range of numbers.
ENTER to create the column of numbers.
10
Download