Uploaded by Shivram Janapareddy

Biostatistics

advertisement
Introduction to Biostatistics
BIOSTATISTICS:
Application of statistical methods to the data, which is derived from biological sciences
such as medicine.
STATISTICS:
 Statistics is the branch of science, which deals with the theories, and methods of collection,
classification, analysis and interpretation of data.
IMPORTANT TERMS
POPULATION:
The term population is used to denote the “units” under study. It includes all persons, events
and objects under study
Ex: If the objective is to assess the quality of tablets of a batch, then all tablets from that
batch form the population.
Population is described in terms of size, structure, time frame, geography and nature.
Homogeneous: When there is practically very little variation in the characteristic of the units
in the population.
Ex: Size of tablets in a bottle
1
Heterogeneous: When there is wide variation in the characteristic of the units in the
population under study.
Ex: Gender
Finite: When the units in the population are countable, the population is finite.
Infinite: When the units in the population are not easily countable (Ex: World Population) or
can be created by infinite permutations & combination (Ex: throws of dice), the population is
said to be infinite.
Time Frame/Geography: When we speak of the population of a city or a country, we must
specify the time we are referring (i.e., population of 1991 or 2001).
Dynamic: When the units in the population change frequently, there by affecting the
parameter, the population is said to be dynamic.
Ex: Patients in a hospital.
Static: When the population units do not change frequently, the population is said to be
static.
Ex: Doctors in a hospital.
SAMPLE
Sample is a part of population which represents the entire population
Instead of studying the entire population, only sample is studied.
Process of collection of samples is known as sampling.
Types of sampling methods
Random sampling: A random sample is one where each item of the population has an equal
chance of being included in the sample .A random sample may be taken from infinite or
finite population. random sampling is a scientific method of getting a sample from the
population. this method is also known as “unrestricted random sampling” device.
Stratified sampling: If a population is divided into relatively homogeneous groups or strata
and a random sample is drawn from each group or stratum to produce an overall sample, it is
known as “stratified sampling”.
2
Cluster sampling: It is also known as sampling stages. In the cluster sampling method, the
population is divided into some recognizable subgroups which are called clusters. Now the
random sample of these clusters is drawn and all the units belonging to the selected clusters
constitute the sample
FREQUENCY
The number of times a value of the variable occurs is called the frequency.
OBSERVATION
Measurement of an event is called observation.
For Ex: B.P, Temperature of the body, etc.
DATA
Data is a collection of observations expressed in numerical figures
CLASS INTERVAL
Each group into which the raw data is condensed is called as class interval.
class interval is of two types 1.overlapping 2.Non overlapping
CLASS LIMIT
The difference between the upper limit and lower limit of a class in called as class limit
Class limit = upper limit - lower limit
CLASS MARK:
IMPORTANT SYMBOLS
∑: summation
E: Expected Number
O: Observed Number
N or n: Number of observations
3
P: Probability
f: Frequency
C.F: Cumulative Frequency
_
x :Mean
M: Median
Mo : Mode
Q :Quartile Deviation
δ: Mean Deviation
 : Standard deviation
χ2: Chi-square test
‘t’test: Student’s test or ‘t’ratio
r:Correlation
b:Regression
DATA COLLECTION
1. Measurement: The required information is collected by actual measurement in the object,
element or person.
2. Questionnaire: A standardized and pre tested Questionnaire is sent and the respondents
are expected to give the information by answering it.
3. Interview: This method can be used as a supplement to the questionnaire or can be used
independently. Here the information is collected by face to face dialogue with the
respondents.
4. Records: Sometimes the information required can be available in various records like
census, hospital records, etc
4
Variation is of two types:
•Biological Variation
•Sampling Variation
Biological Variation: This term is used for variation that is seen in the measurement/counts
in the same individual even if measurement/enumeration method is standardized and even if
the person taking measurement/making count is same
Ex: Blood pressure of an individual can be show variation even if it is taken
by identical method, applying identical criteria and even if it is measured by the same person.
Sampling Variation: The term sampling variation is used for variation seen in the statistics
of two samples from same application.
Ex: Even if there are 40% girls in a college, two samples if identical size drawn from
this population may vary from this parameter and may show difference between them.
5
Mistakes & Errors
These are of three types
•Instrumental/Technical Error
•Systematic Error
•Random Error
Instrumental Error/Technical Error: These are introduced as a result of faulty and
unstandardized instruments, improper calibration, etc.
Systematic Error: This is an error introduced due to a peculiar fault in the machine or
technique. This gives rise to same error again.
Ex: If the ‘zero’ of the weighing machine is not adjusted properly, it will give
rise to a systematic error
Random Error: It is introduced by the changes in the conditions in which the observations
are made or measurements are taken
Ex: A person may stand in different position at two different times, when his height is
being taken. A person may tell his age differently when asked on two different occasions. In
such cases, even if the instrument/method is good, an error may occur. The error due to this
phenomenon will not be constant or systematic.
Mistakes / Errors can be prevented by :
i. Using standard calibrated instruments.
ii. Using standardized pretested questionnaire.
iii. Using trained & skilled persons.
iv. Using multiple observations and averaging them.
v. Using correct recording procedures.
vi. Applying standard and widely accepted statistical manipulations/calculations.
6
Data Types
Based on Characteristics
Data is of two types based on characteristics
1.Attributes: Attributes are the non-measurable characteristics which cannot be numerically
expressed in terms of unit. These are qualitative objects.Eg: Sex
2.Variables: Variables are the measurable characteristics which can be numerically
expressed in terms of some unit.These are quantities which are capable of being measured by
quantitative methods directly. An individual observation of any variable is known variate.If
we measure the height of some individuals of a population and obtain some values, the
obtained value is a variable. For example height and length in cm,weight in gms, Hb in % ,
etc of individuals.
Variables are of two types
A) Discrete Variable is one which cannot take all the values & there is a gap between one
value and the other. For example, No. of persons in a family, No. of books in a library are
discrete variables. One cannot say that there are 3.5 persons in a family or there are 500.6
books in a library. The discrete variable may take any integer from 0 toµ.
7
B) Continuous Variable is one which can take any value & there is no interval.For example,
the weight and height of human being is a continuous variable because it may take any value.
Height of patients may be 120cm, 120.2cm, 120.5 cm and so on. Measurements of Hb%, etc
are also continuous variables.
In general discrete variable takes integer value while continuous value can take fraction
value.
Based on Source
Data is of two types based on Source
1. Primary Data: If the data is derived from the direct measurements/observation on the
population unit.
2. Secondary Data: When the data is not derived from the primary source,but derive from
sources like records.
Based on Fields
In computer database management software, data is arranged in tabular form. The columns
are called fields and the rows are records. Each field can hold record of its type which needs
to be described at the time of creation of data base. The common types of fields are character
type, numeric type, date type and logical type.
1. Character Type Ex: name, Address.
2. Numeric Type Ex: Age, height, Weight, Blood sugar level
3. Date Type Ex: Date of Birth, Date of Admission, Date of Discharge, etc.The date can be
expressed in British-dd/mm/yyyy, American-
mm/dd/yyyy, ANSI-yyyy/mm/dd
Format.
4. Logical Type: This refers to the dichotomous data. Ex: Sex (male/female), result of dry
trial (cured/not cured).
8
Data Presentation
Data Presentation is of three types
1. Tabular Presentation

Reference Table / Master Table

Correlation Table

Association Table

Two by two Table

Text Table
2.Diagramatic Presentation

Line diagram

Bar diagram

Pie diagram
3. Graphical Presentation (Important types in relation to
frequency distribution)

Histogram

Frequency Polygon

Frequency Curve

Cumulative Frequency/O give
1. Tabular Presentation
Reference Table / Master Table:
This table shows all variables that can be cross
classified. It contains all the result of data reduction.
Correlation Table: This shows the two quantitative variables cross classified in many
classes. It is used to calculate correlation coefficient(r).
9
Kgs
150-154.9
155-159.9
160-164.9
165-169.9
Cms
40-44.9
50
10
10
10
45-49.9
30
50
20
20
50-54.9
20
30
50
30
55-59.9
10
20
20
50
Association Table: This table shows associates between two qualitative variables, It is also
required for calculating sensitivity and specificity of screening test.
Two by two Table: This table shows frequency distribution of two variables or two classes.
Ex: Sex distribution of patients in two hospitals.
Hospital
Male Female
Total
A
600
400
1000
B
500
500
1000
Total
1100
900
2000
Text Table: It is descriptive table. It does not contain numerical data.
Ex: Some information about Drugs
DRUG
MANUFACTURER
LOCATION
Histac
Ranbaxy
Delhi
Anastrazole
AstraGenica
Banglore
Imatinib
Novartis
Mumbai
10
2.Diagramatic Presentation
Diagrams help biostatisticians to visualize the meaning of a numerical complex at a glance.
Line diagram: This the simplest type of diagram. For a diagrammatic representation of data,
the frequencies of the discrete variable can be presented by a line diagram.
The data variable is taken on the X-axis, and the frequencies of the observation on the Yaxis.The straight lines are drawn whose lengths are proportional to the frequencies.
Ex: the frequency distribution of a discrete variable (rate of reproduction of 50 fishes) is
given in the following table.
Rate of
10
reproduction
3
20
30
40
50
60
70
80
90
4
7
8
9
9
2
6
2
Frequency
Line diagram is given in the following figure of the data presented in above table
11
Bar diagram: Bar diagrams are one dimensional diagrams because the length of the bar is
important, and not the width. In this case the rectangular bars of equal width is drawn.
Ex: the frequency distribution of a discrete variable (rate of reproduction of 50 fishes) is
given in the following table.
Rate of
10
reproduction
3
20
30
40
50
60
70
80
90
4
7
8
9
9
2
6
2
Frequency
Bar diagram is given in the following figure of the data presented in above table
Pie diagram:
It is an easy way of presenting discrete data of qualitative characters such as blood groups,
Rh factors, age groups, Sex groups etc. the frequencies of the groups are shown in a circle.
Degrees of angle denote the frequency and area of the sector. it presents comparative
12
difference at a glance. Size of each angle is calculated by multiplying the class percentage
with 3.6 i.e., 360/100 or by the following formula.
Blood Groups
No. Of Persons
Percentage
Degrees
Male
Female
Total
A
427
317
744
26.5
94.4
B
559
412
971
34.5
124.2
O
521
367
888
31.6
113.8
AB
122
85
207
7.4
26.2
Total
1629
1181
2810
100.0
360.0
13
3.Graphical Presentation
This is visual presentation of data. Important types of graphs in relation to frequency
distribution.
Histogram
This is method of choice for quantitative continuous data. It is an area diagram consisting of
series of adjacent blocks(rectangles).Entire area covered by rectangle represents the entire
frequency and the area covered by the individual block represents the frequency of the
variable represented by that block.X-axis represents class interval and Y-axis represents
frequency per unit of class interval.
14
The presentation in the histogram of distribution is illustrated as below
Distribution of Total Serum Protein levels(g/100ml) in 436 individuals
Total Serum Protein (g/ml) No. of individuals
4.0
4
5.0
12
6.0
7
6.2
9
6.3
34
6.5
105
7.0
237
8.0
27
9.0-10.0
1
Total
436
15
Frequency Polygon
A frequency polygon is a slight variation of histogram. Instead of rectangles erected
over the intervals, points are plotted at the mid points of the tops of the corresponding
rectangles in a histogram, and the successive points joined by straight lines. A frequency
polygon may be chosen to compare two frequency distributions.
16
Frequency Curve
When the total frequency is large and when we adopt much narrower class intervals, the
frequency polygon will most often have a much smoother appearance, which is called
Frequency Curve
Cumulative Frequency Curve
Following figure illustrates a cumulative frequency polygon which is also known as O give
17
Frequency Distribution
Frequency Distribution is the summary of the number of times different values of a variable
occur.
For example: Hemoglobin Values of 50 subjects
9.8, 10.5, 8.0, 9.2, 11.8, 13.2, 11.4, 10.1, 7.7, 11.9
14.1, 10.8, 12.1, 9.0, 12.7, 10.9, 8.8, 11.9, 9.6, 13.1
10.0, 14.1, 10.9, 8.6, 9.9, 13.8, 11.7, 9.9, 12.8, 10.0
13.9, 10.2, 11.9, 10.3, 13.3, 10.2, 10.8, 9.6, 10.7, 11.1
10.5, 11.3, 10.7, 11.7, 10.9, 12.0, 10.6, 12.3, 11.2, 11.3
Class
Interval
7.5-8.4
Frequency Cumulative % of
% of
Frequency frequency Cumulative
frequency
2
2
0.02
0.02
8.5-9.4
4
6
0.04
0.06
9.5-10.4
11
17
0.11
0.17
10.5-11.4
15
32
0.15
0.32
11.5-12.4
9
41
0.09
0.41
12.5-13.4
5
46
0.05
0.46
13.5-14.4
4
50
0.04
0.50
18
Measure of central tendency
Centering constants are also termed as “measures of central tendency”. A measure of
central tendency is a typical value around which other figures congregate.
MEAN
Mean is the summation of the observation and dividing it by the total number of
observations.
Ungrouped data:
M or X=Arithmetic Mean
X= Character observed
Σ=Summation Values
n=No. of observations
Ex: Serum Albumen Levels (g%) of 24 Pre-School Children (Ungrouped Data)
2.90
3.57
3.73
3.55
3.72
3.88
2.98
3.61
3.75
3.45
3.71
3.84
3.30
3.62
3.76
3.38
3.66
3.76
3.43
3.69
3.77
3.43
3.68
3.76
The total of all these values, ΣX: 85.93
19
Total number of observations, n=24
Grouped data:
M or X=Arithmetic Mean
X= Mid Point of Class Interval
Σ=Summation
f = Frequency
Ex:Protein Intake of 400 Families
Protein
Intake/Day(g)
Class-Interval
15-25
25-35
35-45
45-55
55-65
65-75
75-85
Total
No. of Mid-Point of Classfamilies Interval
f
x
30
20
40
30
100
40
110
50
80
60
30
70
10
80
400
Multiply
f&x
fx
600
1200
4000
5500
4800
2100
800
19000
20
IMPORTANCE:
Arithmetic mean is effected by all the observations as each contribution to its calculation.
However, the effect of extreme values is more as compared to those values that are near to
mean.
MEDIAN
The median is known as the measure of location,that is it tells where the data are. It is an
average which divides into 2 equal halves.
Median is the middle observation if the series is arranged in the ascending order or
descending order. When there are even number of observations ,then Arithmetic mean of
the middle two observations is taken as the median.
Ungrouped Data
Ex: Series of albumen levels of 24 preschool children
2.90
3.57
3.73
3.55
3.72
3.88
2.98
3.61
3.75
3.45
3.71
3.84
3.30
3.62
3.76
3.38
3.66
3.76
3.43
3.69
3.77
3.43
3.68
3.76
Arrange all the 24 values in the ascending order of the magnitude, we get the following
data:
2.90
2.98
3.30
3.38
3.43
3.43
3.45
3.55
3.57
3.61
3.62
3.66
3.68
3.69
3.71
3.72
3.73
3.75
3.76
3.76
3.76
3.77
3.84
3.88
The 12th value is 3.66 & 13th is 3.68; Median is the average of these two.
Grouped Data
21
Where L = Lower limit of the median class
n=Total No. of observations (Cumulative Frequency)
F=Cumulative Frequency prior to the median class
f=Actual Frequency of the median class
C=Class Interval of the median class.
Ex: Protein Intake of 400 Families
Protein
Intake/Day
No. of
Families
Cumulative
Frequency
15-25
30
30
25-35
40
70
35-45
100
170
45-55
110
280
55-65
80
360
65-75
30
390
75-85
10
400 = n
Total
400
Steps: 1. Find the cumulative frequencies
2. Find out the median class (n/2)
3. Apply the formula
Procedure:
n=400
Next Find out the median values by using n/2 =400/2=200
The value 200 is in between 170 & 280 of the cumulative frequency. So we take higher
cumulative frequency i.e., 280
Now we take the corresponding class of cumulative frequency 280
22
Here the corresponding class interval of the cumulative frequency 280 is 45-55.
So 45-55 is the median class.
45 is the lower limit of the median class (L).
110 is the actual frequency of the median class.
170 is cumulative frequency prior to median class
10 is the class interval of the median class.
MEDIAN is not affected to a great degree of extreme values. The median does not use all
the information in the data and so it can be shown to less efficient than the mean or
average, which does use all the values of data.
MODE
Mode is the most frequently occurring observation.
Ungrouped Data
Ex: SerumAlbumin levels (g%) of 16 pre school children
3.57
3.76
3.73
3.55
3.72
3.76
3.55
3.76
3.43
3.55
3.57
3.76
3.72
3.55
3.76
3.57
x
3.43
3.55
3.57
3.72
3.73
3.76
f
1
4
3
2
1
5
23
Here, the observation 3.76 is most commonly occurring and hence the mode is 3.76.
Also,by formula Mode = 3Median - 2Mean.
Grouped Data
Where
LM – Lower limit of modal class.
- The difference between the frequency of the modal class and the preceding
modal class (f1-f0).
- The difference between the frequency of the modal class and the succeeding
modal class (f1-f2).
C – Class interval of the modal class.
f1 – Frequency of the modal class.
F0 – Frequency of the preceding modal class.
F2- Frequency of the succeeding modal class.
Ex: Classification of Mode of protein Intake of 400 Families
Protein
No. of
intake/day(g)
Families
Class Interval
f
15-25
30
25-35
40
35-45
100
45-55
110
55-65
80
65-75
30
75-85
10
24
Highest Frequency (f1) is 110
The corresponding class is 45-55.It is the modal class.
Therefore, 45 is the lower limit of the modal class.
f0 is the frequency preceding the modal class = 100
f1 is the frequency of modal class = 110
f2 is the frequency succeeding the modal class = 80
C is the class interval=10
Apply the formula
Mode is unaffected by extreme values. It is a positional average and can be located easily
by inspection
Measures of Dispersion/Variation
Centering constants are representative values of the series. They do not express the range
of normalness. Centering constants together with measures of variation help
understanding of the data better than the centering constants alone.
“The degree to which numerical data tend to spread about an average value is called the
variation or dispersion of the data”
Example: Suppose there are three series of nine items as s follows
Series A
Series B
Series C
40
36
1
40
37
9
40
38
20
40
39
30
40
40
40
25
40
41
50
40
42
60
40
43
70
40
44
80
360
Total
360
360
Mean
40
40
40
In the 1st series A, the mean is 40 and the value of all the items is identical. The items are
not at all scattered and the mean fully discloses the characteristics of this distribution.
In the 2nd series, though the mean is 40 yet all the items of the series have different
values. But the items are not very much scattered as the min value of the series is 36 and
the max value is 44.In this case also mean is the good representative of the series. Here
mean cannot replace each item, yet the difference between the mean and the other items
is not very significant.
In the 3rd series C, the mean is 40 and the values of different items are also different, and
are widely scattered. The min value of the series is 1 and the max value is 8.So, the
average does not satisfactorily represent the individual items in this group.
In order to have correct analysis of then three series, it is essential that we need to
study something more than their averages because averages are identical and yet the
series widely differ from each other in their formation.
Range
Range is defined as the difference between the largest and the smallest value of variable
in a series. The value of the range is dependent only upon the two extreme observations
in the other observations.
Ungrouped Data
Ex: Hemoglobin values (g%) of normal children
11.8
12.9
12.4
13.3
13.8
11.4
12.3
11.7
12.9
12.2
26
10.4
10.8
12.7
13.2
11.6
12.0
12.2
14.2
10.8
10.5
11.6
13.5
12.2
11.2
12.6
13.0
The lowest value in these observations is 10.4 and the highest value is 14.2. Therefore,
the range is 10.4 g% - 14.2 g%.
Grouped Data
Ex: Protein intake of 400 Families
Protein
No.of
Intake/Day(g)
Families
15-25
30
25-35
40
35-45
100
45-55
110
55-65
80
65-75
30
75-85
10
Total
400
For this frequency distribution table, the accurate range cannot be found out but we can
approximately give the lowest and the highest values of the class intervals. Thus, the
range is 15g-85g
Interquartile Range
27
The interquartile range of a group of observations is the interval between the values of
the upper quartile and the lower quartile for that group. Lower quartile is the value below
which 25% of the observations fall. Upper quartile of a group is the value above which
25% of the observations fall.
This measure gives us the range which coves the middle 50% of the
observations in the group. Unlike the range, the value given by this measure is unaffected
by the occurrence of rare extreme values and makes a good measure\re of dispersion
Ungrouped data
Eg: To find out Interquartile range
for the hemoglobin values (g %) of 26 Normal Children
11.8
12.9
12.4
13.3
13.8
11.4
12.3
11.7
12.9
12.2
10.4
10.8
12.7
13.2
11.6
12.0
12.2
14.2
10.8
10.5
11.6
13.5
12.2
11.2
12.6
13.0
Arranging these observations in the ascending order of magnitude
10.4
11.6
12.2
12.9
13.8
10.5
11.6
12.2
12.9
14.2
10.8
11.7
12.3
13.0
10.8
11.8
12.4
13.2
11.2
12.0
12.6
13.3
11.4
12.2
12.7
13.5
The lower quartile Q1 is 11.6(I.e.,) about 25%of the number of observations fall below
the value 11.6.The upper quartile Q3 is 12.9 (I.e.,) nearly 25%of the numbers of
observations are above 12.9. Therefore, the interquartile range is 11.6-12.9
28
The series of observation is divided into two halves and median is located. If ‘n’ is an
even number,then medians for both halves are located presuming each
half to be
independent series. If ‘n’ is an odd number, median of the series participates in locating
the median of both upper and lower halves. Lower median and upper median is the
interquartile range and it contains middle 50% observations. It is a better indicator of
variation than range.
Grouped Data: To obtain quartile deviation from grouped data one has to obtain Q1 and
Q3 first.Formula to calculate Q1 and Q3 is given below:
Where L-Lower limit of the Q1 Class
n-Total No. of observations (Cumulative Frequency)
F-Cumulative Frequency prior to the Q1 Class
f-Actual Frequency of the Q1 Class
C-Class Interval
Where L-Lower limit of the Q3 Class
n-Total No. of observations (Cumulative Frequency)
F-Cumulative Frequency prior to the Q3 Class
f-Actual Frequency of the Q3 Class
C-Class Interval
Q1 Class =n/4
Q3 Class =3n/4
29
Ex: Protein Intake of 400 Families
Protein
Intake/Day
No. of
Families
Cumulative
Frequency
15-25
30
30
25-35
40
70
35-45
100
170
45-55
110
280
55-65
80
360
65-75
30
390
75-85
10
400 = n
Total
400
Next Find out the Q1 Class by using n/4 =400/4=100
The value 100 is in between 70 & 170 of the cumulative frequency. So we take
higher cumulative frequency i.e, 170
Now we take the corresponding class of cumulative frequency 170
Here the corresponding class interval of the cumulative frequency 170 is 35-45.
So 35-45 is the Q1 Class.
35 is the lower limit of the Q1 Class (L).
100 is the actual frequency of the Q1 Class.
70 is cumulative frequency prior to Q1 Class
10 is the class interval
30
Next Find out the Q3 Class by using 3n/4 =1200/4=300
The value 300 is in between 280 & 360 of the cumulative frequency. So we take
higher cumulative frequency i.e., 360
Now we take the corresponding class of cumulative frequency 360
Here the corresponding class interval of the cumulative frequency 360 is 55-65.
So 55-65 is the Q3 Class.
55 is the lower limit of the Q3 Class (L).
80 is the actual frequency of the Q3 Class.
280 is cumulative frequency prior to Q3 Class.
10 is the class interval
31
Mean Deviation
The mean deviation is the arithmetic mean of the deviations the observations from the
arithmetic mean ignoring the sign of these deviations. The mean deviation is based on all
observations in the group.
Ungrouped Data
Where
indicates the difference between the value of the observation and the
arithmetic mean ignoring the sign of difference, n indicates the total number of
observations.
Ex : Hemoglobin values (g%) of 12 subjects
Hb g% Level
Deviation from mean
(without sign)
7.2
2.1
7.6
1.7
7.8
1.5
8.6
0.7
8.9
0.4
9.2
0.1
9.4
0.1
9.8
0.5
10.0
0.7
10.6
1.3
11.2
1.9
11.6
2.3
Mean=9.3
32
Mean = 9.3
Total of the deviations from arithmetic mean
(Without taking into account the sign)
Grouped Data
Ex: Protein intake of 400 Families
Protein
No. of
Mid-Point of
Deviation of mid-
Absolute
intake/day(g)
Families class-interval point from mean
Class interval
f
X
15-25
30
20
27.5
825
25-35
40
30
17.5
700
35-45
100
40
7.5
750
45-55
110
50
2.5
275
55-65
80
60
12.5
1000
65-75
30
70
22.5
675
75-85
10
80
32.5
325
TOTAL
400
Deviation fx
f
4550
33
Mean = 47.5
Standard Deviation
The Standard deviation is the square root of the average of the squared deviations of the
observations from the arithmetic mean. The deviation from mean is considered without
its sign in calculating the mean deviation,but in calculating the standard deviation it is
squared
The standard deviation of the population is usually denoted by , and that of the sample
by S
Ungrouped Data
If the sample size is more than 30 then the formula for ungrouped data is
If the sample size is less than 30 then the formula for ungrouped data is
34
Ex : Hemoglobin values (g%) of 12 subjects
Hb g% Level
Deviation from
Square of Deviation
mean
7.2
-2.1
4.41
7.6
-1.7
2.89
7.8
-1.5
2.25
8.6
-0.7
0.49
8.9
-0.4
0.16
9.2
-0.1
0.01
9.4
+0.1
0.01
9.8
+0.5
0.05
10.0
+0.7
0.49
10.6
+1.3
1.69
11.2
+1.9
3.61
11.6
+2.3
5.29
Mean=9.3
= 21.35
Arithmetic Mean = 9.3
35
Grouped Data
If the sample size is more than 30 then the formula for grouped data is
If the sample size is less than 30 then the formula for grouped data is
Ex: Protein intake of 400 Families
Protein
No.Of
intake/day(g)
Families of C.I
Class-Interval
Mid Point Deviation of mid- Squared
point from mean Deviation
Freq*Squared
Deviation
f
x
15-25
30
20
-27.5
756.25
22687.5
25-35
40
30
-17.5
306.25
12250.0
35-45
100
40
-7.5
56.25
5625.0
45-55
110
50
2.5
6.25
687.5
55-65
80
60
12.5
156.25
12500.0
65-75
30
70
22.5
506.25
15187.5
75-85
10
80
32.5
1056.25
10562.5
Total
400
f*
79500.0
36
Variance
Since, Variance is square of standard deviation, it has no unit of measurement
Coefficient of Variation
It is the ratio of S.D and mean expressed as percentage
CV is useful in comparing variation in two characteristics with different units of
measurement like height and weight, Hb% and ESR, etc
Following is an example where the coefficient of variance is used for comparison of
variability in different characteristics
S.No
No.Rec-
Arith.Mean Range
Stan.Dev. Coeff.Of
orded
Var.
Height(cm)
33
164.6
142.2-180.3
7.64
4.7%
Weight(kg)
33
43.1
22.0-55.1
6.48
15.0%
37
Brain(g)
14
1317.0
1100 - 2335
296.1
22.5%
Heart(g)
33
249.5
110-1000
150.8
60.4%
Liver(g)
33
1205.0
540-2500
376.3
30.2%
Spleen(g)
32
367.2
53-2700
561.4
152.9%
NORMAL DISTRIBUTION
Normal
distribution,is
the
most
commonly
observed
probability
distribution.
Mathematicians de Moivre and Laplace used this distribution in the 1700's. In the early
1800's, German mathematician and physicist Karl Gauss used it to analyze astronomical data,
and it consequently became known as the Gaussian distribution among the scientific
community.
The sampling distribution formed by actually taking the sample from the population is
called observed sampling distribution. In several situations the theoretical sampling
distributions are very close approximations of the observed sampling distribution. From
the theoretical distribution the necessary evaluation of a sample can be done by using
mathematical models.
Normal Distribution is a symmetrical distribution and fundamental to many
tests of significance. The two parameters of the normal distribution are the mean and the
standard deviation (σ).
38
Normal distribution curve is symmetrical, bell shaped curve.
The shape of the normal distribution resembles that of a bell, so it sometimes is referred
to as the "bell curve", an example of which follows:
Bell Curve Characteristics
The bells curve has the following characteristics:

Symmetric

Unimodal

Extends to +/- infinity

Two parameters of the normal distribution are the Mean and Standard Deviation
Characteristics of Normal curve:

Highest point in the frequency distribution represents Mean, Median, and Mode.

In normal distribution curve Mean, Median, Mode are identical.

The frequency of measurements goes on increasing from one side, reaches peak and
goes on declining exactly as they have mounted.

It is symmetrical with a bell shaped curve.
39

Relation ship between the Normal Curve and Standard Deviation:

1st and 3rd quartile that is semi inter quartile range is equal to +/- 0.6745 of standard
deviation and which covers 50% area.

+/-1 standard deviation covers 68.27% area.

+/-2 standard deviation covers 95.45% area.

+/-3 standard deviation covers 99.73% area.

Only 0.275 area remains outside of the curve.
Probability
Probability is possibility of occurrence of any event or a permutation/combination of
events.
In the science of genetics, it is uncertain whether an offspring will be male or female, but
in the long run it is known approximately what percent of offspring will be male and
what percent will be female. The long term regularity provides us with a measure of the
amount of chance, as it is denoted by probability.
40
Chance is measured on probability scale having zero at one end and unity at the other.
Zero end represents “absolute Impossibility” and unity end represents “Absolute
Certainty”
In complex situations, the evaluation of probability will have to be based on
observational or experimental evidence.
For example if we want to know the probability of success of a surgical procedure, a
review of past experience of this surgical procedure under similar condition s will
provide the data for estimating this probability. The longer the series we have, the closer
the estimate would be to the true value.
Probability scale has a range of 0 to1. By P = 0, it means that there is absolutely no
chance that the observed difference could be due to sampling variation. By P = 1, it
means that the observed difference in two samples is due to sampling variation.
Eg In a given case P is in between 0 and 1.
If P= 0.4, it means that the chances that the given difference is due to sampling variation
are 4 in 10.The counterpart of the statement is that the chances that the observed
difference is not due to sampling variation are 1-0.4 = 0.6, (i.e.) 6 in 10.
41
Test of Significance
The essence of any test of significance in biostatistics is to find out P value and draw the
inference. It is customary to accept the difference to be due to chance (i.e.) sampling
variation if P is 0.05 or more. The observed difference in the samples under study this
condition is said to be “Statistically not significant”. If P value is less than 0.05, the
observed difference is not considered as not due to sampling variation but due to some
difference in the samples themselves. The observed difference, under these circumstances
is said to be “Statistically Significant”.
Null Hypothesis
Null hypothesis is another important concept in statistics. In any test of significance, we
start with the Hypothesis (Assumption) that “the observed difference in the samples
under study is due to sampling variation” and proceed to prove/disprove this hypothesis.
The essence of any test of significance is to calculate probability. It is customary to
accept the null hypothesis if probability value is 0.05 or more. With every null hypothesis
there is an alternate hypothesis. Usually, the alternate hypothesis is that “the observed
difference in the samples is not due to the sampling variation, but due to the difference in
42
the samples”. In fact this itself is the objective of the study. A hypothesis, which is a test
for possible rejection under the assumption that it is true.
Hypothesis (Assumption)
Accept Alternate Hypothesis
The observed difference in the
samples is not due to sampling
variation, but is due to the
difference in the samples
Accept Null Hypothesis
The observed difference in the
samples is due to sampling
variation
If alternate hypothesis is accepted, the null hypothesis is automatically rejected.
However, if null hypothesis is accepted, two possibilities exist:
1. That the alternate hypothesis is rejected and
2. That the sample size may be inadequate to detect the difference.
Determination of significance
Based on data we have different types of tests to help us in determining whether observed
differences between samples are actually due to chance, or they are really significant.
43
Level of significance
The statistical tests fix the probability at a certain level, called as level of significance. The
commonly used level of significance are 5% (0.05) and 1 %( 0.01).if we choose 5% level of
significance, it implies that in 5 out of 100. In other words this implies that we are 95%
confidence. Level of significance desired is always fixed in advance before applying the test.
Interpretation

If calculated value is less than table value, null hypothesis is accepted, alternate
hypothesis is rejected, and the difference in the two means is statistically not
significant.

If calculated value is more than table value, alternate hypothesis is accepted, null
hypothesis is rejected, and the difference in the two means is statistically significant.
44
Normally 5% level of significance (α=0.05) is used in testing a hypothesis and taking a
decision unless otherwise any other level of significance is specifically stated.
t- Test
Mr.Wiliam Gosset (1908) applied a statistical tool called t- test. The pen name of
Mr.Gosset is ‘student’ and hence this test is called student’s t- test or t-ratio because it is a
ratio of difference between two means. In t-test, we make a choice between two alternatives
i. To accept the null hypothesis (no difference between the two means)
ii. To reject the null hypothesis i.e. the difference between the means of the two samples is
statistically significant.
Determination of significance
Probability of occurrence of any calculated value of ‘t’ is determined by comparing it with
the value given in the ‘t’ table corresponding to the combined degrees of freedom, derived
from the number of observations in the samples under study.
If the calculated value of‘t’ exceeds the value given at P = 0.05(5% level) in the table,
it is said to be significant. If the calculated value‘t’ is less than the value given in ‘t’table, it is
not significant.
Degrees of Freedom
The quantity in the denominator which is less than the independent number of
observations in a sample is called degree of freedom. In unpaired t – test, df = n-1. In pared ttest df = n1 + n2-2 (n1 & n2 are the number of observations in each of the two series.)
T-Test:
T –test is an estimate of the extent by which values in a small set of data deviate from the
mean; it is used to determine the variation within a set of data and to compare two sets of
data.
Unpaired t – Test: two groups
Eg: The Hb % of 10 pulmonary TB patients (x1) and 12 comparable controls (x2) is
given below
TB patients: 9.0, 8.6, 7.5, 8.0, 7.3, 8.0, 7.0, 9.0, 8.0, 8.6
(n1= 10, m1 = 8.1)
Controls: 9.5, 9.0, 7.7, 8.8, 8.0, 9.0, 8.1, 9.2, 8.5, 8.6, 9.0, 10.0
(n2=12, m2= 8.78)
45
Calculation of pooled S.D. (PSD)
x1
9.0
8.6
7.5
8.0
7.3
8.0
x1 -m1
0.9
0.5
-0.6
-0.1
-0.8
-0.1 -1.1 0.9
0.25 0.36
0.01
0.64 0.01 1.21 0.81 0.01 0.25 ∑(x1 -m1)2=4.36
(x1 -m1)2 0.81
9.0
7.7
8.8
8.0
9.0
8.1
7.0
9.0
9.2
8.0
8.6
m1=8.1
-0.1 0.5
X2
9.5
8.5
8.6
9.0
10.0
x2-m2
0.72 0.22 -1.08 0.02 -0.78 0.22 -0.68 0.42 -0.28 -0.18 0.22 1.22
m2=8.1
(x2-m2)2 .5184 .0484 1.1664 .0004 .6084 .0484 .4624 .1764 .0784 .0324 .0484 1.4884 ∑(x2m2)2=4.67
Calculating of‘t’
Step 4: Calculation of degrees of freedom
The quantity in the denominator, which is one, less than the independent number of
observations in sample, is called degree of freedom.
df = n1 + n2 – 2 = 10 + 12 - 2 = 20
46
Finding table value: Here level of significance is 5% so the probability is 0.05(See
Statistical Tables) - table‘t’ value for 20 DF at probability of 0.05 is 2.09.
Interpret
In our example calculated‘t’ value is more than table ‘t’ value .So, null hypothesis is
rejected, alternate hypothesis is accepted and difference between two means is statistically
significant. In other words it indicates that Hb% is affected and significantly low in TB
patients as compared to controls.
Paired t -Test
Eg: Anti hypersensitive effect of a drug was tested on 15 individuals. The recordings of
diastolic blood pressure in mm are shown in the table.
S.No Before
After
Diff(d) x1 - x2
d- md
(d- md)2
Trt (x1)
Trt(x2)
(md is mean of d)
1
96
90
6
-7.6
57.76
2
98
92
6
-7.6
57.76
3
110
100
10
-3.6
12.96
4
112
100
12
-1.6
2.56
5
118
98
20
6.4
40.96
6
120
100
20
6.4
40.96
7
140
100
40
26.4
696.96
8
102
90
12
-1.6
2.56
9
98
88
10
-3.6
12.96
10
124
126
-2
-15.6
243.36
11
118
120
-2
-15.6
243.36
12
120
100
20
6.4
40.96
13
122
100
22
8.4
70.56
47
14
120
98
22
8.4
70.56
15
98
90
8
-5.6
31.36
1492
204
Total 1696
Mean
∑(d- md)2=1625.6
md =13.6
Calculate pooled Standard Deviation (PSD)
= 10.77
Calculate Standard Error of PSD (SE-PSD)
Step 5: Calculate‘t’
Calculate Degrees of Freedom
DF = n-1 = 14.
Here level of significance is 5% so the probability is 0.05(See Statistical Tables) - table‘t’
value for 14 DF at probability of 0.05 is 2.14.
Interpret
In our example calculated‘t’ value is more than table ‘t’ value .So, null hypothesis is
rejected, alternate hypothesis is accepted and difference between two means is statistically
significant. In other words the difference in before and after values is considered statistically
significant.
ANOVA
(Analysis of Variance)
ANOVA is used to examine the significance of the difference amongst more than two sample
means at the same time. For example, when we want to compare more than two populations
48
such as yield of crop from several varieties of seeds. One can draw inferences about whether
the samples have been drawn from population having the same mean, with the help of this
technique.
The test is also called “F” test, as it was developed by R.A.Fisher in 1920.He developed
systematic procedure for the analysis of variation. It consists of classifying and cross –
classifying statistical results and testing whether the means of specific classification differ
significantly.
For example, five fertilizers are used to five plots of paddy. We may be interested in finding
out whether the effect of these fertilizers on the yields are significantly different. We make
use of ANOVA to answer this type of problems. It enables us to analyze the total variation
into component which may be attributed to various ‘sources’ or ‘causes’. It can provide us
with meaningful comparisons of sample data which are classified according to two or more
variables.
Types of ANOVA The analysis of variance has been classified into
a) One – Way Classification: Under this, only one factor is considered and its effect on
elementary units is considered (i.e.) data are classified according to only one criterion. Eg:
Yield of crop affected by type of seed only.
b) Two – Way Classification: more than two independent factors have an effect on the
response variable of interest. Eg: Yield of crop affected by type of seed as well as type of
fertilizer.
One Way Classification
In one way classification, the data are classified according to only one criterion. That is, the
arithmetic means of populations from which K samples are randomly drawn are equal to one
another.
Principle: We take two estimates of population variance (i.e.) one based on between samples
variance and other within samples variance. Then these two estimates of population variance
are compared with ‘F’ test as follows.
49
The value of F is to be compared to the F- limit for a given degrees of freedom. If the table
calculated F value exceeds the F-table value, we can say that there are significant variance
between the sample means.
Eg: A certain manure was used on four plots of land A, B, C, and D.Four beds were prepared
in each plot and the manure used. The output of the crop in the beds of the plots A, B, C and
D is given below.
Land A
Land B
Land C
Land D
6
15
9
8
8
10
3
12
10
4
7
1
8
7
1
3
Using ANOVA find out whether the difference in the means of the production of crops of the
plots in significant or not.
Total sum of all the items of various samples (Eg: 4 samples)
Total sum of all the items of various samples = T
50
Correction Factor = T2 / N
Where N is the number of items
Total sum of squares (or) total SS
Step 4: Sum of squares (or) SS – between
Step 5 : Sum of squares (or) SS – within
SS – within = Total Sum of Squares – Sum of Squares between the
samples
= [Total SS] – [SS- between]
OR
= [Value of step 3] – [Value of step 4]
= [228] – [40] = 188
Step 6: Make ANOVA Table
Source of
Sum of Squares Degree of
Mean Square (MS)
Variation
SS
Freedom d.f
Between
40
3(c-1)
40 / 3=13.33
188
12(N-c)
188/12=15.66
228
15
Samples
Within
Samples
Total
51
Find out F value
= 13.33 / 15.66 = 0.851
The table value of F for n1 = 3 and n2 = 12 at 5% level of significance = 3.49.
Inference
The calculated value (0.851) is lesser than the table value (3.49). Therefore the difference in
the means of the production of crops of the plots is not significant. Null hypothesis is
accepted, alternate hypothesis is rejected.
Two Way ANOVA
Analysis of Variance in Two Way Classification
It is used when the data are classified on the basis of two or more factors.
In a two – way classification the data are classified according to two different criteria.
Where, SSC – Sum of squares between columns
SSR - Sum of squares between rows
SSE - Sum of squares due to error
52
MSC - Mean square between columns
MSR - Mean square between rows
MSE - Mean square due to error
SST – Total Sum of squares
The sum of squares for the source ‘Residual’ (SSE) is obtained by subtracting from the total
sum of squares (SST) by the sum of squares between columns (SSC) and rows (SSR)(i.e.,)
SSE = SST – (SSC + SSR)
Eg: Set up two way ANOVA table for the following results per acre production data for
Sorghum
Name of Fertilizer
Variety of Sorghum Seeds
Co. 1
Co.5
Co. 9
Urea
6
5
5
AmmoniumSulphate
7
5
4
Zinc Sulphate
3
3
3
Potash
8
7
4
Solution :
Step 1:
Calculate Total = T = 6 + 7 + 3 + 8 + 5 + 5 + 3 + 7 +5 + 4 + 3 + 4 = 60
No. of items = N = 12
Step 2:
Step 3 :
Square of all items = 62 + 72 + 32 + 82 + 52 + 52 + 32 + 72 + 52 + 42 + 32 + 42 = 332
Step 4 :
Total sum of squares (SST)
53
SST = Square of all items – Correction Factor
= [Value of step 3 – Correction factor] = 332 – 300 = 32
Step 5 :
Sum of squares between variety of Sorghum Seeds (or)
Name of Fertilizer
Variety of Sorghum Seeds
Co.1
Co. 5
Co. 9
Total
6
5
5
16 = Σ x1
AmmoniumSulphate x2 7
5
4
16 = Σ x2
Zinc Sulphate x3
3
3
3
9 = Σ x3
Potash x4
8
7
4
19 = Σ x4
Total
Σ Co.1= 24 Σ Co.5= 20
Σ Co.9= 16
60
Urea x1
Step 6 :
Sum of squares between fertilizers (or)
54
Step 7 :
SS for error(SSE)
SSE = SST – (SSC + SSR)
= Total Sum of squares – (Sum of squares between columns +
Sum of squares between rows)
(Or) in short
SSE = Total SS – (SS between columns + SS between rows)
= [Value of step 4 ] – [Value of step 5 + value of step 6]
= 32 –(8 + 18) = 6
Step 8 :
Degrees of freedom
d.f for total variance = (C × r ) – 1 = (3 × 4)- 1 = 11
d.f for variance between columns = (C – 1 ) = 3 – 1 = 2
d.f for variance between rows = (r – 1 ) = 4 – 1 = 3
d.f for residual variance = (C - 1)(r-1) = (3 – 1)(4 – 1) = 6
Step 9 :
55
Step 10 :
Setting two way ANOVA Table
Step 11: Inference:
Since the F – ratio concerning the varieties of Sorghum seeds (4.0) is less than the table value
(5.14) the differences concerning the varieties of Sorghum seeds are insignificant at 5 %.
Null hypothesis is accepted, alternate hypothesis is rejected.
But the differences concerning fertilizers are significant at 5 % because the calculated
value – F (6.0) is more than the table value (4.76). Alternate hypothesis is accepted, Null
hypothesis is rejected.
Chi – Square Test
Chi-Square test(
) is a measure to study the difference of actual and expected frequencies.
In sampling studies, there will be no perfect coincidence between expected and observed
56
frequencies. Chi-Square measures the difference between the expected and observed
frequencies. If there is no difference between the actual and expected frequencies, ChiSquare is zero.Thus, the Chi-Square test describes the discrepancy between theory and
observation.
Chi –Square test was developed by Prof. Fisher in 1870. Karl Pearson
improved Fisher’s Chi –Square test in its modern form in the year 1900. Chi –Square is
derived from Greek letter (Chi -
) and pronounced as Ki.
Formula for determining
O = Observed Frequency, E = Expected Frequency
From these equations, the Chi-Square value of will be zero if O = E in each class but due to
chance error this never happens and the observed results are based on the number of degrees
of freedom (d.f) and the critical level of probability P (0.05).
Degrees of Freedom (d.f) : In Chi-Square test, while comparing the calculated value of ChiSquare with the tabulated value, we have to calculate the degrees of freedom. The degree of
freedom are calculated from the number of classes. Therefore, the number of degrees of
freedom in a Chi-Square test is equal to the number of classes minus one. In a contingency
table (Association attributes can be shown by the table called contingency table),the degrees
of freedom are calculated on.
d.f. = (r-1) (c-1)
r = No. of rows in a table
c = No. of columns in a table
57
Eg: RBCs count lac/mm3 and Hb% g/100ml of 500 persons of test locality was recorded as
follows. Is there any significant relation between RBCs count and Hb%? Find it by chisquare method.
RBCs count
Hb%
Total
Above Normal
Below Normal
Above Normal
85
75
160
Below Normal
165
175
340
Total
250
250
500
O- E
(O- E) 2
85-80=5
25
25/80=0.31
(250*340)/500=170 165-170=-5
25
25/170=0.14
75
(250*160)/500=80
25
25/80=0.31
175
(250*340)/500=170 175-170=5
25
25/170=0.14
O = Observed
E = Expected
Frequency
Frequency
85
(250*160)/500=80
165
75-80=-5
Total
58
Here degree of freedom (DF) = (r-1) (c-1)
r = No. of rows in a table
c = No. of columns in a table
DF= (2-1) * (2-1) =1
Inference: At the 5% significance level (0.05) of table value of
at 1 d.f = 3.84.the The
calculated value (0.9) is lesser than the table value (3.84). The difference between RBCs and
Hb% is not significant. It indicates that Hb% and RBCs count are fully independent of each
other. Here Null hypothesis is accepted, alternate hypothesis is rejected.
Correlation & Regression
Correlation
Correlation(r)
Definition: The statistical tool for measuring the degree of relationship between the two
variables i.e. a change in one variables results a positive or negative change in the other
variables is known as correlation.
Kinds of correlation: positive, negative, zero correlation.
Positive or direct correlation: when values of the two variables deviate together then the
condition is known as positive correlation. For instance if we say body weight is increasing
with the increasing in height-relationship is positive one.
59
Negative or inverse correlation: If one variables increases (or decreases) and the other
decreases or (increases) they are said to be negatively correlated. For E.g. if temperature is
increasing and the lipid content of body of a sample is decreasing, it is a case of negative
correlation.
Zero correlation: If the variation of one variable has no relation with the variation in the
other it is called zero correlation. For E.g.
Coefficient of correlation: The extent or degree of relationship between the variables is
measured in terms of another parameter called coefficient of correlation.
It is denoted by “r” i.e. -1≤r≤1.
60
Properties of coefficient correlation:
(1) It is measure of the closeness between the two variables.
(2) It lies between -1 and + 1 i.e. -1≤r≤1.
(3) The correlation is perfect and positive if r=1 and it is perfect and negative if r= -1.
(4) If, r= 0 then there is no correlation between the two variables and said to be
independent.
Correlation measures linear association (or relationship) between two quantitative
variables. The extent or degree of relation between the variables is measured in terms of
another parameter called Coefficient of correlation. It is denoted by ‘r’.It has a positive
range of –1.0 to +1.0. Positive sign indicates positive correlation. Here, if one variable
increases, other also increases (as in height & weight).Negative sign indicates negative
correlation. Here, as one variable increases, other one decreases. The magnitude of
correlation indicates degree of correlation. Higher the value of r, higher is the correlation.
An r value of 0.0 indicates no correlation at all. An r value of 1.0 indicates perfect
positive correlation. An r value of –1.0 indicates negative perfect correlation.
For calculating r, we must have n pairs of measurements of x and y.Traditionally, y is a
dependent variable, and x is an independent variable.
Ex: In measurement of correlation between height and weight of 10 persons, height will
be x and weight will be y. Having obtained n pairs of observations of x and y, correlation
coefficient (r) is calculated by:
Eg : Table Height & Weight of 10 Individuals
61
Ht Cm(X)
Wt Kg(Y)
XY
(X)2
(Y)2
150
52
7800
22500
2704
160
58
9280
25600
3364
170
71
12070
28900
5041
175
74
12950
30625
5476
155
58
8990
24025
3364
165
61
10065
27225
3721
172
70
12040
29584
4900
179
75
13425
32041
5625
154
56
8624
23716
3136
163
60
9780
26569
3600
∑X=1643
∑Y=635
∑XY=105024
∑(X)2=270785 ∑(Y)2=40931
= 164.3
= 63.5
Correlation Coefficient ‘r’:
Inference: There is a strong positive correlation between the height and weight of persons.
Therefore both variables are highly correlated.
62
Regression Analysis
Regression is used to denote estimation or prediction of the average value of n variable for a
specified value of other variable & denoted by regression Coefficient ‘b’.It is also called
“Slope”. The slope is the vertical distance divided by the horizontal distance between any
two regression line, which is the rate of change along the regression line. It is calculated by
the following equation.
Eg: Table Height & Weight of 10 Individuals
Ht Cm(X)
Wt Kg(Y)
XY
(X)2
(Y)2
150
52
7800
22500
2704
160
58
9280
25600
3364
170
71
12070
28900
5041
175
74
12950
30625
5476
155
58
8990
24025
3364
165
61
10065
27225
3721
172
70
12040
29584
4900
179
75
13425
32041
5625
154
56
8624
23716
3136
163
60
9780
26569
3600
∑X=1643
∑Y=635
∑XY=105024
∑(X)2=270785 ∑(Y)2=40931
63
=164.3
=63.5
Regression Coefficient is useful to give an estimate of y for unknown value of x, using
equation
y = a+ bx
= 63.5 - 0.8255 × 164.3 = -72.12
Having known this values Y (wt) for any known value of X (ht) can be estimated
(predicted) with the help of equation Y= a+ bX
Eg: For height = 175 cm ,
Weight will be = –72.12+ 0.8255 × 175 = 72.34 Kg.
Regression Line: Calculate y for two extreme values of x (say X1 & X2), using equation Y=
a + bX, we get estimates of corresponding values of Y ( say Y1 & Y2).In the above Ex, X1 =
150cm & X2 = 179 cm. Estimates of Y1& Y2 are 51.70 Kg and 75.64Kg respectively.
We now get two points
1. X1 = 150 cm, Y1 = 51.70Kg
2. X2 = 179 cm, Y2 = 75.64 Kg
Plot these points on the graph and join these points with a straight line. What we get is a
regression line. We can then obtain estimates of Y for a known value of x with the help of
regression line.
64
Tables
Students’ ‘t’ Table
Distribution of ‘t’ values ( Students Distribution) Probabilities
df
0.05
0.01
0.001
10
2.23
3.17
4.59
11
2.20
3.10
4.44
12
2.18
3.05
4.32
65
13
2.16
3.07
4.22
14
2.14
2.98
4.14
15
2.13
2.95
4.07
16
2.12
2.92
4.02
17
2.11
2.90
3.96
18
2.10
2.88
3.92
19
2.09
2.86
3.88
20
2.09
2.84
3.85
21
2.08
2.83
3.82
22
2.07
2.82
3.79
23
2.07
2.81
3.77
24
2.06
2.80
3.74
25
2.06
2.79
3.72
Chi –Square Table
The probabilities of exceeding different chi-square values for degrees of freedom from 1 – 50
when the expected hypothesis is true
df
0.05
0.01
0.001
1
3.84
6.64
10.83
2
5.99
9.21
13.82
3
7.82
11.35
16.27
66
4
9.49
13.28
18.47
5
11.07
15.09
20.52
6
12.59
16.81
22.46
7
14.07
18.48
24.32
8
15.51
20.09
26.13
9
16.92
21.67
27.88
10
18.31
23.21
29.59
11
19.68
24.73
31.26
12
21.03
26.22
32.91
13
22.36
27.69
34.53
14
23.69
29.14
36.12
15
25.00
30.58
37.70
F table
n2/n1
1
1
2
3
4
5
6
7
8
161.4476 199.5000 215.7073 224.5832 230.1619 233.9860 236.7684 238.8827
2
18.5128
19.0000
19.1643
19.2468
19.2964
19.3295
19.3532
19.3710
3
10.1280
9.5521
9.2766
9.1172
9.0135
8.9406
8.8867
8.8452
4
7.7086
6.9443
6.5914
6.3882
6.2561
6.1631
6.0942
6.0410
5
6.6079
5.7861
5.4095
5.1922
5.0503
4.9503
4.8759
4.8183
6
5.9874
5.1433
4.7571
4.5337
4.3874
4.2839
4.2067
4.1468
67
7
5.5914
4.7374
4.3468
4.1203
3.9715
3.8660
3.7870
3.7257
8
5.3177
4.4590
4.0662
3.8379
3.6875
3.5806
3.5005
3.4381
9
5.1174
4.2565
3.8625
3.6331
3.4817
3.3738
3.2927
3.2296
10
4.9646
4.1028
3.7083
3.4780
3.3258
3.2172
3.1355
3.0717
11
4.8443
3.9823
3.5874
3.3567
3.2039
3.0946
3.0123
2.9480
12
4.7472
3.8853
3.4903
3.2592
3.1059
2.9961
2.9134
2.8486
13
4.6672
3.8056
3.4105
3.1791
3.0254
2.9153
2.8321
2.7669
14
4.6001
3.7389
3.3439
3.1122
2.9582
2.8477
2.7642
2.6987
15
4.5431
3.6823
3.2874
3.0556
2.9013
2.7905
2.7066
2.6408
16
4.4940
3.6337
3.2389
3.0069
2.8524
2.7413
2.6572
2.5911
17
4.4513
3.5915
3.1968
2.9647
2.8100
2.6987
2.6143
2.5480
18
4.4139
3.5546
3.1599
2.9277
2.7729
2.6613
2.5767
2.5102
19
4.3807
3.5219
3.1274
2.8951
2.7401
2.6283
2.5435
2.4768
20
4.3512
3.4928
3.0984
2.8661
2.7109
2.5990
2.5140
2.4471
21
4.3248
3.4668
3.0725
2.8401
2.6848
2.5727
2.4876
2.4205
22
4.3009
3.4434
3.0491
2.8167
2.6613
2.5491
2.4638
2.3965
23
4.2793
3.4221
3.0280
2.7955
2.6400
2.5277
2.4422
2.3748
24
4.2597
3.4028
3.0088
2.7763
2.6207
2.5082
2.4226
2.3551
25
4.2417
3.3852
2.9912
2.7587
2.6030
2.4904
2.4047
2.3371
68
Download