Introduction to Biostatistics BIOSTATISTICS: Application of statistical methods to the data, which is derived from biological sciences such as medicine. STATISTICS: Statistics is the branch of science, which deals with the theories, and methods of collection, classification, analysis and interpretation of data. IMPORTANT TERMS POPULATION: The term population is used to denote the “units” under study. It includes all persons, events and objects under study Ex: If the objective is to assess the quality of tablets of a batch, then all tablets from that batch form the population. Population is described in terms of size, structure, time frame, geography and nature. Homogeneous: When there is practically very little variation in the characteristic of the units in the population. Ex: Size of tablets in a bottle 1 Heterogeneous: When there is wide variation in the characteristic of the units in the population under study. Ex: Gender Finite: When the units in the population are countable, the population is finite. Infinite: When the units in the population are not easily countable (Ex: World Population) or can be created by infinite permutations & combination (Ex: throws of dice), the population is said to be infinite. Time Frame/Geography: When we speak of the population of a city or a country, we must specify the time we are referring (i.e., population of 1991 or 2001). Dynamic: When the units in the population change frequently, there by affecting the parameter, the population is said to be dynamic. Ex: Patients in a hospital. Static: When the population units do not change frequently, the population is said to be static. Ex: Doctors in a hospital. SAMPLE Sample is a part of population which represents the entire population Instead of studying the entire population, only sample is studied. Process of collection of samples is known as sampling. Types of sampling methods Random sampling: A random sample is one where each item of the population has an equal chance of being included in the sample .A random sample may be taken from infinite or finite population. random sampling is a scientific method of getting a sample from the population. this method is also known as “unrestricted random sampling” device. Stratified sampling: If a population is divided into relatively homogeneous groups or strata and a random sample is drawn from each group or stratum to produce an overall sample, it is known as “stratified sampling”. 2 Cluster sampling: It is also known as sampling stages. In the cluster sampling method, the population is divided into some recognizable subgroups which are called clusters. Now the random sample of these clusters is drawn and all the units belonging to the selected clusters constitute the sample FREQUENCY The number of times a value of the variable occurs is called the frequency. OBSERVATION Measurement of an event is called observation. For Ex: B.P, Temperature of the body, etc. DATA Data is a collection of observations expressed in numerical figures CLASS INTERVAL Each group into which the raw data is condensed is called as class interval. class interval is of two types 1.overlapping 2.Non overlapping CLASS LIMIT The difference between the upper limit and lower limit of a class in called as class limit Class limit = upper limit - lower limit CLASS MARK: IMPORTANT SYMBOLS ∑: summation E: Expected Number O: Observed Number N or n: Number of observations 3 P: Probability f: Frequency C.F: Cumulative Frequency _ x :Mean M: Median Mo : Mode Q :Quartile Deviation δ: Mean Deviation : Standard deviation χ2: Chi-square test ‘t’test: Student’s test or ‘t’ratio r:Correlation b:Regression DATA COLLECTION 1. Measurement: The required information is collected by actual measurement in the object, element or person. 2. Questionnaire: A standardized and pre tested Questionnaire is sent and the respondents are expected to give the information by answering it. 3. Interview: This method can be used as a supplement to the questionnaire or can be used independently. Here the information is collected by face to face dialogue with the respondents. 4. Records: Sometimes the information required can be available in various records like census, hospital records, etc 4 Variation is of two types: •Biological Variation •Sampling Variation Biological Variation: This term is used for variation that is seen in the measurement/counts in the same individual even if measurement/enumeration method is standardized and even if the person taking measurement/making count is same Ex: Blood pressure of an individual can be show variation even if it is taken by identical method, applying identical criteria and even if it is measured by the same person. Sampling Variation: The term sampling variation is used for variation seen in the statistics of two samples from same application. Ex: Even if there are 40% girls in a college, two samples if identical size drawn from this population may vary from this parameter and may show difference between them. 5 Mistakes & Errors These are of three types •Instrumental/Technical Error •Systematic Error •Random Error Instrumental Error/Technical Error: These are introduced as a result of faulty and unstandardized instruments, improper calibration, etc. Systematic Error: This is an error introduced due to a peculiar fault in the machine or technique. This gives rise to same error again. Ex: If the ‘zero’ of the weighing machine is not adjusted properly, it will give rise to a systematic error Random Error: It is introduced by the changes in the conditions in which the observations are made or measurements are taken Ex: A person may stand in different position at two different times, when his height is being taken. A person may tell his age differently when asked on two different occasions. In such cases, even if the instrument/method is good, an error may occur. The error due to this phenomenon will not be constant or systematic. Mistakes / Errors can be prevented by : i. Using standard calibrated instruments. ii. Using standardized pretested questionnaire. iii. Using trained & skilled persons. iv. Using multiple observations and averaging them. v. Using correct recording procedures. vi. Applying standard and widely accepted statistical manipulations/calculations. 6 Data Types Based on Characteristics Data is of two types based on characteristics 1.Attributes: Attributes are the non-measurable characteristics which cannot be numerically expressed in terms of unit. These are qualitative objects.Eg: Sex 2.Variables: Variables are the measurable characteristics which can be numerically expressed in terms of some unit.These are quantities which are capable of being measured by quantitative methods directly. An individual observation of any variable is known variate.If we measure the height of some individuals of a population and obtain some values, the obtained value is a variable. For example height and length in cm,weight in gms, Hb in % , etc of individuals. Variables are of two types A) Discrete Variable is one which cannot take all the values & there is a gap between one value and the other. For example, No. of persons in a family, No. of books in a library are discrete variables. One cannot say that there are 3.5 persons in a family or there are 500.6 books in a library. The discrete variable may take any integer from 0 toµ. 7 B) Continuous Variable is one which can take any value & there is no interval.For example, the weight and height of human being is a continuous variable because it may take any value. Height of patients may be 120cm, 120.2cm, 120.5 cm and so on. Measurements of Hb%, etc are also continuous variables. In general discrete variable takes integer value while continuous value can take fraction value. Based on Source Data is of two types based on Source 1. Primary Data: If the data is derived from the direct measurements/observation on the population unit. 2. Secondary Data: When the data is not derived from the primary source,but derive from sources like records. Based on Fields In computer database management software, data is arranged in tabular form. The columns are called fields and the rows are records. Each field can hold record of its type which needs to be described at the time of creation of data base. The common types of fields are character type, numeric type, date type and logical type. 1. Character Type Ex: name, Address. 2. Numeric Type Ex: Age, height, Weight, Blood sugar level 3. Date Type Ex: Date of Birth, Date of Admission, Date of Discharge, etc.The date can be expressed in British-dd/mm/yyyy, American- mm/dd/yyyy, ANSI-yyyy/mm/dd Format. 4. Logical Type: This refers to the dichotomous data. Ex: Sex (male/female), result of dry trial (cured/not cured). 8 Data Presentation Data Presentation is of three types 1. Tabular Presentation Reference Table / Master Table Correlation Table Association Table Two by two Table Text Table 2.Diagramatic Presentation Line diagram Bar diagram Pie diagram 3. Graphical Presentation (Important types in relation to frequency distribution) Histogram Frequency Polygon Frequency Curve Cumulative Frequency/O give 1. Tabular Presentation Reference Table / Master Table: This table shows all variables that can be cross classified. It contains all the result of data reduction. Correlation Table: This shows the two quantitative variables cross classified in many classes. It is used to calculate correlation coefficient(r). 9 Kgs 150-154.9 155-159.9 160-164.9 165-169.9 Cms 40-44.9 50 10 10 10 45-49.9 30 50 20 20 50-54.9 20 30 50 30 55-59.9 10 20 20 50 Association Table: This table shows associates between two qualitative variables, It is also required for calculating sensitivity and specificity of screening test. Two by two Table: This table shows frequency distribution of two variables or two classes. Ex: Sex distribution of patients in two hospitals. Hospital Male Female Total A 600 400 1000 B 500 500 1000 Total 1100 900 2000 Text Table: It is descriptive table. It does not contain numerical data. Ex: Some information about Drugs DRUG MANUFACTURER LOCATION Histac Ranbaxy Delhi Anastrazole AstraGenica Banglore Imatinib Novartis Mumbai 10 2.Diagramatic Presentation Diagrams help biostatisticians to visualize the meaning of a numerical complex at a glance. Line diagram: This the simplest type of diagram. For a diagrammatic representation of data, the frequencies of the discrete variable can be presented by a line diagram. The data variable is taken on the X-axis, and the frequencies of the observation on the Yaxis.The straight lines are drawn whose lengths are proportional to the frequencies. Ex: the frequency distribution of a discrete variable (rate of reproduction of 50 fishes) is given in the following table. Rate of 10 reproduction 3 20 30 40 50 60 70 80 90 4 7 8 9 9 2 6 2 Frequency Line diagram is given in the following figure of the data presented in above table 11 Bar diagram: Bar diagrams are one dimensional diagrams because the length of the bar is important, and not the width. In this case the rectangular bars of equal width is drawn. Ex: the frequency distribution of a discrete variable (rate of reproduction of 50 fishes) is given in the following table. Rate of 10 reproduction 3 20 30 40 50 60 70 80 90 4 7 8 9 9 2 6 2 Frequency Bar diagram is given in the following figure of the data presented in above table Pie diagram: It is an easy way of presenting discrete data of qualitative characters such as blood groups, Rh factors, age groups, Sex groups etc. the frequencies of the groups are shown in a circle. Degrees of angle denote the frequency and area of the sector. it presents comparative 12 difference at a glance. Size of each angle is calculated by multiplying the class percentage with 3.6 i.e., 360/100 or by the following formula. Blood Groups No. Of Persons Percentage Degrees Male Female Total A 427 317 744 26.5 94.4 B 559 412 971 34.5 124.2 O 521 367 888 31.6 113.8 AB 122 85 207 7.4 26.2 Total 1629 1181 2810 100.0 360.0 13 3.Graphical Presentation This is visual presentation of data. Important types of graphs in relation to frequency distribution. Histogram This is method of choice for quantitative continuous data. It is an area diagram consisting of series of adjacent blocks(rectangles).Entire area covered by rectangle represents the entire frequency and the area covered by the individual block represents the frequency of the variable represented by that block.X-axis represents class interval and Y-axis represents frequency per unit of class interval. 14 The presentation in the histogram of distribution is illustrated as below Distribution of Total Serum Protein levels(g/100ml) in 436 individuals Total Serum Protein (g/ml) No. of individuals 4.0 4 5.0 12 6.0 7 6.2 9 6.3 34 6.5 105 7.0 237 8.0 27 9.0-10.0 1 Total 436 15 Frequency Polygon A frequency polygon is a slight variation of histogram. Instead of rectangles erected over the intervals, points are plotted at the mid points of the tops of the corresponding rectangles in a histogram, and the successive points joined by straight lines. A frequency polygon may be chosen to compare two frequency distributions. 16 Frequency Curve When the total frequency is large and when we adopt much narrower class intervals, the frequency polygon will most often have a much smoother appearance, which is called Frequency Curve Cumulative Frequency Curve Following figure illustrates a cumulative frequency polygon which is also known as O give 17 Frequency Distribution Frequency Distribution is the summary of the number of times different values of a variable occur. For example: Hemoglobin Values of 50 subjects 9.8, 10.5, 8.0, 9.2, 11.8, 13.2, 11.4, 10.1, 7.7, 11.9 14.1, 10.8, 12.1, 9.0, 12.7, 10.9, 8.8, 11.9, 9.6, 13.1 10.0, 14.1, 10.9, 8.6, 9.9, 13.8, 11.7, 9.9, 12.8, 10.0 13.9, 10.2, 11.9, 10.3, 13.3, 10.2, 10.8, 9.6, 10.7, 11.1 10.5, 11.3, 10.7, 11.7, 10.9, 12.0, 10.6, 12.3, 11.2, 11.3 Class Interval 7.5-8.4 Frequency Cumulative % of % of Frequency frequency Cumulative frequency 2 2 0.02 0.02 8.5-9.4 4 6 0.04 0.06 9.5-10.4 11 17 0.11 0.17 10.5-11.4 15 32 0.15 0.32 11.5-12.4 9 41 0.09 0.41 12.5-13.4 5 46 0.05 0.46 13.5-14.4 4 50 0.04 0.50 18 Measure of central tendency Centering constants are also termed as “measures of central tendency”. A measure of central tendency is a typical value around which other figures congregate. MEAN Mean is the summation of the observation and dividing it by the total number of observations. Ungrouped data: M or X=Arithmetic Mean X= Character observed Σ=Summation Values n=No. of observations Ex: Serum Albumen Levels (g%) of 24 Pre-School Children (Ungrouped Data) 2.90 3.57 3.73 3.55 3.72 3.88 2.98 3.61 3.75 3.45 3.71 3.84 3.30 3.62 3.76 3.38 3.66 3.76 3.43 3.69 3.77 3.43 3.68 3.76 The total of all these values, ΣX: 85.93 19 Total number of observations, n=24 Grouped data: M or X=Arithmetic Mean X= Mid Point of Class Interval Σ=Summation f = Frequency Ex:Protein Intake of 400 Families Protein Intake/Day(g) Class-Interval 15-25 25-35 35-45 45-55 55-65 65-75 75-85 Total No. of Mid-Point of Classfamilies Interval f x 30 20 40 30 100 40 110 50 80 60 30 70 10 80 400 Multiply f&x fx 600 1200 4000 5500 4800 2100 800 19000 20 IMPORTANCE: Arithmetic mean is effected by all the observations as each contribution to its calculation. However, the effect of extreme values is more as compared to those values that are near to mean. MEDIAN The median is known as the measure of location,that is it tells where the data are. It is an average which divides into 2 equal halves. Median is the middle observation if the series is arranged in the ascending order or descending order. When there are even number of observations ,then Arithmetic mean of the middle two observations is taken as the median. Ungrouped Data Ex: Series of albumen levels of 24 preschool children 2.90 3.57 3.73 3.55 3.72 3.88 2.98 3.61 3.75 3.45 3.71 3.84 3.30 3.62 3.76 3.38 3.66 3.76 3.43 3.69 3.77 3.43 3.68 3.76 Arrange all the 24 values in the ascending order of the magnitude, we get the following data: 2.90 2.98 3.30 3.38 3.43 3.43 3.45 3.55 3.57 3.61 3.62 3.66 3.68 3.69 3.71 3.72 3.73 3.75 3.76 3.76 3.76 3.77 3.84 3.88 The 12th value is 3.66 & 13th is 3.68; Median is the average of these two. Grouped Data 21 Where L = Lower limit of the median class n=Total No. of observations (Cumulative Frequency) F=Cumulative Frequency prior to the median class f=Actual Frequency of the median class C=Class Interval of the median class. Ex: Protein Intake of 400 Families Protein Intake/Day No. of Families Cumulative Frequency 15-25 30 30 25-35 40 70 35-45 100 170 45-55 110 280 55-65 80 360 65-75 30 390 75-85 10 400 = n Total 400 Steps: 1. Find the cumulative frequencies 2. Find out the median class (n/2) 3. Apply the formula Procedure: n=400 Next Find out the median values by using n/2 =400/2=200 The value 200 is in between 170 & 280 of the cumulative frequency. So we take higher cumulative frequency i.e., 280 Now we take the corresponding class of cumulative frequency 280 22 Here the corresponding class interval of the cumulative frequency 280 is 45-55. So 45-55 is the median class. 45 is the lower limit of the median class (L). 110 is the actual frequency of the median class. 170 is cumulative frequency prior to median class 10 is the class interval of the median class. MEDIAN is not affected to a great degree of extreme values. The median does not use all the information in the data and so it can be shown to less efficient than the mean or average, which does use all the values of data. MODE Mode is the most frequently occurring observation. Ungrouped Data Ex: SerumAlbumin levels (g%) of 16 pre school children 3.57 3.76 3.73 3.55 3.72 3.76 3.55 3.76 3.43 3.55 3.57 3.76 3.72 3.55 3.76 3.57 x 3.43 3.55 3.57 3.72 3.73 3.76 f 1 4 3 2 1 5 23 Here, the observation 3.76 is most commonly occurring and hence the mode is 3.76. Also,by formula Mode = 3Median - 2Mean. Grouped Data Where LM – Lower limit of modal class. - The difference between the frequency of the modal class and the preceding modal class (f1-f0). - The difference between the frequency of the modal class and the succeeding modal class (f1-f2). C – Class interval of the modal class. f1 – Frequency of the modal class. F0 – Frequency of the preceding modal class. F2- Frequency of the succeeding modal class. Ex: Classification of Mode of protein Intake of 400 Families Protein No. of intake/day(g) Families Class Interval f 15-25 30 25-35 40 35-45 100 45-55 110 55-65 80 65-75 30 75-85 10 24 Highest Frequency (f1) is 110 The corresponding class is 45-55.It is the modal class. Therefore, 45 is the lower limit of the modal class. f0 is the frequency preceding the modal class = 100 f1 is the frequency of modal class = 110 f2 is the frequency succeeding the modal class = 80 C is the class interval=10 Apply the formula Mode is unaffected by extreme values. It is a positional average and can be located easily by inspection Measures of Dispersion/Variation Centering constants are representative values of the series. They do not express the range of normalness. Centering constants together with measures of variation help understanding of the data better than the centering constants alone. “The degree to which numerical data tend to spread about an average value is called the variation or dispersion of the data” Example: Suppose there are three series of nine items as s follows Series A Series B Series C 40 36 1 40 37 9 40 38 20 40 39 30 40 40 40 25 40 41 50 40 42 60 40 43 70 40 44 80 360 Total 360 360 Mean 40 40 40 In the 1st series A, the mean is 40 and the value of all the items is identical. The items are not at all scattered and the mean fully discloses the characteristics of this distribution. In the 2nd series, though the mean is 40 yet all the items of the series have different values. But the items are not very much scattered as the min value of the series is 36 and the max value is 44.In this case also mean is the good representative of the series. Here mean cannot replace each item, yet the difference between the mean and the other items is not very significant. In the 3rd series C, the mean is 40 and the values of different items are also different, and are widely scattered. The min value of the series is 1 and the max value is 8.So, the average does not satisfactorily represent the individual items in this group. In order to have correct analysis of then three series, it is essential that we need to study something more than their averages because averages are identical and yet the series widely differ from each other in their formation. Range Range is defined as the difference between the largest and the smallest value of variable in a series. The value of the range is dependent only upon the two extreme observations in the other observations. Ungrouped Data Ex: Hemoglobin values (g%) of normal children 11.8 12.9 12.4 13.3 13.8 11.4 12.3 11.7 12.9 12.2 26 10.4 10.8 12.7 13.2 11.6 12.0 12.2 14.2 10.8 10.5 11.6 13.5 12.2 11.2 12.6 13.0 The lowest value in these observations is 10.4 and the highest value is 14.2. Therefore, the range is 10.4 g% - 14.2 g%. Grouped Data Ex: Protein intake of 400 Families Protein No.of Intake/Day(g) Families 15-25 30 25-35 40 35-45 100 45-55 110 55-65 80 65-75 30 75-85 10 Total 400 For this frequency distribution table, the accurate range cannot be found out but we can approximately give the lowest and the highest values of the class intervals. Thus, the range is 15g-85g Interquartile Range 27 The interquartile range of a group of observations is the interval between the values of the upper quartile and the lower quartile for that group. Lower quartile is the value below which 25% of the observations fall. Upper quartile of a group is the value above which 25% of the observations fall. This measure gives us the range which coves the middle 50% of the observations in the group. Unlike the range, the value given by this measure is unaffected by the occurrence of rare extreme values and makes a good measure\re of dispersion Ungrouped data Eg: To find out Interquartile range for the hemoglobin values (g %) of 26 Normal Children 11.8 12.9 12.4 13.3 13.8 11.4 12.3 11.7 12.9 12.2 10.4 10.8 12.7 13.2 11.6 12.0 12.2 14.2 10.8 10.5 11.6 13.5 12.2 11.2 12.6 13.0 Arranging these observations in the ascending order of magnitude 10.4 11.6 12.2 12.9 13.8 10.5 11.6 12.2 12.9 14.2 10.8 11.7 12.3 13.0 10.8 11.8 12.4 13.2 11.2 12.0 12.6 13.3 11.4 12.2 12.7 13.5 The lower quartile Q1 is 11.6(I.e.,) about 25%of the number of observations fall below the value 11.6.The upper quartile Q3 is 12.9 (I.e.,) nearly 25%of the numbers of observations are above 12.9. Therefore, the interquartile range is 11.6-12.9 28 The series of observation is divided into two halves and median is located. If ‘n’ is an even number,then medians for both halves are located presuming each half to be independent series. If ‘n’ is an odd number, median of the series participates in locating the median of both upper and lower halves. Lower median and upper median is the interquartile range and it contains middle 50% observations. It is a better indicator of variation than range. Grouped Data: To obtain quartile deviation from grouped data one has to obtain Q1 and Q3 first.Formula to calculate Q1 and Q3 is given below: Where L-Lower limit of the Q1 Class n-Total No. of observations (Cumulative Frequency) F-Cumulative Frequency prior to the Q1 Class f-Actual Frequency of the Q1 Class C-Class Interval Where L-Lower limit of the Q3 Class n-Total No. of observations (Cumulative Frequency) F-Cumulative Frequency prior to the Q3 Class f-Actual Frequency of the Q3 Class C-Class Interval Q1 Class =n/4 Q3 Class =3n/4 29 Ex: Protein Intake of 400 Families Protein Intake/Day No. of Families Cumulative Frequency 15-25 30 30 25-35 40 70 35-45 100 170 45-55 110 280 55-65 80 360 65-75 30 390 75-85 10 400 = n Total 400 Next Find out the Q1 Class by using n/4 =400/4=100 The value 100 is in between 70 & 170 of the cumulative frequency. So we take higher cumulative frequency i.e, 170 Now we take the corresponding class of cumulative frequency 170 Here the corresponding class interval of the cumulative frequency 170 is 35-45. So 35-45 is the Q1 Class. 35 is the lower limit of the Q1 Class (L). 100 is the actual frequency of the Q1 Class. 70 is cumulative frequency prior to Q1 Class 10 is the class interval 30 Next Find out the Q3 Class by using 3n/4 =1200/4=300 The value 300 is in between 280 & 360 of the cumulative frequency. So we take higher cumulative frequency i.e., 360 Now we take the corresponding class of cumulative frequency 360 Here the corresponding class interval of the cumulative frequency 360 is 55-65. So 55-65 is the Q3 Class. 55 is the lower limit of the Q3 Class (L). 80 is the actual frequency of the Q3 Class. 280 is cumulative frequency prior to Q3 Class. 10 is the class interval 31 Mean Deviation The mean deviation is the arithmetic mean of the deviations the observations from the arithmetic mean ignoring the sign of these deviations. The mean deviation is based on all observations in the group. Ungrouped Data Where indicates the difference between the value of the observation and the arithmetic mean ignoring the sign of difference, n indicates the total number of observations. Ex : Hemoglobin values (g%) of 12 subjects Hb g% Level Deviation from mean (without sign) 7.2 2.1 7.6 1.7 7.8 1.5 8.6 0.7 8.9 0.4 9.2 0.1 9.4 0.1 9.8 0.5 10.0 0.7 10.6 1.3 11.2 1.9 11.6 2.3 Mean=9.3 32 Mean = 9.3 Total of the deviations from arithmetic mean (Without taking into account the sign) Grouped Data Ex: Protein intake of 400 Families Protein No. of Mid-Point of Deviation of mid- Absolute intake/day(g) Families class-interval point from mean Class interval f X 15-25 30 20 27.5 825 25-35 40 30 17.5 700 35-45 100 40 7.5 750 45-55 110 50 2.5 275 55-65 80 60 12.5 1000 65-75 30 70 22.5 675 75-85 10 80 32.5 325 TOTAL 400 Deviation fx f 4550 33 Mean = 47.5 Standard Deviation The Standard deviation is the square root of the average of the squared deviations of the observations from the arithmetic mean. The deviation from mean is considered without its sign in calculating the mean deviation,but in calculating the standard deviation it is squared The standard deviation of the population is usually denoted by , and that of the sample by S Ungrouped Data If the sample size is more than 30 then the formula for ungrouped data is If the sample size is less than 30 then the formula for ungrouped data is 34 Ex : Hemoglobin values (g%) of 12 subjects Hb g% Level Deviation from Square of Deviation mean 7.2 -2.1 4.41 7.6 -1.7 2.89 7.8 -1.5 2.25 8.6 -0.7 0.49 8.9 -0.4 0.16 9.2 -0.1 0.01 9.4 +0.1 0.01 9.8 +0.5 0.05 10.0 +0.7 0.49 10.6 +1.3 1.69 11.2 +1.9 3.61 11.6 +2.3 5.29 Mean=9.3 = 21.35 Arithmetic Mean = 9.3 35 Grouped Data If the sample size is more than 30 then the formula for grouped data is If the sample size is less than 30 then the formula for grouped data is Ex: Protein intake of 400 Families Protein No.Of intake/day(g) Families of C.I Class-Interval Mid Point Deviation of mid- Squared point from mean Deviation Freq*Squared Deviation f x 15-25 30 20 -27.5 756.25 22687.5 25-35 40 30 -17.5 306.25 12250.0 35-45 100 40 -7.5 56.25 5625.0 45-55 110 50 2.5 6.25 687.5 55-65 80 60 12.5 156.25 12500.0 65-75 30 70 22.5 506.25 15187.5 75-85 10 80 32.5 1056.25 10562.5 Total 400 f* 79500.0 36 Variance Since, Variance is square of standard deviation, it has no unit of measurement Coefficient of Variation It is the ratio of S.D and mean expressed as percentage CV is useful in comparing variation in two characteristics with different units of measurement like height and weight, Hb% and ESR, etc Following is an example where the coefficient of variance is used for comparison of variability in different characteristics S.No No.Rec- Arith.Mean Range Stan.Dev. Coeff.Of orded Var. Height(cm) 33 164.6 142.2-180.3 7.64 4.7% Weight(kg) 33 43.1 22.0-55.1 6.48 15.0% 37 Brain(g) 14 1317.0 1100 - 2335 296.1 22.5% Heart(g) 33 249.5 110-1000 150.8 60.4% Liver(g) 33 1205.0 540-2500 376.3 30.2% Spleen(g) 32 367.2 53-2700 561.4 152.9% NORMAL DISTRIBUTION Normal distribution,is the most commonly observed probability distribution. Mathematicians de Moivre and Laplace used this distribution in the 1700's. In the early 1800's, German mathematician and physicist Karl Gauss used it to analyze astronomical data, and it consequently became known as the Gaussian distribution among the scientific community. The sampling distribution formed by actually taking the sample from the population is called observed sampling distribution. In several situations the theoretical sampling distributions are very close approximations of the observed sampling distribution. From the theoretical distribution the necessary evaluation of a sample can be done by using mathematical models. Normal Distribution is a symmetrical distribution and fundamental to many tests of significance. The two parameters of the normal distribution are the mean and the standard deviation (σ). 38 Normal distribution curve is symmetrical, bell shaped curve. The shape of the normal distribution resembles that of a bell, so it sometimes is referred to as the "bell curve", an example of which follows: Bell Curve Characteristics The bells curve has the following characteristics: Symmetric Unimodal Extends to +/- infinity Two parameters of the normal distribution are the Mean and Standard Deviation Characteristics of Normal curve: Highest point in the frequency distribution represents Mean, Median, and Mode. In normal distribution curve Mean, Median, Mode are identical. The frequency of measurements goes on increasing from one side, reaches peak and goes on declining exactly as they have mounted. It is symmetrical with a bell shaped curve. 39 Relation ship between the Normal Curve and Standard Deviation: 1st and 3rd quartile that is semi inter quartile range is equal to +/- 0.6745 of standard deviation and which covers 50% area. +/-1 standard deviation covers 68.27% area. +/-2 standard deviation covers 95.45% area. +/-3 standard deviation covers 99.73% area. Only 0.275 area remains outside of the curve. Probability Probability is possibility of occurrence of any event or a permutation/combination of events. In the science of genetics, it is uncertain whether an offspring will be male or female, but in the long run it is known approximately what percent of offspring will be male and what percent will be female. The long term regularity provides us with a measure of the amount of chance, as it is denoted by probability. 40 Chance is measured on probability scale having zero at one end and unity at the other. Zero end represents “absolute Impossibility” and unity end represents “Absolute Certainty” In complex situations, the evaluation of probability will have to be based on observational or experimental evidence. For example if we want to know the probability of success of a surgical procedure, a review of past experience of this surgical procedure under similar condition s will provide the data for estimating this probability. The longer the series we have, the closer the estimate would be to the true value. Probability scale has a range of 0 to1. By P = 0, it means that there is absolutely no chance that the observed difference could be due to sampling variation. By P = 1, it means that the observed difference in two samples is due to sampling variation. Eg In a given case P is in between 0 and 1. If P= 0.4, it means that the chances that the given difference is due to sampling variation are 4 in 10.The counterpart of the statement is that the chances that the observed difference is not due to sampling variation are 1-0.4 = 0.6, (i.e.) 6 in 10. 41 Test of Significance The essence of any test of significance in biostatistics is to find out P value and draw the inference. It is customary to accept the difference to be due to chance (i.e.) sampling variation if P is 0.05 or more. The observed difference in the samples under study this condition is said to be “Statistically not significant”. If P value is less than 0.05, the observed difference is not considered as not due to sampling variation but due to some difference in the samples themselves. The observed difference, under these circumstances is said to be “Statistically Significant”. Null Hypothesis Null hypothesis is another important concept in statistics. In any test of significance, we start with the Hypothesis (Assumption) that “the observed difference in the samples under study is due to sampling variation” and proceed to prove/disprove this hypothesis. The essence of any test of significance is to calculate probability. It is customary to accept the null hypothesis if probability value is 0.05 or more. With every null hypothesis there is an alternate hypothesis. Usually, the alternate hypothesis is that “the observed difference in the samples is not due to the sampling variation, but due to the difference in 42 the samples”. In fact this itself is the objective of the study. A hypothesis, which is a test for possible rejection under the assumption that it is true. Hypothesis (Assumption) Accept Alternate Hypothesis The observed difference in the samples is not due to sampling variation, but is due to the difference in the samples Accept Null Hypothesis The observed difference in the samples is due to sampling variation If alternate hypothesis is accepted, the null hypothesis is automatically rejected. However, if null hypothesis is accepted, two possibilities exist: 1. That the alternate hypothesis is rejected and 2. That the sample size may be inadequate to detect the difference. Determination of significance Based on data we have different types of tests to help us in determining whether observed differences between samples are actually due to chance, or they are really significant. 43 Level of significance The statistical tests fix the probability at a certain level, called as level of significance. The commonly used level of significance are 5% (0.05) and 1 %( 0.01).if we choose 5% level of significance, it implies that in 5 out of 100. In other words this implies that we are 95% confidence. Level of significance desired is always fixed in advance before applying the test. Interpretation If calculated value is less than table value, null hypothesis is accepted, alternate hypothesis is rejected, and the difference in the two means is statistically not significant. If calculated value is more than table value, alternate hypothesis is accepted, null hypothesis is rejected, and the difference in the two means is statistically significant. 44 Normally 5% level of significance (α=0.05) is used in testing a hypothesis and taking a decision unless otherwise any other level of significance is specifically stated. t- Test Mr.Wiliam Gosset (1908) applied a statistical tool called t- test. The pen name of Mr.Gosset is ‘student’ and hence this test is called student’s t- test or t-ratio because it is a ratio of difference between two means. In t-test, we make a choice between two alternatives i. To accept the null hypothesis (no difference between the two means) ii. To reject the null hypothesis i.e. the difference between the means of the two samples is statistically significant. Determination of significance Probability of occurrence of any calculated value of ‘t’ is determined by comparing it with the value given in the ‘t’ table corresponding to the combined degrees of freedom, derived from the number of observations in the samples under study. If the calculated value of‘t’ exceeds the value given at P = 0.05(5% level) in the table, it is said to be significant. If the calculated value‘t’ is less than the value given in ‘t’table, it is not significant. Degrees of Freedom The quantity in the denominator which is less than the independent number of observations in a sample is called degree of freedom. In unpaired t – test, df = n-1. In pared ttest df = n1 + n2-2 (n1 & n2 are the number of observations in each of the two series.) T-Test: T –test is an estimate of the extent by which values in a small set of data deviate from the mean; it is used to determine the variation within a set of data and to compare two sets of data. Unpaired t – Test: two groups Eg: The Hb % of 10 pulmonary TB patients (x1) and 12 comparable controls (x2) is given below TB patients: 9.0, 8.6, 7.5, 8.0, 7.3, 8.0, 7.0, 9.0, 8.0, 8.6 (n1= 10, m1 = 8.1) Controls: 9.5, 9.0, 7.7, 8.8, 8.0, 9.0, 8.1, 9.2, 8.5, 8.6, 9.0, 10.0 (n2=12, m2= 8.78) 45 Calculation of pooled S.D. (PSD) x1 9.0 8.6 7.5 8.0 7.3 8.0 x1 -m1 0.9 0.5 -0.6 -0.1 -0.8 -0.1 -1.1 0.9 0.25 0.36 0.01 0.64 0.01 1.21 0.81 0.01 0.25 ∑(x1 -m1)2=4.36 (x1 -m1)2 0.81 9.0 7.7 8.8 8.0 9.0 8.1 7.0 9.0 9.2 8.0 8.6 m1=8.1 -0.1 0.5 X2 9.5 8.5 8.6 9.0 10.0 x2-m2 0.72 0.22 -1.08 0.02 -0.78 0.22 -0.68 0.42 -0.28 -0.18 0.22 1.22 m2=8.1 (x2-m2)2 .5184 .0484 1.1664 .0004 .6084 .0484 .4624 .1764 .0784 .0324 .0484 1.4884 ∑(x2m2)2=4.67 Calculating of‘t’ Step 4: Calculation of degrees of freedom The quantity in the denominator, which is one, less than the independent number of observations in sample, is called degree of freedom. df = n1 + n2 – 2 = 10 + 12 - 2 = 20 46 Finding table value: Here level of significance is 5% so the probability is 0.05(See Statistical Tables) - table‘t’ value for 20 DF at probability of 0.05 is 2.09. Interpret In our example calculated‘t’ value is more than table ‘t’ value .So, null hypothesis is rejected, alternate hypothesis is accepted and difference between two means is statistically significant. In other words it indicates that Hb% is affected and significantly low in TB patients as compared to controls. Paired t -Test Eg: Anti hypersensitive effect of a drug was tested on 15 individuals. The recordings of diastolic blood pressure in mm are shown in the table. S.No Before After Diff(d) x1 - x2 d- md (d- md)2 Trt (x1) Trt(x2) (md is mean of d) 1 96 90 6 -7.6 57.76 2 98 92 6 -7.6 57.76 3 110 100 10 -3.6 12.96 4 112 100 12 -1.6 2.56 5 118 98 20 6.4 40.96 6 120 100 20 6.4 40.96 7 140 100 40 26.4 696.96 8 102 90 12 -1.6 2.56 9 98 88 10 -3.6 12.96 10 124 126 -2 -15.6 243.36 11 118 120 -2 -15.6 243.36 12 120 100 20 6.4 40.96 13 122 100 22 8.4 70.56 47 14 120 98 22 8.4 70.56 15 98 90 8 -5.6 31.36 1492 204 Total 1696 Mean ∑(d- md)2=1625.6 md =13.6 Calculate pooled Standard Deviation (PSD) = 10.77 Calculate Standard Error of PSD (SE-PSD) Step 5: Calculate‘t’ Calculate Degrees of Freedom DF = n-1 = 14. Here level of significance is 5% so the probability is 0.05(See Statistical Tables) - table‘t’ value for 14 DF at probability of 0.05 is 2.14. Interpret In our example calculated‘t’ value is more than table ‘t’ value .So, null hypothesis is rejected, alternate hypothesis is accepted and difference between two means is statistically significant. In other words the difference in before and after values is considered statistically significant. ANOVA (Analysis of Variance) ANOVA is used to examine the significance of the difference amongst more than two sample means at the same time. For example, when we want to compare more than two populations 48 such as yield of crop from several varieties of seeds. One can draw inferences about whether the samples have been drawn from population having the same mean, with the help of this technique. The test is also called “F” test, as it was developed by R.A.Fisher in 1920.He developed systematic procedure for the analysis of variation. It consists of classifying and cross – classifying statistical results and testing whether the means of specific classification differ significantly. For example, five fertilizers are used to five plots of paddy. We may be interested in finding out whether the effect of these fertilizers on the yields are significantly different. We make use of ANOVA to answer this type of problems. It enables us to analyze the total variation into component which may be attributed to various ‘sources’ or ‘causes’. It can provide us with meaningful comparisons of sample data which are classified according to two or more variables. Types of ANOVA The analysis of variance has been classified into a) One – Way Classification: Under this, only one factor is considered and its effect on elementary units is considered (i.e.) data are classified according to only one criterion. Eg: Yield of crop affected by type of seed only. b) Two – Way Classification: more than two independent factors have an effect on the response variable of interest. Eg: Yield of crop affected by type of seed as well as type of fertilizer. One Way Classification In one way classification, the data are classified according to only one criterion. That is, the arithmetic means of populations from which K samples are randomly drawn are equal to one another. Principle: We take two estimates of population variance (i.e.) one based on between samples variance and other within samples variance. Then these two estimates of population variance are compared with ‘F’ test as follows. 49 The value of F is to be compared to the F- limit for a given degrees of freedom. If the table calculated F value exceeds the F-table value, we can say that there are significant variance between the sample means. Eg: A certain manure was used on four plots of land A, B, C, and D.Four beds were prepared in each plot and the manure used. The output of the crop in the beds of the plots A, B, C and D is given below. Land A Land B Land C Land D 6 15 9 8 8 10 3 12 10 4 7 1 8 7 1 3 Using ANOVA find out whether the difference in the means of the production of crops of the plots in significant or not. Total sum of all the items of various samples (Eg: 4 samples) Total sum of all the items of various samples = T 50 Correction Factor = T2 / N Where N is the number of items Total sum of squares (or) total SS Step 4: Sum of squares (or) SS – between Step 5 : Sum of squares (or) SS – within SS – within = Total Sum of Squares – Sum of Squares between the samples = [Total SS] – [SS- between] OR = [Value of step 3] – [Value of step 4] = [228] – [40] = 188 Step 6: Make ANOVA Table Source of Sum of Squares Degree of Mean Square (MS) Variation SS Freedom d.f Between 40 3(c-1) 40 / 3=13.33 188 12(N-c) 188/12=15.66 228 15 Samples Within Samples Total 51 Find out F value = 13.33 / 15.66 = 0.851 The table value of F for n1 = 3 and n2 = 12 at 5% level of significance = 3.49. Inference The calculated value (0.851) is lesser than the table value (3.49). Therefore the difference in the means of the production of crops of the plots is not significant. Null hypothesis is accepted, alternate hypothesis is rejected. Two Way ANOVA Analysis of Variance in Two Way Classification It is used when the data are classified on the basis of two or more factors. In a two – way classification the data are classified according to two different criteria. Where, SSC – Sum of squares between columns SSR - Sum of squares between rows SSE - Sum of squares due to error 52 MSC - Mean square between columns MSR - Mean square between rows MSE - Mean square due to error SST – Total Sum of squares The sum of squares for the source ‘Residual’ (SSE) is obtained by subtracting from the total sum of squares (SST) by the sum of squares between columns (SSC) and rows (SSR)(i.e.,) SSE = SST – (SSC + SSR) Eg: Set up two way ANOVA table for the following results per acre production data for Sorghum Name of Fertilizer Variety of Sorghum Seeds Co. 1 Co.5 Co. 9 Urea 6 5 5 AmmoniumSulphate 7 5 4 Zinc Sulphate 3 3 3 Potash 8 7 4 Solution : Step 1: Calculate Total = T = 6 + 7 + 3 + 8 + 5 + 5 + 3 + 7 +5 + 4 + 3 + 4 = 60 No. of items = N = 12 Step 2: Step 3 : Square of all items = 62 + 72 + 32 + 82 + 52 + 52 + 32 + 72 + 52 + 42 + 32 + 42 = 332 Step 4 : Total sum of squares (SST) 53 SST = Square of all items – Correction Factor = [Value of step 3 – Correction factor] = 332 – 300 = 32 Step 5 : Sum of squares between variety of Sorghum Seeds (or) Name of Fertilizer Variety of Sorghum Seeds Co.1 Co. 5 Co. 9 Total 6 5 5 16 = Σ x1 AmmoniumSulphate x2 7 5 4 16 = Σ x2 Zinc Sulphate x3 3 3 3 9 = Σ x3 Potash x4 8 7 4 19 = Σ x4 Total Σ Co.1= 24 Σ Co.5= 20 Σ Co.9= 16 60 Urea x1 Step 6 : Sum of squares between fertilizers (or) 54 Step 7 : SS for error(SSE) SSE = SST – (SSC + SSR) = Total Sum of squares – (Sum of squares between columns + Sum of squares between rows) (Or) in short SSE = Total SS – (SS between columns + SS between rows) = [Value of step 4 ] – [Value of step 5 + value of step 6] = 32 –(8 + 18) = 6 Step 8 : Degrees of freedom d.f for total variance = (C × r ) – 1 = (3 × 4)- 1 = 11 d.f for variance between columns = (C – 1 ) = 3 – 1 = 2 d.f for variance between rows = (r – 1 ) = 4 – 1 = 3 d.f for residual variance = (C - 1)(r-1) = (3 – 1)(4 – 1) = 6 Step 9 : 55 Step 10 : Setting two way ANOVA Table Step 11: Inference: Since the F – ratio concerning the varieties of Sorghum seeds (4.0) is less than the table value (5.14) the differences concerning the varieties of Sorghum seeds are insignificant at 5 %. Null hypothesis is accepted, alternate hypothesis is rejected. But the differences concerning fertilizers are significant at 5 % because the calculated value – F (6.0) is more than the table value (4.76). Alternate hypothesis is accepted, Null hypothesis is rejected. Chi – Square Test Chi-Square test( ) is a measure to study the difference of actual and expected frequencies. In sampling studies, there will be no perfect coincidence between expected and observed 56 frequencies. Chi-Square measures the difference between the expected and observed frequencies. If there is no difference between the actual and expected frequencies, ChiSquare is zero.Thus, the Chi-Square test describes the discrepancy between theory and observation. Chi –Square test was developed by Prof. Fisher in 1870. Karl Pearson improved Fisher’s Chi –Square test in its modern form in the year 1900. Chi –Square is derived from Greek letter (Chi - ) and pronounced as Ki. Formula for determining O = Observed Frequency, E = Expected Frequency From these equations, the Chi-Square value of will be zero if O = E in each class but due to chance error this never happens and the observed results are based on the number of degrees of freedom (d.f) and the critical level of probability P (0.05). Degrees of Freedom (d.f) : In Chi-Square test, while comparing the calculated value of ChiSquare with the tabulated value, we have to calculate the degrees of freedom. The degree of freedom are calculated from the number of classes. Therefore, the number of degrees of freedom in a Chi-Square test is equal to the number of classes minus one. In a contingency table (Association attributes can be shown by the table called contingency table),the degrees of freedom are calculated on. d.f. = (r-1) (c-1) r = No. of rows in a table c = No. of columns in a table 57 Eg: RBCs count lac/mm3 and Hb% g/100ml of 500 persons of test locality was recorded as follows. Is there any significant relation between RBCs count and Hb%? Find it by chisquare method. RBCs count Hb% Total Above Normal Below Normal Above Normal 85 75 160 Below Normal 165 175 340 Total 250 250 500 O- E (O- E) 2 85-80=5 25 25/80=0.31 (250*340)/500=170 165-170=-5 25 25/170=0.14 75 (250*160)/500=80 25 25/80=0.31 175 (250*340)/500=170 175-170=5 25 25/170=0.14 O = Observed E = Expected Frequency Frequency 85 (250*160)/500=80 165 75-80=-5 Total 58 Here degree of freedom (DF) = (r-1) (c-1) r = No. of rows in a table c = No. of columns in a table DF= (2-1) * (2-1) =1 Inference: At the 5% significance level (0.05) of table value of at 1 d.f = 3.84.the The calculated value (0.9) is lesser than the table value (3.84). The difference between RBCs and Hb% is not significant. It indicates that Hb% and RBCs count are fully independent of each other. Here Null hypothesis is accepted, alternate hypothesis is rejected. Correlation & Regression Correlation Correlation(r) Definition: The statistical tool for measuring the degree of relationship between the two variables i.e. a change in one variables results a positive or negative change in the other variables is known as correlation. Kinds of correlation: positive, negative, zero correlation. Positive or direct correlation: when values of the two variables deviate together then the condition is known as positive correlation. For instance if we say body weight is increasing with the increasing in height-relationship is positive one. 59 Negative or inverse correlation: If one variables increases (or decreases) and the other decreases or (increases) they are said to be negatively correlated. For E.g. if temperature is increasing and the lipid content of body of a sample is decreasing, it is a case of negative correlation. Zero correlation: If the variation of one variable has no relation with the variation in the other it is called zero correlation. For E.g. Coefficient of correlation: The extent or degree of relationship between the variables is measured in terms of another parameter called coefficient of correlation. It is denoted by “r” i.e. -1≤r≤1. 60 Properties of coefficient correlation: (1) It is measure of the closeness between the two variables. (2) It lies between -1 and + 1 i.e. -1≤r≤1. (3) The correlation is perfect and positive if r=1 and it is perfect and negative if r= -1. (4) If, r= 0 then there is no correlation between the two variables and said to be independent. Correlation measures linear association (or relationship) between two quantitative variables. The extent or degree of relation between the variables is measured in terms of another parameter called Coefficient of correlation. It is denoted by ‘r’.It has a positive range of –1.0 to +1.0. Positive sign indicates positive correlation. Here, if one variable increases, other also increases (as in height & weight).Negative sign indicates negative correlation. Here, as one variable increases, other one decreases. The magnitude of correlation indicates degree of correlation. Higher the value of r, higher is the correlation. An r value of 0.0 indicates no correlation at all. An r value of 1.0 indicates perfect positive correlation. An r value of –1.0 indicates negative perfect correlation. For calculating r, we must have n pairs of measurements of x and y.Traditionally, y is a dependent variable, and x is an independent variable. Ex: In measurement of correlation between height and weight of 10 persons, height will be x and weight will be y. Having obtained n pairs of observations of x and y, correlation coefficient (r) is calculated by: Eg : Table Height & Weight of 10 Individuals 61 Ht Cm(X) Wt Kg(Y) XY (X)2 (Y)2 150 52 7800 22500 2704 160 58 9280 25600 3364 170 71 12070 28900 5041 175 74 12950 30625 5476 155 58 8990 24025 3364 165 61 10065 27225 3721 172 70 12040 29584 4900 179 75 13425 32041 5625 154 56 8624 23716 3136 163 60 9780 26569 3600 ∑X=1643 ∑Y=635 ∑XY=105024 ∑(X)2=270785 ∑(Y)2=40931 = 164.3 = 63.5 Correlation Coefficient ‘r’: Inference: There is a strong positive correlation between the height and weight of persons. Therefore both variables are highly correlated. 62 Regression Analysis Regression is used to denote estimation or prediction of the average value of n variable for a specified value of other variable & denoted by regression Coefficient ‘b’.It is also called “Slope”. The slope is the vertical distance divided by the horizontal distance between any two regression line, which is the rate of change along the regression line. It is calculated by the following equation. Eg: Table Height & Weight of 10 Individuals Ht Cm(X) Wt Kg(Y) XY (X)2 (Y)2 150 52 7800 22500 2704 160 58 9280 25600 3364 170 71 12070 28900 5041 175 74 12950 30625 5476 155 58 8990 24025 3364 165 61 10065 27225 3721 172 70 12040 29584 4900 179 75 13425 32041 5625 154 56 8624 23716 3136 163 60 9780 26569 3600 ∑X=1643 ∑Y=635 ∑XY=105024 ∑(X)2=270785 ∑(Y)2=40931 63 =164.3 =63.5 Regression Coefficient is useful to give an estimate of y for unknown value of x, using equation y = a+ bx = 63.5 - 0.8255 × 164.3 = -72.12 Having known this values Y (wt) for any known value of X (ht) can be estimated (predicted) with the help of equation Y= a+ bX Eg: For height = 175 cm , Weight will be = –72.12+ 0.8255 × 175 = 72.34 Kg. Regression Line: Calculate y for two extreme values of x (say X1 & X2), using equation Y= a + bX, we get estimates of corresponding values of Y ( say Y1 & Y2).In the above Ex, X1 = 150cm & X2 = 179 cm. Estimates of Y1& Y2 are 51.70 Kg and 75.64Kg respectively. We now get two points 1. X1 = 150 cm, Y1 = 51.70Kg 2. X2 = 179 cm, Y2 = 75.64 Kg Plot these points on the graph and join these points with a straight line. What we get is a regression line. We can then obtain estimates of Y for a known value of x with the help of regression line. 64 Tables Students’ ‘t’ Table Distribution of ‘t’ values ( Students Distribution) Probabilities df 0.05 0.01 0.001 10 2.23 3.17 4.59 11 2.20 3.10 4.44 12 2.18 3.05 4.32 65 13 2.16 3.07 4.22 14 2.14 2.98 4.14 15 2.13 2.95 4.07 16 2.12 2.92 4.02 17 2.11 2.90 3.96 18 2.10 2.88 3.92 19 2.09 2.86 3.88 20 2.09 2.84 3.85 21 2.08 2.83 3.82 22 2.07 2.82 3.79 23 2.07 2.81 3.77 24 2.06 2.80 3.74 25 2.06 2.79 3.72 Chi –Square Table The probabilities of exceeding different chi-square values for degrees of freedom from 1 – 50 when the expected hypothesis is true df 0.05 0.01 0.001 1 3.84 6.64 10.83 2 5.99 9.21 13.82 3 7.82 11.35 16.27 66 4 9.49 13.28 18.47 5 11.07 15.09 20.52 6 12.59 16.81 22.46 7 14.07 18.48 24.32 8 15.51 20.09 26.13 9 16.92 21.67 27.88 10 18.31 23.21 29.59 11 19.68 24.73 31.26 12 21.03 26.22 32.91 13 22.36 27.69 34.53 14 23.69 29.14 36.12 15 25.00 30.58 37.70 F table n2/n1 1 1 2 3 4 5 6 7 8 161.4476 199.5000 215.7073 224.5832 230.1619 233.9860 236.7684 238.8827 2 18.5128 19.0000 19.1643 19.2468 19.2964 19.3295 19.3532 19.3710 3 10.1280 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452 4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.0410 5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183 6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468 67 7 5.5914 4.7374 4.3468 4.1203 3.9715 3.8660 3.7870 3.7257 8 5.3177 4.4590 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381 9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296 10 4.9646 4.1028 3.7083 3.4780 3.3258 3.2172 3.1355 3.0717 11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 3.0123 2.9480 12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486 13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 2.8321 2.7669 14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 2.7642 2.6987 15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408 16 4.4940 3.6337 3.2389 3.0069 2.8524 2.7413 2.6572 2.5911 17 4.4513 3.5915 3.1968 2.9647 2.8100 2.6987 2.6143 2.5480 18 4.4139 3.5546 3.1599 2.9277 2.7729 2.6613 2.5767 2.5102 19 4.3807 3.5219 3.1274 2.8951 2.7401 2.6283 2.5435 2.4768 20 4.3512 3.4928 3.0984 2.8661 2.7109 2.5990 2.5140 2.4471 21 4.3248 3.4668 3.0725 2.8401 2.6848 2.5727 2.4876 2.4205 22 4.3009 3.4434 3.0491 2.8167 2.6613 2.5491 2.4638 2.3965 23 4.2793 3.4221 3.0280 2.7955 2.6400 2.5277 2.4422 2.3748 24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 2.4226 2.3551 25 4.2417 3.3852 2.9912 2.7587 2.6030 2.4904 2.4047 2.3371 68