Uses of Biostatistics in Epidemiology (1) Amornrath Podhipak, Ph.D. Department of Epidemiology Faculty of Public Health Mahidol University 2006 Medical doctors and public health personnel Why Statistics ?? Why Computers ?? Why Software ?? A tools for calculation Why do we need “statistics” in medicine and public health? (particularly, epidemiology??) *Medicine is becoming increasingly quantitative in describing a condition. Most of malaria patients are infected with P.falciparum. 82.5% got P.falciparum. Those patients looks pale. Haemoglobin level was 9.89 mg%, on average. Epidemiology concerns with describing disease pattern in a group of people. Descriptive statistics give a clearer picture of what we want to describe. * The answer to a research question need to be more definite. Is the new treatment better: how much better?, in what aspect?, any evidence? could it be a real difference? Inferential statistics give an answer in the world of uncertainty. Before using statistics, we need some kinds of measurements, in order to get more detailed information. Measurement of characteristics (Variables vs Constant) 4 scales of measurement Qualitative variables - Nominal scale (group classification only) - Ordinal scale (classification with ordering / ranking) Quantitative variables - Interval (magnitude + constant distance between points) - Ratio (magnitude + constant distance between points + true zero) Intelligent? BP? 140/90 Handsom e? Income? 100,000 Weght? 80 kg Married? Height? 160 cm HIV? Female Nominal scale 1 Male Values have no meaning. Ordinal scale Equal distance between points does not reflect equal interval value. 2 2 1 3 Interval scale i.e. degree celcius 0 10 20 30 Freezing point was supposed to be zero degree celcius Not the true ZERO temperature (no heat ) Equal distance between points means equal interval value. Ratio scale i.e. weight 0 10 20 30 True ZERO (nothing here) Equal distance between points means equal interval value. Questionnaire (TB and Passive smoking) Sex [ ] Male [ ] Female Education [ ] 1-6 yr [ ] 7-9 yr [ ] 9+ yr Family income ……………………. Baht/m Passive Smoking ……... Record form Result from tuberculin test ……………………. mm X-ray [ ] +ve Weight …………. kg, [ ] -ve Height ………………….. cm Variable (characteristic being measured) Result of measurement Marital status single/married/divorced Type nominal gender male/female nominal smoking yes/no nominal smoking nonsmoker/ light smoker/ ordinal moderate smoker/ heavy smoker smoking number of cig/day ratio feeling of pain yes/no nominal feeling of pain none/light/moderate/high ordinal feeling of pain 0 ---------> 10 ordinal attitude toward strongly agree/ agree/ ordinal selective abortion not sure/ disagree/ strongly disagree blood pressure mmHg ratio temperature degree celcius interval weight gram ratio tumor stage I, II, III, IV ordinal Quantitative (numeric, metric) variables are classified as continuous It can take all values in an interval e.g. weight, temperature, etc. discrete It can take only certain values (often integer value) e.g. parity, number of sex partners, etc. Continuous data can be categorised into groups, which one needs to define “upper boundary” and “lower boundary” of a value (or a class) 120 121 122 123 124 125 boundaries: 120.5, 121.5, 122.5, 123.5, 124.5 … 126 127 120.1 120.2 120.3 120.4 120.5 120.6 boundaries: 120.15, 120.25, 120.35, 120.45, 120.55 … 120.7 120.8 120.11 120.12 120.13 120.14 120.15 120.16 120.17 120.18 boundaries: 120.115, 120.125, 120.135, 120.145, 120.155 … Descriptive statistics - a way to summarize a dataset (a group of measurement) Example: Height of 100 children, 10-12 years of age. 140 123 140 151 155 147 134 151 132 134 140 142 138 138 134 158 141 151 141 138 140 161 132 141 130 155 140 140 140 141 136 155 142 125 146 135 153 140 142 130 141 129 155 123 135 141 142 141 165 141 123 130 125 134 139 136 127 147 153 132 125 139 136 135 134 136 147 139 146 140 134 129 129 135 142 147 142 134 134 138 125 134 136 135 139 139 146 140 151 127 What are values that best describe the height of these 100 persons? 129 130 153 130 149 132 127 149 151 129 1) Rearrange the data: 123 129 132 134 136 139 140 142 147 153 123 129 132 134 136 139 140 142 147 153 124 129 132 134 136 140 141 142 147 153 125 129 132 135 138 140 141 142 149 155 125 129 134 135 138 140 141 142 149 155 125 130 134 135 138 140 141 142 151 155 125 130 134 135 138 140 141 146 151 155 127 130 134 135 139 140 141 146 151 158 127 130 134 136 139 140 141 146 151 161 Minimum, Maximum, Range, Median, Mode 123 , 165 , 42 , 139, 140 Max-Min , Value in the middle, Most repeated value 127 130 134 136 139 140 141 147 151 165 2) Present in a table (Frequency distribution) Class Boundaries: (depends on the boundaries of these values) 119.5-124.5 124.5-129.5 129.5-134.5 134.5-139.5 139.5-144.5 144.5-149.5 149.5-154.5 154.5-159.5 159.5-164.5 164.5-169.5 Height (cm) Mid point (X) 120-124 125-129 130-134 135-139 140-144 145-149 150-154 155-159 160-164 165-169 122 127 132 137 142 147 152 157 162 167 f = frequency 3 12 18 24 19 9 8 5 1 1 25 3) Present in a graph (Histogram) Frequency 20 15 10 5 0 120 125 130 135 140 145 150 155 160 165 170 Height (cm) Methods of data presentation 1. Table 2. Graph - line graph - bar chart - pie chart - scatter plot - area graph - error bar - histogram Another set of value for describing a dataset is the MEAN and STANDARD DEVIATION. Mean indicates the location. Standard deviation indicates the scatterness of data (roughly). Example: Dataset 1: Age of 6 children 4 4 4 4 4 4 Mean = 4.0 years sd = 0 y (no variation) Example: Dataset 2: Age of 6 children 2 2 4 4 6 6 Mean = 4.0 years sd = 1.79 y(with variation) or, another example: The average body height of these children was 138.9 cm. with standard deviation of 8.9 cm. The average body height of these children was 138.9 cm. with standard deviation of 0.2 cm. If we categorize the data into qualitative (tall/short) the proportion would then be calculated. Descriptive statistics (proportion and/or percentage) Most of the children were less than 150 cm. tall. 85% of them had height less than 152 cm. A final note on defining a variable and a measurement: Important things to consider before making any measurement: 1. Do we measure the right thing? Fatty food and CVD 2. What is the tool that can actually measure what we want to measure? Morphology (measure) indicators % standard weight body mass index (wt/ht2) tricep skinfold thickness Wt for age Wt for height etc. Food intake (ask) Protein calorie intake (ask & calculate) 3. How valid the instrument? Does the questionnaire actually get the fatty food intake information? (scope of questions, recall of subjects, certainty of reported amount of food, variability of ingredients, etc.) Does the information obtained actually reflect fatty food intake? 4. How precise the instrument? Does the information precisely estimate the amount of fatty food intake for each individual? In summary: Statistics (and epidemiology) deals with a group (the bigger the group, the better the result) of persons (not one individual patient). We look for the characteristics which are most common in the group. Descriptive statistics is used for explaining our sample (or findings) i.e. Most of the patients were anemic. 80% of them had haemoglobin level less than 10 mg%. The average haemoglobin level was 9.5 mg% with standard deviation of 1.5 mg%. Inferential statistics (Infer to general population of interest)