ENG-7: Biostatistics ENG-7: Biostatistics Examples: Introduction Statistics: Collect, organize, classify and summarize DATA in order to perform analyses and interpretation, produce forecasts and predictions or make decision and define policies. 1. Who will win the next election? Inferential Statistics consists of the tools that are used to make inference about a population. For this we often use probabilistic models, which are mathematical models to deal with uncertainty. 3. Determining which one of two tooth pastes is more efficient than the other in preventing tooth decay. 2. Producing the best possible gasoline by changing the levels of a given set of components. 4. Forecasting the amount of rainfall for the next winter. Data are collected to explore a population, the target of our study, this produces a sample. 5. Monitoring patients with heart diseases and decide which factors affect their health. A random sample is a sample from a population taken in such a way that every sample of size n has an equal probability of selection. 6. Optimize the performance of a computer by tuning the parameters of the OS. 1 ENG-7: Biostatistics 2 ENG-7: Biostatistics Types of data • Quantitative data. Correspond to observations measured on a numerical scale. How data are collected • Observational studies. Collect data with little or no control over possible affecting factors. – Continuous Variables. Can assume any value in some interval of real numbers. They are usually related to measurements of physical quantities. • Designed experiments. Data are collected by means of an experiment where most important factors are subject to control. – Discrete variables. Can assume only a countable number of values. They are usually related to counts. • Qualitative data. Correspond to observations classified in groups or categories. • Survey samples. Data are collected from a finite population carefully considering its structure. • Ranked (ordinal) data. Observations can be classified within categories that have a natural ordering. 3 4 ENG-7: Biostatistics ENG-7: Biostatistics The mean The population mean is defined as the sum of the values of the variable under study divided by the number of objects in the population. It is usually referred to using the letter µ. Population parameters The descriptive characteristics of a population can be summarized by some population parameters. These are numerical measures that are typical of each population. Usual examples are the central tendency of the population or its dispersion. Parameters are usually unknown quantities. If a population has five elements: X1 = 0, X2 = −1, X3 = 5, X4 = 2.4, X5 = −0.7 then the mean is µ= Parameters are estimated using values from a sample. Sample values are used to obtain a statistics. Actually, any function of a sample is called a statistics. The observed value of a statistics depends on how the sample is obtained, thus statistics are random variables. 0 − 1 + 5 + 2.4 − 0.7 = 1.14 5 In general we write µ= ΣN i=1 Xi N To estimate µ we take a sample of size n, Y1 , . . . , Yn , and take its average Σn Yi Y = i=1 n 5 ENG-7: Biostatistics 6 ENG-7: Biostatistics The median Consider the circumferences at chest height (CCH) of 15 maple trees. Then d = (15 + 1)/2 = 8 is the maximum depth. As a measure of the central tendency of a population, the mean can be seriously affected by very large of very small observations. The mean salary of the employees of a company may not give a good idea of the kind of income of the average worker, since a small number of very high salaries will pull the average up. A more robust measure of central tendency is the median. This is defined as follows: CCH 18 21 22 29 29 36 37 38 Depth 1 2 3 4 5 6 7 8 CCH 56 59 66 70 88 88 93 120 Depth 7 6 5 4 3 2 1 For a population of 12 cypress trees we take the average of the two observations of depth 6, (56 + 68)/2 = 62. 1. Order the elements of the population 2. Define depth the position of Xi relative to the nearest extreme CCH 17 19 31 39 48 56 3. When N is an odd number the population median is the observation of maximum depth. Depth 1 2 3 4 5 6 CCH 68 73 73 75 80 122 4. If N is an even number, the median is the average of the observations with maximum depth. Depth 6 5 4 3 2 1 7 8 ENG-7: Biostatistics ENG-7: Biostatistics The range The variance Consider the two samples of weights of albacore tuna in the table Sample 1 8.9 9.6 11.2 9.4 9.9 10.9 10.4 11.0 9.7 Sample 2 3.1 17.0 9.9 5.1 18.0 3.8 10.0 2.9 21.2 As a measure of dispersion we can consider the difference between each observation and the mean. Let zi = Yi − Y . These differences are called deviates. Deviates have the following property n X i=1 they have the same mean (10.11 kg) and the same median (9.9 kg). But they are not identical since observations are scattered in different ways. i=1 Yi − Y = n X i=1 Yi − n X n X Y = i=1 Yi − nY = 0 i=1 A positive measure of dispersion can be obtained by considering the squares of the deviates. The sample variance Pn 2 Pn Pn ( i=1 Yi ) 2 2 i=1 Yi − 2 n i=1 (Yi − Y ) s = = n−1 n−1 One measure of the dispersion of a population is given by the range. The range is defined as the difference between the maximum and the minimum. If we denote the ordered sample by Y(i) then the sample range is is an estimator of the population variance PN (Xi − µ)2 2 σ = i=1 , N the average square deviation of the observations from the mean. Y(n) − Y(1) Sample 1 has a range of 2.3 kg, for sample 2 the range is 18.3 kg. 9 ENG-7: Biostatistics For the second sample of the previous table we have that X X Yi = 91, Yi2 = 1, 318.92, n = 9 so n X zi = 2 1, 318.92 − (91) 9 s = = 49.851 kg2 9−1 Notice that the units of dispersion in this example are kg2 and not kg. 2 To obtain a measure of dispersion in the scale of the original data we compute the standard deviation as the square root of the variance. So, for our example, s = 7.06kg. 11 10