Describing Variation and Distribution of Data CHAPTER 8 “Variability is the law of life, as no two faces are the same, so no two bodies are alike, and no two individuals react alike and behave alike under the abnormal conditions which we know as disease” -William Osler VARIABLE- A measure of a single characteristic that can vary VARIATIONS CAUSES Biologic differences- can result from many factors such as genes, nutrition, environmental exposures, age, sex and race. Presence or absence of disease and stages or extent of disease Example: cancer of the cervix may be in situ, localized, invasive, or metastatic. Different conditions of measurement often account for the variations observed in medical data and include factors such as time of the day, ambient temperature or noise, and the presence of fatigue or anxiety in the patient. Example: Blood pressure is higher with anxiety or following exercise and lower after sleep. VARIATIONS CAUSES Different techniques of measurement – can produce different results. Example: A blood pressure measurement derived from the use of an intraarterial catheter may differ from a measurement derived from the use of an arm cuff. Measurement Error –can also cause variation Example: Two different blood pressure cuffs of the same size may give different measurements in the same patient because of defective performance by one of the cuffs. VARIATIONS Some types of variation can distort data systematically in one direction, this form of distortion is called systematic error and can introduce bias Example: measuring and weighing patients while wearing shoes Other types of variation are random, and is called as random error, this makes some readings to high and others too low, it is not systematic and does not introduce bias Example: slight, inevitable inaccuracies in obtaining any measurement, such as blood pressure. Statistics and Variables Quantitave and Qualitative Data Quantitative characteristic, such as systolic blood pressure measurement or serum sodium level, is characterized using defined, continuous measurement scale. Qualitative characteristic, such as coloration of the skin, is described by its features, generally in words rather than numbers. Statistics and Variables Types of Variables Nominal Variables Dichotomous (binary) variables Ordinal (ranked) variables Continuous (dimensional) variables Ratio variables Risks and proportions Types of Variables Nominal Variables Naming or categoric variables that are not based on measurement scales or rank order. Examples: - Blood groups ( O, A, B, and AB) - Occupations - Food groups - Skin color - Assigning of number to each color (e.g. 1 is bluish purple, 2 is red, 3 is white, 4 is blue and 5 is yellow) Types of Variables Dichotomous (Binary) Variables Dichotomous (Greek, “cut into two”) Variable with only two levels Example: Investigators might choose to create a variable with only two levels: normal skin color (coded as a 1) and abnormal skin color (coded as 2) In many cases dichotomous variables inadequately describe the information needed. Example: Study of heart murmurs - Dichotomous data concerning a murmur’s timing ( e.g., systolic or diastolic) - Nominal data on its location (e.g., aortic valve area) and character ( e.g., rough) - Ordinal data on its loudness (e.g., grade III) Dichotomous, nominal and often ordinal variables are referred to as discrete variables because the numbers of possible values they can take are countable Types of Variables Ordinal (Ranked) Variables Data that can be characterized in terms of three or more qualitative values that have a clearly implied direction from better to worse. Examples - Satisfaction with care ( “very satisfied”, “ fairly satisfied”, “not satisfied”) - Amount of swelling in a patient’s legs ( “none”, or 1+, 2+, 3+ or 4+) - Pain (absent, mild, moderate, or severe) (scale of 0-10, 0-no pain and 10-worst imaginable pain) Types of Variables Continuous (Dimensional) Variables Data that are measured in continuous (dimensional) measurement scales. Continuous data show not only the position of the different observations relative to each other, but also the extent to which one observation differs from another. Examples -Patients’ height, weights, systolic and diastolic blood pressures and serum glucose levels. Types of Variables Ratio Variables If a continuous scale has a true 0 point, the variables derived from it can be called ratio variables. Kelvin temperature scale is a ratio scale because 0 degrees on this scale is absolute 0. Centigrade temperature scale is a continuous scale, but not a ratio scale because 0 degrees on this scale does not mean the absence of heat Examples of the Different Types of Data Information Variable Type Content Examples Higher Ratio Temperatiure (Kelvin) ; Blood pressure Higher Continuous(dimensional) Temperature (Fahrenheit) Higher Ordinal(ranked) Edema= 3+ out of 5; Perceived quality of care= good/fair/poor Higher Binary (dichotomous) Gender= male/female; Heart murmur=present/absent Lower Nominal Blood type; skin color Types of Variables Risks and Proportions as Variables Risk is the conditional probability of an event (e.g., death or disease) in a defined population in a defined period. Risks and proportions, which are variables created by the ratio of counts in the numerator to counts in the denominator. Risks and proportions can be analyzed using the statistical method for continuous variables COUNTS AND UNITS OF OBSERVATION It is the person or thing from which the data originated. Examples: - Persons - Animals - Cells May be arranged in a frequency table (characteristics :x and y axis) COUNTS AND UNITS OF OBSERVATION TABLE 8.2 Standard 2x2 Table Showing Gender of 71 Participants and Whether Serum Total Cholesterol Was Checked CHOLESTEROL LEVEL (NO. OF PARTICIPANTS ) GENDER Checked Not Checked Total Female 17 (63%) 10(37%) 27 (100%) Male 25 (57%) 19 (43%) 44 (100%) Total 42 (59%) 29 (41%) 71 (100%) Data from unpublished findings in a sample of 71 young adults in Connecticut. Combining Data The conversion of continuous variable to an ordinal variable by grouping units with similar values together Example: Individual birth weights of infants can be converted to a range of birth weights. Advantage: Percentage can be created, it can show the mortality rate and survival rate Disadvantage: Lost of individual information FREQUENCY DISTRIBUTIONS FREQUENCY DISTRIBUTIONS OF CONTINUOUS VARIABLES Frequency distribution can be shown by creating a table that lists the values of the variable according to the frequency with which the value occurs. 4,5 4 3,5 Number of persons 3 2,5 2 1,5 1 0,5 0 123 143 163 183 203 223 243 263 Serum level of Total Cholesterol (mg/dL) FREQUENCY DISTRIBUTIONS Range of a Variable Range The distance between the lowest and highest observations of the variable. Example Based on the table of Serum levels of the Total Cholesterol Reported in 71 Participants, The Cholesterol levels vary from a value of 124mg/dL to a value of 264mg/dL Range= (264-124)= 140 FREQUENCY DISTRIBUTIONS Real and Theoretical Frequency Distributions Real frequency distributions – are those obtained from actual data or a sample. Theoretical frequency Distributions –are calculated using assumptions about the population from which the sample was obtained. - Normal Distribution FREQUENCY DISTRIBUTIONS Real and Theoretical Frequency Distributions NORMAL DISTRIBUTION - It is also called the Gaussian distribution (after Johann Karl Gauss) - Bell-shaped - Bell-shaped curve are often used to represent the expected or theoretical distribution of the observations (the height of the curve on the y-axis) for the different possible values on a measurement scale (on the x-axis) normal (Gaussian) distribution mean=median=mode Symmetrical distribution FREQUENCY DISTRIBUTIONS Parameters of a Frequency Distribution Measures of Central Tendency and Measures of Dispersion -Two types of descriptors known as parameters which defined the frequency distributions from continuous data. Measures of central tendency Examining a distribution First step = look for the central tendency of the observations The next step= examine in detail the mode, median, and the mean. Measures of Central Tendency Mode The most commonly observed value. Frequency distribution typically has a mode at more than one value. Example In the table of Serum Levels of Total Cholesterol Reported in 71 Participants, the most commonly observed cholesterol levels (each with four observations) are 171 mg/dL and 180mg/dL. Measures of Central Tendency Median It is the middle observation when data have been arranged in order from the lowest value to the highest value. Example In the table shown, the median value is 178 mg/dL. Measures of Central Tendency Median When there is an even number of observations, the median is considered to lie halfway between the two middle observations. Example: In the table, the two middle observations are the 13th and 14th observations. The corresponding values for these are 57 and 58mg/dL Initial HDL cholesterol values (mg/dL) of participants 31, 41, 44,46,47,47,48,48, 49,52, 53, 54, 57, 58, 58, 60, 60, 62, 63, 64, 67, 69, 70, 77, 81, and 90 Median (mg/dL) (57+ 58)/2=57.5 The Median value is also called the 50th percentile observation because 50% of the observation lie at the value or below. Measures of Central Tendency Mean It is the average value, or the sum (∑) of all the observed values (xi) divided by the total number of observations (N); where the subscript letter i means the “value of x for the individual i, where i( ranges from 1 to N”). Mean= x̅= ∑(xi) N Example: No. of observations or N 26 Initial HDL cholesterol values (mg/dL) of participants 31, 41, 44,46,47,47,48,48, 49,52, 53, 54, 57, 58, 58, 60, 60, 62, 63, 64, 67, 69, 70, 77, 81, and 90 Mean or x̅ (mg/dL) 1496/26=57.5 mg/dL Measures of dispersion The next after the central tendency of frequency distribution is determined, the next step is to determine how spread out (dispersed) the numbers are Based on Percentiles Based on the Mean Measures of Dispersion Based on Percentiles Percentile of distribution – is a point at which a certain percentage of the observations lie below the indicated point when all the observations are ranked in descending order. Example: -The median discussed previously, is the 50th percentile because 50% of the observation are below it. - The 75th percentile is the point at or below which 75% of the observations lie. - The 25th percentile is the point at or below which 25% of the observations lie. Measures of Dispersion Based on the Mean Three measures of dispersion based on the mean Mean deviation Variance Standard deviation Mean Absolute Deviation This is seldom used, but helps define the concept of dispersion. Mean deviation= ___________ N Does not have mathematical properties (as base for many statistical tests) Variance has become the fundamental measure of dispersion. Variance The fundamental measure of dispersion in statistics that are based on the normal distribution. It is the sum of the squared deviations from the mean, divided by the number of observations minus 1. Variance = - symbol for variance calculated from the observed data N-1 = degrees of freedom = numerator of the variance is an extremely important measure in statistics. It is usually called either the sum of squares (SS) or the total sum of squares (TSS) How to compute for the variance? Standard deviation s= √ It is the square root of the variance It is used to describe the amount of spread in the frequency distribution. It is an average of the deviations from the mean. How to compute for Standard Deviation? REVIEW (VARIANCE) √171.94 = 13.1 STANDARD DEVIATION IS 13.1 mg/dL TABLE 8.4 Raw Data and Results of Calculations in Study of Serum Levels of High-Density Lipoprotein (HDL) Cholesterol in 26 Participants Parameters Raw Data or Results of Calculation No. of observations or N 26 Initial HDL cholesterol values (mg/dL) of participants 31, 41, 44,46,47,47,48,48, 49,52, 53, 54, 57, 58, 58, 60, 60, 62, 63, 64, 67, 69, 70, 77, 81, and 90 Highest Value (mg/dL) 90 Lowest value (mg/dL) 31 Mode (mg/dL) 47, 48, 58 and 60 Median (mg/dL) (57+58)/2=57.5 Sum of the values, or sum of xi (mg/dL) 1496 Mean, or the x̅ (mg/dL) 1496/26=57.5 Range (mg/dL) 90-31= 59 Interquartile range (mg/dL) 64-48=16 Sum of (xi-x̅)² or TSS 4298.46 mg/dL squared Variance or s² 171.94 mg/dL Standard deviation or s √171.94 = 13.1 mg/dL Problems in Analyzing a Frequency Distribution In a normal (Gaussian) distribution, the following holds true: mean=median=mode Symmetrical distribution Problems in Analyzing a Frequency Distribution Skewness SKEWNESS – A horizontal stretching of a frequency distribution to one side or the other, so that one tail of observations is longer and has more observations than the other tail. Skewed to the left Skewed to the right Problems in Analyzing a Frequency Distribution Skewed to the left – when a histogram or a frequency polygon has a longer tail on the left side of the diagram - Negatively skewed distribution Problems in Analyzing a Frequency Distribution Skewed to the right- Positively skewed Problems in Analyzing a Frequency Distribution Kurtosis- characterized by a vertical stretching or flattening of the frequency distribution. C. Abnormal peaking D. Abnormal flattening Thank you very much for listening! Reference: Elmore, J.G.,G.,W.D.M., Nelson, H.D., & katz, D.L. (2020). Jekel’s Epidemiology. Biostatistics, Preventive Medicine, and Public Health. Elsevier.