MBC7201 Statistical Data Analysis Explain the principles underlying the various statistical methods used for data analysis Course Objectives Analyze scientific data using various statistical approaches Apply statistical models during their experimental design process Report their data collected in a scientific way backed up by statistical analysis To summarize data into meaningful interpretable outputs To assess underlying relationships between factors/ attributes of interest Objectives of statistical analysis To help interpret quantitative results Descriptive statistics • Organise • Summarise • Simplify • Describe and present data Inferential statistics • Generalise from samples to populations • Hypothesis testing • Make predictions Some definitions Statistics: Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions. • Biostatistics: statistical processes and methods applied to the collection, analysis, and interpretation of biological data and especially data relating to human biology, health, and medicine. Some definitions Parameter eg µ, σ Statistic eg x(bar), s2 Variable Attribute ⚫ ⚫ Gender Attribute Male Female Variable: Characteristic or attribute that can assume different values Random Variable: values are determined by chance. Data Qualitative data Quantitative data Non-numerical Numerical variables “I even one time helped the doctor during a surgery operation and this was so exciting to me” “that was the most interesting part of it, you really feel what you have been learning, and it was interesting especially in the theatre” Discrete 5 mosquitos 3 colonies Continuous 11.54 g/dl Hb 3.652 µg/ml CRP Hemoglobin status Variable Attributes Values Relationship Normal Mild anemia 13 11 Severe anemia 6 Qualities of Variables ⚫ ⚫ Exhaustive -- Should include all possible answerable responses. Mutually exclusive -- No respondent should be able to have two attributes of a variable simultaneously (for example, male vs female, employed vs. unemployed). Levels of Measurement • Measurement is the process of assigning numbers to quantities ― This permits data to be categorised, quantified and/or analysed in order that meaningful conclusions can be drawn • The level of measurement: ― helps you decide what statistical analysis is appropriate on the values that were assigned ― helps you decide how to interpret the data from that variable Levels of Measurement 1. Nominal Scale Lowest Level 2. Ordinal Scale 3. Interval Scale 4. Ratio Scale Highest Level Nominal Scale • A measure of identity or category • Useful for quantifying qualitative data • Categories must be both mutually exclusive and comprehensive/exhaustive • Provides no information regarding either order or magnitude Ordinal Scale • When attributes can be rank-ordered… • Provides no information regarding magnitude • Distances between attributes do not have any meaning Table 2 Background characteristics of respondents Percent distribution of women and men age 15-49 by selected background characteristics, Uganda DHS 2016 UDHS 2016 UDHS 2016 Interval Scale ⚫ ⚫ ⚫ ⚫ ⚫ When distance between attributes has meaning, for example, temperature (in Celsius) - distance from 30-40 is same as distance from 70-80 No “true zero.” For example, there is no such thing as “no temperature,” at least not with Celsius. Zero does not mean the absence of value, but is actually another number used on the scale, like 0 degrees Celsius. Negative numbers also have meaning. Without a true zero, it is impossible to compute ratios: ⚫ ⚫ ⚫ 80 C degrees is not twice as hot as 40C Convert to Farenheiht Convert to K Ratio Scale ⚫ ⚫ ⚫ Has an absolute zero that is meaningful Can construct a meaningful ratio (fraction), for example, number of clients in past six months It is meaningful to say that “...we had twice as many clients in this period as we did in the previous six months” The Hierarchy of Levels Absolute zero Distance is meaningful Attributes can be ordered Attributes are only named; “weakest” Nominal, Ordinal, Interval or Ratio? 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Blood lactate concentration (mmol l-1) Profile of Mood States (Scale 1-7) Heart Rate (beats min-1) Blood Group Year of Birth (AD) Atmospheric pressure (mmHg) The absorbance of a mixture of Biuret reagent with serum samples Hemoglobin (g/dL) Anemia (mild, moderate, severe) pH using a pH meter (Omar) pH using litmus paper (Omar) Enzyme activity (Katal) (Steven Ogaba) Vitamin A (IU) (Amos) The kelvin (symbol K) is the base unit of temperature in the International System of Units (SI). The Kelvin scale is an absolute thermodynamic temperature scale using as its null point absolute zero, the temperature at which all thermal motion ceases in the classical description of thermodynamics. The Kelvin scale is named after the Belfast-born, Glasgow University engineer and physicist William Lord Kelvin (1824– 1907), who wrote of the need for an "absolute thermometric scale". Unlike the degree Fahrenheit and degree Celsius, the kelvin is not referred to or typeset as a degree. Descriptive Statistics Descriptive Statistics are used by researchers to report on populations and samples (but mostly samples) Descriptive Statistics Summary descriptions of measurements (variables) taken from a group of people By summarizing information, Descriptive Statistics speed up and simplify comprehension of a group’s characteristics Sample vs. Population Population Saple Descriptive Statistics Which Group is Smarter? Class A: IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Class B: IQs of 13 Students 127 162 131 103 96 111 80 109 93 87 120 105 109 Each individual may be different. If you try to understand a group by remembering the qualities of each member, you become overwhelmed and fail to understand the group. Descriptive Statistics Which group is smarter now? Class A: Average IQ 110.54 Class B: Average IQ 110.23 They are roughly the same! • With a summary descriptive statistic, it is much easier to answer our question. Types of descriptive statistics Organize and Present Data Summarize Data • Tables • Graphs ―Bar Chart ―Histogram ―Stem and Leaf Plot ―Box plot* ―Scatter plot* • Central Tendency • Variation (including dispersion) Frequency Distribution Frequency Distribution of IQ for Two Classes IQ Frequency 82.00 87.00 89.00 93.00 96.00 97.00 98.00 102.00 103.00 105.00 106.00 107.00 109.00 111.00 115.00 119.00 120.00 127.00 128.00 131.00 140.00 162.00 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 Total 24 Relative Frequency Distribution of IQ for Two Classes IQ Frequency Percent Valid Percent Cumulative Percent 82.00 87.00 89.00 93.00 96.00 97.00 98.00 102.00 103.00 105.00 106.00 107.00 109.00 111.00 115.00 119.00 120.00 127.00 128.00 131.00 140.00 162.00 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 4.2 4.2 4.2 8.3 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 8.3 4.2 4.2 4.2 4.2 4.2 8.3 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 8.3 4.2 4.2 4.2 8.3 12.5 20.8 25.0 29.2 33.3 37.5 41.7 45.8 50.0 54.2 58.3 62.5 66.7 70.8 75.0 79.2 83.3 91.7 95.8 100.0 Total 24 100.0 100.0 Grouped Relative Frequency Distribution Relative Frequency Distribution of IQ for Two Classes IQ Frequency 80 – 89 90 – 99 100 – 109 110 – 119 120 – 129 130 – 139 140 – 149 150 and over Total 3 5 6 3 3 2 1 1 24 Percent Cumulative Percent 12.5 20.8 25.0 12.5 12.5 8.3 4.2 4.2 12.5 33.3 58.3 70.8 83.3 91.6 95.8 100.0 100.0 100.0 Histogram 6 5 Frequency 4 3 2 1 Mean = 110.4583 Std. Dev. = 19.00338 N = 24 0 80.00 100.00 120.00 IQ Based on the IQ data 140.00 160.00 Bar Graph Effect of an intervention to introduce OFSP on vitamin A intake IP = Intensive program; RP = Reduced program Stem and Leaf Plot Stem and Leaf Plot of IQ for Two Classes Stem 8 9 10 11 12 13 14 15 16 Leaf 279 3678 235679 159 078 1 0 2 Key 8 2 means 82 Summarizing data Central Tendency Variation (or Measure of Location or the “Middle Values” of the Groups) (or Measure of Dispersion or Summary of Differences within Groups) • Mean • Median • Mode • • • • Range Interquartile Range Variance Standard Deviation Mean Class A: IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Σ Yi = 1437 Y-barA = Σ Yi = 1437 = 110.54 n 13 Class B: IQs of 13 Students 127 162 131 103 96 111 80 109 93 87 120 105 109 Σ Yi = 1433 Y-barB = Σ Yi = 1433 = 110.23 n 13 Mean The mean is the “balance point” Each person’s score is like 1 pound placed at the score’s position on a see-saw. Below, on a 200 cm see-saw, the mean equals 110, the place on the see-saw where a fulcrum finds balance: 1 lb at 93 cm 17 units below 1 lb at 106 cm 1 lb at 131 cm 110 cm 4 units below 21 units above 0 units The scale is balanced because… 17 + 4 on the left = 21 on the right Mean 1. Means can be badly affected by outliers (data points with extreme values unlike the rest) 2. Outliers can make the mean a bad measure of central tendency or common experience Income in Uganda Sudhir Ruparelia All of us Mean Outlier Median • The middle value when a variable’s values are ranked in order; the point that divides a distribution into two equal halves. • When data are listed in order, the median is the point at which 50% of the cases are above and 50% below it. • The 50th percentile. Median Class A--IQs of 13 Students 89 93 97 98 102 106 109 110 115 119 128 131 140 Median = 109 (six cases above, six below) 89 93 97 98 102 106 109 110 115 119 128 131 140 If the first student were to drop out of Class A, there would be a new median: Median = 109.5 109 + 110 = 219/2 = 109.5 (six cases above, six below) Median • The median is unaffected by outliers, making it a better measure of central tendency, better describing the “typical person” than the mean when data are skewed. • If the recorded values for a variable form a symmetric distribution, the median and mean are identical. • In skewed data, the mean lies further toward the skew than the median. Symmetric Skewed Mean Mean Median Median Mode • The most common data point • The mode may not be at the center of a distribution The combined IQ scores for Classes A & B: 80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162 The mode!! Bimodal distributions Mode 1. 2. 3. It may give you the most likely experience rather than the “typical” or “central” experience. In symmetric distributions, the mean, median, and mode are the same. In skewed data, the mean and median lie further toward the skew than the mode. Symmetric Skewed Mean Median Mode Mode Median Mean Summarizing data Central Tendency Variation (or Measure of Location or the “Middle Values” of the Groups) (or Measure of Dispersion or Summary of Differences within Groups) ✓Mean ✓Median ✓Mode • • • • Range Interquartile Range Variance Standard Deviation Interquartile Range • A quartile is the value that marks one of the divisions that breaks a series of values into four equal parts. • The median is a quartile and divides the cases in half. • 25th percentile is a quartile that divides the first ¼ of cases from the latter ¾. • 75th percentile is a quartile that divides the first ¾ of cases from the latter ¼. • The interquartile range is the distance or range between the 25th percentile and the 75th percentile. Below, what is the interquartile range? 25% of cases 0 25% 250 25% 500 750 25% of cases 1000 Interquartile range (IQR) IQR = Q3 – Q1 Variance A measure of the spread of the recorded values on a variable. A measure of dispersion. The larger the variance, the further the individual cases are from the mean. Mean The smaller the variance, the closer the individual scores are to the mean. Mean Variance Calculating variance starts with a“deviation.” A deviation is the distance away from the mean of a case’sscore. Yi – Y-bar • We want to add the individual deviations to get total deviation, but if we were to do that, we would get zero every time. Why? • We need a way to eliminate negative signs. Squaring the deviations will eliminate negative signs... (Yi – Y-bar)2 Variance When the squared deviations are all added together, you get what we call the “Sum of Squares.” Sum of Squares (SS) = Σ (Yi – Y-bar)2 SS = (Y1 – Y-bar)2 + (Y2 – Y-bar)2 + . . . + (Yn – Y-bar)2 The approximate average sum of squares is the variance • SS/N = Variance for a population • SS/n-1 = Variance for a sample • Variance = Σ(Yi – Y-bar)2 / n –1 Standard Deviation To convert variance into something of meaning, we use standard deviation. The square root of the variance reveals the average deviation of the observations from the mean Summary of steps s.d. = Σ(Yi – Y-bar)2 n-1 • Variance – Deviation from mean for each case – Square the deviations – Sum of squares – Divide by n-1 • Standard deviation – Square root of the variance Standard Deviation 1. Larger s.d. = greater amounts of variation around the mean. For example: 19 2. 3. 4. 25 Y = 25 s.d. = 3 31 13 25 37 Y = 25 s.d. = 6 s.d. = 0 only when all values are the same (this means you have a constant and not a“variable”) If you were to “rescale” a variable, the s.d. would change by the same magnitude—if we changed units above so the mean equaled 250, the s.d. on the left would be 30, and on the right, 60 Like the mean, the s.d. will be inflated by an outlier case value. Summarizing data Central Tendency Variation (or Measure of Location or the “Middle Values” of the Groups) (or Measure of Dispersion or Summary of Differences within Groups) ✓Mean ✓Median ✓Mode ✓Range ✓Interquartile Range ✓Variance ✓Standard Deviation Box-Plots* A way to graphically portray almost all the descriptive statistics at once Outliers • Mild outlier: any point beyond an inner fence on either side • Extreme outlier: any point beyond an outer (± 3*IQR) fence Box plots C-Reactive protein as a prognostic indicator in hospitalized patients with COVID-19 Scatter diagrams